Google URL Index Hits 1 Trillion

Google URL Index Hits 1 Trillion 249

Posted by Soulskill on Saturday July 26, 2008 @12:03AM from the orders-of-magnitude dept.

mytrip points out news that Google's index of unique URLs has reached a milestone: one trillion. Google's blog provides some more information, noting, "The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day."

Google URL Index Hits 1 Trillion

This discussion has been archived. No new comments can be posted.

Search 249 Comments Log In/Create an Account

Comments Filter:

Re:Amazing (Score:5, Informative)

by Freaky Spook ( 811861 ) writes: on Saturday July 26, 2008 @12:39AM (#24345691)

I couldn't agree more.
Many of the clients I support are constantly asking me "Is there a program that does this? or Can you find me a program to do this" etc etc.
I used to be able to just use google to help me get started but these days the top level searches are all those bloody link farms peddling "free" software, even when typing in the word review you come up with link farms that offer no reviews.

Comment removed (Score:5, Informative)

by account_deleted ( 4530225 ) writes: on Saturday July 26, 2008 @01:06AM (#24345809)

Comment removed based on user account deletion

Re:Amazing (Score:5, Informative)

by arotenbe ( 1203922 ) writes: on Saturday July 26, 2008 @01:27AM (#24345877) Journal

Many of the clients I support are constantly asking me "Is there a program that does this? or Can you find me a program to do this" etc etc.
I used to be able to just use google to help me get started but these days the top level searches are all those bloody link farms peddling "free" software
Have you tried SourceForge [sourceforge.net]? That's what it's there for, you know.

Re:How long till.. (Score:5, Informative)

by onedotzero ( 926558 ) writes: on Saturday July 26, 2008 @02:05AM (#24346013) Homepage

... and cut out Experts-Exchange.com from your search results since their pages don't actually return the information you think they do.
Perhaps you should try scrolling to the bottom of the page... :)

Re:How long till.. (Score:5, Informative)

by cdrudge ( 68377 ) writes: on Saturday July 26, 2008 @02:05AM (#24346015) Homepage

It took me a while to realize it, but if you scroll clear to the bottom of an expert exchange post, you'll find the comments unhidden and relevant.

Re:How long till.. (Score:5, Informative)

by Anonymous Coward writes: on Saturday July 26, 2008 @02:07AM (#24346021)

...and cut out Experts-Exchange.com from your search results since their pages don't actually return the information you think they do.
If you block cookies from experts-exchange.com you can actually see the answers on any e-e page - after you visit the first time, it normally sets a cookie to not show results next visit, which is how they get Google to index their pages anyway. With cookies from them blocked, you can then see the answers - you just have to scroll 7/8s of the way down the page past all the fake "Please sign up to see this result" boxes.
(First AC post in years... tee hee. :)

Re:Try "Live" search (Score:0, Informative)

by Anonymous Coward writes: on Saturday July 26, 2008 @02:12AM (#24346047)

The first result for 'getfirefox' is http://www.getfirefox.net/ [getfirefox.net] which seems to be correct to me.
The first result for 'Linux' is http://www.linux.org/ [linux.org] not some Microsoft website.
The first result for 'Open Office' is http://www.openoffice.org/ [openoffice.org] and not some Microsoft website.
You lie?

Re:Some numbers (Score:3, Informative)

by Ihmhi ( 1206036 ) writes: <i_have_mental_health_issues@yahoo.com> on Saturday July 26, 2008 @03:35AM (#24346313)

You mean Googlewhacking [wikipedia.org], except not nearly as hard?

Re:How long till.. (Score:4, Informative)

by Eddi3 ( 1046882 ) writes: on Saturday July 26, 2008 @04:33AM (#24346487) Homepage Journal

Actually, If you go to the cached version of those pages, you can see all the answers. You can also just use the Googlebot's user agent via the User Agent Switcher [mozilla.org].

Re:No concern for the foreign readers? (Score:3, Informative)

by Smauler ( 915644 ) writes: on Saturday July 26, 2008 @04:40AM (#24346513)

No one in the UK uses the long scale system really. For example, traditional UK billions are _never_ used in governmental budgets, and no one points out that the "American" billion is being used. A billion is just 1E9 here, like just about everywhere else.
I guess some older people may be confused (what's new ;)), but I'll wager a large proportion of the younger UK population don't even know what a traditional English billion is. I'm 30, and I've never used 1E12 as a billion, or even been taught it could be.

Dynamic pages pollute count (Score:5, Informative)

by Coolhand2120 ( 1001761 ) writes: on Saturday July 26, 2008 @05:52AM (#24346759)

There are so many dynamic pages on the net now that one web site, like slashdot as an earlier poster commented, can contain literally millions of pages. People use programs like modrewrite [apache.org], isapirewrite [isapirewrite.com] and linkfreeze [helicontech.com] to manipulate spiders into crawling pages that are near identical. For more than one customer I've made meta, title and content randomization, serialization and or URL rewriting schemes to make damn sure spiders index every possible dynamic page, and it works. I have a single dynamic page that must have been indexed hundreds, maybe thousands of times with slightly different content, and they are all in the index.

Google tries to detect a dynamic page by looking for ampersands and equal signs, as well as looking at the content of the page, it is really quite easy to fool.

e.g.: http://somesite.com/itemlist.php?listmode=1&category=beds&orderby=7 [somesite.com]
when 'rewritten' shows up as
http://somesite.com/items/1/beds/7.html

So 1 billion web pages could be, and I know a few thousand pages like this, just a few hundred thousand dynamic pages. Not that the pages don't have relevant information, some of the stuff can be redundant though. For instance, when the spider crawls across "Records per page = 10" > "Records per page = 20" > "Records per page = 30" etc.. or when lazy programmers don't use cookies and databases to store information but try and concatenate the URL with the user's selections. Thank god for that GET limit [boutell.com]. People need to use POST!

If someone knows how to stop this message board from creating links out of false URLs please, let me know.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google URL Index Hits 1 Trillion 249

Google URL Index Hits 1 Trillion More Login

Google URL Index Hits 1 Trillion

Re:Amazing (Score:5, Informative)

Comment removed (Score:5, Informative)

Re:Amazing (Score:5, Informative)

Re:How long till.. (Score:5, Informative)

Re:How long till.. (Score:5, Informative)

Re:How long till.. (Score:5, Informative)

Re:Try "Live" search (Score:0, Informative)

Re:Some numbers (Score:3, Informative)

Re:How long till.. (Score:4, Informative)

Re:No concern for the foreign readers? (Score:3, Informative)

Dynamic pages pollute count (Score:5, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot