Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet

Google URL Index Hits 1 Trillion 249

mytrip points out news that Google's index of unique URLs has reached a milestone: one trillion. Google's blog provides some more information, noting, "The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day."
This discussion has been archived. No new comments can be posted.

Google URL Index Hits 1 Trillion

Comments Filter:
  • Amazing (Score:3, Interesting)

    by SoupIsGoodFood_42 ( 521389 ) on Saturday July 26, 2008 @12:13AM (#24345541)

    As someone who is partially engineering/analytically minded (but not a great programmer) it amazes me how Google has manged to index so much data, yet at the same time, serve up results in a fraction of a second to so many people.

  • Some numbers (Score:5, Interesting)

    Counts of words:

    the: 18.3 billion pages
    a: 23.9B
    0: 12.7B
    1: 25.4B
    in: 17.1B
    I: 10.2B

    I know these numbers aren't exact, but you'd think one of them would be over 100B if Google is really indexing a trillion pages. What's on them? Anyone find any keywords that produce more?

  • by bogaboga ( 793279 ) on Saturday July 26, 2008 @12:31AM (#24345663)
    This might be off-topic but I wonder what's going on with Sergey Brin and Larry Page's [PhD] education? Just wondering...did they give up?
  • Bandwidth (Score:1, Interesting)

    by Anonymous Coward on Saturday July 26, 2008 @01:18AM (#24345843)

    I wonder how much bandwidth the daily/continuous Google index process takes.

  • Re:Amazing (Score:3, Interesting)

    by SoupIsGoodFood_42 ( 521389 ) on Saturday July 26, 2008 @01:41AM (#24345931)

    Yeah, that's a problem. I'm sure they'll work it out. I don't find it to be a problem most of the time though, just on certain searches in certain places. They have a real spam problem if you search for info on pharmaceuticals in their groups search last time I checked (about a month ago). The problem wasn't the Usenet groups, but their own special groups, and the worse thing is you can't filter out their groups and just search Usenet ones.

    I tried to contact them about it and discovered that they could also really do with making their site more accessible to general enquirers and feedback -- Apple make it easy to give feedback. Yeah, they probably ignore most of the stuff sent through there, but at least you can feel as if there is some possibility of them knowing about a certain issues, unlike when you can't find a way to seed feedback and just get end up with sales and PR contacts (who probably delete anything not related to their department) after 10 mins of browsing -- not good for customer relations.

  • Try "Live" search (Score:3, Interesting)

    by symbolset ( 646467 ) * on Saturday July 26, 2008 @01:45AM (#24345953) Journal

    And you'll be back faster than a Google search result. Weeding out the crap?

    Just for a sample, try this one: getfirefox [live.com]. If the first link on that search goes to a Mozilla mirror you will win one Internet. Try Linux [live.com]. Hey, this is fun. Spoiler: the first link there is always "www.Microsoft.com/Windows : Special Offers from Windows Vista® w/ the Purchase of Select Laptops." The first time I tried this I was looking for Open Office and wound up misdirected to a members only site where you had to register to download a probably spyware infested Open Office [live.com] and signing up for unlimited pharma spam. The scary part is that the text of the link misled me to believe I was headed for "OpenOffice.org". Try it and see. Let's find more horrifically inappropriate ad placements and query results, shall we? I'll bet you could come up with a really funny one.

    Note: Please don't go to any of the sites linked to those search results through live.com. Bad things might happen to your Windows box and there's nothing there of interest for your powerbook.

    Yeah, that's a good search result ad, don't you think? No wonder Google is becoming a verb.

  • by sweet_petunias_full_ ( 1091547 ) on Saturday July 26, 2008 @01:46AM (#24345959)

    "the web is something like 42% porn"

    That probably stopped being the case after namespace speculators started buying up expired domains in large numbers just to put up a mildly useless index on *each* and *every* site to collect ad revenue or marketing statistics off of unwary visitors. I would also include typosquatters in that category, and maybe someone else can name a few other examples of utter namespace hogging uselessness.

    Whatever it is, you can rest assured that it's mostly repetitive trash... no need to stand in awe of it.

  • Re:First Post (Score:4, Interesting)

    by Vectronic ( 1221470 ) on Saturday July 26, 2008 @01:54AM (#24345983)

    -1 Redundant sure...

    But that's sort of along the lines I was/am thinking... take txoof's post alone (or mine, or whoever may reply) there are 3 separate URLS for each Slashdot comment

    The Header:
    http://search.slashdot.org/comments.pl?sid=626647&cid=24345519 [slashdot.org]

    The User:
    http://slashdot.org/~txoof [slashdot.org]

    The Score:
    http://search.slashdot.org/article.pl?sid=08/07/26/0036245# [slashdot.org]

    How many Slashdot comments are there? It's probably in the high millions, (rhetorical, but I'm interested to know none-the-less) There's like an average of about 250 comments per article, about 25 articles a day, thats about 2 million a year, so 6 million links, then take into consideration stuff like Facebook, which bounces URLs (http://www.facebook.com/link=###/etc) or sites that generate a random identifier every few minutes, making those "unique", gets unexciting quite quickly, Although billions is still fairly high.

  • Re:How long till.. (Score:5, Interesting)

    by blahplusplus ( 757119 ) on Saturday July 26, 2008 @02:22AM (#24346097)

    "I'm more interested in when Google starts returning relevant results to my queries.

    I can't believe that I'm the only one that finds Google's quality of service somewhat below par."

    You're not the only one, but for the most part it is better then most other search engines out there. The real problem is spammers and paid advertising, I think spammers have really made search frustrating for a lot of companies. And ad companies pay other people to promote their sites for them (digg, slashdot, etc). I've noticed the increase in spam-vertised websites in search results for a lot of things.

    Personally I think the idea of sharding and search being more specific for what you're looking for is needed. I'd like to see a google with 'tags' and a delicious interface, things like educational institutions and universities get lumped into their own search engine space for instance, this would help narrow down what one is looking for, although it would take time and feedback to design something well for other areas. The fact is that search results get diluted as you put more and more stuff online (numbers and geometric scale).

    For fun, I've noticed stumble upon and del.ico.us are not bad alternatives when looking for new and interesting sites without having to use search

  • by Doug52392 ( 1094585 ) on Saturday July 26, 2008 @02:29AM (#24346119)

    On my home Web server, I accidentally left a copy of the PHP manual in a browsable folder, which was linked to the homepage. So when Google indexed my homepage, guess what it also checked for? Every single page the homepage linked to! Including that manual... and damn the PHP manual has a LOT of pages.

    So when I got back on the server and pulled up the logs (it was running strangely slow) I found Googlebot accessing page after page after page of the PHP manual. Thousands of pages. Lagging the server and Internet to hell.

  • Re:Some numbers (Score:5, Interesting)

    by Shaitan Apistos ( 1104613 ) on Saturday July 26, 2008 @02:40AM (#24346151)

    My Hobby

    Attributing my sources: http://xkcd.com/369/ [xkcd.com]

    In [xkcd.com] , my [xkcd.com] humble [xkcd.com] opinion [xkcd.com] my [xkcd.com] usage [xkcd.com] of [xkcd.com] "My [xkcd.com] Hobby" [xkcd.com] was [xkcd.com] sufficient [xkcd.com] attribution [xkcd.com], all [xkcd.com] by [xkcd.com] itself. [xkcd.com]

  • Re:Some numbers (Score:3, Interesting)

    by SnowZero ( 92219 ) on Saturday July 26, 2008 @05:17AM (#24346611)

    You can list all of them with less than a gigabyte: 10^8 * (8+1) ~= 858 MB
    The web is pretty big, so all of them are bound to happen *somewhere*.

    Plus, I just registered all8digits.net

  • Re:Amazing (Score:1, Interesting)

    by Anonymous Coward on Saturday July 26, 2008 @05:32AM (#24346679)

    You are better off posting here, on blogoscoped, or blogging about it yourself (We're human after all). Any sort of normal feedback system doesn't work when there are individuals out there who don't *want* your feedback to make it through, and they can spend money to protect their "business" that you'd like to complain about.

    Can you give some specific example queries?

  • Comment removed (Score:4, Interesting)

    by account_deleted ( 4530225 ) on Saturday July 26, 2008 @06:12AM (#24346801)
    Comment removed based on user account deletion
  • Comment removed (Score:3, Interesting)

    by account_deleted ( 4530225 ) on Saturday July 26, 2008 @06:41AM (#24346879)
    Comment removed based on user account deletion
  • Re:How long till.. (Score:2, Interesting)

    by MPAB ( 1074440 ) on Saturday July 26, 2008 @08:13AM (#24347169)

    Try searching for a given product and the word "review" or something alike. You'll get endless pages of stores with no review whatsoever that must be scrolled away till you find a real technology site that has actually reviewed the product.

  • by blind biker ( 1066130 ) on Saturday July 26, 2008 @08:21AM (#24347207) Journal

    Thanks, but what I was trying to say (and I'll admit to bad wording), is that not only does google.com search return webstore fronts when I am actually looking for technical information about electronic components (this is the point I did not get across well - I am not looking for shops, but for info), but it returns the worst kind of webshops. The kind that isn't really a webshop at all, as in, you can't actually buy anything from them using the web.

    As for froogle: I just tried searching for "NAD 701" (without the quotes). The results I got were quite amusing. Not really related to the fine receiver/amp from NAD.

  • Re:How long till.. (Score:2, Interesting)

    by DiarmuidBourke ( 910868 ) on Saturday July 26, 2008 @01:07PM (#24349073)

    I'm more interested in when Google starts returning relevant results to my queries.

    I can't believe that I'm the only one that finds Google's quality of service somewhat below par. I guess they're better than randomly stabbing in the dark, and there certainly isn't any alternative that's obviously better, but Google sure isn't everything they think they are.

    I find this larger index rather unsettling as I feel my search results are becomming more unrelevant. Mostly due to the following reasoning.

    Finding 1 page in a billion page index is relatively easier than finding 1 page in a trillion page index.

    Has the relevance of the results kept up with the growing index size? or does the growing index cause more unrelevant pages to appear higher in the search results?

    In my opinion the next big advancement that is needed on the web is better auto-catagorisation of content, so your search queries can better be matched to the content in the index. Simply relying on pagerank and text matching can't 'cut it' alone anymore IMO.

I've noticed several design suggestions in your code.

Working...