Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
The Internet Businesses Google

Google Index Doubles 324

geekfiend writes "Today Google updated their website to indicate over eight billion pages crawled, cached and indexed. They've also added an entry to their blog explaining that they still have tons of work to do."
This discussion has been archived. No new comments can be posted.

Google Index Doubles

Comments Filter:
  • by xiando ( 770382 ) on Thursday November 11, 2004 @07:06AM (#10785970) Homepage Journal
    Personally I find that the lack of relevant pages if the biggest problem with search engines, not the lack of pages with information. It seems I always find what I'm looking for eventually, what I need improved is the time I spend looking though spam-bomb pages before I find a page with the correct information.

    These spam-pages seem to be increasing; I mean those pages with just a buch of keywords or the output of some search system.
  • by Jugalator ( 259273 ) on Thursday November 11, 2004 @07:07AM (#10785973) Journal
    I wonder if it'll take longer to index twice as many pages? Or if they, along with this change, improved their spider and/or added hardware. Otherwise I'm not sure this change is for the better, unless you like to search for really obscure topics.
  • by Sanity ( 1431 ) on Thursday November 11, 2004 @07:08AM (#10785986) Homepage Journal
    Does every minor Google or Apple related thing deserve a slashdot story? Can slashdot create a "Fanboy" section for insignificant stories advocating Google (with their software patent) and Apple (with their iTunes DRM)? That way I could filter them out more easily.
  • by seanyboy ( 587819 ) on Thursday November 11, 2004 @07:10AM (#10785991)
    Google needs to stop obsessing about the number of indexed pages, and start concentrating on the quality. Since pagerank was switched off, 2 out of 5 searches now seem to be jammed with pages full of nothing but random words and adverts. It's even more galling when the adverts are Google Ads. Much as I love Google, they're becoming increasingly less effective as a tool.
  • Re:This is news ? (Score:3, Insightful)

    by dotmike ( 829740 ) on Thursday November 11, 2004 @07:18AM (#10786018)

    Yeah, but it'd be news if the sun set twice in one night or rose twice as bright.

    It's more the exponential increase in the size of the index rather than the piecemeal addition.

  • by manmanic ( 662850 ) on Thursday November 11, 2004 @07:21AM (#10786029)
    Does this mean that I've been missing a huge amount of important information until now? I'd just assumed that Google covered the entire relevant web but now it seems to cover the whole same amount again. My Google alerts [googlealert.com] also seem to have started producing a lot more results which suggest that a lot of these new pages are rated quite highly. Who knows how much more quality content on the web we're just not seeing?
  • by Onionesque ( 455220 ) <spammie@pobox.com> on Thursday November 11, 2004 @07:21AM (#10786032) Homepage
    To paraphrase Churchill, Google is the worst system devised by the wit of man, except for all the others. Where else would you go? Yahoo? Hey, how about AltaVista?

    The problems faced by Google in their battle against the scumbags who would game the system are faced by every other search engine. Google, IMHO, handles them better.

  • by slavemowgli ( 585321 ) on Thursday November 11, 2004 @07:23AM (#10786041) Homepage
    I don't quite believe that Google would've limited themselves that way (using 32 bit identifiers for documents) - that would've been incredibly short-sighted.
  • by Kithraya ( 34530 ) on Thursday November 11, 2004 @07:51AM (#10786142)
    I'm especially irritated by the increasing number of highly-ranked pages that are nothing more than another search engine's results. If Google could find some way to identify and remove these from my result set, Google's usefulness to me would increase 10 times over.
  • Does this mean...? (Score:4, Insightful)

    by jimicus ( 737525 ) on Thursday November 11, 2004 @07:51AM (#10786147)
    Does this mean twice as many pages with "Search for 'printer problem linux' on Kelkoo"?
  • Re:What? (Score:2, Insightful)

    by poohsuntzu ( 753886 ) on Thursday November 11, 2004 @07:58AM (#10786173) Homepage
    It isn't about having a better search engine, so much as it is knowing how to use it. If you are looking for information on a recipe for oriental rice using asian spice, how would you search?

    Bad search example:

    oriental rice recipe asian spice


    Good search example:

    recipe+"oriental rice"+spice


    See the difference? google tries its best to get rid of the spam pages, but it won't ever combat them all. Half of the work has to be done with you understanding the best way to describe to the search engine, what it is you want to do. The better you explain it, the better it can search for you.
  • by Rakshasa Taisab ( 244699 ) on Thursday November 11, 2004 @08:03AM (#10786182) Homepage
    You can rant all you want, but Google still has a fair use right to your images. They are reduced resolution images and therefor legal for non-commercial use.

    Not to mention robot.txt, but that is so obvious it shouldn't need to be mention.
  • by Mostly a lurker ( 634878 ) on Thursday November 11, 2004 @08:06AM (#10786190)
    the masses who use the "default" (MSN?) aren't bothering to answer

    I think it is more that many users of IE just do not twig that their failed page access resulted in an automatic query to MSN.

    In reality, most users make occasional deliberate queries to Google and more frequent accidental queries to MSN.

  • So, to sum up... (Score:5, Insightful)

    by kahei ( 466208 ) on Thursday November 11, 2004 @08:15AM (#10786225) Homepage

    I am feeding this troll because there are people who really _do_ think like that and I wish I could yell at them to their faces :)

    You put content in a place where it is publically accessible. You explicitly and proactively made that content available to everyone, including 'the average surfer' and googlebots. You took no steps to make it available only to the select few of whom you approve.

    Now you are all cross and bothered because average surfers / googlebots have read / copied your content, such as it is.

    The solution is to drown yourself in a bucket. I have a bucket.

  • by Sai Babu ( 827212 ) on Thursday November 11, 2004 @08:19AM (#10786239) Homepage

    This is why I've been begging google folks to implement NEAR [pandia.com] operator!

    Here is an example msn search: http://search.msn.com/results.aspx?FORM=SMCRT&q=fi sh%20NEAR%20ahi%20NEAR%20recipe [msn.com]

  • by Jugalator ( 259273 ) on Thursday November 11, 2004 @08:24AM (#10786257) Journal
    Wow, Microsoft must have fixed it...
    It now no longer shows microsoft.com as top hit.

    Haha, I guess the joke reached MS headquarters. :-P
  • by PsychoSlashDot ( 207849 ) on Thursday November 11, 2004 @09:03AM (#10786413)
    What I've read on the Google help pages seems to indicate that they don't index punctuation or capitalization. When you search for something, your string is looked for within an existing index, and appropriate reference materials are shown. Including punctuation wouldn't result in any hits within their index, meaning no results.

    Now, obviously, it is theoretically possible to do just about anything. But in this case, with the architecture they have in place, anyone ever doing what you're asking would require a full-text search through their multi-TB dataset, which I suspect is highly impractical.

    My point is that as I understand it, Google has coded a number of shortcut tricks which allow reasonable search times, and full-text string-exact searching would prevent them from using those shortcuts, resulting in search times they don't seem to think is reasonable.
  • by Mant ( 578427 ) on Thursday November 11, 2004 @11:18AM (#10787695) Homepage

    Robots.txt isn't some thing that only applies to Google, it is (supposed) to be honoured by all search engines, and uses the Robots Exclusion Standard. So, when you claim these are Google's arbitary rules, you are in fact wrong. They are neither Google's nor arbitary (at least no more than any web standard).

    So your clue, not so much of clue, as robots.txt doesn't fit your description.

    As for why you should know about it, you are putting up a web site, it is part of running a web site. You might as well complain why you need to know about HTML, CSS or registering a domain name. Quite what coming from the UK has to do with it (something I also do), I have no idea.

    "I simply do not want the average surfer to be able to visit my site, I am not interested in serving my pages to them, they simply would not appreciate or understand what it is I am showing."

    Then a publicly accessable webiste is the wrong place. It is not your personal space, and it isn't private. You made it available to the world, nobody made you. To turn around and complain when (some of) the world visits it is hypocracy.

    It's like putting up posters around a town, then running around complaining all these people are looking at them, won't appreciate them, and you don't want them too. It's also comes across as condescending and arrogant, which probably explains the nastiness of some of the responses.

    You opted in when you put up the publicly accessable website. If all search engines had to be opt in, nobody could find anything on the web, and it would use a lot of its utility. Your assumed to want them crawling becuase the vast majority of people do, they want their site to be found. If you don't though, no problem, just use the standards for stopping searches, or password protect the site. No scandal at all, just hysterics.

    Showing the low res thumbnail of your image isn't violating your copyright either. The only legitimate claim you have is the amount of time it took to remove something from the cache.

    The "thieves" accusation is even more ridiculous. If you put something up on the web people can see for free, you can't complain. There are options if you want to protect it. Google doesn't claim you work as theirs (which would be 'stealing' or at least copyright violation), they help people find you public web site.

    If you don't want a public website but made one, whose fault is that? If you are going to run a website and can't be bothered to find out how to do it properly, you can't blame Google.

  • by PMuse ( 320639 ) on Thursday November 11, 2004 @01:38PM (#10789424)
    How about a NEAR operator? Sure, AND OR NOT are nice, but my results would be a lot more relevant if I could eliminate results where the search terms appeared a thousand words apart.
  • by cavemanf16 ( 303184 ) on Thursday November 11, 2004 @02:17PM (#10789894) Homepage Journal
    MSN's "msnbot" has been crawling/spidering my webserver (which runs Geeklog and is just another blog of my random crap) pretty extensively for weeks now. (Lie 5 times a day it seems) Searching on Google for my site's name now reveals more results from my site, but not a lot of those circle-jerk style search results pages that are just trying to generate some ad revenues. However, using the beta.search.msn.com site DOES yield a lot more random crap (mostly blogs and personal webservers) that somehow generated some kind of link to my site because of the title of one of my articles, someone linking to my site in one of their blog posts, etc.

    I have a feeling MSN's new search site is gonna be mostly blogs and advertisements, not relevant information. I think it's good Google has indexed more pages, but I still believe their algorithm will continue to provide more USEFUL results than MSN. (BTW, the googlebot doesn't hit my site too frequently which tells me Google's bot understands that my site isn't updated too frequently, nor is it linked to from other important sites)

Always try to do things in chronological order; it's less confusing that way.

Working...