Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet Businesses Google

Google Index Doubles 324

geekfiend writes "Today Google updated their website to indicate over eight billion pages crawled, cached and indexed. They've also added an entry to their blog explaining that they still have tons of work to do."
This discussion has been archived. No new comments can be posted.

Google Index Doubles

Comments Filter:
  • by hanssprudel ( 323035 ) on Thursday November 11, 2004 @07:07AM (#10785977)

    What the article does not point out is why this something important. For just about forever google's store has been coverging on 2**32 documents. Some people have speculated that Google simply could not update their 100,000+ servers with a new system that allowed more. Apparently they have now done the necessary architecture changes to allow for identifying documents by 64 bit (or more identifiers) and back in the business of making their search for comprehensive.

    Good timing to conincide with MSN attempt to start a new searchengine too!
  • Re:This is news ? (Score:3, Interesting)

    by PerpetualMotion ( 550623 ) on Thursday November 11, 2004 @07:12AM (#10785997)
    A bigger index does not equal better search results, however, with the press this will generate, it will equal profits.
  • by Jugalator ( 259273 ) on Thursday November 11, 2004 @07:16AM (#10786010) Journal
    Good timing to conincide with MSN attempt to start a new searchengine too!

    Yes, they'd better fight back, as they now have a serious competitor in MSN.
    It's giving very accurate results [msn.com].

    Doesn't anyone find it strange that Google gave the same top result there a while back?

    MSN must be using a very similar algorithm.

    Maybe a bit too similar...?

    *tinfoil hat on*
  • by jlar ( 584848 ) on Thursday November 11, 2004 @07:29AM (#10786061)
    "Does this mean that I've been missing a huge amount of important information until now?"

    Maybe the steep increase is due to all the new file formats they are indexing now. That might be useful for some people (although I sometimes find it kind of annoying that a search returns MS-Word documents).
  • Re:This is news ? (Score:3, Interesting)

    by Ford Prefect ( 8777 ) on Thursday November 11, 2004 @07:32AM (#10786074) Homepage
    A bigger index does not equal better search results, however, with the press this will generate, it will equal profits.

    It would be terribly easy to get trillions of pages indexed. For instance, a site I've been working on has a public calendar system, with results fished out of a database. There are very few actual events in it at the moment, but with the 'Previous' and 'Next' links it'll run from 1970 to 2038. A naïve web-crawler would index every single month for every single year, but Google would appear to have crawled over just a few, presumably flagging the pages as too similar to warrant further investigation.

    With stuff like public web forums, Slashdot and the like, I can easily imagine comparatively small sites producing thousands of pages apiece. Is there useful information in there? Quite possibly, but it definitely needs treating in a different manner to an old-fashioned, static-pages-only site...
  • by Anonymous Coward on Thursday November 11, 2004 @07:34AM (#10786082)
    For just about forever google's store has been coverging on 2**32 documents. Some people have speculated that Google simply could not update their 100,000+ servers with a new system that allowed more. Apparently they have now done the necessary architecture changes to allow for identifying documents by 64 bit (or more identifiers) and back in the business of making their search for comprehensive.
    As someone who routinely follows these things, I couldn't agree more with your statement. My company operates a number of sites, and over the past 6 months, we've seen an obvious trend. Sites with, say, 5000+ pages, which used to be entirely indexed in Google, gradually had pages lost from Google. A search for site:somesite.com would return 5000 results 6 months ago. 3 or 4 months ago, the same search gave maybe 1000 results. This month maybe 500 or 600. We were definitely of the opinion that Google's index was "maxxed out" and was dropping large portions of indexed sites in favor of attempting to index new sites.

    Now after seeing this story, I did a search and found literally all 5000+ pages are indexed once again. This is a huge step forward for webmasters everywhere. If your site had been slowly edged out of Google's index it's most likely back in its entirety now.

    Thanks G.
  • Microsoft (Score:4, Interesting)

    by Cookeisparanoid ( 178680 ) on Thursday November 11, 2004 @07:38AM (#10786099) Homepage
    A lot of people have been asking what the point of the artical is, why does it matter, well possibly because Microsoft announced the launch of their search engine http://news.bbc.co.uk/1/hi/technology/4000015.stm and are claiming more pages index than google (5 billion) so google have responded by effectivly doubling their pages indexed.
  • meta-no-archive (Score:3, Interesting)

    by Anonymous Coward on Thursday November 11, 2004 @07:54AM (#10786154)
    apparently my sites will never get a good ranking on google because I don't want the search engine to cache the site. So I'm using meta no-archive tags. That's the only thing I can figure out why the sites rank so poorly on google, when they come up in the top 10-20 hits on yahoo and other search engines. The keywords for the searches are valid, the sites are relevant to the keyword searches, yet the sites don't show in the top 100-300 on google.

    I've avoided all the usual spam type of tags (auto refreshing, hidden text, cloaking, etc.) and the sites are legitimate and on the up and up, and yet the only page or two that google is spidering are the one or few that appear to be without the no-archive tags and possibly the revisit/expire tags.

    Is google's policy, allows us to cache your site, or get penalized? Anyone else run into a similar problem or can shed some light on this? The only other thing I can think of is the robots text file, that keeps googlebot, and then other spiders through a *, from entering images directories. The spiders, including googlebot, aren't restricted from entering any other directories, they are given free reign.

    Anyone else with problems with no-cache, no archive, tight revisit/expire times, or similar non-spam tags that result in penalties in google ranking?

    I've been using google exclusively for a few years now. But the poor page ranking of sites on my server got me wondering about other sites that may be relevant to my own searches which may be exluded or penalized by google. So I've started using Yahoo search again, as much as I hate Yahoo (what they do with advertising to Yahoo groups and Yahoo mail is a shame). It appears that Yahoo is including better results because other sites show up with higher ranking that actually are relevant. So I've learned that Google isn't as perfect as I thought it was, which was disappointing in itself. It was easy using one search site. Now I have to use two to make sure I'm getting good results. Anyone know if there is a plugin for Firefox with both Google and Yahoo search boxes on the toolbar?
  • by seanyboy ( 587819 ) on Thursday November 11, 2004 @07:58AM (#10786169)
    My bad. I'd skimmed a few things on the web, and assumed that it had been switched off. Looks instead as though Google have changed how it works. See PageRank is dead [zawodny.com]. I need to investigate further.
  • by metlin ( 258108 ) * on Thursday November 11, 2004 @08:08AM (#10786195) Journal
    Google has a problem with this because some of those searches are actually useful.

    For instance, when I search for something technical, I often run into search results from DBLP, arXiv, CiteSeer and the like -- although these are really search results within themselves, they're immensely useful to me.

    Since we both effectively have a conflict of interest - Google would need to figure out a way to strike a balance.
  • Re:What? (Score:3, Interesting)

    by LiquidCoooled ( 634315 ) on Thursday November 11, 2004 @08:10AM (#10786202) Homepage Journal
    I see the difference...

    Search terms: oriental rice recipe asian spice
    Search Results: Results 1 - 10 of about 254,000 for oriental rice recipe asian spice . (0.40 seconds)
    Search Effectiveness: REASONABLE. good list of relivent items matched.

    Search terms: recipe+"oriental rice"+spice
    Search Results: Your search - recipe+"oriental rice"+spice - did not match any documents.
    Search Effectiveness: UTTER SHITE

    The user wants SIMPLICITY. If google cannot give decent results for simple search criteria, then people will go elsewhere.

    Its the KISS principle in effect.

  • by corrie ( 111769 ) on Thursday November 11, 2004 @08:27AM (#10786276)
    However, results from places like Starware Search are not useful, and elevates my blood pressure with all the attempts at spamming me.

    Just because I use Firefox and Adblock doesn't mean I now want to visit all possible spam sites in existence.

    I don't care if Starware and friends make their money from advertising or not. The point is that Google is ALREADY a search engine, and a pretty good one at that. What is the point of returning results from another search engine, especially if the other one does not even have specialised domain?
  • by jez9999 ( 618189 ) on Thursday November 11, 2004 @08:30AM (#10786287) Homepage Journal
    One thing that would really help me sometimes would be if Google allowed you to do an 'exact match' search. No, I don't mean enclosing something in double quotes, that still ignores capitalization, whitespace, and most non-letter characters. I'd like to be able to search for pages that have the EXACT string '#windows EFNET', for example, or '/usr/bin/' or whatever. '/Usr/biN' wouldn't match, and nor would '#windows^^EFNET' (where ^ is equal to a space :-) ).

    I sent an e-mail to Google about this and the guy who replied didn't seem to think it was possible... anyone know if it is?
  • by jmcmunn ( 307798 ) on Thursday November 11, 2004 @09:23AM (#10786531)
    Because every blogger in the universe has added at least 3 pages since the last index. I fail to see how it is significant to me that there are now 8 billion mostly worthless sites out there. The number of actually useful sites has not gone up considerably.
  • Re:Image Search (Score:3, Interesting)

    by BoldAC ( 735721 ) on Thursday November 11, 2004 @09:27AM (#10786551)
    While waiting for the update to their image search, everybody should optimize their web pages... google-style.

    For those of you that don't believe that having keywords in your URLS... just use google's own story, for example.

    http://www.google.com/googleblog/2004/11/googles -i ndex-nearly-doubles.html

    "Google Index Nearly Doubles" is in the url and the first header. Look at how they do thinks... and your google traffic will increase.
  • by Erasmus Darwin ( 183180 ) on Thursday November 11, 2004 @09:39AM (#10786617)
    "But in this case, with the architecture they have in place, anyone ever doing what you're asking would require a full-text search through their multi-TB dataset, which I suspect is highly impractical."

    Actually, they could cut that down considerably. For example, say we were doing an exact search for '#windows EFNET' as in the original example. The first thing they could do is start with a traditional search on "#windows EFNET" [google.com]. At that point, they've cut their multi-TB dataset down to just a few megs or less of likely matches (in this case, only 10 pages matched). Then they could do a full-text check on each result, looking for an exact match and discarding all the rest.

  • by bighoov ( 605325 ) on Thursday November 11, 2004 @09:40AM (#10786618) Homepage
    Probably not short sighted, but rather an space and cpu efficiency issue. Space - If you have 64-bit doc ids, even if you index 2^48 documents you're still wasting 16 bits per stemmed word per document. CPU - dealing with 64-bit integers on 32-bit hardware usually involves multiple loads, and decreases what can fit in the hardware data caches.

"Protozoa are small, and bacteria are small, but viruses are smaller than the both put together."

Working...