Forgot your password?

typodupeerror
Google Businesses The Internet Technology

NCSA Compares Google and Yahoo Index Numbers 395

Posted by ScuttleMonkey
from the searching-for-truth dept.
chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "
This discussion has been archived. No new comments can be posted.

NCSA Compares Google and Yahoo Index Numbers

Comments Filter:
  • by ackthpt (218170) * on Monday August 15, 2005 @02:12PM (#13322982) Homepage Journal
    So the summary is in all but 3% of the time, Yahoo finds less pages than Google and that 18 bi1110nz Mayer claimed are a number he pulled right out of his own arse.

    Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.

    75% less truth than other leading brand

  • Accurate results? (Score:5, Interesting)

    by bigwavejas (678602) * on Monday August 15, 2005 @02:12PM (#13322984) Journal
    Google sometimes returns some pretty interesting/ entertaining results.

    Try searching for the word, "failure" in Google and check the results.

    This brings into question *accurate* results. In this case it appears that's left to interpretation.

  • The results (Score:5, Interesting)

    by Swamii (594522) on Monday August 15, 2005 @02:17PM (#13323034) Homepage
    For those that don't want to read the flippin' article:

    Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.


    In other words, they believe Google indexes more items based on their own tests of searching.
  • Hrmm (Score:3, Interesting)

    by T3kno (51315) on Monday August 15, 2005 @02:18PM (#13323041) Homepage
    Why wget instead of LWP?
  • by Whafro (193881) on Monday August 15, 2005 @02:19PM (#13323053) Homepage
    TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.

    That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.

    Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.
  • by Anonymous Coward on Monday August 15, 2005 @02:24PM (#13323098)

    It seems to me that when Slashdot publishes an article that is favourable to Google, that was submitted by a Google staff member, one might question whether someone involved has a conflict of interest. It's not astroturfing, because his employment at Google was clearly mentioned. It might be an ad (or more correctly, a press release) masquerading as news. I wonder if the article would have been published had it been submitted anonymously...

  • by Iriel (810009) on Monday August 15, 2005 @02:24PM (#13323109) Homepage
    I think it is possible that Yahoo! has more items indexed than Google. It may not be true after all, but one has to give thought to the fact that Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results. It's possible that Yahoo! could have simply been fudging the numbers to get some press now that they're actually starting to get noticed again. I can't make a certain conjecture in either direction, but don't totally discredit Yahoo! without looking into everything.
  • More please! (Score:5, Interesting)

    by 2008 (900939) on Monday August 15, 2005 @02:25PM (#13323114) Journal
    This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.

    OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".
  • by WoTG (610710) on Monday August 15, 2005 @02:30PM (#13323176) Homepage Journal
    Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.

    The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)
  • Re:Accurate results? (Score:2, Interesting)

    by ArsonSmith (13997) on Monday August 15, 2005 @02:33PM (#13323220) Journal
    Hmm, bumbling idiot possibly, but sense when has becoming the President of the US, then being elected again been the mark of a failure????
  • Proper name samples (Score:5, Interesting)

    by jkauzlar (596349) * on Monday August 15, 2005 @02:39PM (#13323272) Homepage
    Let's try a few samples of proper names:

    Search: Valerie Plame
    Google: 908,000
    Yahoo: 2,580,000

    Search: "Boulder, Colorado"
    Google: 1,600,000
    Yahoo: 5,880,000

    Search: "Linus Torvalds"
    Google: 2,560,000
    Yahoo: 5,870,000

    I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.

  • not so fast (Score:2, Interesting)

    by betsywetsy (12592) on Monday August 15, 2005 @02:48PM (#13323345)
    Looking at the first item in their result log, I'm unimpressed.
    Yahoo returns 0 results, and Google returns... 4 different links to the ispell dictionary (or variants thereof).
    ('carbolization clambers')
  • Re:Conclusion (Score:1, Interesting)

    by Anonymous Coward on Monday August 15, 2005 @02:52PM (#13323383)
    Hmmm, from my experience google sometimes returns results that don't have the search terms in the page... but the result is a page that has that search term is linked to that result... i think that makes sense... but then again i might just be on crack
  • by jkauzlar (596349) * on Monday August 15, 2005 @02:52PM (#13323387) Homepage
    Okay, here are some unlikely proper names which stay well within the 1000 maximum hit limit:

    Search: "Dirk Bradford"
    Google: 11
    Yahoo: 15

    Search: "Ronald Hendrickson"
    Google: 170
    Yahoo: 418

    Search: "centerville baptist church" iowa
    Google: 43
    Yahoo: 37

    Well that's less certain. It's hard finding words that return over zero but less than a thousand results...

  • by Vellmont (569020) on Monday August 15, 2005 @03:18PM (#13323677)
    There's an inherent assumption in the Yahoo claim that more==better. Do I really care if a search returns 1 million results vs 6 million results?

    What I care about is actually getting the information I went out to find. There's only a certain amount of hits I'm willing to explore. That's probbably on the order of 100-200 or so if I _really_ need the information. The implication by Yahoo is that more hits == better top ranked hits. Is that true? Really what should be done is just compare the top few hundred hits between the two search engines and see how they differ. Those are the only ones that matter anyway.

    Where more results might prove usefull is obscure searches with less than 100-200 hits. But if this study is true, Yahoo does a worse job on obscure searches that google.

    The problem of course is the type of obscure searches that this study performed. Two random words out of a dictionary just isn't what your typical person conducting a search engine query is looking for.
  • Re:Conclusion (Score:2, Interesting)

    by christor (663626) on Monday August 15, 2005 @03:19PM (#13323682)
    Instructions to build search engine with "largest number of indexed pages":

    1. Make a list of 999 sites.
    2. Set up website with a query input form.
    3. Upon query, return the entire list.

    A major problem with this study is that the number of results returned depends on two variables: (a) the number of sites in the index (so far so good) and (b) the accuracy and sensitivity of the search algorithm. The latter is the very point of a search engine. Yahoo may, who knows, be more selective in returning results.

    I'm a google fan, but these results prove nothing.
  • Flaws in methodology (Score:2, Interesting)

    by brokeninside (34168) on Monday August 15, 2005 @03:28PM (#13323814)
    1. Assumes that Yahoo's expansion is random. If the increase in Yahoo's pages are not random, then the results may be skewed. For example, Yahoo's expansion may have been mostly, or even entirely, in pages built of common words that all receive more than 1000 hits upon searching.

    2. Assumes, as many people have stated, that by using an English dictionary for its seeds, the study assumes that Yahoo's expansion has been in English. If Yahoo has expanded it's database in non-English pages with few words that overlap into English, those pages will not show up in the study.

    This study essentially determines that Google has a larger database of random, obscure English language words. Consequently, they demonstrate that Google is the superior search engine for finding obscure, random English words.

    One additional check that they could have thrown in would be how many of the pages in the links presently deliver 404 errors. That would have been far more interesting to me than how well the search engines do at finding obscure and random English words.
  • Re:you are WRONG (Score:3, Interesting)

    by adpowers (153922) on Monday August 15, 2005 @04:28PM (#13324531)
    Google does use stemming, I see it all the time. The results are still different, though, because I'm sure they weight the main query higher than the stems.

    Also, you can see something to similar to stemming when you search for certain acronyms. Try searching for [lotr] or [ada]. It also performs searches for the full version of the acronym, as you can see by the bold query in the snippets and title.
  • by freality (324306) on Monday August 15, 2005 @04:41PM (#13324714) Homepage Journal
    The most basic measure of performance in Information Retrieval is precision vs. recall.

    Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.

    Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.

    Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.

    The NCSA study basically misses the effect this decision would have on perceived size of index.

    A simple demonstration shows how it works.

    First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.

    In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.

    Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.

Mystics always hope that science will some day overtake them. -- Booth Tarkington

Working...