NCSA Compares Google and Yahoo Index Numbers 395

Posted by ScuttleMonkey on Monday August 15, 2005 @02:11PM from the searching-for-truth dept.

chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "

This discussion has been archived. No new comments can be posted.

NCSA Compares Google and Yahoo Index Numbers

Load All Comments

Search 395 Comments Log In/Create an Account

Comments Filter:

Yahoo pants down, egg on face, no WMD either. (Score:3, Interesting)

by ackthpt ( 218170 ) * writes: on Monday August 15, 2005 @02:12PM (#13322982) Homepage Journal

So the summary is in all but 3% of the time, Yahoo finds less pages than Google and that 18 bi1110nz Mayer claimed are a number he pulled right out of his own arse.
Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.
75% less truth than other leading brand

Share
twitter facebook
- Yahoo returns dupes... (Score:3, Insightful)
  
  by Marnhinn ( 310256 ) writes:
  
  Yahoo returns a lot of dupes.
  
  They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.
  
  All they can really show is that google returns more unique results per 1000 (which usually means that more items are indexed, but could be from Google's Pagerank also)...
  - Re:Yahoo returns dupes... (Score:5, Funny)
    
    by Anonymous Coward writes: on Monday August 15, 2005 @02:23PM (#13323089)
    
    Yahoo returns a lot of dupes.
    
    If that's the case, then why is Google the darling of slashdot? ;)
    
    Parent Share
    twitter facebook
  - Re:Yahoo returns dupes... (Score:3, Insightful)
    
    by NickFortune ( 613926 ) writes:
    
    Yahoo returns a lot of dupes.
    Interestingly however, for the search results analysed, google performed noticeably better whether dupes were included or discarded.
    They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.
    That isn't actually what they did. They only analysed results that scored less that 1000 results on both google and yahoo. If either engine sco
    - - Re:Yahoo returns dupes... (Score:3, Insightful)
        
        by NickFortune ( 613926 ) writes:
        
        Could be that Google's "smaller" index is searched by a less picky search tool that gives more results because it doesn't sucessfully eliminate as many useless pages.
        Could be. And it could be that Google's results are both both more numerous and of better quality. The tests did not, as you quite rightly point out, consider the relevance of the results. As is proper, the researchers make no claims regarding relevance.
        On the other hand, their findings to cast doubt upon Yahoo's claims regarding index si
        
        Re:Yahoo returns dupes... (Score:4, Insightful)
        
        by NickFortune ( 613926 ) writes: on Monday August 15, 2005 @06:49PM (#13325932) Homepage Journal
        
        On the other hand, their findings to cast doubt upon Yahoo's claims regarding index size.
        These findings don't do anything of the sort. In fact, Google could have only 999 pages in index, and if it returned all 999 for every query it would have won this test. There's too many assumptions here for the results to be useful.
        
        'Scuse me: I said "cast doubt upon" not "conclusively disproved".
        If Yahoo's indices are, as they claim, more than twice the size of Google's, then we might reasonably expect them to return more hits for an arbitary query. That they do not do so suggests that Yahoo may well be telling fibs.
        Yes, there are other explanations, like for example, Google deliberately falsifying all sub 1000 hit queries, as you point out. However, one likely, arguably the most likely explanation is that Yahoo is being a bit sparing with the truth in its press releases.
        Hence "cast doubt upon".
        
        Parent Share
        twitter facebook
        
        Re:Yahoo returns dupes... (Score:3, Insightful)
        
        by sam1am ( 753369 ) writes:
        
        As soon as I read this, I had the following thought...
        One interesting statistic would be the number of searches for which Google had over 1,000 results, compared to the number of searches for which Yahoo had more than 1,000 results.
        If Yahoo caused 80% of the "over-popular result" discards, well, I'd say that would be highly relevant.
        But then I read footnote [3]:
        [3] In a small number of cases, one search engine (almost always Google) will return results over 1,000 while the other search engine will
- Re:Yahoo pants down, egg on face, no WMD either. (Score:4, Interesting)
  
  by Iriel ( 810009 ) writes: on Monday August 15, 2005 @02:24PM (#13323109) Homepage
  
  I think it is possible that Yahoo! has more items indexed than Google. It may not be true after all, but one has to give thought to the fact that Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results. It's possible that Yahoo! could have simply been fudging the numbers to get some press now that they're actually starting to get noticed again. I can't make a certain conjecture in either direction, but don't totally discredit Yahoo! without looking into everything.
  
  Parent Share
  twitter facebook
  - Re:Yahoo pants down, egg on face, no WMD either. (Score:3, Insightful)
    
    by dustmite ( 667870 ) writes:
    
    I think it is possible that Yahoo! has more items indexed than Google ... Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results
    If we assume that Yahoo has offered subscription-based content searching for about two years (not sure of the exact length of time), then to get even close to the difference they are citing here in their marketing (over 11 billion more items), they would have to
- Re:Yahoo pants down, egg on face, no WMD either. (Score:2, Insightful)
  
  by Sandor at the Zoo ( 98013 ) writes:
  
  Yeah, this "study" seems to be something whipped together over a weekend. Particularly:
  Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google.
  So, anything popular gets tossed. What if Yahoo! indexes all the pages with popular search terms, but Google only indexes the first 1,000? I doubt very much that it's the case, but this whole approach seems suspect at best.
  They threw out what is probably a huge
- not so fast (Score:2, Interesting)
  
  by betsywetsy ( 12592 ) writes:
  
  Looking at the first item in their result log, I'm unimpressed.
  Yahoo returns 0 results, and Google returns... 4 different links to the ispell dictionary (or variants thereof).
  ('carbolization clambers')
- Re:Yahoo pants down, egg on face, no WMD either. (Score:5, Insightful)
  
  by loose_cannon_gamer ( 857933 ) writes: on Monday August 15, 2005 @02:54PM (#13323399)
  
  After reading half the comments on this page, I'm amused at how many alert readers are making the same mistake that they accuse Yahoo of -- misstating results.
  Can we conclude from this study that Google has a bigger index than Yahoo? No. Can we conclude that when you pick two English words that when entered into both Google and Yahoo, both return less than 1000 results, that Google has consistently more results? Yes.
  The real question is, what can we infer from the actual indisputable findings of this study? I find no ready method of generalization. If you are inclined to believe google is better, you feel happy inside. If you think yahoo is better, you have many options to dispute the idea that the study result generalizes to search engine index size.
  As a google fan, I enjoy the warm fuzzies, but I don't see that much to get excited about either way.
  
  Parent Share
  twitter facebook
  - Re:Yahoo pants down, egg on face, no WMD either. (Score:3, Insightful)
    
    by Iriel ( 810009 ) writes:
    
    I agree on that. Based on the methods used to test a general index size, I think it leaves a lot of holes. When you're talking about millions of items, a generalization can be woefully innacurate.
    
    Rather than talking about indexed content, it seems like this test is actually more appropriate to use as some sort of analysis on the overall usefullness of the search engines. Even then, though, the results could be skewed to say that it's better to provide a wealth of pages (Google) or to have fine tuned and nar
Accurate results? (Score:5, Interesting)

by bigwavejas ( 678602 ) * writes: on Monday August 15, 2005 @02:12PM (#13322984) Journal

Google sometimes returns some pretty interesting/ entertaining results.
Try searching for the word, "failure" in Google and check the results.
This brings into question *accurate* results. In this case it appears that's left to interpretation.

Share
twitter facebook
- What would you want them to return? (Score:2)
  
  by brunes69 ( 86786 ) writes:
  
  All Google does is index the web. In this case, it seems like there are more web pages/more highly linked pages about GW being a failure than anyone else.
  
  Is this that hard to beleive? What would you rather it return for such a query? A dictionary definition? If you want a dictionary definition, use the define: oerator.
  
  Trust me - GW will not be on the top of the failure list forever. In another few years we will have a new most-hated person. This is the nature of a real web index, because it is the nature of
  - Re:What would you want them to return? (Score:2)
    
    by bigwavejas ( 678602 ) * writes:
    
    I have no opinion on it actually. I just found it interesting Google displayed GW as result 1 and Yahoo! as result 4. Obviously there's two different search methodologies used here.
    - Re:What would you want them to return? (Score:5, Insightful)
      
      by Intron ( 870560 ) writes: on Monday August 15, 2005 @02:29PM (#13323171)
      
      The top of the page return for Yahoo is
      
      "Failure on eBay Find failure items at low prices. "
      
      which illustrates the most important difference between Yahoo and Google.
      
      Parent Share
      twitter facebook
  - Re:What would you want them to return? (Score:2)
    
    by robertjw ( 728654 ) writes:
    
    What would you rather it return for such a query?
    
    Something related to the word "failure". Did a search of the pages for GW, Jimmy Carter and Michael Moore that came up for a search word "failure". That word isn't even in any of those three pages. Seems to me there is something wrong when a search term lists pages that don't even have the actual word in it.
- Re:Accurate results? (Score:2, Funny)
  
  by DroopyStonx ( 683090 ) writes:
  
  Um, GW Bush is the first result.
  
  Seems fairly accurate to me...
  - Re:Accurate results? (Score:5, Insightful)
    
    by jrallison ( 857135 ) writes: on Monday August 15, 2005 @02:28PM (#13323160) Homepage
    
    It is odd however the #1 result for failure is a webpage without the word "failure" in it.
    
    Parent Share
    twitter facebook
    - Re:Accurate results? (Score:2)
      
      by G27 Radio ( 78394 ) writes:
      
      I believe this came about from Michael Moore (and others) creating links that say miserable failure [whitehouse.gov] and link to GWB's biography.
      
      If you'll notice, Michael Moore's site shows up next on this list despite the fact that it no longer contains the "miserable failure" link.
      
      I short, if you create a link to a site, the words in your link will also be associated with that site.
    - Re:Accurate results? (Score:2)
      
      by gstoddart ( 321705 ) writes:
      
      It is odd however the #1 result for failure is a webpage without the word "failure" in it.
      
      Since Google is bringing up pages with an eye to how many other people link to you, clearly a lot of people have used 'failure' in reference to GW Bush.
  - Re:Accurate results? (Score:2, Interesting)
    
    by ArsonSmith ( 13997 ) writes:
    
    Hmm, bumbling idiot possibly, but sense when has becoming the President of the US, then being elected again been the mark of a failure????
  - Re:Accurate results? (Score:2)
    
    by Krach42 ( 227798 ) writes:
    
    Well, Michael Moore is second place.
    
    So, I'd say that it's at least fair and non-partisan.
- Re:Accurate results? (Score:2)
  
  by ArsonSmith ( 13997 ) writes:
  
  That's funny I get POOP, DICK, VIGINAS, POOP, DICK, VAGINAS. Wonder why that would be a failure? You're right though it is kinda funny. Say that like 5 times out load at work.
- Re:Accurate results? (Score:5, Insightful)
  
  by MindStalker ( 22827 ) writes: <mindstalker&gmail,com> on Monday August 15, 2005 @02:30PM (#13323192) Journal
  
  Well google also indexes based upon refering links and not just the context in the page itself. So if many websites refer to GW as a failure, GWs page itself will turn up as a high hit. Yahoo does this as well, but doesn't not nessesarly give it the same weight. This could highly affect amounts of returns. Because if we say that google returned X pages for a search on term "y" many of these pages may not actually mention "y" thus giving a larger page count for "y". While with yahoos method, it will mainly return pages that mention "y" themself. And possibly add some pages that are mentioned to include "y" by links. This can vastly alter the count.
  
  Parent Share
  twitter facebook
- Re:Accurate results? (Score:2)
  
  by l3v1 ( 787564 ) writes:
  
  Try searching for the word, "failure" in Google and check the results.
  
  You can't honestly think that someone sane enough would use any kind of text-indexing database search engine for making a query like "query". That would render the whole concept of rdbms and some dozen years of cbir research instantly useless, since you would need to filter out all the relevant [relevant for you, that is] information all by yourself from the vast amounts of useless crap that a response for a query like "failure" would
- Don't even contain the search term (Score:3, Insightful)
  
  by Midnight Thunder ( 17205 ) writes:
  
  The interesting thing is that the top three results make no reference to the word failure. Of course it is probably based on pages linking to these three, but I wonder if they should even be included for the lack of the search term?
- Re:Accurate results? Bad example (Score:2)
  
  by SirSlud ( 67381 ) writes:
  
  Terrible example. Search for "http" ... MUCH more interesting. They don't even strip "http://" off the URLs when they do their scoring!
Conclusion (Score:3, Informative)

by mboverload ( 657893 ) writes: on Monday August 15, 2005 @02:13PM (#13322989) Journal

"Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "

Share
twitter facebook
- Re:Conclusion (Score:5, Insightful)
  
  by nutshell42 ( 557890 ) writes: on Monday August 15, 2005 @02:26PM (#13323132) Journal
  
  And Nutshell42's New Amazing Search Engine gives you even more results. Even though my index size is only 1.something million. I simply return every single wikipedia article in every language as result no matter what you search.
  Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous. Only a thorough study comparing results and how useful they were (which is hard to do, expensive and time consuming) has any meaning that goes beyond producing lots of funny numbers and percentages.
  96.34% of all percentages are completely useless.
  btw. I use google, not yahoo
  
  Parent Share
  twitter facebook
  - Re:Conclusion (Score:3, Insightful)
    
    by rossifer ( 581396 ) writes:
    
    Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous.
    
    No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.
    
    What you're talking about is measuring the effectiveness of page ranking, which is a completely different measure of how good a search e
    - Re:Conclusion (Score:2)
      
      by Retric ( 704075 ) writes:
      
      No, google will add pages that don't include the word you searched for. Thus you can't assume that the page is not in yahoo's index because they did not return it.
      
      EX: Search for "Failure" on google and you get linked to a page that never uses that word. Granted a Biography of President George W. Bush might fit the search criteria but it might not be returned by all search engines even if it was in their database.
    - Re:Conclusion (Score:5, Insightful)
      
      by barawn ( 25691 ) writes: on Monday August 15, 2005 @02:55PM (#13323415) Homepage
      
      No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.
      
      Actually, it might not be, thanks to their methodology.
      
      They only used searches with less than 1000 results. They therefore got a lot of searches with small results numbers (because they were searching for bizarre word combinations, like "promotion bedabble"). The total number of results was something like 500,000 or so (order of magnitude) for 10,000 searches. That's an average of 50 results/search, and I'd bet there's a large, large tail, so the most common search is probably something like 10 results.
      
      The problem with this is that in their word list, the same sites are being returned over and over!. For instance, sites containing dictionary lists appear in both "promotion bedabble" and "foliolate defecations" because, duh, that's the only place they'll appear. Since they're just searching the same type of site over and over, they get the same result magnified a lot: Google has more "dictionary lists" in its index than Yahoo. Most of the "dictionary list" word searches returned about 10-20 for Google, and few, if any, for Yahoo.
      
      It's a pretty serious flaw in the methodology, as far as I can tell - they're double counting huge numbers of results, and so they're not really getting a good statistical sample of the index.
      
      Parent Share
      twitter facebook
      - Re:Conclusion (Score:3, Insightful)
        
        by barawn ( 25691 ) writes:
        
        Correction to myself: the total responses to their list was ~150,000 to ~10,000 searches for Yahoo, and ~400,000 for Google. So the average is 15 results for Yahoo and 40 for Google. Given that most "dictionary list" results were between 10 and 40, that should pretty much tell you that their entire result is just a massively multiplied reflection of those searches.
        
        As an interesting aside, though: if you dig through their log, you can see several interesting things. If you look at only results which return
- Re:Conclusion (Score:2)
  
  by sysadmn ( 29788 ) writes:
  
  To be pedantic [google.com], and I am, shouldn't it say that "Yahoo!'s search engine consistently returned fewer results than Google"?
They might have a larger index file (Score:4, Insightful)

by BlackCobra43 ( 596714 ) writes: on Monday August 15, 2005 @02:14PM (#13323006)

but they can't sift through it nearly as well as Google, so what does it matter? Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.

Share
twitter facebook
- Re:They might have a larger index file (Score:2)
  
  by babyrat ( 314371 ) writes:
  
  Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.
  
  What if the dictionary is a German dictionary and you speak German?
Flawed conclusion? (Score:5, Insightful)

by Prong_Thunder ( 572889 ) writes: on Monday August 15, 2005 @02:16PM (#13323025)

Sorry, but if Google consistently returns more results, it could just as easily mean that the filtering isn't as good.

I still prefer Google though.

Share
twitter facebook
- Re:Flawed conclusion? (Score:5, Insightful)
  
  by Ossifer ( 703813 ) writes: on Monday August 15, 2005 @02:24PM (#13323103)
  
  Exactly! I find the conclusions of the research to be quite specious. Yahoo may simply have tighter controls of what is considered a match, which, by the way, is no simple algorithm.
  
  In any case, I am usually not so interested in the numbers of matches, but in the quality of the list returned--hopefully one website will have exactly what I need...
  
  Parent Share
  twitter facebook
  - Re:Flawed conclusion? (Score:2, Insightful)
    
    by Lewisham ( 239493 ) writes:
    
    Agreed, whoever conducted this "research" is pretty idiotic. The pages returned != pages available.
    
    This isn't worthy of the NCSA, or indeed any university, to be shown in any public format with any conclusions at *all*. You'd be laughed out of the conference hall if you presented this.
  - Re:Flawed conclusion? (Score:2)
    
    by Monkeyman334 ( 205694 ) writes:
    
    Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine.
    
    Read that and tell me where they conclude that Google returns better results. You people need to actually read the conclusion, kthnx.
- Interesting but... (Score:3, Insightful)
  
  by kf6auf ( 719514 ) writes:
  
  While it is true that more results could mean worse filtering, that is a separate test entirely.
  I tend to think that ordering is more important than filtering down to a small number of results, since having lots of results returned doesn't hurt if the search engine can order well so that what you want is most likely to be in the top 10-25. This is especially true when there will be at most a couple of results where I'd rather have the search engine try at the ordering and have me do most of the filtering
- Not really (Score:2)
  
  by Mr. Underbridge ( 666784 ) writes:
  
  In fact, all results that match a query are returned, it's the ranking that matters. Google is also more rigorous about excluding apparant duplicate results, and don't count those in the stats.
- Re:Flawed conclusion? (Score:2)
  
  by OpenYourEyes ( 563714 ) writes:
  
  An interesting solution to this problem could be to extend the test. For those pages that turn up in one results and not the other, query the other one for that exact page to see if it has it.
- Re:Flawed conclusion? (Score:2)
  
  by jhoger ( 519683 ) writes:
  
  I think they could fix this problem by discarding result URLs which do not actually have the searched for term.
  
  They aren't trying to infer how high quality the set of results is, just the relative proportion of sites indexed by either engine, so I think this would be a good solution.
  
  -- John.
- Or the other way around (Score:2)
  
  by jetkust ( 596906 ) writes:
  
  Maybe Yahoo indexes more useless pages than google does.
- Re:Flawed conclusion? (Score:3, Insightful)
  
  by barawn ( 25691 ) writes:
  
  Or it could mean that Google has more Ispell lists in its index.
  
  Which appears to be the case.
  
  A search for "inabilities hydrocephalic" returns almost all dictionary lists in Google, except 2. There's only 2 results in Yahoo, one of which is a dictionary list (or equivalent).
  
  But the official results for this? 16 for Google, 2 for Yahoo.
  
  The reason this is a problem is because almost every search returns the same dictionary lists, so it amounts to double (or probably around 5000-fold) weighting of those sites i
The results (Score:5, Interesting)

by Swamii ( 594522 ) writes: on Monday August 15, 2005 @02:17PM (#13323034) Homepage

For those that don't want to read the flippin' article:

Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.

In other words, they believe Google indexes more items based on their own tests of searching.

Share
twitter facebook
- Re:The results (Score:3, Insightful)
  
  by mi ( 197448 ) writes:
  
  Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.
  Informative. But do they also explain, why this (Google's results) is a good thing? From my experience, Google's results beyond the second page are never useful, so they may as well not be there at all.
  I don't see, how NCSA's findings can prove or disprove's Yahoo's earlier claims.
English Language (Score:4, Insightful)

by morcheeba ( 260908 ) * writes: on Monday August 15, 2005 @02:18PM (#13323038) Journal

They only used words from the English Ispell word list. Besides the english-language bias, this is probably limited in other ways. News websites use a limited vocabulary, but a lot of proper names -- so if one engine indexed these better, they wouldn't necessarily get a better rating. News sites are also very dynamic and have a large number of webpages, so they would be influential in the count.

Share
twitter facebook
- Proper name samples (Score:5, Interesting)
  
  by jkauzlar ( 596349 ) * writes: on Monday August 15, 2005 @02:39PM (#13323272) Homepage
  
  Let's try a few samples of proper names:
  Search: Valerie Plame
  Google: 908,000
  Yahoo: 2,580,000
  Search: "Boulder, Colorado"
  Google: 1,600,000
  Yahoo: 5,880,000
  Search: "Linus Torvalds"
  Google: 2,560,000
  Yahoo: 5,870,000
  I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.
  
  Parent Share
  twitter facebook
  - Re:Proper name samples (Score:3, Interesting)
    
    by jkauzlar ( 596349 ) * writes:
    
    Okay, here are some unlikely proper names which stay well within the 1000 maximum hit limit:
    Search: "Dirk Bradford"
    Google: 11
    Yahoo: 15
    Search: "Ronald Hendrickson"
    Google: 170
    Yahoo: 418
    Search: "centerville baptist church" iowa
    Google: 43
    Yahoo: 37
    Well that's less certain. It's hard finding words that return over zero but less than a thousand results...
  - Re:Proper name samples (Score:3, Informative)
    
    by Zapdos ( 70654 ) writes:
    
    From the Article:
    However, in the case of Yahoo! the actual number of search results returned is only one-fifth the estimated total.
  - Those are estimates (Score:5, Insightful)
    
    by mcc ( 14761 ) writes: <amcclure@purdue.edu> on Monday August 15, 2005 @03:35PM (#13323906) Homepage
    
    Of course the study also demonstrates that on the searched terms, Yahoo's estimate numbers vastly overestimated the number of available results they actually found. So if the pages from the study are even close to representative in that regard then this would make the numbers you quote utterly meaningless.
    
    Which is the entire reason, of course, why they kept the limits under 1,000 in the first place-- that for any number over 1,000, if the search engine says, say, "I found "2.5 million results for 'Valerie Plame'", you have no way to tell whether it's telling the truth or not.
    
    Parent Share
    twitter facebook
Hrmm (Score:3, Interesting)

by T3kno ( 51315 ) writes: on Monday August 15, 2005 @02:18PM (#13323041) Homepage

Why wget instead of LWP?

Share
twitter facebook
- Re:Hrmm (Score:2)
  
  by glwtta ( 532858 ) writes:
  
  Why not? Often wget is faster to set up, since it already has a whole lot of functionality rolled in that you'd have to do by hand with LWP.
- Re:Hrmm (Score:2)
  
  by molarmass192 ( 608071 ) writes:
  
  I don't know if this post is serious but it's probably because wget works standalone and provides a heck of a lot of functionality out of the box without coding anything. I'm not a big perl fan but I do think cpan is one impressive collection of work, I wish other prog langs would follow that example.
Queries with 1,000 results (Score:4, Interesting)

by Whafro ( 193881 ) writes: on Monday August 15, 2005 @02:19PM (#13323053) Homepage

TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.

That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.

Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.

Share
twitter facebook
- Re:Queries with 1,000 results (Score:3, Insightful)
  
  by barawn ( 25691 ) writes:
  
  That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
  
  Well, there's a worse bias. They're grabbing words from an Ispell word list.
  
  There are websites which contain the Ispell word list. There appear to be more of those returned in Google as results than in Yahoo. (here [nd.edu] is one returned in Google for "apprizers expense", but which is not returned in Yahoo.)
  
  This basic
- - Re:Queries with 1,000 results (Score:2)
    
    by Whafro ( 193881 ) writes:
    
    That's irrelevant in this case, certainly. This wasn't a judgment of what the best search engine is, but instead which search engine had more results. This was strictly quantity, and not quality.
This is what matters... (Score:2)

by Rolan ( 20257 ) * writes:

This boils down to the real numbers that matter. It doesn't really matter if your index is "bigger" or not, it is about the results that are returned. The other thing that matters (and can't really be measured in a scientific manner) is relevance. It's easy to return results for a set of words, it is hard to return relevant results for a set of words. My personal experience is that Google returns more relevant and better ordered results than Yahoo!.
- Exactly (Score:2)
  
  by mopslik ( 688435 ) writes:
  
  If $SEARCH_ENGINE returns 1,000,000 results, and assuming I can sift through each result at an astonishing rate of 1 per second, it will take me 1,000,000/(60*60) = 278 hours, or 11 1/2 days to wade through the junk.
  
  The number of results is largely irrelevant. Give me quality filtering instead. Fortunately, Google does that for the most part.
The ultimate test (Score:2)

by kevin_conaway ( 585204 ) writes:

To me, the test is googling myself and seeing what comes back. Google seems to favor mailing lists high in its results so all the stupid things I've said over the years are right up there on front. Of course, I think Google is more accurate because things actually attributed to me show up higher in the results, but is that actually correct? I don't know.
- Re:The ultimate test (Score:2)
  
  by Vegeta99 ( 219501 ) writes:
  
  Ha! Yeah. According to Google, anyway, Plug N' Play is satan, and I really dispised MP3 players (in favor of MD players).
  - Re:The ultimate test (Score:2)
    
    by jandrese ( 485 ) * writes:
    
    If it's any consolation, I hated those early MP3 players too. I mean what's not to like about 8MB of fixed non-upgradable storage on your music player? Especially when 2.5MB of that is taken up by the OS.
    
    On the other hand I've always hated MD players. Closed proprietary formats suck.
Quality not quantity (Score:2)

by ngunton ( 460215 ) writes:

Surely it's the quality of the results that counts, rather than the quantity? Who needs 1,000,000 matches anyway, when most people don't go past the first page or two of the results? The article doesn't talk at all about how relevant the matches were. I'm not saying that it invalidates their study, but I would say that any search engine that returns millions of hits for any query is simply showing off. Give me a search engine that shows me fewer matches, but the best hits anyday. Lately, Google has increasi
Perl Code (Score:4, Funny)

by hayro ( 854797 ) writes: on Monday August 15, 2005 @02:24PM (#13323107)

I don't know about the study but that is the most readable perl code I have seen in a long time.

Share
twitter facebook
- - Re:Perl Code (Score:3, Insightful)
    
    by pfafrich ( 647460 ) writes:
    Readable code because:
    
    Well laid out and indented
    
    Long and meaningful function and variable names
    
    Good logical structure, no fancy tricks
    
    It looks like C!
interesting but inconclusive (Score:2)

by it0 ( 567968 ) writes:

It's a nice test but ifail to see how they can extrapolate this to be true for all searches.

Don't forget that also a lot of queries get handtuned at google/yahoo to give the proper resultset.

Also to keep in mind that size doesn't matter but relevancy does!

And they both cheat at that as well, they just give back the highley ranked pages for those words. Works ok for a lot of people but hardly relevant.
Study has poor assumption (Score:2, Insightful)

by Anonymous Coward writes:

The study noted that although Yahoo says that have ~twice as many pages indexed as google, when they queried each engine with two arbitrary words from the dictionary, they got less responses from Yahoo.
From this they concluded yahoo's claim of twice as many pages is suspicious.

What's suspicious is that these people consider themselves scientific. What if, for example, Yahoo just returns meaningful results, whereas google returns anything with those words in? For example, what if you search for "
Methodology (Score:5, Insightful)

by enjo13 ( 444114 ) writes: on Monday August 15, 2005 @02:26PM (#13323125) Homepage

The very methodology used in this case seems rather incorrect to me.

The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.

That assumption is flat out incorrect. There are actually multiple problems.

First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.

Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.

Share
twitter facebook
International Listings (Score:5, Insightful)

by Dominatus ( 796241 ) writes: on Monday August 15, 2005 @02:27PM (#13323138)

The study only checked English words. Is it possible that the increase came from Yahoo expanding into more international website markets?

Just a thought

Share
twitter facebook
can we trust the methodology (Score:2)

by GabrielF ( 636907 ) writes:

Basically NCSA's method assumes that if a search engine indexes twice the number of pages, than it will return twice the number of results for a given search. However, in order for this to be the case, the 10 billion+ more pages that yahoo indexes would have to be roughly equivalent to the pages that google indexes. If Yahoo is indexing 20 billion pages, but ten billion of those are in mandarin, than searching for random combinations of english words (which NCSA is doing) won't tell us which search engine i
This is what passes for CS research nowadays? (Score:5, Insightful)

by adrizk ( 137574 ) writes: on Monday August 15, 2005 @02:28PM (#13323151)

Seriously. 'We wrote a script and here are the results'? This would take an average PERL programmer what -- 30 minutes of work? Has academic research in computing really sunk to this level?

Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.

Share
twitter facebook
- Re:This is what passes for CS research nowadays? (Score:2)
  
  by DogDude ( 805747 ) writes:
  
  Has academic research in computing really sunk to this level?
  
  Considering that most people call a "fact" something that they found on Wikipedia or via Google, I'd have to say that the answer to your questions is "yes". The Net is a vast source of incorrect, incomplete, and otherwise bad data. There may be a lot of information out there, but the vast majority is wrong. This "cheapening" of information has and probably will lead to more of this crap "research".
- Re:This is what passes for CS research nowadays? (Score:2)
  
  by 99BottlesOfBeerInMyF ( 813746 ) writes:
  
  but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.
  
  Did you RTFA? Their conclusion is based upon their results which was the best they could do without access to the systems and with limited resources. And what is the conclusion that you complain about the spin on? The conclusion is that yahoo's claim is suspicious. I'd say that is a pretty solid claim. Yahoo's assertions are suspicious and while they could be true, are worth questioning in li
Interesting study... (Score:2)

by dracken ( 453199 ) writes:

...though flawed in many respects. The raw number of pages returned may not indicate the size of indices. Google is famous because it returns *relevant* pages but not necessarily *more* pages. A search engine that returns its entire index with each search isnt all that useful.

Secondly, results for all keywords may not increase with the size of the index. The pages which were indexed might correspond to popular searches (that return more than 1000 results, which were not considered if you RTFA) - so consider
methodology (Score:2)

by abde ( 136025 ) writes:

the assumptoins seem to be that sarch results are randomlydistributed. But by teh very nature of search - a targeted and subjective request for information - that is clearly the wrong model. I don't se why the assumption that a 2x bigger index should return 2x more results for any query 1000.

A better test would be to see how much overlap there was between queries. Do the top 50 returns on queries (ofany size, not just imited to those with N 1000 returns) match? to wuithin what percentage?
Google parses plurals differently. (Score:3, Interesting)

by WoTG ( 610710 ) writes: on Monday August 15, 2005 @02:30PM (#13323176) Homepage Journal

Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.

The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)

Share
twitter facebook
- - - Re:you are WRONG (Score:3, Interesting)
      
      by adpowers ( 153922 ) writes:
      
      Google does use stemming, I see it all the time. The results are still different, though, because I'm sure they weight the main query higher than the stems.
      
      Also, you can see something to similar to stemming when you search for certain acronyms. Try searching for [lotr] or [ada]. It also performs searches for the full version of the acronym, as you can see by the bold query in the snippets and title.
Who cares about... (Score:2, Insightful)

by Ignignokt ( 803398 ) writes:

the number of results anyways? Who makes it to page 5000 when doing a search?
More results == better search engine? (Score:3, Insightful)

by RunzWithScissors ( 567704 ) writes: on Monday August 15, 2005 @02:30PM (#13323185)

So in the conclusion, the author writes that since Google displayed more results, based on their random test data, it was the superior search engine? That seems so wrong somehow...

Wouldn't a better search engine return less, but more appropriate results? I mean, how many of us have found the information we were actually looking for on page ten or twelve of a search. And, isn't less more, but better? %insert Linux geek laughs here%

One would think that volume of results would not a better search engine make, although it may indicate a larger engine index size; an expicit statement to that effect seems to be missing from the NCSA report.

-Runz

Share
twitter facebook
Quality Quantity (Score:2, Insightful)

by hagrin ( 896731 ) writes:

This is just another example in the age old argument of which is better. IMO, the quality of the search results is what matters more than the sheer quantity of information. One relevant find is more valuable than 100 inaccurate results. A test of accuracy might be more valuable and one that would be difficult to engineer. For instance, if I type in a word that has a direct correlating .com domain, that should be the first result (assuming no other words in the title - i.e. "hagrin" brings me my home pag
Wait, wait, wait (Score:2)

by antifoidulus ( 807088 ) writes:

What's this? A concise and well written summary with a link directly to the well written article? No twisting/breaking of the truth in order to incite /. groupthink comments? No pointless plugs for unrelated topics? No ADS?!?!

Jesus, the editors keep that up they might actually have a worthwhile site going....never fear, I'm sure the next dupe and/or an article comparing spooning to unmanned space travel will surface before the day's end.
WMD flaimbait? (Score:2)

by mi ( 197448 ) writes:

Or off-topic? Or troll?
The NCSA's test neither confirms nor disproves Yahoo's earlier claims. Their lesser average results may just indicate higher quality threshold -- Google's results beyond the second page are never useful either.
I'd say, it is kind'a early to claim "pants down, egg on face"...
Not only does Google do More, it does Better (Score:3, Informative)

by Ralph Spoilsport ( 673134 ) writes: on Monday August 15, 2005 @02:32PM (#13323210) Journal

In regards to a similar article last week, I posted my own personal results [slashdot.org] on what I found when I did a search on Kyzyl, the capital of Tuva.
Google not only gave MORE results, it gave BETTER results. The only bad results were some hairsplitting (if largely well meant) from fellow /.ers... (I mentioned Tuva as a suburb of Mongolia, and while it IS a part of the Russian Federation, it is Much More Mongolian than Russian. And if the rising tide of neoNazi scum in Russia get their way, Tuva could easily be cut adrift into the Mongolian/Chinese orbit...but I digress...)
The essential point is: Which Does the Job Better For Me? Google. Therefore, I use Google. Assuming the Copernican position that I am not atypical, I would therefore extrapolate that this is very true for most other people as well. Which means that Yahoo has a LONG way to go and A LOT more work to do.
RS

Share
twitter facebook
Not Convincing (Score:2)

by FreshFunk510 ( 526493 ) writes:

Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google.
In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist.. and wrote a PERL script to randomly select two words at a time from that list.
Is it just me or does this study not sound convincing enough? There are too many holes in the way the study was conducted, I
Nice an objective (Score:2)

by Gumber ( 17306 ) writes:

Nice to take an anti-yahoo submission from a Google employee. I guess I should be happy they at least disclosed the conflict. It's more than you can say for someone like Bob "rove-puppet" Novak.
Results of my own study... (Score:5, Funny)

by Locke2005 ( 849178 ) writes: on Monday August 15, 2005 @02:38PM (#13323261)

Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!

Share
twitter facebook
- Re:Results of my own study... (Score:2, Funny)
  
  by WillAffleckUW ( 858324 ) writes:
  
  Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!
  
  But that's because both Yahoo and Google cap results at 1000, so if you have more than that, it won't count for either engine.
Quality vs. Quantity (Score:2)

by Sigh Phi ( 324315 ) writes:

The study only addresses the issue of size of the indices and returned results. Understandable, and it certainly debunks Yahoo's claims, or at least, makes them irrelevant -- what good is a 19 billion-page index if you don't actually get any more search results?

But the real utility of a search engine is the relevance of those search results. Google has been successful because its search results are relevant to a large portion of its users. The real question when comparing search engines is, can one help you
Are more search results "better"? (Score:3, Interesting)

by Vellmont ( 569020 ) writes: on Monday August 15, 2005 @03:18PM (#13323677) Homepage

There's an inherent assumption in the Yahoo claim that more==better. Do I really care if a search returns 1 million results vs 6 million results?

What I care about is actually getting the information I went out to find. There's only a certain amount of hits I'm willing to explore. That's probbably on the order of 100-200 or so if I _really_ need the information. The implication by Yahoo is that more hits == better top ranked hits. Is that true? Really what should be done is just compare the top few hundred hits between the two search engines and see how they differ. Those are the only ones that matter anyway.

Where more results might prove usefull is obscure searches with less than 100-200 hits. But if this study is true, Yahoo does a worse job on obscure searches that google.

The problem of course is the type of obscure searches that this study performed. Two random words out of a dictionary just isn't what your typical person conducting a search engine query is looking for.

Share
twitter facebook
Holy lack of IR stastics understanding, Batman! (Score:5, Interesting)

by freality ( 324306 ) writes: on Monday August 15, 2005 @04:41PM (#13324714) Homepage Journal

The most basic measure of performance in Information Retrieval is precision vs. recall.

Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.

Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.

Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.

The NCSA study basically misses the effect this decision would have on perceived size of index.

A simple demonstration shows how it works.

First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.

In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.

Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.

Share
twitter facebook
- More please! (Score:5, Interesting)
  
  by 2008 ( 900939 ) writes: on Monday August 15, 2005 @02:25PM (#13323114) Journal
  
  This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.
  
  OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".
  
  Parent Share
  twitter facebook
  - MODERATRORS, look here!!! (Score:2)
    
    by Junior J. Junior III ( 192702 ) writes:
    
    Mod parent up.
  - Re:More please! (Score:3, Funny)
    
    by Overzeetop ( 214511 ) writes:
    
    Oh, please don't ask for "more like this". It just gives the editors a reason to think that there is a hardcore contigent of /. readers who crave dupes. I mean, how can they get more "like this" than to simply repost it in a couple of hours.
- Re:Good article (Score:2)
  
  by amliebsch ( 724858 ) writes:
  
  ERROR: LOGIC FAILURE
  Returning fewer pages does not necessarily mean poorer search results - after all, a good search will present the maximum number of relevant pages, but no others. Google only wins if all of the extra results it shows are actually relevant. By the methods of this test and your analysis, I could write a search engine that returns its entire index as the result set for every search, and it would be the best websearch ever! Billions of results on every search!
  I would like to see an objec
- Re:Conflict of interest? (Score:2)
  
  by 99BottlesOfBeerInMyF ( 813746 ) writes:
  
  It seems to me that when Slashdot publishes an article that is favourable to Google, that was submitted by a Google staff member, one might question whether someone involved has a conflict of interest.
  
  It just so happens that a lot of the news about a given company comes to the attention of the people in that company. Should Slashdot not allow submissions from posters that regard products or services they are working on? So long as it is news and affiliations are disclosed what's the problem?
  It might
- - Re:Automated querying is Illegal (Score:2)
    
    by Dachannien ( 617929 ) writes:
    
    And in this case, there's no contract to break. But Google can still IP ban you if they want.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Yahoo pants down, egg on face, no WMD either. (Score:3, Interesting)

Yahoo returns dupes... (Score:3, Insightful)

Re:Yahoo returns dupes... (Score:5, Funny)

Re:Yahoo returns dupes... (Score:3, Insightful)

Re:Yahoo returns dupes... (Score:3, Insightful)

Re:Yahoo returns dupes... (Score:4, Insightful)

Re:Yahoo returns dupes... (Score:3, Insightful)

Re:Yahoo pants down, egg on face, no WMD either. (Score:4, Interesting)

Re:Yahoo pants down, egg on face, no WMD either. (Score:3, Insightful)

Re:Yahoo pants down, egg on face, no WMD either. (Score:2, Insightful)

not so fast (Score:2, Interesting)

Re:Yahoo pants down, egg on face, no WMD either. (Score:5, Insightful)

Re:Yahoo pants down, egg on face, no WMD either. (Score:3, Insightful)

Accurate results? (Score:5, Interesting)

What would you want them to return? (Score:2)

Re:What would you want them to return? (Score:2)

Re:What would you want them to return? (Score:5, Insightful)

Re:What would you want them to return? (Score:2)

Re:Accurate results? (Score:2, Funny)

Re:Accurate results? (Score:5, Insightful)

Re:Accurate results? (Score:2)

Re:Accurate results? (Score:2)

Re:Accurate results? (Score:2, Interesting)

Re:Accurate results? (Score:2)

Re:Accurate results? (Score:2)

Re:Accurate results? (Score:5, Insightful)

Re:Accurate results? (Score:2)

Don't even contain the search term (Score:3, Insightful)

Re:Accurate results? Bad example (Score:2)

Conclusion (Score:3, Informative)

Re:Conclusion (Score:5, Insightful)

Re:Conclusion (Score:3, Insightful)

Re:Conclusion (Score:2)

Re:Conclusion (Score:5, Insightful)

Re:Conclusion (Score:3, Insightful)

Re:Conclusion (Score:2)

They might have a larger index file (Score:4, Insightful)

Re:They might have a larger index file (Score:2)

Flawed conclusion? (Score:5, Insightful)

Re:Flawed conclusion? (Score:5, Insightful)

Re:Flawed conclusion? (Score:2, Insightful)

Re:Flawed conclusion? (Score:2)

Interesting but... (Score:3, Insightful)

Not really (Score:2)

Re:Flawed conclusion? (Score:2)

Re:Flawed conclusion? (Score:2)

Or the other way around (Score:2)

Re:Flawed conclusion? (Score:3, Insightful)

The results (Score:5, Interesting)

Re:The results (Score:3, Insightful)

English Language (Score:4, Insightful)

Proper name samples (Score:5, Interesting)

Re:Proper name samples (Score:3, Interesting)

Re:Proper name samples (Score:3, Informative)

Those are estimates (Score:5, Insightful)

Hrmm (Score:3, Interesting)

Re:Hrmm (Score:2)

Re:Hrmm (Score:2)

Queries with 1,000 results (Score:4, Interesting)

Re:Queries with 1,000 results (Score:3, Insightful)

Re:Queries with 1,000 results (Score:2)

This is what matters... (Score:2)

Exactly (Score:2)

The ultimate test (Score:2)

Re:The ultimate test (Score:2)

Re:The ultimate test (Score:2)

Quality not quantity (Score:2)

Perl Code (Score:4, Funny)

Re:Perl Code (Score:3, Insightful)

interesting but inconclusive (Score:2)

Study has poor assumption (Score:2, Insightful)

Methodology (Score:5, Insightful)

International Listings (Score:5, Insightful)

can we trust the methodology (Score:2)

This is what passes for CS research nowadays? (Score:5, Insightful)

Re:This is what passes for CS research nowadays? (Score:2)

Re:This is what passes for CS research nowadays? (Score:2)

Interesting study... (Score:2)

methodology (Score:2)

Google parses plurals differently. (Score:3, Interesting)