Slashdot Log In
NCSA Compares Google and Yahoo Index Numbers
Posted by
ScuttleMonkey
on Mon Aug 15, 2005 01:11 PM
from the searching-for-truth dept.
from the searching-for-truth dept.
chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Yahoo pants down, egg on face, no WMD either. (Score:3, Interesting)
Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.
75% less truth than other leading brand
Yahoo returns dupes... (Score:3, Insightful)
They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.
All they can really show is that google returns more unique results per 1000 (which usually means that more items are indexed, but could be from Google's Pagerank also)...
Re:Yahoo returns dupes... (Score:5, Funny)
If that's the case, then why is Google the darling of slashdot?
Parent
Re:Yahoo returns dupes... (Score:3, Insightful)
Interestingly however, for the search results analysed, google performed noticeably better whether dupes were included or discarded.
They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.
That isn't actually what they did. They only analysed results that scored less that 1000 results on both google and yahoo. If either engine sco
Re:Yahoo returns dupes... (Score:3, Insightful)
Could be. And it could be that Google's results are both both more numerous and of better quality. The tests did not, as you quite rightly point out, consider the relevance of the results. As is proper, the researchers make no claims regarding relevance.
On the other hand, their findings to cast doubt upon Yahoo's claims regarding index si
Re:Yahoo returns dupes... (Score:4, Insightful)
If Yahoo's indices are, as they claim, more than twice the size of Google's, then we might reasonably expect them to return more hits for an arbitary query. That they do not do so suggests that Yahoo may well be telling fibs.
Yes, there are other explanations, like for example, Google deliberately falsifying all sub 1000 hit queries, as you point out. However, one likely, arguably the most likely explanation is that Yahoo is being a bit sparing with the truth in its press releases.
Hence "cast doubt upon".
Parent
Re:Yahoo returns dupes... (Score:3, Insightful)
But then I read footnote [3]:
Re:Yahoo pants down, egg on face, no WMD either. (Score:4, Interesting)
Parent
Re:Yahoo pants down, egg on face, no WMD either. (Score:3, Insightful)
I think it is possible that Yahoo! has more items indexed than Google ... Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results
If we assume that Yahoo has offered subscription-based content searching for about two years (not sure of the exact length of time), then to get even close to the difference they are citing here in their marketing (over 11 billion more items), they would have to
Re:Yahoo pants down, egg on face, no WMD either. (Score:5, Insightful)
Can we conclude from this study that Google has a bigger index than Yahoo? No. Can we conclude that when you pick two English words that when entered into both Google and Yahoo, both return less than 1000 results, that Google has consistently more results? Yes.
The real question is, what can we infer from the actual indisputable findings of this study? I find no ready method of generalization. If you are inclined to believe google is better, you feel happy inside. If you think yahoo is better, you have many options to dispute the idea that the study result generalizes to search engine index size.
As a google fan, I enjoy the warm fuzzies, but I don't see that much to get excited about either way.
Parent
Re:Yahoo pants down, egg on face, no WMD either. (Score:3, Insightful)
Rather than talking about indexed content, it seems like this test is actually more appropriate to use as some sort of analysis on the overall usefullness of the search engines. Even then, though, the results could be skewed to say that it's better to provide a wealth of pages (Google) or to have fine tuned and nar
Accurate results? (Score:5, Interesting)
Try searching for the word, "failure" in Google and check the results.
This brings into question *accurate* results. In this case it appears that's left to interpretation.
Re:Accurate results? (Score:5, Insightful)
Parent
Don't even contain the search term (Score:3, Insightful)
Re:Accurate results? (Score:5, Insightful)
Parent
Re:What would you want them to return? (Score:5, Insightful)
"Failure on eBay Find failure items at low prices. "
which illustrates the most important difference between Yahoo and Google.
Parent
Conclusion (Score:3, Informative)
Re:Conclusion (Score:5, Insightful)
Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous. Only a thorough study comparing results and how useful they were (which is hard to do, expensive and time consuming) has any meaning that goes beyond producing lots of funny numbers and percentages.
96.34% of all percentages are completely useless.
btw. I use google, not yahoo
Parent
Re:Conclusion (Score:3, Insightful)
No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.
What you're talking about is measuring the effectiveness of page ranking, which is a completely different measure of how good a search e
Re:Conclusion (Score:5, Insightful)
Actually, it might not be, thanks to their methodology.
They only used searches with less than 1000 results. They therefore got a lot of searches with small results numbers (because they were searching for bizarre word combinations, like "promotion bedabble"). The total number of results was something like 500,000 or so (order of magnitude) for 10,000 searches. That's an average of 50 results/search, and I'd bet there's a large, large tail, so the most common search is probably something like 10 results.
The problem with this is that in their word list, the same sites are being returned over and over!. For instance, sites containing dictionary lists appear in both "promotion bedabble" and "foliolate defecations" because, duh, that's the only place they'll appear. Since they're just searching the same type of site over and over, they get the same result magnified a lot: Google has more "dictionary lists" in its index than Yahoo. Most of the "dictionary list" word searches returned about 10-20 for Google, and few, if any, for Yahoo.
It's a pretty serious flaw in the methodology, as far as I can tell - they're double counting huge numbers of results, and so they're not really getting a good statistical sample of the index.
Parent
Re:Conclusion (Score:3, Insightful)
As an interesting aside, though: if you dig through their log, you can see several interesting things. If you look at only results which return
They might have a larger index file (Score:4, Insightful)
Flawed conclusion? (Score:5, Insightful)
I still prefer Google though.
Re:Flawed conclusion? (Score:5, Insightful)
In any case, I am usually not so interested in the numbers of matches, but in the quality of the list returned--hopefully one website will have exactly what I need...
Parent
Interesting but... (Score:3, Insightful)
While it is true that more results could mean worse filtering, that is a separate test entirely.
I tend to think that ordering is more important than filtering down to a small number of results, since having lots of results returned doesn't hurt if the search engine can order well so that what you want is most likely to be in the top 10-25. This is especially true when there will be at most a couple of results where I'd rather have the search engine try at the ordering and have me do most of the filtering
Re:Flawed conclusion? (Score:3, Insightful)
Which appears to be the case.
A search for "inabilities hydrocephalic" returns almost all dictionary lists in Google, except 2. There's only 2 results in Yahoo, one of which is a dictionary list (or equivalent).
But the official results for this? 16 for Google, 2 for Yahoo.
The reason this is a problem is because almost every search returns the same dictionary lists, so it amounts to double (or probably around 5000-fold) weighting of those sites i
The results (Score:5, Interesting)
In other words, they believe Google indexes more items based on their own tests of searching.
Re:The results (Score:3, Insightful)
I don't see, how NCSA's findings can prove or disprove's Yahoo's earlier claims.
English Language (Score:4, Insightful)
Proper name samples (Score:5, Interesting)
Search: Valerie Plame
Google: 908,000
Yahoo: 2,580,000
Search: "Boulder, Colorado"
Google: 1,600,000
Yahoo: 5,880,000
Search: "Linus Torvalds"
Google: 2,560,000
Yahoo: 5,870,000
I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.
Parent
Re:Proper name samples (Score:3, Interesting)
Search: "Dirk Bradford"
Google: 11
Yahoo: 15
Search: "Ronald Hendrickson"
Google: 170
Yahoo: 418
Search: "centerville baptist church" iowa
Google: 43
Yahoo: 37
Well that's less certain. It's hard finding words that return over zero but less than a thousand results...
Re:Proper name samples (Score:3, Informative)
However, in the case of Yahoo! the actual number of search results returned is only one-fifth the estimated total.
Those are estimates (Score:5, Insightful)
Which is the entire reason, of course, why they kept the limits under 1,000 in the first place-- that for any number over 1,000, if the search engine says, say, "I found "2.5 million results for 'Valerie Plame'", you have no way to tell whether it's telling the truth or not.
Parent
Hrmm (Score:3, Interesting)
Queries with 1,000 results (Score:4, Interesting)
That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.
Re:Queries with 1,000 results (Score:3, Insightful)
That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
Well, there's a worse bias. They're grabbing words from an Ispell word list.
There are websites which contain the Ispell word list. There appear to be more of those returned in Google as results than in Yahoo. (here [nd.edu] is one returned in Google for "apprizers expense", but which is not returned in Yahoo.)
This basic
Perl Code (Score:4, Funny)
Re:Perl Code (Score:3, Insightful)
Methodology (Score:5, Insightful)
The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.
That assumption is flat out incorrect. There are actually multiple problems.
First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.
Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.
International Listings (Score:5, Insightful)
Just a thought
This is what passes for CS research nowadays? (Score:5, Insightful)
Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.
Google parses plurals differently. (Score:3, Interesting)
The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)
Re:you are WRONG (Score:3, Interesting)
Also, you can see something to similar to stemming when you search for certain acronyms. Try searching for [lotr] or [ada]. It also performs searches for the full version of the acronym, as you can see by the bold query in the snippets and title.
More results == better search engine? (Score:3, Insightful)
Wouldn't a better search engine return less, but more appropriate results? I mean, how many of us have found the information we were actually looking for on page ten or twelve of a search. And, isn't less more, but better? %insert Linux geek laughs here%
One would think that volume of results would not a better search engine make, although it may indicate a larger engine index size; an expicit statement to that effect seems to be missing from the NCSA report.
-Runz
Not only does Google do More, it does Better (Score:3, Informative)
Google not only gave MORE results, it gave BETTER results. The only bad results were some hairsplitting (if largely well meant) from fellow /.ers... (I mentioned Tuva as a suburb of Mongolia, and while it IS a part of the Russian Federation, it is Much More Mongolian than Russian. And if the rising tide of neoNazi scum in Russia get their way, Tuva could easily be cut adrift into the Mongolian/Chinese orbit...but I digress...)
The essential point is: Which Does the Job Better For Me? Google. Therefore, I use Google. Assuming the Copernican position that I am not atypical, I would therefore extrapolate that this is very true for most other people as well. Which means that Yahoo has a LONG way to go and A LOT more work to do.
RS
Results of my own study... (Score:5, Funny)
Are more search results "better"? (Score:3, Interesting)
What I care about is actually getting the information I went out to find. There's only a certain amount of hits I'm willing to explore. That's probbably on the order of 100-200 or so if I _really_ need the information. The implication by Yahoo is that more hits == better top ranked hits. Is that true? Really what should be done is just compare the top few hundred hits between the two search engines and see how they differ. Those are the only ones that matter anyway.
Where more results might prove usefull is obscure searches with less than 100-200 hits. But if this study is true, Yahoo does a worse job on obscure searches that google.
The problem of course is the type of obscure searches that this study performed. Two random words out of a dictionary just isn't what your typical person conducting a search engine query is looking for.
Holy lack of IR stastics understanding, Batman! (Score:5, Interesting)
Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.
Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.
Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.
The NCSA study basically misses the effect this decision would have on perceived size of index.
A simple demonstration shows how it works.
First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.
In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.
Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.
More please! (Score:5, Interesting)
OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".
Parent
Re:More please! (Score:3, Funny)