NCSA Issues Disclaimer on Google/Yahoo Study 118
Jean Veronis writes "NCSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's: ' Staff at the NCSA noted several issues with the study'. This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "
Disclaimer Text (Score:5, Interesting)
"The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.
Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.
A Comparison of the Size of the Yahoo and Google Indices [uiuc.edu] "
Re:Disclaimer Text (Score:1)
Comment removed (Score:3, Interesting)
Re:... so? (Score:2, Interesting)
Looks like they're still doing some looking to make sure their results are rock solid, but that so far they seem to be. As such, the current state of reality is that the fact is that Google has a must bigger index of the world wide web (or Internet, or whatever you want to call it) than Yahoo. Yahoo may have a bigger index squ
Re:... so? (Score:2, Interesting)
This is not the case. The methods depend on each search engine's algorithms and are very likely to differ greatly.
In any case, whether a particular query returns 40 results or 40000 does not matter -- only the first 20 are ever of any use...
"Could" & "might" (Re:... so?) (Score:1)
/. 503 error (Score:2, Interesting)
Anyone else get 503 errors when trying to reach Slashdot?
Where do you go to talk about Slashdot being Slashdotted?
Re:/. 503 error (Score:2, Informative)
I've been getting 500 errors the whole morning while trying to reach
Re:/. 503 error (Score:5, Funny)
The trick is to refresh as fast as you can, until the bad 500 errors go away.
Re:/. 503 error (Score:1)
Re:/. 503 error (Score:1)
Re:/. 503 error (Score:1)
Re:/. 503 error (Score:1)
Re:/. 503 error (Score:2)
Re:/. 503 error (Score:1)
First rule of
A crucial issue... (Score:5, Funny)
Wait... (Score:5, Funny)
Re:Wait... (Score:1)
The truth is: size is everything.
Re:Wait... (Score:1, Funny)
I think that applies to both situations too...
Re:Wait... (Score:1)
Re:Wait... (Score:1)
That's what they say to you
Re:Wait... (Score:1)
I thought that size didn't matter.
It's not the wand, it's the wizard...
Re:Hah, hah (Score:1)
I'll have you know Mambo is very professional. If you ment Mumbo, then yeah those guys are a bunch of dead beats.
But why publish it? (Score:3, Insightful)
Re:But why publish it? (Score:3, Insightful)
It was not "published" (Score:5, Insightful)
Dude, it was never published, it was posted on one web server that is part of the ncsa.uiuc.edu sub-domain (specifically, vburton.ncsa.uiuc.edu). There are probably hundreds of machines that are in this network, and posting something on a web server running there does not equate to NCSA formally publishing an article. What we're talking about here is a web page written by two students, they worked on a project, they wanted to post it for other people to see. So that's what they did, period.
Stupidly, everyone is claiming that NCSA backed this whole thing, like they (NCSA) are on some crusade to compare Yahoo and Google. But this must be taken for what it is - a project by two students. NCSA's disclaimer is just trying to make this clear for the idiots out there who think that every little thing a student says or does must have been funded, supported, backed, etc. by NCSA.
Re:But why publish it? (Score:2)
Yes, the web page was lacking in methodology and had a numbe
Re:But why publish it? (Score:2)
Notice that the summary was submitted by a well known Google employee, and that it states the study was conducted by the N
Filtering (Score:5, Insightful)
I don't know which disturbs me more: The possibility that this is the correct explanation for the discrepancy or the possibility that it isn't.
It seems to me that the correct solution to filtering results would be to put the "undesirable" results at the bottom of the list, not get rid of them entirely. One man's trash is another man's treasure after all.
Maybe those pages never were crawled by yahoo. (Score:5, Interesting)
Re:Maybe those pages never were crawled by yahoo. (Score:2)
Never mind that, look how low your ID is :-o (Score:1)
Re:Maybe those pages never were crawled by yahoo. (Score:2)
No it isn't. In principle yes, but I tested for this, and it is Yahoo the one that returns more spam:
http://slashdot.org/comments.pl?sid=159703&cid=13
http://slashdot.org/comments.pl?sid=158453&cid=13
Trash and treasure. (Score:1)
A Study By Students... (Score:2)
Re:A Study By Students... (Score:2)
Covering Ones Rear (Score:3, Insightful)
Re:Covering Ones Rear (Score:2)
What I really love is that fact that the page used to have the professor listed on the list of authors, NCSA logos were on the page, UIUC was listed under the authors' affiliation, and it looked much more official. Now that it's been aired out as non-scientific, there's all sorts of disclaimers saying that it was his student's work, and shifting the blame. Too bad it was published on his webspace :P
Perhaps the professor of History and Sociology will think twice next time before attempting to put his nam
Why is the disclaimer needed? (Score:3, Interesting)
Aside from the URL, was there some sort of NCSA association implied or claimed in the original post then removed?
disclaimer: people believe everything they read (Score:2)
The dark web (Score:5, Insightful)
This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.
What'd therefore be relevant and interesting to know isn't how many hundreds of pages Google vs Yahoo get for "my job sucks", but rather how many it gets for "my weevil collection".
Re:The dark web (Score:1)
Your wish fulfilled:
Google: Approx 47,100
Yahoo: Approx 258,000
Both searches we for "my weevil collection" without quotes. With quotes the results are:
Google: 3
Yahoo: 4
Yahoo is champ.
Re:The dark web (Score:2)
google - 915,000
yahoo - 2,200,000
Search results for "my weevil collection":
google - 3
yahoo - 4
Yahoo returned the wikipedia page for "weevil" on the first page, so now I know what it is :)
Re:The dark web (Score:1, Funny)
google - 915,000
yahoo - 2,200,000
Search results for "my weevil collection":
google - 3
yahoo - 4
You're getting negative hits?
Re:The dark web (Score:3, Interesting)
The original search rating papers (Kleinberg's algorithm, PageRank) made the ground-breaking observation that links between pages contained lots of useful information that could be used for ranking in addition to the keywords contained in the pages themselves. This was in a time where websites were mostly personal, and there was an atmosphere of friendly sharing of information where people would link to other sites that they find interesting. However, how much have things changed today? Who s
Re:The dark web (Score:2)
Re:The dark web (Score:2)
Re:The dark web (Score:2)
As for the search engine, how about a little checkbox that says "No Business". What's a business? Someone who sells something (loosely). Anyway it'd take a heck of a lot of work to implement and define, but that's a checkbox I'd have thoroughly molested. I'd also want to make a thesaurus that edits words as you type.
Re:The dark web (Score:1)
http://research.yahoo.com/research/data_analytics
Re:The dark web (Score:2)
Re:The dark web (Score:2)
There is always the issue of abuse, no matter what the method.
Re:The dark web (Score:2)
I quite often find that Amazon's "people who bought this book also bought/viewed
Maybe Mozilla/Firefox could work with Google to implement this type of fee
Feedback system (Score:2)
That already exists as stumbleupon.com [stumbleupon.com] and deli.cio.us [deli.cio.us].
Re:The dark web (Score:3, Insightful)
I personally don't think Google is _excluding_ pages that somehow don't get enough links to them. Typically, good resources will get linked to, and thus taking into account the number of links to a page seems sensible.
From personal experience, I can't say I have anything
Re:The dark web (Score:2)
Dude, if your idea of the dark web is millions of spam infested blog pages that have been crawled by a million spam robots putting links in the comment pages, along with a smattering of Mediawiki sites that have similarly been "0wnz0r3d" by spam crawlers that edit the pages a
trust (Score:5, Funny)
Hit volume is a minor factor (Score:2)
Important thing to note: The general principal is MORE COMPLEX than "find all pages containing this term". You can ADD terms and get MORE hits.
As an example and as a t
Re: (Score:1)
Re:Hit volume is a minor factor (Score:2)
I say so what.... (Score:2)
Big deal about what some other corp. says. This is a Joe Schmoe study conducted by college students. This means they're an independent, non-funded (therefore non-corp influenced) study. Too bad they have seemingly been coerced into changing some things in their article. *sigh* Why can't they ever stick with their guns??
Re:NCSA? (Score:1)
National Center for Supercomputing Applications [uiuc.edu], a research branch of the University of Illinois at Urbana-Champaign [uiuc.edu].
Their claims to fame include having four supercomputers of the 50 fastest in the world [top500.org], and creating Mosaic [wikipedia.org], the first graphical web browser.
If you have doubts about the influence of Mosaic, load up internet explorer, click "Help" in the menu, then click "About Internet Explorer" and read the blurb....
Accuracy of Google counts? (Score:5, Interesting)
On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 [yahoo.com] results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 [yahoo.com] results.
So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.
Re:Accuracy of Google counts? (Score:1)
Here's the corrected version of the first one, "lawyer" [yahoo.com] that results in 125,000,000 many estimated hits. The second one, "lawyer lawyer" [yahoo.com] results in 124,000,000 many estimated hits.
Re:Accuracy of Google counts? (Score:2)
But yes, those numbers you show are quite strange.
Re:Accuracy of Google counts? (Score:2)
So the interesting question is: Why does it work with lawyer, but not with bla?
Re:Accuracy of Google counts? (oblig.) (Score:2, Funny)
lawyer lawyer - results 29,300,000
lawyer lawyer lawyer - results 62,000,000
lawyer lawyer lawyer lawyer - results 78,600,000
lawyer lawyer lawyer lawyer
lawyer lawyer lawyer lawyer
lawyer lawyer lawyer lawyer
LAW SUIT LAW SUIT!
lawyer lawyer lawyer lawyer ...
lawyer lawyer
This may be true (Score:3, Funny)
In contrast, Yahoo, unless I misunderstand, only compresses the file after it has been indexed. Since only the file is compressed and not the individual characters, they indeed have a larger index file as the study concluded. :)
More thoughts on a better test (Score:5, Interesting)
- Crawler Test
Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.
Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.
Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.
This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.
- Common Word Test
Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.
This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.
- Conclusion
Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.
Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.
Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.
Re:More thoughts on a better test (Score:1)
A google search for yahoo site:yahoo.com turns up over 57,000,000 hits, not zero.
Re:More thoughts on a better test (Score:2)
The problem with such searches is that if a search engine misindexes a mirror site or a 401 page and that is returned in the count, then that SE looks bigger.
On the other hand if you launch a query that has 5-10 answers you can actually examine every single page on both result pages and make sure that all hits are correct and distinct.
Using that technique Google comes ahead of Yahoo, by a large margin. [slashdot.org]
Re:More thoughts on a better test (Score:2)
What you describe would be to me a "substantial structural difference". Which means I agree.
However, that doesn't change that I do think it's better to accept probable error in a huge population of samples than to choose a m
Re:More thoughts on a better test (Score:1)
Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.
From the article:
Interestingly, the actual total number of results returns varies dramatically from the estimated total number of results that both Google and Yahoo! provide users in the search results. In the case of Google, the number of actual results returned is about one third of the estimation that Google gives. However
Re:More thoughts on a better test (Score:2)
But even though I said take everything I said with a grain of salt, I would take that claim about estimated vs. actual hits with 2 grains of salt.
Here's my superb rationale why.
Consider that if it was easy or efficient return the exact # of hits, they would probably do it, instead of, for instance, "Results 251 - 259 of about 284". I mean, consider the UI people at Google.. do they want to muddly their otherwise famously precise and clear interface with a guffaw like that? I'd bet Not unless th
Re:More thoughts on a better test (Score:2)
So again, looks like Yahoo has a good handle on what its index size is, but is simply filtering out lots of the results from its index. Perhaps there's not really that much difference between the two after all
Yahoo hits.. my error. (Score:2)
Darn students... (Score:2)
Oh, wait. Which students were these?
Study: Red Delicious apples != Fuji apples (Score:3, Interesting)
One major problem with the study, not really addressed by this problems article is one of comparison. Yes, Yahoo! and Google are two search engines, but they perform their searches differently, and more importantly, use different criteria when returning matches! It is quite possible that when doing the exact same search in both yeilds a difference in results. Why? Because the two different search engines have different criteria for providing output, or matches if you will, from said search. Perhaps Yahoo! does indeed have more pages indexed, but because of their search algorithm, or their program which displays results to the user, less matches are provided; even though more pages were looked at.
I'm not saying that Yahoo! does have more or less pages than Google. But the study that was executed and published did not account for many of the differences between the products they were comparing. The above is merely another interpretation of said results. I'm glad that some folks at NCSA agree and provided some clarification; lest we get another urban myth like storks bring babies.
-Runz
Matt Cheney is a punk (Score:1)
Yahoo bigger, how? (Score:2)
Re:"Editor" seemed to contradicted someone's skill (Score:2)