Please create an account to participate in the Slashdot moderation system

NCSA Issues Disclaimer on Google/Yahoo Study 118

Posted by Hemos on Monday August 22, 2005 @12:15PM from the point-counterpoint dept.

Jean Veronis writes "NCSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's: ' Staff at the NCSA noted several issues with the study'. This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "

This discussion has been archived. No new comments can be posted.

NCSA Issues Disclaimer on Google/Yahoo Study

Load All Comments

Search 118 Comments Log In/Create an Account

Comments Filter:

Disclaimer Text (Score:5, Interesting)

by Stanistani ( 808333 ) writes: on Monday August 22, 2005 @12:24PM (#13372514) Homepage Journal

From http://vburton.ncsa.uiuc.edu/indexsize.html [uiuc.edu]:
"The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.

Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.

A Comparison of the Size of the Yahoo and Google Indices [uiuc.edu] "

Share
twitter facebook
- Re:Disclaimer Text (Score:1)
  
  by klept ( 895849 ) writes:
  
  Ok, so NCSA claims to not be associated with this paper, but several changes have been made to reflect several concerns of NCSA staff. So which is it, NCSA, are you involved or not? And from the fact that changes were made to reflect your concern, it sounds like you were involved. It also sounds like you pissed off Google.
Comment removed (Score:3, Interesting)

by account_deleted ( 4530225 ) writes: on Monday August 22, 2005 @12:24PM (#13372516)

Comment removed based on user account deletion

Share
twitter facebook
- Re:... so? (Score:2, Interesting)
  
  by Anonymous Coward writes:
  
  Preliminary results (from 7000 test queries) indicates that the results of this verification study confirms the conclusions of this study, but final results are still forthcoming.
  
  Looks like they're still doing some looking to make sure their results are rock solid, but that so far they seem to be. As such, the current state of reality is that the fact is that Google has a must bigger index of the world wide web (or Internet, or whatever you want to call it) than Yahoo. Yahoo may have a bigger index squ
  - Re:... so? (Score:2, Interesting)
    
    by mi ( 197448 ) writes:
    
    The whole method seems flawed. Trying to compare the sizes of two sets by the sizes of various subsets makes sense only if the method of selecting the subsets is the same.
    This is not the case. The methods depend on each search engine's algorithms and are very likely to differ greatly.
    In any case, whether a particular query returns 40 results or 40000 does not matter -- only the first 20 are ever of any use...
    - - "Could" & "might" (Re:... so?) (Score:1)
        
        by mi ( 197448 ) writes:
        
        Count the "coulds" and the "mights" in your post and agree with me, that NCSA's method can not be used to conclusively compare the sizes of Yahoo!'s and Google's indexes...
/. 503 error (Score:2, Interesting)

by dhasenan ( 758719 ) writes:

Off topic...

Anyone else get 503 errors when trying to reach Slashdot?

Where do you go to talk about Slashdot being Slashdotted?
- Re:/. 503 error (Score:2, Informative)
  
  by paulius_g ( 808556 ) writes:
  
  Glad you asked...
  
  I've been getting 500 errors the whole morning while trying to reach /. But not 503 ones. After one or two page refreshes, it starts working!
  - Re:/. 503 error (Score:5, Funny)
    
    by Anonymous Coward writes: on Monday August 22, 2005 @12:47PM (#13372640)
    
    I've been getting 500 errors the whole morning while trying to reach /. But not 503 ones. After one or two page refreshes, it starts working!
    
    The trick is to refresh as fast as you can, until the bad 500 errors go away.
    
    Parent Share
    twitter facebook
    - Re:/. 503 error (Score:1)
      
      by paulius_g ( 808556 ) writes:
      
      Yeah well, don't make me refresh YOUR site!
      - Re:/. 503 error (Score:1)
        
        by StarvingSE ( 875139 ) writes:
        
        In Soviet Russia, the sites refresh you!
        
        Re:/. 503 error (Score:1)
        
        by paulius_g ( 808556 ) writes:
        
        In Northern Siberia, you are refreshed until you are slashdotted!
- Re:/. 503 error (Score:1)
  
  by Analog Squirrel ( 547794 ) writes:
  
  Yeah, I got one this morning, too - although it is a pretty rare occurance for me - and the thought of "slashdot getting slashdotted" did go through my mind... and I found a bit of preverse humor in it. :-)
- Re:/. 503 error (Score:2)
  
  by antdude ( 79039 ) writes:
  
  I noticed their comments were not showing this morning either. I think they were related.
- Re:/. 503 error (Score:1)
  
  by yiantsbro ( 550957 ) writes:
  
  "Where do you go to talk about Slashdot being Slashdotted?"
  
  First rule of /. --- you don't talk about /.
A crucial issue... (Score:5, Funny)

by d3m057h3n35 ( 695460 ) writes: on Monday August 22, 2005 @12:26PM (#13372521)

Also pertinent was the discovery that Yahoo's claims to increased index size were based on the hope that buying products from companies which advertise "longer, thicker index size in two weeks, money-back guarantee, all-natural supplements" would yield actual results.

Share
twitter facebook
Wait... (Score:5, Funny)

by lbmouse ( 473316 ) writes: on Monday August 22, 2005 @12:28PM (#13372524) Homepage

I thought that size didn't matter.

Share
twitter facebook
- Re:Wait... (Score:1)
  
  by theotherlight ( 904426 ) writes:
  
  That's only what /.'ers tell themselves to feel better...
  
  The truth is: size is everything.
- Re:Wait... (Score:1, Funny)
  
  by thegamerformelyknown ( 868463 ) writes:
  
  Actually, it's not how big it is, but what you do with it:)
  
  I think that applies to both situations too...
  - Re:Wait... (Score:1)
    
    by Analog Squirrel ( 547794 ) writes:
    
    "Both"? You mean there's more than 1?
- Re:Wait... (Score:1)
  
  by chrisvdb ( 149510 ) writes:
  
  I thought that size didn't matter.
  
  That's what they say to you ;-)
- Re:Wait... (Score:1)
  
  by Barkmullz ( 594479 ) writes:
  
  I thought that size didn't matter.
  
  It's not the wand, it's the wizard...
- Re:Hah, hah (Score:1)
  
  by Fr05t ( 69968 ) writes:
  
  "such mambo-jumbo study. Very professional!"
  
  I'll have you know Mambo is very professional. If you ment Mumbo, then yeah those guys are a bunch of dead beats.
But why publish it? (Score:3, Insightful)

by ChrisF79 ( 829953 ) writes: on Monday August 22, 2005 @12:34PM (#13372549) Homepage

Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.

Share
twitter facebook
- Re:But why publish it? (Score:3, Insightful)
  
  by 'nother poster ( 700681 ) writes:
  
  From the disclaimer I would say thet the report was not a university sanctioned project, but a funtime project for a couple of students. They then published it in a manner that implied that it was offical work of the university, or at least sanctioned by the professor. Now, whether the study is right or wrong come peer review, the university wants it known that it wasn't their project. A peer reviewed research project is much different than throwing together a bad stats class midterm and putting the resu
- It was not "published" (Score:5, Insightful)
  
  by kaan ( 88626 ) writes: on Monday August 22, 2005 @01:04PM (#13372738)
  
  why publish it in the first place?
  
  Dude, it was never published, it was posted on one web server that is part of the ncsa.uiuc.edu sub-domain (specifically, vburton.ncsa.uiuc.edu). There are probably hundreds of machines that are in this network, and posting something on a web server running there does not equate to NCSA formally publishing an article. What we're talking about here is a web page written by two students, they worked on a project, they wanted to post it for other people to see. So that's what they did, period.
  
  Stupidly, everyone is claiming that NCSA backed this whole thing, like they (NCSA) are on some crusade to compare Yahoo and Google. But this must be taken for what it is - a project by two students. NCSA's disclaimer is just trying to make this clear for the idiots out there who think that every little thing a student says or does must have been funded, supported, backed, etc. by NCSA.
  
  Parent Share
  twitter facebook
- Re:But why publish it? (Score:2)
  
  by hackstraw ( 262471 ) * writes:
  
  Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.
  
  Yes, the web page was lacking in methodology and had a numbe
  - Re:But why publish it? (Score:2)
    
    by monkeydo ( 173558 ) writes:
    
    Yes, you are underreacting. Did you miss the original Slashdot posting:
    NCSA Compares Google and Yahoo Index Numbers
    chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "
    Notice that the summary was submitted by a well known Google employee, and that it states the study was conducted by the N
Filtering (Score:5, Insightful)

by Spazmania ( 174582 ) writes: on Monday August 22, 2005 @12:35PM (#13372555) Homepage

Readers can consult the list of search terms provided by the authors, and can see for themselves that, in the vast majority of cases retained (i.e. those with fewer than 1000 results), the results in question are lists and spam.

I don't know which disturbs me more: The possibility that this is the correct explanation for the discrepancy or the possibility that it isn't.

It seems to me that the correct solution to filtering results would be to put the "undesirable" results at the bottom of the list, not get rid of them entirely. One man's trash is another man's treasure after all.

Share
twitter facebook
- Maybe those pages never were crawled by yahoo. (Score:5, Interesting)
  
  by MushMouth ( 5650 ) writes: on Monday August 22, 2005 @12:51PM (#13372657)
  
  It is entirely possible that Yahoo not only crawls more content on a whole, but the percentage of the content that they crawl that is spam is smaller. Crawlers get to spam/SEO pages via other spam/SEO pages, thus if they have better filtering mechanisms they may simply stop earlier and put their (not completely unlimited) resources into crawling more useful data.
  
  Parent Share
  twitter facebook
  - Re:Maybe those pages never were crawled by yahoo. (Score:2)
    
    by rbarreira ( 836272 ) writes:
    
    Have you read the study?
  - Never mind that, look how low your ID is :-o (Score:1)
    
    by Low Slashdot ID Guy! ( 909023 ) writes:
    
    I, mean, like, cool, dude!
  - Re:Maybe those pages never were crawled by yahoo. (Score:2)
    
    by Alomex ( 148003 ) writes:
    
    No it isn't. In principle yes, but I tested for this, and it is Yahoo the one that returns more spam:
    
    http://slashdot.org/comments.pl?sid=159703&cid=133 74598 [slashdot.org]
    
    http://slashdot.org/comments.pl?sid=158453&cid=132 75737 [slashdot.org]
- Trash and treasure. (Score:1)
  
  by solomonrex ( 848655 ) writes:
  
  I think the point is that copies of the ispell dictionary and spam are repetitive, which are normally not included with search results. Why do you need more than one copy of an identical result?
A Study By Students... (Score:2)

by __aaclcg7560 ( 824291 ) writes:

The fact that a study conducted by students got mention on /. is impressive. Usually, most works done by students are ignored as class exercises. Now "retracted" can be added to the list.
- - Re:A Study By Students... (Score:2)
    
    by BitchKapoor ( 732880 ) writes:
    
    This so-called study was thought up by a well-known campus political player who recently completed a masters in library science and programmed by a friend who graduated a while a go, I think with a bachelor's in CS. I bet the whole thing was hacked up to try to get Cheney a job at Google.
Covering Ones Rear (Score:3, Insightful)

by gkozlyk ( 247448 ) writes: on Monday August 22, 2005 @12:39PM (#13372589) Homepage

Ah, the good old disclaimer added to cover ones rear. With litigation flying free as newspaper in the wind, one can't be to careful these days.

Share
twitter facebook
- Re:Covering Ones Rear (Score:2)
  
  by Stalus ( 646102 ) writes:
  
  What I really love is that fact that the page used to have the professor listed on the list of authors, NCSA logos were on the page, UIUC was listed under the authors' affiliation, and it looked much more official. Now that it's been aired out as non-scientific, there's all sorts of disclaimers saying that it was his student's work, and shifting the blame. Too bad it was published on his webspace :P
  
  Perhaps the professor of History and Sociology will think twice next time before attempting to put his nam
Why is the disclaimer needed? (Score:3, Interesting)

by frdmfghtr ( 603968 ) writes: on Monday August 22, 2005 @12:48PM (#13372647)

I didn't read the article the first time around, so maybe something was changed/removed that prompted the disclaimer. I read the report and couldn't find a single reference to NCSA, except in the URL and in the disclaimer itself.

Aside from the URL, was there some sort of NCSA association implied or claimed in the original post then removed?

Share
twitter facebook
- disclaimer: people believe everything they read (Score:2)
  
  by kaan ( 88626 ) writes:
  
  I think you're right, the original article had no visible association with NCSA other than the url. But this is just like the classic telephone game: I tell you something, you repeat it to somebody else with a minor addition/change, then that person tells somebody else, etc. By the time it goes 4 or 5 hops, it's been totally twisted around, and my original message has turned into something idiotic, and everyone thinks I said it. This is exactly what happened here, because it started showing up on blogs,
The dark web (Score:5, Insightful)

by SpinyNorman ( 33776 ) writes: on Monday August 22, 2005 @12:49PM (#13372652)

The Yahoo vs Google page count methodology of counting numbers of pages returned for various high-response queries seems to be completely ignoring the fact that Yahoo *might be* picking up some of the less highly linked-to "dark web" that Google's page rank alogorithm are going to rate lowly, and which their crawler may be ignoring.

This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.

What'd therefore be relevant and interesting to know isn't how many hundreds of pages Google vs Yahoo get for "my job sucks", but rather how many it gets for "my weevil collection".

Share
twitter facebook
- Re:The dark web (Score:1)
  
  by squoozer ( 730327 ) writes:
  
  Your wish fulfilled:
  Google: Approx 47,100
  Yahoo: Approx 258,000
  
  Both searches we for "my weevil collection" without quotes. With quotes the results are:
  
  Google: 3
  Yahoo: 4
  
  Yahoo is champ.
- Re:The dark web (Score:2)
  
  by zarr ( 724629 ) writes:
  
  Search results for "weevil":
  google - 915,000
  yahoo - 2,200,000
  Search results for "my weevil collection":
  google - 3
  yahoo - 4
  Yahoo returned the wikipedia page for "weevil" on the first page, so now I know what it is :)
  - Re:The dark web (Score:1, Funny)
    
    by Anonymous Coward writes:
    
    Search results for "weevil":
    google - 915,000
    yahoo - 2,200,000
    
    Search results for "my weevil collection":
    google - 3
    yahoo - 4
    
    You're getting negative hits?
- Re:The dark web (Score:3, Interesting)
  
  by markov_chain ( 202465 ) writes:
  
  Interesting.
  
  The original search rating papers (Kleinberg's algorithm, PageRank) made the ground-breaking observation that links between pages contained lots of useful information that could be used for ranking in addition to the keywords contained in the pages themselves. This was in a time where websites were mostly personal, and there was an atmosphere of friendly sharing of information where people would link to other sites that they find interesting. However, how much have things changed today? Who s
  - Re:The dark web (Score:2)
    
    by MushMouth ( 5650 ) writes:
    
    At one time that was google's 'pages like this one', and then there are Alexa's "Related Links", which have been around since before Google. Unfortunately there are privacy issues, and there would be (and is for alexa) a whole industry built around gaming that system.
    - Re:The dark web (Score:2)
      
      by markov_chain ( 202465 ) writes:
      
      Good tip, thanks. The Alexa client is dead on. Abuse and privacy issues are inevitable, but I'm curious how a search engine using client-side information compares to a crawl based one.
  - Re:The dark web (Score:2)
    
    by Epistax ( 544591 ) writes:
    
    I have had a few items pop in my head which I thought were "totally awesome" and if successfully employed, worth many many monies.
    
    As for the search engine, how about a little checkbox that says "No Business". What's a business? Someone who sells something (loosely). Anyway it'd take a heck of a lot of work to implement and define, but that's a checkbox I'd have thoroughly molested. I'd also want to make a thesaurus that edits words as you type.
    - Re:The dark web (Score:1)
      
      by dogod ( 732331 ) writes:
      
      yahoo might have what your looking for, though it's only beta. they allow you to switch from more commercial or more informational sites (i.e., from academic, non-commercial, or research-oriented sources) i have yet to try it so i don't know how well it works.
      
      http://research.yahoo.com/research/data_analytics/ mindset__intent-driven_search.shtml [yahoo.com]
  - Re:The dark web (Score:2)
    
    by alienw ( 585907 ) writes:
    
    What makes you think your method is any better? It would be gamed just like PageRank (much worse, actually). Overreliance on any single method is not good if you want to have a decent search engine.
    - Re:The dark web (Score:2)
      
      by markov_chain ( 202465 ) writes:
      
      It uses a different source of reputation, one that seems more in tune with what the Web content looks like today.
      
      There is always the issue of abuse, no matter what the method.
  - Re:The dark web (Score:2)
    
    by SpinyNorman ( 33776 ) writes:
    
    Interesting idea. I imagine that Google would have the bandwidth and server capacity to capture and processs this data if browsers were able to make it available.
    
    I quite often find that Amazon's "people who bought this book also bought/viewed ..." section turns up useful stuff that a title search doesn't, so I expect the same may be true here too. One could even get a "user interest rating" of pages by how long they viewed them for...
    
    Maybe Mozilla/Firefox could work with Google to implement this type of fee
    - Feedback system (Score:2)
      
      by harmonica ( 29841 ) writes:
      
      Maybe Mozilla/Firefox could work with Google to implement this type of feedback system...
      
      That already exists as stumbleupon.com [stumbleupon.com] and deli.cio.us [deli.cio.us].
- Re:The dark web (Score:3, Insightful)
  
  by RAMMS+EIN ( 578166 ) writes:
  
  ``This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.''
  
  I personally don't think Google is _excluding_ pages that somehow don't get enough links to them. Typically, good resources will get linked to, and thus taking into account the number of links to a page seems sensible.
  
  From personal experience, I can't say I have anything
- Re:The dark web (Score:2)
  
  by illumin8 ( 148082 ) writes:
  
  This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.
  
  Dude, if your idea of the dark web is millions of spam infested blog pages that have been crawled by a million spam robots putting links in the comment pages, along with a smattering of Mediawiki sites that have similarly been "0wnz0r3d" by spam crawlers that edit the pages a
trust (Score:5, Funny)

by dioscaido ( 541037 ) writes: on Monday August 22, 2005 @01:01PM (#13372720)

If it made it through the Slashdot filters, then the study is good enough for me.

Share
twitter facebook
Hit volume is a minor factor (Score:2)

by Council ( 514577 ) writes:

I know it's been said before, but you cannot just measure search engines based on volume of hits returned. Clearly, when you get into the millions, it doesn't hurt the results to prune some crap off the end, and I'm sure they're both doing things -- either one could easily focus a little on breadth of hits per query and jump past the other.

Important thing to note: The general principal is MORE COMPLEX than "find all pages containing this term". You can ADD terms and get MORE hits.

As an example and as a t
- Re: (Score:1)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re:Hit volume is a minor factor (Score:2)
    
    by Council ( 514577 ) writes:
    
    Though note that I'm not really referring to 'index size' but to the size of the list of hits returned for a term. I'm pretty much in favor of the indexes being as large as possible, and that's a reasonable thing to demand. But saying that one engine returns 2,000,000 more hits for 'banana store' than the other is not measuring the same thing at all, and is in fact dumb.
I say so what.... (Score:2)

by Khyber ( 864651 ) writes:

DISCLAIMER: This comment is influenced by Colt 45 malt liquor...

Big deal about what some other corp. says. This is a Joe Schmoe study conducted by college students. This means they're an independent, non-funded (therefore non-corp influenced) study. Too bad they have seemingly been coerced into changing some things in their article. *sigh* Why can't they ever stick with their guns??
- Re:NCSA? (Score:1)
  
  by gauauu ( 649169 ) writes:
  
  National Center for Supercomputing Applications [uiuc.edu], a research branch of the University of Illinois at Urbana-Champaign [uiuc.edu].
  Their claims to fame include having four supercomputers of the 50 fastest in the world [top500.org], and creating Mosaic [wikipedia.org], the first graphical web browser.
  If you have doubts about the influence of Mosaic, load up internet explorer, click "Help" in the menu, then click "About Internet Explorer" and read the blurb....
Accuracy of Google counts? (Score:5, Interesting)

by xiaomonkey ( 872442 ) writes: on Monday August 22, 2005 @01:58PM (#13373169)
Try the following sets of key words on Google:
- lawyer [google.com] - results 29,300,000
- lawyer lawyer [google.com] - results 29,300,000
- lawyer lawyer lawyer [google.com] - results 62,000,000
- lawyer lawyer lawyer lawyer [google.com] - results 78,600,000
This trend appears to continue, as seen in that repeating the "lawyer" keyword 10 times results in Google estimating that there are 389,000,000 [google.com] hits in it's index.

On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 [yahoo.com] results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 [yahoo.com] results.

So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.
Share
twitter facebook
- Re:Accuracy of Google counts? (Score:1)
  
  by xiaomonkey ( 872442 ) writes:
  
  Sorry, the yahoo links I gave above are erroneous.
  
  Here's the corrected version of the first one, "lawyer" [yahoo.com] that results in 125,000,000 many estimated hits. The second one, "lawyer lawyer" [yahoo.com] results in 124,000,000 many estimated hits.
- Re:Accuracy of Google counts? (Score:2)
  
  by rbarreira ( 836272 ) writes:
  
  Those estimates are pretty irrelevant for this discussion, I think. When there are many results, those estimates aren't supposed to be accurate at all, that's why the study focused on queries with very few results.
  
  But yes, those numbers you show are quite strange.
- Re:Accuracy of Google counts? (Score:2)
  
  by Sique ( 173459 ) writes:
  
  For the search terms bla, bla bla, bla bla bla etc.pp. the numbers at Google remain pretty stable (starting out at 2.040 mio, and later somewhere between 1.870 und 1.900 mio).
  
  So the interesting question is: Why does it work with lawyer, but not with bla?
- Re:Accuracy of Google counts? (oblig.) (Score:2, Funny)
  
  by CycleMan ( 638982 ) writes:
  
  lawyer - results 29,300,000
  lawyer lawyer - results 29,300,000
  lawyer lawyer lawyer - results 62,000,000
  lawyer lawyer lawyer lawyer - results 78,600,000
  lawyer lawyer lawyer lawyer
  lawyer lawyer lawyer lawyer
  lawyer lawyer lawyer lawyer
  LAW SUIT LAW SUIT!
  lawyer lawyer lawyer lawyer
  lawyer lawyer ...
This may be true (Score:3, Funny)

by lcsjk ( 143581 ) writes: on Monday August 22, 2005 @02:22PM (#13373386)

I understand that Google uses a very efficient compression technology to compress documents before they are indexed, thereby making characters so small that they can only be read with a magnifying glass or microscope.
In contrast, Yahoo, unless I misunderstand, only compresses the file after it has been indexed. Since only the file is compressed and not the individual characters, they indeed have a larger index file as the study concluded. :)

Share
twitter facebook
More thoughts on a better test (Score:5, Interesting)

by freality ( 324306 ) writes: on Monday August 22, 2005 @02:23PM (#13373391) Homepage Journal

After criticising [slashdot.org] the study when I first saw it, I now have some constructive ideas on how to perform a better test of the relative search performance of the two engines.

- Crawler Test

Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.

Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.

Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.

This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.

- Common Word Test

Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.

- Conclusion

Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.

Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.

Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.

Share
twitter facebook
- Re:More thoughts on a better test (Score:1)
  
  by knoebelsPT ( 883833 ) writes:
  
  What?
  
  A google search for yahoo site:yahoo.com turns up over 57,000,000 hits, not zero.
- Re:More thoughts on a better test (Score:2)
  
  by Alomex ( 148003 ) writes:
  
  A search for "the" on each show Yahoo significantly in the lead.
  
  The problem with such searches is that if a search engine misindexes a mirror site or a 401 page and that is returned in the count, then that SE looks bigger.
  
  On the other hand if you launch a query that has 5-10 answers you can actually examine every single page on both result pages and make sure that all hits are correct and distinct.
  
  Using that technique Google comes ahead of Yahoo, by a large margin. [slashdot.org]
  - Re:More thoughts on a better test (Score:2)
    
    by freality ( 324306 ) writes:
    
    Yeah, I agree there are many caveats. Like I said: "However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with."
    
    What you describe would be to me a "substantial structural difference". Which means I agree.
    
    However, that doesn't change that I do think it's better to accept probable error in a huge population of samples than to choose a m
- Re:More thoughts on a better test (Score:1)
  
  by Esteanil ( 710082 ) writes:
  
  Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.
  
  From the article:
  
  Interestingly, the actual total number of results returns varies dramatically from the estimated total number of results that both Google and Yahoo! provide users in the search results. In the case of Google, the number of actual results returned is about one third of the estimation that Google gives. However
  - Re:More thoughts on a better test (Score:2)
    
    by freality ( 324306 ) writes:
    
    Well, yeah.
    
    But even though I said take everything I said with a grain of salt, I would take that claim about estimated vs. actual hits with 2 grains of salt.
    
    Here's my superb rationale why.
    
    Consider that if it was easy or efficient return the exact # of hits, they would probably do it, instead of, for instance, "Results 251 - 259 of about 284". I mean, consider the UI people at Google.. do they want to muddly their otherwise famously precise and clear interface with a guffaw like that? I'd bet Not unless th
    - Re:More thoughts on a better test (Score:2)
      
      by freality ( 324306 ) writes:
      
      Actually, glad you brought this up. I turned off duplicate elimination on both sites for my test query and got an exact match between Yahoo's estimated and actual number of results pages. Google's actual number increased but still fell short of actual.
      
      So again, looks like Yahoo has a good handle on what its index size is, but is simply filtering out lots of the results from its index. Perhaps there's not really that much difference between the two after all :)
- - Yahoo hits.. my error. (Score:2)
    
    by freality ( 324306 ) writes:
    
    Yeah, I see that now too. I must have mistyped. My apologies to Google for publicly questioning their editorial policy without merit!
Darn students... (Score:2)

by jpellino ( 202698 ) writes:

Felonies for the whole lot of'em!

Oh, wait. Which students were these?
Study: Red Delicious apples != Fuji apples (Score:3, Interesting)

by RunzWithScissors ( 567704 ) writes: on Monday August 22, 2005 @02:32PM (#13373462)

I got flamed for proposing this theory when the article was first posted on /.

One major problem with the study, not really addressed by this problems article is one of comparison. Yes, Yahoo! and Google are two search engines, but they perform their searches differently, and more importantly, use different criteria when returning matches! It is quite possible that when doing the exact same search in both yeilds a difference in results. Why? Because the two different search engines have different criteria for providing output, or matches if you will, from said search. Perhaps Yahoo! does indeed have more pages indexed, but because of their search algorithm, or their program which displays results to the user, less matches are provided; even though more pages were looked at.

I'm not saying that Yahoo! does have more or less pages than Google. But the study that was executed and published did not account for many of the differences between the products they were comparing. The above is merely another interpretation of said results. I'm glad that some folks at NCSA agree and provided some clarification; lest we get another urban myth like storks bring babies.

-Runz

Share
twitter facebook
Matt Cheney is a punk (Score:1)

by vishesh ( 885709 ) writes:

I took a philosophy class with Matt Cheney at the University of Illinois. Let me just say for the record that he is a douchebag. I am really not surprised that he tried to pass off this study under the auspices of NCSA. I'm just glad to see that someone called him on this.
Yahoo bigger, how? (Score:2)

by SharpFang ( 651121 ) writes:

Google has great most of the web covered. While obeying robots.txt and such, they can't index much more of meaningful content. So how did Yahoo almost triple the Google's goal? Well, as long as you're looking for obvious stuff with "easy hits", the results will be similar. But if you enter REALLY obscure stuff, for which Google shows 3-5 hits, Yahoo will show the same 3-5 hits and 15 others, which are all different variants of 404, pages pointed to through broken links. Simply put, 2/3 of Yahoo index are "4
- Re:"Editor" seemed to contradicted someone's skill (Score:2)
  
  by ryanov ( 193048 ) writes:
  
  I don't see why this was modded down. It really took away from the summary. Granted, the sentence makes no sense WITHOUT the error.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Disclaimer Text (Score:5, Interesting)

Re:Disclaimer Text (Score:1)

Comment removed (Score:3, Interesting)

Re:... so? (Score:2, Interesting)

Re:... so? (Score:2, Interesting)

"Could" & "might" (Re:... so?) (Score:1)

/. 503 error (Score:2, Interesting)

Re:/. 503 error (Score:2, Informative)

Re:/. 503 error (Score:5, Funny)

Re:/. 503 error (Score:1)

Re:/. 503 error (Score:1)

Re:/. 503 error (Score:1)

Re:/. 503 error (Score:1)

Re:/. 503 error (Score:2)

Re:/. 503 error (Score:1)

A crucial issue... (Score:5, Funny)

Wait... (Score:5, Funny)

Re:Wait... (Score:1)

Re:Wait... (Score:1, Funny)

Re:Wait... (Score:1)

Re:Wait... (Score:1)

Re:Wait... (Score:1)

Re:Hah, hah (Score:1)

But why publish it? (Score:3, Insightful)

Re:But why publish it? (Score:3, Insightful)

It was not "published" (Score:5, Insightful)

Re:But why publish it? (Score:2)

Re:But why publish it? (Score:2)

Filtering (Score:5, Insightful)

Maybe those pages never were crawled by yahoo. (Score:5, Interesting)

Re:Maybe those pages never were crawled by yahoo. (Score:2)

Never mind that, look how low your ID is :-o (Score:1)

Re:Maybe those pages never were crawled by yahoo. (Score:2)

Trash and treasure. (Score:1)

A Study By Students... (Score:2)

Re:A Study By Students... (Score:2)

Covering Ones Rear (Score:3, Insightful)

Re:Covering Ones Rear (Score:2)

Why is the disclaimer needed? (Score:3, Interesting)

disclaimer: people believe everything they read (Score:2)

The dark web (Score:5, Insightful)

Re:The dark web (Score:1)

Re:The dark web (Score:2)

Re:The dark web (Score:1, Funny)

Re:The dark web (Score:3, Interesting)

Re:The dark web (Score:2)

Re:The dark web (Score:2)

Re:The dark web (Score:2)

Re:The dark web (Score:1)

Re:The dark web (Score:2)

Re:The dark web (Score:2)

Re:The dark web (Score:2)

Feedback system (Score:2)

Re:The dark web (Score:3, Insightful)

Re:The dark web (Score:2)

trust (Score:5, Funny)

Hit volume is a minor factor (Score:2)

Re: (Score:1)

Re:Hit volume is a minor factor (Score:2)

I say so what.... (Score:2)

Re:NCSA? (Score:1)

Accuracy of Google counts? (Score:5, Interesting)

Re:Accuracy of Google counts? (Score:1)

Re:Accuracy of Google counts? (Score:2)

Re:Accuracy of Google counts? (Score:2)

Re:Accuracy of Google counts? (oblig.) (Score:2, Funny)

This may be true (Score:3, Funny)

More thoughts on a better test (Score:5, Interesting)

Re:More thoughts on a better test (Score:1)

Re:More thoughts on a better test (Score:2)

Re:More thoughts on a better test (Score:2)

Re:More thoughts on a better test (Score:1)

Re:More thoughts on a better test (Score:2)

Re:More thoughts on a better test (Score:2)

Yahoo hits.. my error. (Score:2)

Darn students... (Score:2)

Study: Red Delicious apples != Fuji apples (Score:3, Interesting)

Matt Cheney is a punk (Score:1)

Yahoo bigger, how? (Score:2)

Re:"Editor" seemed to contradicted someone's skill (Score:2)