Is Microsoft Crawling Google? 480
triplecoil writes "Jason Dowdell over at WebProNews has written a piece questioning a tactic Microsoft might be using to beef up its new search engine. He thinks they might be dipping into Google's results to supplement its own. Dowdell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."
Don't concern yourself with this crap... (Score:4, Insightful)
Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No. Do I care what they are doing? No, as long as they are obeying my robots.txt.
I have complained before about MSNbot ignoring changes to robots.txt while Google happily changed its habbits (I can't find the link sorry). My recent fighting with Googlebot has come to a head when I had to disallow them access to my gallery completely because they refused to honor anything except Disallow:
Do I care if MSNbot is crawling Google and then finding sites and links to search? No as it's none of OUR concern. What is OUR concern is our own robots.txt and how the spiders interact with our sites through that file. Let Google deal with Microsoft/MSNbot if that's what needs to be done but don't concern yourself with it otherwise.
Google is Catholic? (Score:5, Funny)
Google is Catholic?
Comment removed (Score:4, Insightful)
Re:Don't concern yourself with this crap... (Score:3, Insightful)
Crawling a gallery of images (and all image property links as well) all day for several days might be considered "DoSing" I consider it being rude.
You're right, they don't have to obey the robots.txt but they should when they say they will.
Re:Don't concern yourself with this crap... (Score:4, Informative)
User-agent: Googlebot-Image
Disallow:
That should work? No?
Re:Don't concern yourself with this crap... (Score:5, Interesting)
There's more to it than that. Google caches your pages and makes that cache of your copyright material available. Arguably if you have used your robots.txt file to tell it not to index (and therefore cache) your pages and it still does they are breaching copyright. OK, the Google cache is the world's largest breach of copyright anyway, but if you have told its spider not to index and it does regardless, that's a different ballgame.
Putting it out there on the web does not give anyone the right to do with it as they please.
Re:Don't concern yourself with this crap... (Score:5, Interesting)
Re:Don't concern yourself with this crap... (Score:5, Informative)
Remove yourself from google [google.com]
"Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, your webmaster must first insert the appropriate meta tags into the page's HTML code. "
Re:Don't concern yourself with this crap... (Score:5, Insightful)
I shouldn't need to go and fill out some form for every search engine to protect my rights. One accepted standard way to say "do not index this" should be sufficient. This is an automated system. There is an accepted automated method to stop crawlers indexing your site (robots.txt). If they (Google or anyone else) take your copyrighted content and reproduce it automatically when their automatic system could have automatically respected your explicitly stated and legally protected rights they are knowlingly making a flagrant copyright violation.
Re:Don't concern yourself with this crap... (Score:3, Funny)
Interweb? Is that the same as the 'Information superhighway'?
Re:Don't concern yourself with this crap... (Score:3, Funny)
They're very similar. One notable difference is that the Information Superhighway was invented by Al Gore.
Re:Don't concern yourself with this crap... (Score:4, Funny)
-kaplanfx
Re:Don't concern yourself with this crap... (Score:3, Interesting)
Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No.
Google gives a partial answer to this on their GoogleBot page [google.com]:
If they're playing around with new indexing alg
Re:Don't concern yourself with this crap... (Score:3, Interesting)
Googlebot (Google) 74 945.51 KB 11 Nov 2004 - 03:02
Netcraft Web Server Survey 13 0 10 Nov 2004 - 23:48
Mirago 6 76.44 KB 02 Nov 2004 - 04:13
MSNBot 6 76.44 KB 05 Nov 2004 - 05:58
It's interesting that Mirago and MSNBot have taken exactly the same bandwidth in the same amount of visits. Are MS innov^H^H^H^H^H buying new technology again?
Bob
Re:More lies from garcia (Score:3, Insightful)
Difficult to do if Google doesn't want them to (Score:5, Insightful)
Re:Difficult to do if Google doesn't want them to (Score:5, Funny)
"Begun, this war of the corporations has!"
Re:Difficult to do if Google doesn't want them to (Score:2, Funny)
Remember, helping Microsoft is like helping yourself.
customer-provided IP addresses. (Score:2)
Re:Difficult to do if Google doesn't want them to (Score:5, Interesting)
Re:Difficult to do if Google doesn't want them to (Score:4, Interesting)
For google I get: crawl-66-249-64-167.googlebot.com [66.249.64.167]
for msn I get: fj1011.inktomisearch.com [66.196.91.16]
and msn beta I get: 65.54.188.83 (can't find associated domain)
So we can tell that at least this result wasn't stolen from Google.
Re:Difficult to do if Google doesn't want them to (Score:3, Interesting)
Does it violate Google's Terms of Service (Score:5, Insightful)
If not, it's called doing business and gaining an advantage any legitimate way that you can.
I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.
Re:Does it violate Google's Terms of Service (Score:3, Insightful)
If not, it's called doing business and gaining an advantage any legitimate way that you can.
I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.
If I copy your work and take credit or it, does it violate your terms of service? If so, you have legal remedies. If not, it's called doing business and gaining an advantage any legitimat
Re:Does it violate Google's Terms of Service (Score:2)
But, I am not a judge. Or a lawyer. And I expect that if Google litigated here, they would be setting precedent.
uhh, law DOES matter (Score:2)
Comparing this to the MS/Google situation is not the same so the grandparent post still stands.
Re:Does it violate Google's Terms of Service (Score:5, Interesting)
Re:Does it violate Google's Terms of Service (Score:5, Informative)
From Googles Privacy Center (http://www.google.com/terms_of_service.html):
Personal Use Only
The Google Services are made available for your personal, non-commercial use only. You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales. You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance. Please contact us for more information.
Yea, and (Score:5, Funny)
Wow (Score:2)
Spork or foon? (Score:2, Offtopic)
So, what name do you favor for the combined fork and spoon utensil?
Spork or foon?
Re:Yea, and (Score:5, Funny)
"The new search engine's name will be Mooglesoft."
Which will subsequently be sued by SCOogle, the latest startup from The Canopy Group, after announcing they purchased the rights to the Internet in a complex transaction which is documented in a briefcase somewhere in Germany.
Re:Yea, and (Score:3, Funny)
Instead of clicking a button named Google Search, it simply says "KupoKupo!"
You are then returned a page where 100% of the text is the word "Kupo"
This is slightly less optimized than a Marklar search (which at least has some words other than 'Marklar').
But will this mean Google can crawl back? (Score:5, Funny)
Or something like that.
biffnix
Microsoft stealing someone elses technology??? (Score:5, Funny)
Re:Microsoft stealing someone elses technology??? (Score:3, Interesting)
I won't steal your oven, but I'll steal your food!
Re:Microsoft stealing someone elses technology??? (Score:2, Informative)
"
Re:Microsoft stealing someone elses technology??? (Score:5, Interesting)
Lo! Note how the review articles of the last few days mention the innovative NEW FEATURE of MSN search called, "Search Near Me" which stores the calculated lat/long of addresses on web pages and returns matches near you.
Note how Google's long in beta Google Local (http://local.google.com) [google.com] stores the calculated lat/long of addresses on pages and returns matches near you. Google Local works better.
Another Microsoft innovation! Let's hope WE remember who had it first!
They been crawling like mad lately (Score:5, Interesting)
Re:They been crawling like mad lately (Score:2)
Re:They been crawling like mad lately (Score:2)
Agent: RssReader/1.0.88.0 (http://www.rssreader.com) Microsoft Windows NT 5.1.2600.0
Syndic8/1.0 (http://www.syndic8.com/)
Agent: CoralWebPrx/0.1 (See http://www.scs.cs.nyu.edu/coral/)
Agent: SharpReader/0.9.5.1 (.NET CLR 1.1.4322.2032; WinNT 5.0.2195.0) not sure if this is a bot or a rss reader, I am tempted to think it is a rss reader becauase the next agent from the IP is Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;
Try this term on MSN search (Score:5, Funny)
more evil than satan [msn.com]
ROOFLES!
Re:Try this term on MSN search (Score:5, Funny)
more evil than god [msn.com] and you get FIREFOX as the first result (then google, of course)
Re:Try this term on MSN search (Score:2, Informative)
I know which search engine I'm sticking with
Re:Try this term on MSN search (Score:5, Funny)
Re:Try this term on MSN search (Score:4, Informative)
Re:Try this term on MSN search (Score:2, Funny)
The Firefox page is fairly popular, and the words "more" and "than" appear over and over, as with Google. (Uh, googles motto "do no evil" wouldn't hit another word, hmmmmmmmmm)
Try this one (seriously): more gay than slashdot
Re:Try this term on MSN search (Score:5, Interesting)
Before you mod me down for that, I'd like to mention that this isn't Microsoft bashing since I am an atheist too and so are Linus [celebatheists.com] and RMS [celebatheists.com].
Comment removed (Score:5, Funny)
Re:Try this term on MSN search (Score:4, Interesting)
Not anymore. They apparently hand-edited their own company out of the results about an hour ago.
Re:Try this term on MSN search (Score:2, Interesting)
Re:Try this term on MSN search (Score:3, Funny)
Re:Try this term on MSN search (Score:3, Informative)
but it seems like google started it several years ago.
http://www.cnn.com/TECH/computing/9911/15/search.e ngine.ms.idg/ [cnn.com]
and
http://searchenginewatch.com/sereport/article.php/ 2167621 [searchenginewatch.com]
btw, it doesen't seem to work on google anymore...
Re:Try this term on MSN search (Score:2, Interesting)
They wouldn't... (Score:4, Funny)
Re:They wouldn't... (Score:5, Funny)
Come on, be serious. Google doesn't plan to buy Microsoft until *after* they reach the one-year post-IPO mark, silly.
Comment removed (Score:5, Funny)
Re:Shocked I tell you (Score:3, Funny)
Microsoft releases "Bob"
From the laugh-it's-funny Dept.
If this were true... (Score:2)
Doing this for say 100,000 domains would be noticable but would not even scape the surface of what's on the web.
Meta-search? (Score:4, Interesting)
In the first case, it's a slimy business practice. In the second, it's fairly cunning ( and has been tried before ).
In either case, I doubt google is in any real danger. They are to search engines what MS is to the desktop. And while MS has squandered that advantage in the desktop arena ( reader homework: 250 word essay as to why ), google is only improving on their work.
Re:Meta-search? (Score:2)
Block? (Score:2)
Probably (Score:2)
Msn Crawling (Score:4, Informative)
MSN starting last week has been pulling EVERY LINK in sequence from my site. Even the larger Artist Index pages [clinko.com] of my site.
Seriously, I've had this same spider on my site for about 36 hours now.
Violates Google's TOS (Score:5, Informative)
Re:Violates Google's TOS (Score:3, Interesting)
Ahhh. So, let's see. If you use google at work, you should be going to jail. Sounds fair.
Can anybody take your comments seriously after you say something like "you should be going to jail?" I don't know when Google became a government agency that could send officers to your door for violating a TOS. No, at best it would be a civil issue. More likely, as you say, they have that clause as a justification if they choose to block usage.
However, of all the companies out there, Google would be the one of
Absurd (Score:5, Insightful)
1) His whole theory is based on the "fact" that the only way in the world to find his pages is to use site:www.sitename.com in Google, implying that Google has cached the results from an earlier crawl. Of course, there is no way that the Microsoft search couldn't have also cached it.
2) Then, he claims that Microsoft is probably screen-scraping Google's results (for all the millions of sites out there), and using these results to recrawl those sites? This doesn't even make any sense.
3) And last but not least, Microsoft is certainly basing its whole search architecture on the assumption that Google wouldn't ever notice MSN mirroring its whole index. Yeah right.
Re:Absurd (Score:2)
Probably Not.. (Score:2, Interesting)
If they are searching Google, they haven't done it recently, or else they haven't gotten to my site yet.
Spike the results, then sue (Score:5, Informative)
Re:Spike the results, then sue (Score:3, Informative)
What a normal spider does is generally try different IP's, see if they're running a webserver. Then they do a DNS lookup, fetch http:///robots.txt and read that to decide if indexing is allowed, and where. Then it just walks through the website. A number of places on the website might not be directly accessible, but also not disallowed for indexing by robots.txt.
If some other site has a link to that webserver in some disconnected region of the website
Limit (Score:2)
I think this is mentioned in Google Hacks by O'Reilly. Those with an online account there can check it out and mock me if I'm wrong
Yep. (Score:2)
They really only need to seed their crawler... (Score:5, Interesting)
I could imagine that Microsoft just needs a few thousand URL's evenly-spread across the internet just to seed their crawler, which they can get from Google by using a list of most popular queries.
Once their crawler has so many starting points it can do the rest itself.
Hello!!... (Score:2)
msnbot IP address ranges? (Score:2)
Anybody know what IP address ranges msnbot is using? Might be possible to limit the rate of connection from those addresses using firewall rules (or, for that matter, forbid connection entirely if that's your preference) to avoid the "hammering" that msnbot is said to be doing...
wow what a load of bullshit... (Score:2)
Yay way to go slashdot thanks for posting the most blatant flamebait article ever - how about for your next post, you repost that routers article about a machine that makes more energy than it uses....
a company I worked for did this once... (Score:3, Insightful)
But...? (Score:2)
Look at it this way. If Google were to complain about someone searching their page/databases, they would be the largest hypocrites in the history of history.
Terrible article (Score:5, Insightful)
Uh.
Microsoft has been developing their internal search engine for quite a while now. Part of developing a search engine is using it to crawl and creating a large corpus of test data. It's hugely likely that M$ has had a working crawler system for much, much longer than would be indicated by their public announcement. Quite a few people who helped develop Altavista at HP/Compaq/DEC research joined Microsoft Research about two years ago - the kind of people who could write a high-performance crawler in their sleep and wake up feeling refreshed.
That article seems like baseless, uninformed speculation, to put it not-so-politely.
This could be entirely natural... (Score:5, Insightful)
Re:This could be entirely natural... (Score:3, Informative)
Find a link, fine
Follow the link, fine
Spider the link, not fine - google's Robots.txt [google.com] does not give them permission to.
Interesting (Score:2, Insightful)
In other news... (Score:3, Funny)
Sure, dipping...not swallowing whole. (Score:2)
I don't think so. You still have to have your own crawler (to use on the top ranked results of any query). And a good set of queries to hit google with (so you have an idea of what to index)...which changes constantly. Look at Google's zeitgeist som
Hey Google, please don't make us... (Score:5, Funny)
Hey Google, please don't make us read those wacky JPG/GIF letter scrambles with criss-cross lines and input the random characters into a field before submitting a search.
"Hold on a sec while I Goog- Huh? Grrrr.... H... P... 7... O... wait no, 7... zero... ummm...
Bogus article (Score:3, Insightful)
And I'm supposed to take this clown's "friend" seriously? That's not a good start, anyway.
But then there's the real howler: the site can allegedly only be found through site: on Google. How does the friend know that? Has he done a complete crawl of the web to find all forward links to any image in his site -- even broken ones? MSNBot, like all bots, recognizes that many anchors are broken, and tries plausible corrections around the broken links. That's particularly useful with a deep link, where the deep link may have timed out but the shallow link still exists.
Full Circle (Score:5, Interesting)
It's interesting to know that Bill Gates has been forced to go back to his roots...
Arg I hate M$ (Score:4, Interesting)
Highly unlikely (Score:4, Insightful)
How do I know? Because a friend of mine decided to find out how common all TLAs are (three-letter acronyms) by counting Google hits on each TLA. This was before the Google API, so he did it with good old fashioned HTTP/HTML. It didn't take long for Google to flag him as evil and block access from his IP block.
Sure, Microsoft could find some way around this-- using different enough IP addresses to conceal the source-- but that's more trouble than it's worse. Worse yet, it sets up a cat-and-mouse game and keeps M$ dependent on Google-- when their stated goal is to beat Google at its own game.
I've got a simpler explaination for what the author is seeing. His evidence is based on the fact that some pages being requested exist only in Google's cache. Well, spiders are supposed to do breadth-first searches so they don't hit the same site too often. Microsoft is probably going against data it collected a few weeks ago but hasn't put on its public servers yet. (Why not? Could be lots of things. Maybe they haven't put enough hardware on the front end to support the amount of data they have on the back end. Or maybe they're just slow.)
As much as I'd like to bash M$, there's nothing here that really looks suspicious to me.
Not quite (Score:4, Insightful)
My garbage doesn't have a copyright statement, contain my patented technology, nor does it come with terms of service or licensing agreements.
what ridiculous logic... (Score:5, Funny)
microsoft is looking at old pages, google uses a cache...ergo microsoft must be using google.
if we're going to use that kind of logic, I could just as easily come up with "afghanistan is in the middle east and supports terrorist, iraq is in the middle east...ergo, iraq must support terrorists", and use it to make a case for invading iraq...but you don't see......oh wait
Nothing new... (Score:2)
Re:Legallity (Score:2)
That's why it's technically illegal to go dumpster diving in dumpsters that are enclosed in those little brick cubes behind buildings. Although I've never really had a problem with them while dumpster diving. They can sure as hell, and probably would, get you for dumping your trash there.
Re:You don't say! (Score:3, Funny)
Re:You don't say! (Score:5, Funny)
Re:As long as it's legal (Score:2)
As long as it helps Microsoft, I highly doubt that Microsoft would be concerned about the ethics of doing such a thing.
Re:MSN and Google (Score:2)
If URLs on your site are old (i.e. 404s) and are only indexed in Google, and yet you find MSN crawling them, only to find that their index is updated with those results shortly thereafter, well, that qualifies as something more than "'ms sucks' rhetoric". "Who cares?" might be a more appropriate retort.
Bloggers are just people. So are reporters. Just because some dude said it in a blog doesn't make it unreliable, any more than a journalist saying it makes it rel
Re:Different Corporate Philosophies? (Score:2)
Re:just a comparison ... (Score:2)