Google Crawls The Deep Web 197
mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"
Just think! (Score:5, Funny)
Re:Just think! (Score:4, Funny)
Re: (Score:2)
Re: (Score:2)
For those people, this move by Google is great news. You see, the delete links were all simple GET requests, so the spiders were able to delete content. However, the scheduling is all done via POST'ed forms, so nothing would ever get recorded. This move on Google's part is really just an attempt to combat this. The o
Re:Just think! (Score:5, Interesting)
Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.
Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.
I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.
Re: (Score:2)
Re: (Score:3, Informative)
Yeah, maybe your machine... That SQL-error looks more like bad session handling on the server hosting your Drupal installation than Google trying to do an SQL-injection... Actually, it looks nothing like an SQL-injection at all. MySQL is merely being asked to insert a duplicate value in a column specified as unique (`sid`), which it refuses because it's not unique. Don't expect an answer, since it's most likely not an error on Google's end.
A little more on topic though, what
Re:Just think! (Score:5, Interesting)
And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..
And now, they are gonna start submitting forms? the fallout is gonna be entertaining.
Re: (Score:3, Informative)
Re: (Score:2)
Bright Planet's DQM (Score:4, Interesting)
Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.
Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.
Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind
Re: (Score:3, Interesting)
Re: (Score:3, Interesting)
It's cases like that where doing a half-arsed job is worse than not trying at all.
Re: (Score:2)
The visitors *do* generate ad revenue.
Oops... (Score:5, Funny)
Re:Oops... (Score:5, Informative)
This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.
This is for things like product catalogue searches where you pick criteria from drop-down boxes. Not so common for run-of-the-mill e-commerce sites, but I've seen a lot on B2B sites.
Re: (Score:3, Funny)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re:Oops... (Score:5, Insightful)
Re: (Score:2)
Unfortunately there are also tons of sites whose developers did not understand the part about POST being for creating new resources, and PUT being for making changes on the server.
HTTP verb semantics are a very dangerous thing for Google or any other third party to rely on, unless they are using a documented API where the developers have explicitly followed REST principles.
Re: (Score:2)
Re: (Score:3, Interesting)
What's it to Google (or a third party) if they mess up your pathetically-designed form?
That depends. If they effectively launch a denial-of-service attack and eat zilliabytes of people's limited bandwidth by attempting to submit with all possible combinations of form controls and large amounts of random data in text fields, would that be:
Re: (Score:2)
Google in a way is saying that if you fail to properly secure your site that they have a right to data mine it and generate profits from your data. Perhaps, mind you, just perhaps, that really, legally, is not appropriate and perhaps a legal investigation is required to clarify this before everyone starts do
Robots.txt is not the answer (Score:3, Insightful)
Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.
For one thing, it is voluntary. Google and some other major search engines may respect it today, but they are under no obligation to do so, nor to continue to do so if they do now.
For another thing, depending on robots.txt makes the whole game opt-out. This is, IMHO, the wrong default for potentially unwelcome visits. We can't keep pretending it's O
Re: (Score:2)
HTTP is a documented API.
What makes you think somebody who's just fucked up HTTP isn't going to go right ahead and fuck up "REST principles" while they're at it?
Re: (Score:2)
Re: (Score:2)
Will it solve captchas? (Score:5, Interesting)
Re: (Score:2)
Re:Will it solve captchas? (Score:5, Funny)
Re:Will it solve captchas? (Score:5, Funny)
Forums? (Score:5, Funny)
On the plus side, this should enable Google to get by the "Must be 18 to view" buttons
Re: (Score:3, Informative)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
The usual excuse for that is that they want a link — for aesthetic purposes, to put in an email, etc. If you're using a form anyway, those reasons disappear. I'm sure there are a few developers who screw this up, but it won't be anywhere near as common as the problems GWA uncovered.
Re: (Score:3, Funny)
HELLO I AM GOOGLEBOT (Score:5, Funny)
Re:HELLO I AM GOOGLEBOT (Score:5, Funny)
Re:HELLO I AM GOOGLEBOT (Score:4, Funny)
Re: (Score:2, Interesting)
Eeek, you found me! (Score:3, Funny)
Re: (Score:2)
Upgrade: Common courtesy/1.0, HTTP/1.1
Forums, and "web 2.0" sites. (Score:2)
So Googlebot will come across a web page.
It follows a link.
The link leads to a page with a form.
Googlebot fills out the form based on content already on the site.
Googlebot clicks submit.
Googlebot goes to the next page, and continues to follow links.
The problem comes when that form was a post form like the one I am typing on right now for a forum, or some other type of form to create user generated content. This makes it seem like google will see the text box an
Re: (Score:2)
Really though i don't think this will be a problem. People at google are pretty smart and i'm sure they've thought of this. Even if you believe google is evil there no evil corporate benefit to spamming garbled text to the entire internet.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Funny)
Re: (Score:3, Informative)
Seems to me it would be easy enough to detect the googlebot user agent, then if so, automatically redirect it to the page on the other end (or even send it to a random 404 page or something), all without processing the form data at all.
<? if ($_SERVER['HTTP_USER_AGENT']=="User_agentMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"); { header( 'Location: /landing_page.php' ) ;
}
else
{
processtheform();
}
?>
Of course, this would have to be implemented, which would b
good and bad (Score:4, Insightful)
Re:good and bad (Score:5, Insightful)
Re: (Score:3, Funny)
Re:good and bad (Score:5, Insightful)
They've indexed Flash for about four years now.
No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.
Re: (Score:2)
I'm in your Intarwebs (Score:2, Funny)
robots.txt (Score:5, Funny)
Note to self... (Score:4, Funny)
Heisenberg for web (Score:2)
Now you can't see what is on the web, by crawling, without changing it.
The Internet is for Porn (Score:5, Funny)
Re: (Score:2)
directions like 'nofollow' are still respected (Score:5, Informative)
Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.
Re: (Score:2, Insightful)
Re: (Score:3, Interesting)
As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record.
I'd mod you up if I had some points. I'm sure there are ethical implications or something when it comes to respecting the website owner's wishes not to index, but it's all public information anyway. If it's on the web and I can look at it, then Google should be able to look at it and index it.
I had no idea that government sites don't allow themselves to be indexed. That is BULLSHIT. People often NEED information from .gov sites and ALL of it should be made easy to find. Refusing to allow indexing
Re: (Score:2)
Re: (Score:2)
I think Google is already there.
Re: (Score:2)
sites can still be excluded (Score:2, Flamebait)
Re:sites can still be excluded (Score:4, Insightful)
- Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.
- If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.
- Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.
The above are all reasons why you might want some or all of the content on a site not indexed.
Re: (Score:2)
Fuzzing the world (Score:3, Insightful)
How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?
And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.
Re: (Score:3, Insightful)
More precisely: Not everyone has been doing that. I'm sure when Google comes along and exposes all their bugs they will quickly take the hint.
I don't really see the problem. The developers who know what they are doing, like you, won't be adversely affected, while the incompetent developers have to scurry around fi
Re: (Score:2)
I wouldn't be surprised if they did that, after all they did a similar thing with GWA and URLs with query strings. But I can't help but think it's a silly path to take. It makes an "unwritten rule" of HTTP that certain magic strings are off-limits, and of course, no specification contains a list of these magic strings, you have to reverse engineer other software for them.
Evil Bot (Score:2)
And a few relevant URLs from helpful sponsors?
Now you just need to hire a few sweatshop workers to get past those pesky captchas...
Anecdote from Google (Score:5, Funny)
So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.
Re: (Score:2)
end and submitting random data on random forms
[/blockquote]
Sod worrying about zapping sites, what will happen when they crawl the nuclear launch site and enter random data into the authorisation field, and in a rare feat of sod's law end up getting the code just right....
(oh and what's the betting they'll put redmond in as a target string?)
Re: (Score:2)
(Or at least super similar.)
Re: (Score:2)
I bet I know what's next (Score:2)
In other news, (Score:2, Insightful)
On phone numbers where a VMS is detected, Google plans to dial "#0#" and other codes in order to determine how to reach a human.
"Since we are a big, rich e
"getindex" Google Keyword? (Score:2)
What could be even better would be if sites that don't want get that huge load just to have their data searchabl
Forms that create agreements (Score:2)
Re: (Score:3, Informative)
How to become a millionaire in four easy steps! (Score:2)
Title Correction (Score:2, Insightful)
They're in for a surprise when... (Score:2)
I can't wait (Score:2)
Re:Google, consider this... (Score:4, Insightful)
If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.
ROBOTS.TXT & CONTENT="NOINDEX", "NOFOLLOW" (Score:2)
Dang, that was hard. Damn you, GOOGLE! Damn you to HELL! You blew it up! You finially blew up the web!
Or not.
Re:Google, consider this... (Score:4, Funny)
If any forms which feed your DB are GET style, aren't user authenticated and/or don't use a CAPTCH then you already have a huge trash data problem. At least the googlebot won't offer to enlarge your penis.
]{
Re: (Score:2)
Re: (Score:2)
Idempotence simply requires that:
f(STATE) == f(f(STATE))
It doesn't require that:
STATE == f(STATE)
So Idempotent actions can cause state changes, such as deleting an item.
Re: (Score:2)
Re: (Score:2)
And I hate it when a search result goes to... another page of search results. "You searched for 'perpetual motion engine'. Here are links to pages of us doing that search on other sites as well." Not very useful.
It isn't easy to programmatically tell the difference, but this seems like this would make that happen much more often.
Re: (Score:3, Informative)
Yes, if you require all your human visitors to read your robots.txt [robotstxt.org], and then require them to check a checkbox to mean that they clearly read and understood the entire body of your robots.txt. Then yes, you'll have to introduce some sort of almost impossible-to-read translucent captcha written in classical Chinese.
Re: (Score:2)
What about online stores with combos / search fields, but no direct index?
What about forums with a guest login?
Re: (Score:3, Informative)