Google Crawls The Deep Web 197
mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"
Bright Planet's DQM (Score:4, Interesting)
Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.
Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.
Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind
Will it solve captchas? (Score:5, Interesting)
Re:HELLO I AM GOOGLEBOT (Score:2, Interesting)
Re:Bright Planet's DQM (Score:3, Interesting)
Re:directions like 'nofollow' are still respected (Score:3, Interesting)
I'd mod you up if I had some points. I'm sure there are ethical implications or something when it comes to respecting the website owner's wishes not to index, but it's all public information anyway. If it's on the web and I can look at it, then Google should be able to look at it and index it.
I had no idea that government sites don't allow themselves to be indexed. That is BULLSHIT. People often NEED information from .gov sites and ALL of it should be made easy to find. Refusing to allow indexing such information is akin to hiding or obfuscating it: you don't actually want anyone to read it or anything, but you can say it's available on the web so your ass is covered. IMO there should be a law stating that all of .gov MUST be indexed by search engines.
Is there a law saying that search engines MUST follow these robots.txt, nofollow, etc? If it's not breaking the law, then Google should have some serious competition. A new search engine that indexes ALL VIEWABLE SITES regardless of the owner's wishes would be fucking great.
Re:Just think! (Score:5, Interesting)
And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..
And now, they are gonna start submitting forms? the fallout is gonna be entertaining.
Re:Oops... (Score:3, Interesting)
What's it to Google (or a third party) if they mess up your pathetically-designed form?
That depends. If they effectively launch a denial-of-service attack and eat zilliabytes of people's limited bandwidth by attempting to submit with all possible combinations of form controls and large amounts of random data in text fields, would that be:
Re:Just think! (Score:5, Interesting)
Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.
Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.
I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.
Re:Bright Planet's DQM (Score:3, Interesting)
It's cases like that where doing a half-arsed job is worse than not trying at all.