Google Crawls The Deep Web 197

Posted by Zonk on Wednesday April 16, 2008 @05:13PM from the delved-too-deeply dept.

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

Google Crawls The Deep Web

This discussion has been archived. No new comments can be posted.

Search 197 Comments Log In/Create an Account

Comments Filter:

Bright Planet's DQM (Score:4, Interesting)

by eldavojohn ( 898314 ) * writes: <eldavojohn&gmail,com> on Wednesday April 16, 2008 @05:15PM (#23095788) Journal

Several years ago, I tried a demo of Bright Planet's Deep Query Manager [brightplanet.com] that would essentially do these searches through a client on your machine in batch-like jobs. Oh, the bandwidth and resources you'll hog!

Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.

Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.

Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind ...

Will it solve captchas? (Score:5, Interesting)

by lastninja ( 237588 ) writes: on Wednesday April 16, 2008 @05:17PM (#23095806)

only half kidding

Re:HELLO I AM GOOGLEBOT (Score:2, Interesting)

by Anonymous Coward writes: on Wednesday April 16, 2008 @05:35PM (#23095972)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:Bright Planet's DQM (Score:3, Interesting)

by menace3society ( 768451 ) writes: on Wednesday April 16, 2008 @06:13PM (#23096436)

You could build a really interesting "Deep Web" crawler by ignoring robots.txt. In fact, an index just of robots.txt files would be pretty cool in its own right. Call it "Sweet Sixteen" (10**100 in binary) or something.

Re:directions like 'nofollow' are still respected (Score:3, Interesting)

by Christophotron ( 812632 ) writes: on Wednesday April 16, 2008 @06:18PM (#23096520)

As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record.

I'd mod you up if I had some points. I'm sure there are ethical implications or something when it comes to respecting the website owner's wishes not to index, but it's all public information anyway. If it's on the web and I can look at it, then Google should be able to look at it and index it.
I had no idea that government sites don't allow themselves to be indexed. That is BULLSHIT. People often NEED information from .gov sites and ALL of it should be made easy to find. Refusing to allow indexing such information is akin to hiding or obfuscating it: you don't actually want anyone to read it or anything, but you can say it's available on the web so your ass is covered. IMO there should be a law stating that all of .gov MUST be indexed by search engines.
Is there a law saying that search engines MUST follow these robots.txt, nofollow, etc? If it's not breaking the law, then Google should have some serious competition. A new search engine that indexes ALL VIEWABLE SITES regardless of the owner's wishes would be fucking great.

Re:Just think! (Score:5, Interesting)

by Ariven ( 256118 ) writes: <ariven&gmail,com> on Wednesday April 16, 2008 @08:45PM (#23098378) Homepage

I remember an article while back where someone had cut/pasted some articles from one section of their site to another.. and as a result had edit and delete links in the live content instead of on their internal web interface.

And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..

And now, they are gonna start submitting forms? the fallout is gonna be entertaining.

Re:Oops... (Score:3, Interesting)

by Anonymous Brave Guy ( 457657 ) writes: on Wednesday April 16, 2008 @10:33PM (#23099346)
What's it to Google (or a third party) if they mess up your pathetically-designed form?

That depends. If they effectively launch a denial-of-service attack and eat zilliabytes of people's limited bandwidth by attempting to submit with all possible combinations of form controls and large amounts of random data in text fields, would that be:
1. antisocial?
2. negligent?
3. the almost immediate end of their reign as most popular search engine as numerous webmasters blocked their robots?
4. illegal?
5. all of the above?
Re:Just think! (Score:5, Interesting)

by jc42 ( 318812 ) writes: on Wednesday April 16, 2008 @11:08PM (#23099630) Homepage Journal

I had similar problems a few years ago. The database had a lot of data in a compact format, and I wrote some retrieval pages that would extract the data and run it through any of a list of formatters to give clients the output format they wanted. Very practical. Over time, the list of output formats slowly grew, as did the database. Then one day, the machine was totally bogged down with http requests. It turned out that a search site had figured out how to use my format-conversion form, and had requested all of our data in every format that my code delivered.

Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.

Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.

I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.

Re:Bright Planet's DQM (Score:3, Interesting)

by enoz ( 1181117 ) writes: on Thursday April 17, 2008 @01:26AM (#23100580)

One time when I was Deep Crawling a particular website I decided to take a peek at their robots.txt file. To my amazement they had listed all the folders that they didn't want anyone to find, yet had provided absolutely no security to prevent you accessing the content if you knew the location.

It's cases like that where doing a half-arsed job is worse than not trying at all.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google Crawls The Deep Web 197

Google Crawls The Deep Web More Login

Google Crawls The Deep Web

Bright Planet's DQM (Score:4, Interesting)

Will it solve captchas? (Score:5, Interesting)

Re:HELLO I AM GOOGLEBOT (Score:2, Interesting)

Re:Bright Planet's DQM (Score:3, Interesting)

Re:directions like 'nofollow' are still respected (Score:3, Interesting)

Re:Just think! (Score:5, Interesting)

Re:Oops... (Score:3, Interesting)

Re:Just think! (Score:5, Interesting)

Re:Bright Planet's DQM (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot