Slashdot Log In
Google Crawls The Deep Web
Posted by
Zonk
on Wednesday April 16, @05:13PM
from the delved-too-deeply dept.
from the delved-too-deeply dept.
mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"
Related Stories
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.

Just think! (Score:5, Funny)
Reply to This
Re:Just think! (Score:4, Funny)
Reply to This
Parent
Re:Just think! (Score:5, Interesting)
And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..
And now, they are gonna start submitting forms? the fallout is gonna be entertaining.
Reply to This
Parent
Re:Just think! (Score:5, Interesting)
Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.
Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.
I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.
Reply to This
Parent
Bright Planet's DQM (Score:4, Interesting)
Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.
Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.
Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind
Reply to This
Oops... (Score:5, Funny)
Reply to This
Re:Oops... (Score:5, Informative)
This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.
This is for things like product catalogue searches where you pick criteria from drop-down boxes. Not so common for run-of-the-mill e-commerce sites, but I've seen a lot on B2B sites.
Reply to This
Parent
Re:Oops... (Score:5, Insightful)
Reply to This
Parent
Will it solve captchas? (Score:5, Interesting)
Reply to This
Re:Will it solve captchas? (Score:5, Funny)
Reply to This
Parent
Re:Will it solve captchas? (Score:5, Funny)
Reply to This
Parent
Forums? (Score:5, Funny)
On the plus side, this should enable Google to get by the "Must be 18 to view" buttons
Reply to This
HELLO I AM GOOGLEBOT (Score:5, Funny)
Reply to This
Re:HELLO I AM GOOGLEBOT (Score:5, Funny)
Reply to This
Parent
Re:HELLO I AM GOOGLEBOT (Score:4, Funny)
Reply to This
Parent
good and bad (Score:4, Insightful)
Reply to This
Re:good and bad (Score:5, Insightful)
Reply to This
Parent
Re:good and bad (Score:5, Insightful)
They've indexed Flash for about four years now.
No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.
Reply to This
Parent
robots.txt (Score:5, Funny)
Reply to This
Note to self... (Score:4, Funny)
Reply to This
The Internet is for Porn (Score:5, Funny)
Reply to This
directions like 'nofollow' are still respected (Score:5, Informative)
Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.
Reply to This
Anecdote from Google (Score:5, Funny)
So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.
Reply to This
Re:Google, consider this... (Score:4, Insightful)
If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.
Reply to This
Parent
Re:sites can still be excluded (Score:4, Insightful)
- Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.
- If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.
- Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.
The above are all reasons why you might want some or all of the content on a site not indexed.
Reply to This
Parent