Google Crawls The Deep Web 197
mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"
good and bad (Score:4, Insightful)
Re:good and bad (Score:5, Insightful)
Re:HELLO I AM GOOGLEBOT (Score:1, Insightful)
Re:Google, consider this... (Score:4, Insightful)
If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.
Fuzzing the world (Score:3, Insightful)
How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?
And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.
Re:good and bad (Score:5, Insightful)
They've indexed Flash for about four years now.
No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.
Re:directions like 'nofollow' are still respected (Score:2, Insightful)
Re:Fuzzing the world (Score:3, Insightful)
More precisely: Not everyone has been doing that. I'm sure when Google comes along and exposes all their bugs they will quickly take the hint.
I don't really see the problem. The developers who know what they are doing, like you, won't be adversely affected, while the incompetent developers have to scurry around fixing their bugs every time something like this happens.
In other news, (Score:2, Insightful)
On phone numbers where a VMS is detected, Google plans to dial "#0#" and other codes in order to determine how to reach a human.
"Since we are a big, rich entity, the laws don't apply to us. We can do black-hat hacking exploits that would cause law enforcement to raid your home if you did the same thing," said a Google spokesman.
Re:Oops... (Score:5, Insightful)
Title Correction (Score:2, Insightful)
Re:sites can still be excluded (Score:4, Insightful)
- Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.
- If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.
- Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.
The above are all reasons why you might want some or all of the content on a site not indexed.
Re:Just think! (Score:1, Insightful)
The way you put it is sounds like Google has somehow done something wrong there but it's not like every user to a site is going to be courteous enough to just not hit the delete link themselves. The responsibility for this problem lies entirely at the feet of the site's developer, even if the link was out there it absolutely should never have just gone ahead and deleted the content without checking who was trying to delete the content first.
There are certainly valid concerns about Google's plans but incompetent web developers shouldn't be one of them and as I pointed out originally, someone's going to take advantage of their incompetence eventually whether it's the Google bot or someone else.
Robots.txt is not the answer (Score:3, Insightful)
Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.
For one thing, it is voluntary. Google and some other major search engines may respect it today, but they are under no obligation to do so, nor to continue to do so if they do now.
For another thing, depending on robots.txt makes the whole game opt-out. This is, IMHO, the wrong default for potentially unwelcome visits. We can't keep pretending it's OK to hit sites with huge increases in traffic because "if it's on the Internet, they expect people to visit". Sure, they expect people to visit — not automated systems run by companies with vast resources that can push a typical small site into paying for extra bandwidth or being taken off-line in a matter of minutes.
It is not OK to just Slashdot a site out of the blue. It is not OK to aggressively attack every form on a site to see what you can find. It is not OK to set up a 1,000,000 computer botnet and then effect a DDoS attack against a web site your client doesn't like. It is not OK to send me so much spam that I have to waste hours of my life sorting through it to find legitimate e-mail. These are all variations of exactly the same principle: knowingly causing a huge, unexpected and potentially expensive or damaging increase in traffic to someone without their knowledge or consent. And most of them are already illegal in a lot of jurisdictions.
It doesn't take a genius to spot that this is unethical behaviour, and it's long past time we stopped pretending it was OK because Google can Do No Evil(TM) and we like Slashdot. The current approach is unsustainable, and since the Internet's days as an unmetered, untaxed medium appear to be numbered in the current political climate, the sooner the robots.txt advocates get it, the better.