Google Crawls The Deep Web

Google Crawls The Deep Web 197

Posted by Zonk on Wednesday April 16, 2008 @05:13PM from the delved-too-deeply dept.

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

Google Crawls The Deep Web

This discussion has been archived. No new comments can be posted.

Search 197 Comments Log In/Create an Account

Comments Filter:

good and bad (Score:4, Insightful)

by ILuvRamen ( 1026668 ) writes: on Wednesday April 16, 2008 @05:20PM (#23095838)

Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

Re:good and bad (Score:5, Insightful)

by QuoteMstr ( 55051 ) writes: <dan.colascione@gmail.com> on Wednesday April 16, 2008 @05:40PM (#23096022)

And should we not make any progress because we might step on a few toes while doing it? If Google can get your into uber-secret-private-database, so ran random user, or random Russian cracker. Fix your damn site if you're worried about this particular attack.

Re:HELLO I AM GOOGLEBOT (Score:1, Insightful)

by Anonymous Coward writes: on Wednesday April 16, 2008 @05:49PM (#23096120)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Re:Google, consider this... (Score:4, Insightful)

by poot_rootbeer ( 188613 ) writes: on Wednesday April 16, 2008 @05:51PM (#23096134)

Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like.

If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.

Fuzzing the world (Score:3, Insightful)

by corsec67 ( 627446 ) writes: on Wednesday April 16, 2008 @05:59PM (#23096222) Homepage Journal

Sweet, now Google will be Fuzzing [wikipedia.org] the entire web.

How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?

And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.

Re:good and bad (Score:5, Insightful)

by Bogtha ( 906264 ) writes: on Wednesday April 16, 2008 @06:02PM (#23096264)

Now all they need is something to read text in flash files and they've got something going.

They've indexed Flash for about four years now.

I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.

Re:directions like 'nofollow' are still respected (Score:2, Insightful)

by QuantumHobbit ( 976542 ) writes: on Wednesday April 16, 2008 @06:13PM (#23096432)

But they don't want you to find out that the moon landing was faked and that Jimmie Hoffa shot Kennedy while driving a car that runs on water. I agree with you. If you don't want people to know something don't put it on the web. If you want people to know put it on the web and let google send the people to you. It's all bureaucracy inaction.

Re:Fuzzing the world (Score:3, Insightful)

by Bogtha ( 906264 ) writes: on Wednesday April 16, 2008 @06:14PM (#23096462)

In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.

More precisely: Not everyone has been doing that. I'm sure when Google comes along and exposes all their bugs they will quickly take the hint.

I don't really see the problem. The developers who know what they are doing, like you, won't be adversely affected, while the incompetent developers have to scurry around fixing their bugs every time something like this happens.

In other news, (Score:2, Insightful)

by mbstone ( 457308 ) writes: on Wednesday April 16, 2008 @06:45PM (#23096884)

Google has announced that Google Phones (beta) will soon unveil the results of its having wardialed all 6,800,000,000 U.S. telephone numbers. Visitors to the Google Phones site will be able to search individual phone numbers to determine (without personally dialing the number) whether the number belongs to a landline telephone, cell phone, fax, or modem.

On phone numbers where a VMS is detected, Google plans to dial "#0#" and other codes in order to determine how to reach a human.

"Since we are a big, rich entity, the laws don't apply to us. We can do black-hat hacking exploits that would cause law enforcement to raid your home if you did the same thing," said a Google spokesman.

Re:Oops... (Score:5, Insightful)

by orkysoft ( 93727 ) writes: <orkysoft@myMONET ... om minus painter> on Wednesday April 16, 2008 @07:17PM (#23097390) Journal

Unfortunately, there are tons of sites whose developers did not understand the part about GET being for looking up stuff, and POST being for making changes on the server.

Title Correction (Score:2, Insightful)

by awyeah ( 70462 ) * writes: on Wednesday April 16, 2008 @11:05PM (#23099604)

"Technology: Google fills your backend database with garbage"

Re:sites can still be excluded (Score:4, Insightful)

by danielsfca2 ( 696792 ) writes: on Thursday April 17, 2008 @12:44AM (#23100314) Journal

I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it?
Well I'm glad you asked. The presence (and continued following) of the robots.txt standard is crucial for these reasons:

- Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.

- If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.

- Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.

The above are all reasons why you might want some or all of the content on a site not indexed.

Re:Just think! (Score:1, Insightful)

by Anonymous Coward writes: on Thursday April 17, 2008 @05:17AM (#23101664)

I can't help but feel that any site that doesn't perform proper access control really needs this kind of wake up call.

The way you put it is sounds like Google has somehow done something wrong there but it's not like every user to a site is going to be courteous enough to just not hit the delete link themselves. The responsibility for this problem lies entirely at the feet of the site's developer, even if the link was out there it absolutely should never have just gone ahead and deleted the content without checking who was trying to delete the content first.

There are certainly valid concerns about Google's plans but incompetent web developers shouldn't be one of them and as I pointed out originally, someone's going to take advantage of their incompetence eventually whether it's the Google bot or someone else.

Robots.txt is not the answer (Score:3, Insightful)

by Anonymous Brave Guy ( 457657 ) writes: on Thursday April 17, 2008 @08:58AM (#23102924)

Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.

For one thing, it is voluntary. Google and some other major search engines may respect it today, but they are under no obligation to do so, nor to continue to do so if they do now.

For another thing, depending on robots.txt makes the whole game opt-out. This is, IMHO, the wrong default for potentially unwelcome visits. We can't keep pretending it's OK to hit sites with huge increases in traffic because "if it's on the Internet, they expect people to visit". Sure, they expect people to visit — not automated systems run by companies with vast resources that can push a typical small site into paying for extra bandwidth or being taken off-line in a matter of minutes.

It is not OK to just Slashdot a site out of the blue. It is not OK to aggressively attack every form on a site to see what you can find. It is not OK to set up a 1,000,000 computer botnet and then effect a DDoS attack against a web site your client doesn't like. It is not OK to send me so much spam that I have to waste hours of my life sorting through it to find legitimate e-mail. These are all variations of exactly the same principle: knowingly causing a huge, unexpected and potentially expensive or damaging increase in traffic to someone without their knowledge or consent. And most of them are already illegal in a lot of jurisdictions.

It doesn't take a genius to spot that this is unethical behaviour, and it's long past time we stopped pretending it was OK because Google can Do No Evil(TM) and we like Slashdot. The current approach is unsustainable, and since the Internet's days as an unmetered, untaxed medium appear to be numbered in the current political climate, the sooner the robots.txt advocates get it, the better.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google Crawls The Deep Web 197

Google Crawls The Deep Web More Login

Google Crawls The Deep Web

good and bad (Score:4, Insightful)

Re:good and bad (Score:5, Insightful)

Re:HELLO I AM GOOGLEBOT (Score:1, Insightful)

Re:Google, consider this... (Score:4, Insightful)

Fuzzing the world (Score:3, Insightful)

Re:good and bad (Score:5, Insightful)

Re:directions like 'nofollow' are still respected (Score:2, Insightful)

Re:Fuzzing the world (Score:3, Insightful)

In other news, (Score:2, Insightful)

Re:Oops... (Score:5, Insightful)

Title Correction (Score:2, Insightful)

Re:sites can still be excluded (Score:4, Insightful)

Re:Just think! (Score:1, Insightful)

Robots.txt is not the answer (Score:3, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot