Google Crawls The Deep Web 197

Posted by Zonk on Wednesday April 16, 2008 @05:13PM from the delved-too-deeply dept.

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

This discussion has been archived. No new comments can be posted.

Google Crawls The Deep Web

Load All Comments

Search 197 Comments Log In/Create an Account

Comments Filter:

Just think! (Score:5, Funny)

by scubamage ( 727538 ) writes: on Wednesday April 16, 2008 @05:14PM (#23095774)

Soon, they'll start injecting SQL too to help map databases! Google is so useful indeed! :)

Share
twitter facebook
- Re:Just think! (Score:4, Funny)
  
  by AKAImBatman ( 238306 ) writes: <[akaimbatman] [at] [gmail.com]> on Wednesday April 16, 2008 @05:28PM (#23095906) Homepage Journal
  
  Hmm... that reminds me of this DailyWTF [thedailywtf.com]. Who knew that Mr. Test User was such a big customer? :-P
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by CastrTroy ( 595695 ) writes:
  
  Actually, we (the web) have had problems with this before. Web accellerators started following links on pages before you clicked them. If the link happened to link to an action deleting something, it would delete it just by visiting a page with the delete link on it. Granted you should never do anything destructive with a get request, but now Google is starting to submit forms. I wonder how much stuff they will end up deleting with their program that automatically submits forms with values it think shoul
  - Re: (Score:2)
    
    by LordKronos ( 470910 ) writes:
    
    I've seen a number of users come crying in the mythtv forum that somehow all of their recordings mysteriously disappeared. Seems having your mythweb completely unsecured isn't such a good thing.
    
    For those people, this move by Google is great news. You see, the delete links were all simple GET requests, so the spiders were able to delete content. However, the scheduling is all done via POST'ed forms, so nothing would ever get recorded. This move on Google's part is really just an attempt to combat this. The o
  - Re:Just think! (Score:5, Interesting)
    
    by jc42 ( 318812 ) writes: on Wednesday April 16, 2008 @11:08PM (#23099630) Homepage Journal
    
    I had similar problems a few years ago. The database had a lot of data in a compact format, and I wrote some retrieval pages that would extract the data and run it through any of a list of formatters to give clients the output format they wanted. Very practical. Over time, the list of output formats slowly grew, as did the database. Then one day, the machine was totally bogged down with http requests. It turned out that a search site had figured out how to use my format-conversion form, and had requested all of our data in every format that my code delivered.
    
    Google wasn't too bad, because at least they spread the requests out over time. But other search sites hit our poor server with requests as fast as the Internet would deliver them. I ended up writing code that spotted this pattern of requests, and put the offending searcher on a blacklist. From then on, they only got back pages saying that they were blacklisted, with an email address to write if this was an error. That address never got any mail, and the problem went away.
    
    Since then, I've done periodic scans of the server logs for other bursts of requests that look like an attempt to extract everything in every format. I've had to add a few more gimmicks (kludges) to spot these automatically and blacklist the clients.
    
    I wonder if google's new code will get past my defenses? I've noticed that googlebot addresses are in the "no CGI allowed" portion of my blacklist, though they are allowed to retrieve the basic data. I'll be on the lookout for symptoms of a breakthrough.
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by Arancaytar ( 966377 ) writes:
  
  Or just deleting those databases in order to reduce the set of information it has to index. Google "Google Purge onion"! :P
- - Re: (Score:3, Informative)
    
    by Lillesvin ( 797939 ) writes:
    
    ... maybe a borked machine?
    
    Yeah, maybe your machine... That SQL-error looks more like bad session handling on the server hosting your Drupal installation than Google trying to do an SQL-injection... Actually, it looks nothing like an SQL-injection at all. MySQL is merely being asked to insert a duplicate value in a column specified as unique (`sid`), which it refuses because it's not unique. Don't expect an answer, since it's most likely not an error on Google's end.
    A little more on topic though, what
    - Re:Just think! (Score:5, Interesting)
      
      by Ariven ( 256118 ) writes: <ariven@@@gmail...com> on Wednesday April 16, 2008 @08:45PM (#23098378) Homepage
      
      I remember an article while back where someone had cut/pasted some articles from one section of their site to another.. and as a result had edit and delete links in the live content instead of on their internal web interface.
      
      And a search engine (I think it was google) crawled the site, hit the delete links and deleted all the pages of the site. At that time it was stated that any link that performs an action, such as delete, should be a post, via form so that search engines wouldnt do that very thing..
      
      And now, they are gonna start submitting forms? the fallout is gonna be entertaining.
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Informative)
        
        by dartarrow ( 930250 ) writes:
        
        I think you mean this: http://thedailywtf.com/Articles/The_Spider_of_Doom.aspx [thedailywtf.com]
- - Re: (Score:2)
    
    by Nullav ( 1053766 ) writes:
    
    +1, Informative!
Bright Planet's DQM (Score:4, Interesting)

by eldavojohn ( 898314 ) * writes: <eldavojohn&gmail,com> on Wednesday April 16, 2008 @05:15PM (#23095788) Journal

Several years ago, I tried a demo of Bright Planet's Deep Query Manager [brightplanet.com] that would essentially do these searches through a client on your machine in batch-like jobs. Oh, the bandwidth and resources you'll hog!

Their stats on how much of the web they hit that Google missed was always impressive (true or not) but perhaps their days are numbered with this new venture by Google.

Quite an interesting concept if you think about it. I always presupposed that companies would hate it but never got 'blocked' from doing it to sites.

Here, suck up my bandwidth without generating ad revenue! Sounds like a lose situation for the data provider in my mind ...

Share
twitter facebook
- Re: (Score:3, Interesting)
  
  by menace3society ( 768451 ) writes:
  
  You could build a really interesting "Deep Web" crawler by ignoring robots.txt. In fact, an index just of robots.txt files would be pretty cool in its own right. Call it "Sweet Sixteen" (10**100 in binary) or something.
  - Re: (Score:3, Interesting)
    
    by enoz ( 1181117 ) writes:
    
    One time when I was Deep Crawling a particular website I decided to take a peek at their robots.txt file. To my amazement they had listed all the folders that they didn't want anyone to find, yet had provided absolutely no security to prevent you accessing the content if you knew the location.
    
    It's cases like that where doing a half-arsed job is worse than not trying at all.
- Re: (Score:2)
  
  by cheater512 ( 783349 ) writes:
  
  The more content they have off your site, the more visitors they send.
  
  The visitors *do* generate ad revenue. :)
Oops... (Score:5, Funny)

by JohnnyDanger ( 680986 ) writes: on Wednesday April 16, 2008 @05:16PM (#23095790)

They just bought everything on Amazon.

Share
twitter facebook
- Re:Oops... (Score:5, Informative)
  
  by Bogtha ( 906264 ) writes: on Wednesday April 16, 2008 @05:57PM (#23096200)
  
  This won't post forms of that sort. In the blog post, they say that they are only doing this for GET forms, which are safe to automate as per the HTTP specification.
  
  This is for things like product catalogue searches where you pick criteria from drop-down boxes. Not so common for run-of-the-mill e-commerce sites, but I've seen a lot on B2B sites.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by Firehed ( 942385 ) writes:
    
    HTTP spec be damned - has IE taught you nothing?
    - - Re: (Score:2)
        
        by Jarjarthejedi ( 996957 ) writes:
        
        IE's horridness trancends the mere concept of acronyms.
      - Re: (Score:2)
        
        by cheater512 ( 783349 ) writes:
        
        No, IE ignores chunks of HTTP as well as HTML.
        
        Re: (Score:2)
        
        by enoz ( 1181117 ) writes:
        
        Methinks the developers where sniffing too much MIME
  - Re:Oops... (Score:5, Insightful)
    
    by orkysoft ( 93727 ) writes: <orkysoft@my r e a l b o x . c om> on Wednesday April 16, 2008 @07:17PM (#23097390) Journal
    
    Unfortunately, there are tons of sites whose developers did not understand the part about GET being for looking up stuff, and POST being for making changes on the server.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by jrumney ( 197329 ) writes:
      
      Unfortunately there are also tons of sites whose developers did not understand the part about POST being for creating new resources, and PUT being for making changes on the server.
      
      HTTP verb semantics are a very dangerous thing for Google or any other third party to rely on, unless they are using a documented API where the developers have explicitly followed REST principles.
      - Re: (Score:2)
        
        by FooAtWFU ( 699187 ) writes:
        
        What's it to Google (or a third party) if they mess up your pathetically-designed form? It's not like they're going to "accidentally purchase something" (like some people suggested) unless they have their robots equipped with billing information submission functions (somehow I doubt it).
        
        Re: (Score:3, Interesting)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        What's it to Google (or a third party) if they mess up your pathetically-designed form?
        That depends. If they effectively launch a denial-of-service attack and eat zilliabytes of people's limited bandwidth by attempting to submit with all possible combinations of form controls and large amounts of random data in text fields, would that be:
        
        antisocial?
        negligent?
        the almost immediate end of their reign as most popular search engine as numerous webmasters blocked their robots?
        illegal?
        all of the above?
        
        Re: (Score:2)
        
        by rtb61 ( 674572 ) writes:
        
        Well that does bring up a point. Should you have to include extra coding in your html to block google, or should google only be allowed to deep search sites that have extra coding that invites them in.
        Google in a way is saying that if you fail to properly secure your site that they have a right to data mine it and generate profits from your data. Perhaps, mind you, just perhaps, that really, legally, is not appropriate and perhaps a legal investigation is required to clarify this before everyone starts do
        
        Robots.txt is not the answer (Score:3, Insightful)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        Every time anyone raises a question like this, someone trots out robots.txt as if it is some sort of magic solution to all the potential problems. It is not.
        
        For one thing, it is voluntary. Google and some other major search engines may respect it today, but they are under no obligation to do so, nor to continue to do so if they do now.
        For another thing, depending on robots.txt makes the whole game opt-out. This is, IMHO, the wrong default for potentially unwelcome visits. We can't keep pretending it's O
      - Re: (Score:2)
        
        by jlarocco ( 851450 ) writes:
        
        HTTP is a documented API.
        
        What makes you think somebody who's just fucked up HTTP isn't going to go right ahead and fuck up "REST principles" while they're at it?
  - Re: (Score:2)
    
    by exp(pi*sqrt(163)) ( 613870 ) writes:
    
    Do you thing the DoD use GET or POST for launching nuclear warheards? Is there a guideline about that?
    - Re: (Score:2)
      
      by enoz ( 1181117 ) writes:
      
      Obviously they should be using DELETE [w3.org]
Will it solve captchas? (Score:5, Interesting)

by lastninja ( 237588 ) writes: on Wednesday April 16, 2008 @05:17PM (#23095806)

only half kidding

Share
twitter facebook
- Re: (Score:2)
  
  by fishybell ( 516991 ) writes:
  
  Just what we need, some 'bot adding it's insightful comments based on other words in the same document...then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?
  - Re:Will it solve captchas? (Score:5, Funny)
    
    by skraps ( 650379 ) writes: on Wednesday April 16, 2008 @05:39PM (#23096004)
    
    Just what we need, some 'bot adding it's insightful comments based on other words in the same document.
    Are such questions on your mind often?
    ..then again, on most sites, would you be able to tell the difference between Google posting something and some 1337 kiddiez?!?!!1eleven?
    What does that suggest to you?
    
    Parent Share
    twitter facebook
    - Re:Will it solve captchas? (Score:5, Funny)
      
      by urcreepyneighbor ( 1171755 ) writes: on Wednesday April 16, 2008 @06:05PM (#23096328)
      
      You whore! You told me you loved me, Eliza! You said you'd call!
      
      Parent Share
      twitter facebook
Forums? (Score:5, Funny)

by fishybell ( 516991 ) writes: <fishybell@hotmaE ... m minus math_god> on Wednesday April 16, 2008 @05:18PM (#23095814) Homepage Journal

Well, I certainly hope that they put in some decent smarts to prevent it from making posts onto forums, blogs, /., etc.

On the plus side, this should enable Google to get by the "Must be 18 to view" buttons ;)

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by brunascle ( 994197 ) writes:
  
  as TFA states, it's only GET requests, not POSTs. so it would mostly be search queries.
  - Re: (Score:2)
    
    by fishybell ( 516991 ) writes:
    
    ...and porn. You can't forget the porn.
  - Re: (Score:2)
    
    by MenTaLguY ( 5483 ) writes:
    
    Unfortunately a lot of developers misuse GET requests for actions which modify state. (I suppose this'll teach them...)
    - Re: (Score:2)
      
      by Bogtha ( 906264 ) writes:
      
      The usual excuse for that is that they want a link — for aesthetic purposes, to put in an email, etc. If you're using a form anyway, those reasons disappear. I'm sure there are a few developers who screw this up, but it won't be anywhere near as common as the problems GWA uncovered.
- Re: (Score:3, Funny)
  
  by spintriae ( 958955 ) writes:
  
  Google's only 12 years old. It shouldn't be visiting those sites.
HELLO I AM GOOGLEBOT (Score:5, Funny)

by Anonymous Coward writes: on Wednesday April 16, 2008 @05:19PM (#23095828)

I am just submitting this form to see what's behind it. PLEASE IGNORE ME.

Share
twitter facebook
- Re:HELLO I AM GOOGLEBOT (Score:5, Funny)
  
  by Anonymous Coward writes: on Wednesday April 16, 2008 @05:21PM (#23095848)
  
  I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
  
  Parent Share
  twitter facebook
  - Re:HELLO I AM GOOGLEBOT (Score:4, Funny)
    
    by Anonymous Coward writes: on Wednesday April 16, 2008 @05:26PM (#23095896)
    
    I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
    
    Parent Share
    twitter facebook
    - Re: (Score:2, Interesting)
      
      by Anonymous Coward writes:
      
      I am just submitting this form to see what's behind it. PLEASE IGNORE ME.
    - - Eeek, you found me! (Score:3, Funny)
        
        by CheeseTroll ( 696413 ) writes:
        
        n/t
- Re: (Score:2)
  
  by Anonymous Brave Guy ( 457657 ) writes:
  
  HTTP/1.1 426 Upgrade Required Upgrade: Common courtesy/1.0, HTTP/1.1
Forums, and "web 2.0" sites. (Score:2)

by PyroMosh ( 287149 ) writes:

This brings up a concern from the description.

So Googlebot will come across a web page.
It follows a link.
The link leads to a page with a form.
Googlebot fills out the form based on content already on the site.
Googlebot clicks submit.
Googlebot goes to the next page, and continues to follow links.

The problem comes when that form was a post form like the one I am typing on right now for a forum, or some other type of form to create user generated content. This makes it seem like google will see the text box an
- Re: (Score:2)
  
  by Idiomatick ( 976696 ) writes:
  
  Google indexes more than any other search engine by expanding the web themselves. It was moving too slow for them.
  
  Really though i don't think this will be a problem. People at google are pretty smart and i'm sure they've thought of this. Even if you believe google is evil there no evil corporate benefit to spamming garbled text to the entire internet.
- Re: (Score:2)
  
  by mmkkbb ( 816035 ) writes:
  
  They will use Markov chains which may end up sounding more intelligent than many forum denizens. Fark, Free Republic, LGF, etc. won't even notice.
- Re: (Score:2)
  
  by Simon (S2) ( 600188 ) writes:
  
  This makes it seem like google will see the text box and input random content from the site, then post it.
  No. Googlebot will only do gets, not posts.
- Re: (Score:2)
  
  by menace3society ( 768451 ) writes:
  
  I am tempted to copy and paste that and post it as my reply, but I think that would be insufferably clever. So, too, is referring the fact that I could be insufferably clever, but choose not to be. Etc...
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:3, Funny)
    
    by enoz ( 1181117 ) writes:
    
    Any forum that can't stop a "good" bot is going to have spam all over it anyway from the "bad" ones...
    C'mon there's no point in Google launching a war against phpBB, there are more than enough spambots doing that already.
- Re: (Score:3, Informative)
  
  by Z80xxc! ( 1111479 ) writes:
  
  Seems to me it would be easy enough to detect the googlebot user agent, then if so, automatically redirect it to the page on the other end (or even send it to a random 404 page or something), all without processing the form data at all.
  
  <? if ($_SERVER['HTTP_USER_AGENT']=="User_agentMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"); { header( 'Location: /landing_page.php' ) ; } else { processtheform(); } ?>
  Of course, this would have to be implemented, which would b
good and bad (Score:4, Insightful)

by ILuvRamen ( 1026668 ) writes: on Wednesday April 16, 2008 @05:20PM (#23095838)

Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

Share
twitter facebook
- Re:good and bad (Score:5, Insightful)
  
  by QuoteMstr ( 55051 ) writes: <dan.colascione@gmail.com> on Wednesday April 16, 2008 @05:40PM (#23096022)
  
  And should we not make any progress because we might step on a few toes while doing it? If Google can get your into uber-secret-private-database, so ran random user, or random Russian cracker. Fix your damn site if you're worried about this particular attack.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by martin-boundary ( 547041 ) writes:
    
    Fix your damn site if you're worried about this particular attack.
    
    Nope. I'll just refer them to the DMCA anti circumvention provisions [chillingeffects.org]. Let those damn phd kids fix their damn algorithms or get the hell off my damn lawn :)
- Re:good and bad (Score:5, Insightful)
  
  by Bogtha ( 906264 ) writes: on Wednesday April 16, 2008 @06:02PM (#23096264)
  
  Now all they need is something to read text in flash files and they've got something going.
  
  They've indexed Flash for about four years now.
  
  I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.
  
  No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Metasquares ( 555685 ) writes:
  
  expose data that's supposed to be protected and private
  Ugh, it's the friend class of the entire Internet!
I'm in your Intarwebs (Score:2, Funny)

by Mathus ( 941922 ) writes:

Cracking your forms. Sorry, could not help myself.
robots.txt (Score:5, Funny)

by B3ryllium ( 571199 ) writes: on Wednesday April 16, 2008 @05:37PM (#23095986) Homepage

Okay, so how long until the spec for robots.txt is updated to have a "DontBeStupid" directive?

Share
twitter facebook
Note to self... (Score:4, Funny)

by fahrbot-bot ( 874524 ) writes: on Wednesday April 16, 2008 @05:38PM (#23095990)

our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML...
...post invoice forms ordering expensive items to be shipped to Google. Be sure to log incoming IP addresses for verification.

Share
twitter facebook
Heisenberg for web (Score:2)

by gmuslera ( 3436 ) writes:

If well you can have links that do actions and change information, submitting forms is a good recipe for massive changes, from comment spam to anything, sky is the limit.

Now you can't see what is on the web, by crawling, without changing it.
The Internet is for Porn (Score:5, Funny)

by kiehlster ( 844523 ) writes: on Wednesday April 16, 2008 @05:49PM (#23096118) Homepage

If you haven't already noticed, AdSense has features now to tell Google how to log into your website so it can catalog your user-only pages. You know what that means. Porn sites are going to start using this so that Googlebot can confirm that it's age is over 18. We'll be showered with a gigantic wave of pornographic information. We will soon have to press juvenile charges against a corporate entity because it lied about its age on web forms to gain access to pornography and forum discussions.

Share
twitter facebook
- Re: (Score:2)
  
  by CheeseTroll ( 696413 ) writes:
  
  Didn't you learn during the 90's that dotcom's age in Internet Years [webxpress.com]? :-)
directions like 'nofollow' are still respected (Score:5, Informative)

by frovingslosh ( 582462 ) writes: on Wednesday April 16, 2008 @05:53PM (#23096160)

Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.
Maybe they shouldn't be, at least not in all cases. Several years back I had done many Google searches for some information that was very important to me, but never could find anything. Then a few months later (too late to be of use), pretty much by a fortunate combination of factors but with no help from Google, I came across the exact information, on a .GOV website in a publicly filed IPO document. As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record. When these nofollow directives are over used by mindless and unaccountable bureaucrats, perhaps someone needs to make the decision that these records should be public and that isn't best served by hiding them deep down a long list of links where they are hard to locate. In cases like this I would applaud any search engine that ignores the "suggestion" not to index public pages just because of an inappropriate tag in the HTML. In fact, if I knew of any search engine that was indexing in spite of this tag, I would switch to them as my first choice search engine in an instant. For starters, I would suggest that any .GOV and any State TLD website should have this tag ignored unless there were darn good reason to do otherwise.

Share
twitter facebook
- Re: (Score:2, Insightful)
  
  by QuantumHobbit ( 976542 ) writes:
  
  But they don't want you to find out that the moon landing was faked and that Jimmie Hoffa shot Kennedy while driving a car that runs on water. I agree with you. If you don't want people to know something don't put it on the web. If you want people to know put it on the web and let google send the people to you. It's all bureaucracy inaction.
- Re: (Score:3, Interesting)
  
  by Christophotron ( 812632 ) writes:
  
  As far as I can tell, our US government aggressively marks websites not to be indexed, even when they contain information that is posted there to be public record.
  
  I'd mod you up if I had some points. I'm sure there are ethical implications or something when it comes to respecting the website owner's wishes not to index, but it's all public information anyway. If it's on the web and I can look at it, then Google should be able to look at it and index it.
  I had no idea that government sites don't allow themselves to be indexed. That is BULLSHIT. People often NEED information from .gov sites and ALL of it should be made easy to find. Refusing to allow indexing
  - Re: (Score:2)
    
    by STrinity ( 723872 ) writes:
    
    Is there a law saying that search engines MUST follow these robots.txt, nofollow, etc?
    
    No, only Internet standards. No need to follow those antiquated things. Google can become the search equivalent of IE.
    - Re: (Score:2)
      
      by enoz ( 1181117 ) writes:
      
      The search equivalent to IE.... so being the dominant player, using a feature-limited interface, and prone to leaking private information?
      
      I think Google is already there.
- Re: (Score:2)
  
  by bigbigbison ( 104532 ) writes:
  
  While I don't see Google doing it because of the backlash I'm a bit surprised that no other search engine has touted ignoring "nofollow" and "noindex" as a "feature" of their search engine in the attempt to look like they are better than Google.
sites can still be excluded (Score:2, Flamebait)

by nurb432 ( 527695 ) writes:

Wimps. Index it all, who cares if the site doesn't want it. If its public facing it deserves to be indexed.
- - Re:sites can still be excluded (Score:4, Insightful)
    
    by danielsfca2 ( 696792 ) writes: on Thursday April 17, 2008 @12:44AM (#23100314) Journal
    
    I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it?
    Well I'm glad you asked. The presence (and continued following) of the robots.txt standard is crucial for these reasons:
    
    - Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.
    
    - If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.
    
    - Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.
    
    The above are all reasons why you might want some or all of the content on a site not indexed.
    
    Parent Share
    twitter facebook
  - - Re: (Score:2)
      
      by nurb432 ( 527695 ) writes:
      
      Then dont put your site public facing.
Fuzzing the world (Score:3, Insightful)

by corsec67 ( 627446 ) writes: on Wednesday April 16, 2008 @05:59PM (#23096222) Homepage Journal

Sweet, now Google will be Fuzzing [wikipedia.org] the entire web.

How will this work for forms that perform translations, validations and similar kinds of operations on other websites? Try to pull the entire internet through each such site it finds?

And then not every web development environment forces GET to not change data. In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.

Share
twitter facebook
- Re: (Score:3, Insightful)
  
  by Bogtha ( 906264 ) writes:
  
  In Ruby on Rails, adding "?methond=post" to the end of a url fakes a post, even though it is actually a GET, which I disabled in the company I work for. Not everyone is going to do that.
  More precisely: Not everyone has been doing that. I'm sure when Google comes along and exposes all their bugs they will quickly take the hint.
  I don't really see the problem. The developers who know what they are doing, like you, won't be adversely affected, while the incompetent developers have to scurry around fi
  - - Re: (Score:2)
      
      by Bogtha ( 906264 ) writes:
      
      I wouldn't be surprised if they did that, after all they did a similar thing with GWA and URLs with query strings. But I can't help but think it's a silly path to take. It makes an "unwritten rule" of HTTP that certain magic strings are off-limits, and of course, no specification contains a list of these magic strings, you have to reverse engineer other software for them.
Evil Bot (Score:2)

by Arancaytar ( 966377 ) writes:

For text boxes, our computers automatically choose words from the site that has the form

And a few relevant URLs from helpful sponsors?

Now you just need to hire a few sweatshop workers to get past those pesky captchas...
Anecdote from Google (Score:5, Funny)

by arrrrg ( 902404 ) writes: on Wednesday April 16, 2008 @06:12PM (#23096424)

When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash [bash.org].) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.

So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.

Share
twitter facebook
- Re: (Score:2)
  
  by Arimus ( 198136 ) writes:
  
  [blockquote]
  end and submitting random data on random forms
  [/blockquote]
  
  Sod worrying about zapping sites, what will happen when they crawl the nuclear launch site and enter random data into the authorisation field, and in a rare feat of sod's law end up getting the code just right....
  
  (oh and what's the betting they'll put redmond in as a target string?)
- Re: (Score:2)
  
  by RyoShin ( 610051 ) writes:
  
  This could be the incident you speak of. [thedailywtf.com] :)
  
  (Or at least super similar.)
- Re: (Score:2)
  
  by sootman ( 158191 ) writes:
  
  That happened to me on a database demo site that I did. The 'edit,' 'details,' and, yes, 'delete' buttons were just plain old text links. I posted the URL of the page to a mailing list, Google came in through that, and methodically 'clicked' on each link, including the 'delete' ones. (There was even a confirmation page with 'Are you sure you want to delete this? _Yes_ or _No_'--as links, of course.) I went to show someone it one day and all the data was gone. It was just sample data, so no great loss. I fig
I bet I know what's next (Score:2)

by 93 Escort Wagon ( 326346 ) writes:

In a few months, there'll be a new blog post - Google will attempt to access and index all sites' password-protected pages by matching usernames found elsewhere on the site (e.g. from email addresses) with intelligent guesses at passwords, based on information it's gleaned regarding those individuals. Failing that, it'll run through entries found in various cracker dictionaries.
In other news, (Score:2, Insightful)

by mbstone ( 457308 ) writes:

Google has announced that Google Phones (beta) will soon unveil the results of its having wardialed all 6,800,000,000 U.S. telephone numbers. Visitors to the Google Phones site will be able to search individual phone numbers to determine (without personally dialing the number) whether the number belongs to a landline telephone, cell phone, fax, or modem.

On phone numbers where a VMS is detected, Google plans to dial "#0#" and other codes in order to determine how to reach a human.

"Since we are a big, rich e
"getindex" Google Keyword? (Score:2)

by Doc Ruby ( 173196 ) writes:

Repeatedly querying to extract every permutation of their API could be much larger than their underlying data (think of the combinatorics of only 5 query fields of only 5 values each, against only a couple of hundred values in the database, like many at sites), and far too much traffic for small sites (and probably for big sites, too, if their combinations of queries at all matches their traffic).

What could be even better would be if sites that don't want get that huge load just to have their data searchabl
Forms that create agreements (Score:2)

by Russ Nelson ( 33911 ) writes:

The problem with their searching is a form like this one: http://quaker.org/users.cgi [quaker.org] It's *meant* to keep people out unless they've entered into a legal agreement.
- Re: (Score:3, Informative)
  
  by dave420 ( 699308 ) writes:
  
  That is a POST form, which Google have said they will not mess with.
How to become a millionaire in four easy steps! (Score:2)

by PAjamian ( 679137 ) writes:
1. 1. Set up a shopping cart which is lack on security and uses GET forms instead of POST forms
2. 2. Put one item in the shopping cart, a used tic tac box for 1 million dollars (it's a collector's item)
3. 3. Wait for the google bot to buy the tic tacs with the corporate credit card
4. 4. Profit!!!!
Title Correction (Score:2, Insightful)

by awyeah ( 70462 ) * writes:

"Technology: Google fills your backend database with garbage"
They're in for a surprise when... (Score:2)

by Dr. Zowie ( 109983 ) writes:

... they hit the Solar Dynamics Observatory database next year. It'll be collecting several petabytes of images...
I can't wait (Score:2)

by sentientbrendan ( 316150 ) writes:

until the google trawler starts making it's own first posts.
- Re:Google, consider this... (Score:4, Insightful)
  
  by poot_rootbeer ( 188613 ) writes: on Wednesday April 16, 2008 @05:51PM (#23096134)
  
  Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like.
  
  If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.
  
  Parent Share
  twitter facebook
- ROBOTS.TXT & CONTENT="NOINDEX", "NOFOLLOW" (Score:2)
  
  by Chyeld ( 713439 ) writes:
  
  http://www.robotstxt.org/ [robotstxt.org]
  
  Dang, that was hard. Damn you, GOOGLE! Damn you to HELL! You blew it up! You finially blew up the web!
  
  Or not.
- Re:Google, consider this... (Score:4, Funny)
  
  by Kristoph ( 242780 ) writes: on Thursday April 17, 2008 @01:26AM (#23100578)
  
  Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create?
  
  If any forms which feed your DB are GET style, aren't user authenticated and/or don't use a CAPTCH then you already have a huge trash data problem. At least the googlebot won't offer to enlarge your penis.
  
  ]{
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by profplump ( 309017 ) writes:
  
  They are only submitting forms with a GET method. According to the HTTP specs, GET requests should always be idempotent. If you've got forms that use the GET method and aren't idempotent you should *already* be taking extra precautions avoid accidental use by bots and other automated tools.
  - Re: (Score:2)
    
    by GryMor ( 88799 ) writes:
    
    You do realize that 'delete' is idempotent, right?
    
    Idempotence simply requires that:
    f(STATE) == f(f(STATE))
    It doesn't require that:
    STATE == f(STATE)
    
    So Idempotent actions can cause state changes, such as deleting an item.
- Re: (Score:2)
  
  by TheRaven64 ( 641858 ) writes:
  
  I'm surprised it isn't doing this already, from the number of 'search results' pages an average Google search turns up.
- Re: (Score:2)
  
  by corsec67 ( 627446 ) writes:
  
  ... I didn't like our search results showing up in theirs.
  
  And I hate it when a search result goes to... another page of search results. "You searched for 'perpetual motion engine'. Here are links to pages of us doing that search on other sites as well." Not very useful.
  
  It isn't easy to programmatically tell the difference, but this seems like this would make that happen much more often.
- Re: (Score:3, Informative)
  
  by stephanruby ( 542433 ) writes:
  
  Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?
  Yes, if you require all your human visitors to read your robots.txt [robotstxt.org], and then require them to check a checkbox to mean that they clearly read and understood the entire body of your robots.txt. Then yes, you'll have to introduce some sort of almost impossible-to-read translucent captcha written in classical Chinese.
- Re: (Score:2)
  
  by aXis100 ( 690904 ) writes:
  
  I'd say there is plenty of value behind forms. They're not just for submitting an application, some places use them as a navigation front end.
  
  What about online stores with combos / search fields, but no direct index?
  What about forums with a guest login?
- Re: (Score:3, Informative)
  
  by dave420 ( 699308 ) writes:
  
  Of course they could link to a site and make the browser perform a POST. That's trivial. A form and some javascript will do that no problem. They seem to not be doing that because GET forms should be non-destructive, whereas POST forms can be quite destructive.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Just think! (Score:5, Funny)

Re:Just think! (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re:Just think! (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3, Informative)

Re:Just think! (Score:5, Interesting)

Re: (Score:3, Informative)

Re: (Score:2)

Bright Planet's DQM (Score:4, Interesting)

Re: (Score:3, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2)

Oops... (Score:5, Funny)

Re:Oops... (Score:5, Informative)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Oops... (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Robots.txt is not the answer (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Will it solve captchas? (Score:5, Interesting)

Re: (Score:2)

Re:Will it solve captchas? (Score:5, Funny)

Re:Will it solve captchas? (Score:5, Funny)

Forums? (Score:5, Funny)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Funny)

HELLO I AM GOOGLEBOT (Score:5, Funny)

Re:HELLO I AM GOOGLEBOT (Score:5, Funny)

Re:HELLO I AM GOOGLEBOT (Score:4, Funny)

Re: (Score:2, Interesting)

Eeek, you found me! (Score:3, Funny)

Re: (Score:2)

Forums, and "web 2.0" sites. (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Funny)

Re: (Score:3, Informative)

good and bad (Score:4, Insightful)

Re:good and bad (Score:5, Insightful)

Re: (Score:3, Funny)

Re:good and bad (Score:5, Insightful)

Re: (Score:2)

I'm in your Intarwebs (Score:2, Funny)

robots.txt (Score:5, Funny)

Note to self... (Score:4, Funny)

Heisenberg for web (Score:2)

The Internet is for Porn (Score:5, Funny)

Re: (Score:2)

directions like 'nofollow' are still respected (Score:5, Informative)

Re: (Score:2, Insightful)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

sites can still be excluded (Score:2, Flamebait)

Re:sites can still be excluded (Score:4, Insightful)

Re: (Score:2)

Fuzzing the world (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Evil Bot (Score:2)

Anecdote from Google (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)