Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Apache Software

Robotcop: It's the Law 54

Voivod writes: "Inspired by the recent Slashdot and Evolt discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"
This discussion has been archived. No new comments can be posted.

Robotcop: It's the Law

Comments Filter:
  • ...and they'll just invent a better spider.
    • ...and is it just me, or are ads starting to appear on comment pages?
    • What, you don't think we can win an arms race against the degenerates who write email harvesting software? :-)

      Right now it provides pretty good protection, especially if the spider needs to get in and out of the site within a set period of time. If you can think of ways to circumvent how Robotcop works, please point them out so we can figure out a solution!

      • Re:Arms Races (Score:4, Insightful)

        by friscolr ( 124774 ) on Tuesday March 12, 2002 @01:15PM (#3150600) Homepage
        If you can think of ways to circumvent how Robotcop works, please point them out so we can figure out a solution!

        looking over the technical review [robotcop.org] and the readme [robotcop.org], a few initial, random, and sporadic thoughts:

        the blocking of valid users seems rather annoying (NAT users, some proxy users) and a bad spider could get around the short interval by increasing its sleep time.

        IPv6 could screw your implementation. If i have access to a huge number of IP addresses then i could access your website through any one of those addresses. A spider could run an initial probe of a few million websites through one ip, change ips, then grab a second page from all those websites, change ips, grab webpage, etc etc.

        if i know a website is running robotcop, can i screw over valid users by forging my ip address, accessing robots.txt, then accessing a honeypot dir? can i screw over all users by cycling through all ips and doing this (yeah that's time consuming, maybe i could just screw over users from one range?)?

        The main problems i see from the robotcop approach is that it assumes everyone who accesses robots.txt is a robot and it assumes valid users will not follow certain paths through the website.
        This is different for email poisoners b/c if i'm a user and i get to page with a bunch of (invalid) email addresses, it doesn't matter. i click back and continue on my way. but for something that actually *blocks* users, it's a bit different.

        As it stands now, i could go to an internet cafe (often they use nat) and block every other user from seeing any site protected by robotcop.

        How about tying both User-Agent and IP address to form valid/invalid users? that way a bad user behind NAT might get blocked while a good user could go on. The more information you can tie to one particular thread of access, the more likely you are to single out one particular user.

        Instead of only blocking ips that seem to be bad spiders, why not feed themm specific information? that way if it is a user you can let them go on - "if you are a valid user, enter the word in the graphic below in this text field and click 'ok'!"

        It really seems that whatever you do, it is possible to work around. Set cookies? i write a bot that keeps track of cookies. hidden webbugs/urls? my bot avoids these.
        I can see robotcop as working in small cases, like for a limited number of servers on the internet, b/c then it is not worth the bot writer's time to implement work arounds. But once it becomes worth their time, you have a game of evolution.

        Not that that's bad; keep a small enough base of users and you probably wont need to update methods all that often.

        • How about tying both User-Agent and IP address to form valid/invalid users?

          on second thought if you did this then a spider could generate varying user_agents to get around robotcop.

          maybe once you've served someone a page asking for identification of sentience you could use these and other params (ACCEPT_LANGUAGE, etc) to identify a valid user, but not to identify a robot.

        • > How about tying both User-Agent and IP
          > address to form valid/invalid users?

          This is a *fantastic* improvement. I'm going to add this to the next version.

          Thanks for your other comments as well. I agree that some kind of, "We think you may be an evil spider. Do this to prove that you aren't..." would also be a good improvement. It's just a matter of coming up with the best method.

          > hidden webbugs/urls? my bot avoids these.

          Not to, uh, ask you to reveal trade secrets or anything, but does your bot actually fetch button images to make sure they're not transparent or something? I agree that it's possible to detect and avoid most of the common "fake link" techniques but 100% seems very unlikely.
          • Re:Arms Races (Score:2, Interesting)

            by J'raxis ( 248192 )
            What about a bot set to change user-agents on the fly? Just collect the few most-popular UAs from other peoples website logs, and use each one at random. Add in a list of open proxies to bounce through and you have a nearly undetectable spider at work. I believe I can do this in about a dozen lines of Perl.

            Maybe you could thwart this by seeing if there are traversal patterns coming from all over the place ("GET /a" from 1.2.3.4, "GET /a/a.html" from 6.7.8.9, "GET /b" from 45.56.56.67, and so on, but that seems like a lot of work and could again be defeated by some randomization.
          • Not to, uh, ask you to reveal trade secrets or anything, but does your bot actually fetch button images to make sure they're not transparent or something?

            heh. i actually don't have any such bot, but meant to imply that it would be possbile to avoid them.
            the most i do with robots on my website is feed them special links - you can load robots.txt on my site and note the minute changes.

            do bad spiders follow predictable patterns? like always entering at /, as opposed to ever having referers like from a google search? pretty much any patterns you can identify will help develop heuristics to distinguish between valid users and robots.

            my coworker [rexroof.com] points out you only need 2 ips to get around just robots mode - one to get robots.txt, other to traverse dirs that robots shouldn't be traversing.

            you should focus most on easily modified identification methods, as well as ease of configuration. maybe eventually there would be room for a distributed fingerprint database of known spiders, or known spider ips, that you could use in conjunction with current methods.

    • I dont think so, have you read the page?
      The only way not to get trapped by robotcop is to obey the robots.txt rules file.

      whats the difference between a bad robot obeying rules and a good one?

      sure ppl may try to work on robots that dont fall in the honey pot, but those dirs on real servers are not going to be as obvious as in the example, i do think however the random pages that a trapped bot gets is too much of a give away thet your trapped, but they now in a loosing fight!
  • by regen ( 124808 ) on Monday March 11, 2002 @05:52PM (#3145545) Homepage Journal
    It's an interesting idea, but it looks like the spiders have to be well behaved to get caught. If the spider never reads the robots.txt file and it claims to be a friendly user agent (not a spider) it seems the only way it could get caught is if it falls into a trap directory. This doesn't seem likely.

    How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.

    Am I missing something?

    • by Voivod ( 27332 ) <cryptic@g m a il.com> on Monday March 11, 2002 @06:03PM (#3145601)
      You just put hidden links in your HTML which only a spider's HTML parser would notice and follow. This technique is already widely used by wpoison [monkeys.com] which is a Perl CGI solution to the spider problem.

      Check out the robotcop.org [robotcop.org] site. It has examples of how to set all this up.
      • The idea of a hidden link that the spider would follow seems like a good idea, but I don't see an example on the robotcop site. Do you have a direct link?

        I guess I could see setting this up similar to a web bug, 1x1 pixel image with link same color as background, but you could then modify the spider not to follow those types of links.

        I think this will be better than nothing, but you start to enter a robotcop / spider arms race.

        • Sorry, you're right - I'll add some example honeypot links to the documentation. The simplest example I can think of is something like this:

          <a href="/honeypot" alt=""><b></b></a>

          There are infinite variations on this theme, like the transparent gif you mentioned, which makes it very difficult for evil spiders to avoid them. Just make sure you test with lynx/w3m first. :-)

          • There are infinite variations on this theme, like the transparent gif you mentioned, which makes it very difficult for evil spiders to avoid them. Just make sure you test with lynx/w3m first.

            Just make sure that it takes at least two steps to get from content to the honeypot. This way, it becomes much more difficult to accidentally tab to a link and activate it, shutting off an entire ISP's proxied access to the web server.

      • I can just see some legit, curious AOL user viewing source and going to that "forbidden" link...then, bam, you've declared AOL a "spider".
  • Perhaps not google per se but there are a lot of legit uses for spiders. Uses that are legal and good for your site and the internet in general. I would be concerned that this is going to cause them some undue issues.

    • Re:Google (Score:1, Informative)

      by Anonymous Coward
      Yes, there are probably some good spiders out there, but the good ones should pay attention to the robots file to see if there allowed to harvest there. If the spider is written to follow some basic rules of good internet behavior it won't encounter any problems.
  • The project says that it feeds malignant spiders poisoned addresses. Don't people check their addresses for addresses that don't deliver? Is this useful? I like the teergrube [tuxedo.org] idea better. Can you modify apache to do this?
    • Yeah, poisoned addresses are stupid. We put the feature in because it seemed like the previous solutions focused on this, but really what Robotcop is about is protecting the website. Screwing with the spider is a second priority.

      It's trivial for email harvesters to run MX lookups on the collected domains to see if they're valid, and spammers are used to "test runs" of millions of addresses to weed out the bad ones, so giving them garbage e-mail addresses probably has little impact on their operation.

      • The point of feeding millions of bad addresses to people isn't to get rid of spammers per se, it's to make the 'millions of addresses' CDs wronger.

        In other words, while a fake address wouldn't be that useful to give to one spammer, a lot of the harvester bots are just people selling the addresses to other people. Hence, a single bad address might got to a thousand people. If they get sold 'a million addresses', and 50% of those are crap, then we've just cut the amount of spam sent in half. (And, just as importantly, we've cut the number of responses in half.)

        Some email address sellers claim the addresses are 'verified', but they usually aren't.

        Some people consider it an added bonus that this make email address harvesters look unreliable, which they are anyway, but that really is counteracted by the people that are being 'helped' by this are the spammers. But sowing distrusts in that industry is a good thing, even if it does indirectly help some spammers realize that address sellers are lying.

        • 3 things...

          1- why not add valid addresses that get sent to /dev/null? e.g. aaaaaa@example.com through zzzzzz@example.com. you'd get a substantial amount of traffic for that many addresses, but you could modify the amount of addresses to whatever your bandwidth/server could handle.

          2- what are the legalities involved in creating a webpage with a specific email address, perhaps send.mail.here.to.be.charged@example.com, and placing it *only* on a webpage with a blatant notice of "If you mail this address you allow me to charge you $100/byte sent to this address" or a more specific terms of use (in order to encompass selling the address to others) and then charge once you get mail to that address? could a terms of use be created that would make getting money legal?

          3 - how many of you use Matt Wright's (*shudder* when you hear his name) formmail? how many of you use fake formmail scripts?

          for a while now i've been using a fake formmail script that only prints out a webpage saying "thank you for using this script" but doesn't actually send mail. Some people see that output (ignoring the html comment that says "I HATE YOU YOU STUPID PIECE OF SHIT"), think the script has worked, and run a program to submit spam to the script to "send" mail to a few thousand addresses.

          so far my fake script has saved thousands of addresses from getting spam. some people test the script with their address first and then dont come back when they dont get the mail, but i could modify the script to send out the first mail from an ip, but not the subsequent mail.

          but im wondering, has anyone else done work on this or heard of work like this?

          If you don't understand what i'm talking about in point 3- "Matt Wright" (is he a real person?) has a series of scripts, one is formmail.pl which allows mail to be sent to any address. some people search for servers with formmail.pl on them and use those scripts to pseudonymously send mail to other people. We had seen this quite a bit at work, which inspired me to create the fake formmail.pl.

          are there any other common scripts like formmail.pl that could be faked in the same manner?

          • >but im wondering, has anyone else done work on this or heard of work like this?

            I wrote one a few weeks ago that catches FormMail probes and mails a warning message to the person who's probing. Since putting the script in place across several domains, I've seen a significant decrease in repeat offenders. I used to get scanned by the same people day in and day out (i.e. the recipient value in the GET requests was the same), and I had a few who'd scan me weekly. Not anymore.

            FWIW, the script is here [shat.net]. It's written in PHP, so you'll have to either redirect requests for formmail.pl to the PHP version, or use a CGI wrapper (hence the shebang line at the top of a PHP script :)

            I also like the idea of a FormMail honeypot. Basically such a script would accept and deliver the first message received from any IP address; this would be the test message indicating that the probe was successful, so you'd want to make sure it was actually sent. Subsequent accesses to the script from the same /8 over the next 24 hours would generate log entries but not actually send mail, the spammer would be spamming into a black hole, [complete the honeypot analogy here]. I'd do it myself but I don't care for the idea of someone hammering me with thousands and thousands of requests. I'd love to know if someone else sets something like this up, though.

            Shaun
    • Don't people check their addresses for addresses that don't deliver?

      No, they don't. Many of the people gathering the lists of addresses are also selling the lists, and the longer the list, the more impressive it sounds to the people who buy the lists. By the time the buyer realizes how many of the addresses are bogus, the list seller has already been paid. Remember, these are spammers, not ethical people.

      Even if they're gathering lists for their own use, consider this: when they send the spam with a forged email address and route it through some unsuspecting schmuck's open relay, the bouces are a problem for whoever owns the domain in the bogus From header and/or the unsuspecting schmuck. It poses no problem for the spammer, so there's no motivation to verify. Why should they? If they actually cared about clogging the network, they wouldn't be spamming in the first place.

      Check this link [whatever.net] to see the 'user unknown' messages rejected by my server in the last couple/few days. Many of the addresses you see have been showing up in that list for years.

      You cannot overestimate the laziness or a spammer, nor can you underestimate their concern for anyone or anything else.

  • Unfortunately some good robots have been known to ignore robots.txt. Fast has in the past fallen into my test honeypot, I would hate to accidentally block someone like google.

    *shudders*
    • I would hate to accidentally block someone like google.

      does robotcop have an 'Allow' feature to always allow certain ips? that and some whois's would solve this problem - it looks like google is 216.239.32.0 - 216.239.63.255

      • does robotcop have an 'Allow' feature to always allow certain ips?

        Nope, but I can definitially add one. I did a lot of testing on my servers to make sure the big search engines or tools like wget wouldn't get into fights with Robotcop. Also, part of the *point* of Robotcop is to bring in enforcement behind the robots.txt file, so any "white hat" vendor currently ignoring that file should be bullied into doing so.

        But I'll add a RobotcopAllow directive anyway. :-).

  • Spiders that follow the rules, of course, can be detected, so what you need is some more to stop spiders that don't, or those that know how not to get stuck in tarpits and those spoofing other clients and not reading robots.txt. The easiest way I can think to do this would be to count how many hits a particular IP has to your server in relation to individual pages. The more unique pages it pulls in a minute, the slower (geometrically) the connection should get. That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute and only see a 100ms slowdown, probably not even noticable, but a spider pulling 100 pages will see a 1000ms slowdown, and pulling 200 pages will result in a 10000ms slowdown per page. Sure they can eventually download all the pages, but make it take a week to do it. Combine that with what you already have and it will make for a very unpleasant spidering expierence.
    • That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute

      What about over 20 Million Members on one ISP's [aol.com] proxy? A story circulating around several tech news sites (about the high likelihood of AOL 8 using Mozilla's Gecko engine) places AOL's U.S. market share at about 30%. Do you really want to drive away 30% of your audience? What about the billion-plus people behind China's NAT?

  • Pretty cool idea, and useful, too. There are some mod_perl modules available, too. There's an example in the mod_perl Developer's Cookbook [amazon.com] and I wrote a simple one here [mooresystems.com].


    They really seem to catch some weird things that I never thought might be wandering around on my website. I recommend lifting the ban on anyone after a while, though, because you can (almost) never be too certain what you've banned.

  • Some large search engines have their spiders spread across multiple hosts. Google is one example. What would happen if crawler-01.nastybot.com grabs the robots.txt file, then crawler-02.nastybot.com violates it? I think with all the open proxies out there, spammers would easily adapt to this. Proxy through someone to grab robots.txt, and then through someone else to make use of the file. This would make IP tracking useless; you couldnt even match by subnet (like you could with *.nastybot.com), since the first request could be from 12.34.56.78 and the second from 31.3.3.7.

    Solution?
    • What would happen if crawler-01.nastybot.com grabs the robots.txt file, then crawler- 02.nastybot.com violates it?

      If a spider can use multiple IPs then it can avoid being tracked, but it still has to worry about falling into trap directories and noticing that it is trapped. If the spider is in no hurry then this is no big deal either, as it can just wait out your ban or return as another IP. Robotcop, or any other software for that matter, can't protect you against a spider like that.

      One thing to keep in mind is that much email address harvesting it done from cheap dialup accounts over short periods. Harvester vendors want to sell software to dumb users who think they are collecting "free" lists to spam at. These spiders don't have the option of working slowly or jumping IPs and are easy pray for Robotcop.

      • One thing to keep in mind is that much email address harvesting it done from cheap dialup accounts over short periods. Harvester vendors want to sell software to dumb users who think they are collecting "free" lists to spam at. These spiders don't have the option of working slowly or jumping IPs and are easy pray for Robotcop.

        Actually, they do have the option [rosinstrument.com], they just need to be rewritten to take advantage of it [samair.ru]. What we have here is an arms race; its only a matter of time before email siphoning bots have proxy-bouncing built into them. :\ Then well need to do something else to keep them away, which theys find a way through once again.
  • good for FTP sites (Score:2, Interesting)

    by eufaula ( 163352 )

    on my network we have a http [lyon.edu] and ftp [lyon.edu] mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.

    if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.

  • This reminds me of a humourous day at work. I had done a bit of php programming in my spare time on a site [utah.edu] where medical students could get sample problems (or entire quizzes) based on what class they were in. About a week after I handed it over to a medical student to maintain, he had emailed me that some people had been trying to access php pages that were non-existant (basically passing variables in the URL that weren't valid database entries). He was worried that somebody was trying to hack his system. He logged some offending IP addresses and sent them to me in an email.

    This is where it gets sorta funny

    So I headed over to network-tools.com [network-tools.com] and looked up the IPs. Each one of them resolved to a webcrawler for a search engine. So I emailed him back explaining that it was just the search engines indexing his pages. It took several more emails to convince him that it was harmless.

    Now that I look back at this anecdote, it doesn't seem that humorous, but I guess at the time I was pretty amused by the fact that a medical student was panicking thinking that a webcrawler was 'hacking' his system (if you're wondering, btw, these online quizzes have absolutely no weight in the medical school courses -- it's just for practice).

    -Sou|cuttr
  • 10 LET M$ = "Microsoft"

    The Robotcop download page [robotcop.org] states that no binaries are available for versions of Apache HTTP Server designed for M$ Windows, and the binaries that do exist (for Red Hat Linux x86 and FreeBSD x86) aren't very compatible with mod_ssl.

    "So compile it yourself!" For one thing, according to the compilation instructions [robotcop.org], those who want to compile Robotcop for Windows will have to wait a year (estimated) until Apache 2.0 is no longer eta but Released. For another, not everybody can afford a license for M$ Visual Studio, which is required to build Apache HTTP Server [apache.org]; apparently, this popular Win32 version of GCC [mingw.org] doesn't cut it.

    In other words, Robotcop won't work for consumers who serve web pages from their home workstation that runs Windows.

    • Apache 1.3 on Unix is multiprocess, and Robotcop uses libMM for sharing things like spider lists between these processes. Apache 1.3 on Win32 is multithreaded, and would require a totally different approach to this. At the time I thought Apache 2 was closer and let me avoid this work, but I'm thinking your estimate is more accurate now. I'm not too familiar with the Apache on Win32 community, but your comment is request #2 for Robotcop to support it, so I might go ahead and do it.

      Regarding your src vs binary comments, you might be interested to know that about 95% of visitors to the Robotcop [robotcop.org] site choose to download the source instead of the much easier to install binary. It'll be interesting to see how Win32 is different once I have something up for that.

      P.S. Don't call people "consumers". Even if they are Windows users, it's not nice. :-)

      • P.S. Don't call people "consumers". Even if they are Windows users, it's not nice. :-)

        Then what is the correct term for people who go into Best Buy, buy a PC, and use only the operating system that Microsoft forced the PC vendor to pre-install because the buyer doesn't know better? I used "consumer" to refer to those who use Windows on their home computers not by choice but by ignorance of other options or by lack of drivers for proprietary devices.

        • I'm sorry, but Microsoft or the PC vendor has never made anyone keep the original OS on it. Yes, it might void your warranty if you take off Windows. But now youre talking about a not so tech-savvy user, and youre assuming that they'll use Apache for a www server? Com'on now. Don't blame ignorance in the OS type but then not in the http-daemon. I'm not a (100%) MS basher, but please dont be an anti-MS basher (ie. don't use Microsoft as an excuse. Everyone has options.
          • I'm sorry, but Microsoft or the PC vendor has never made anyone keep the original OS on it.

            Yes they have. A PC has been sold whose BIOS verifies that the OS is Windows. It's called the Xbox. I know the Xbox is a game console, but it's only the beginning.

            But now youre talking about a not so tech-savvy user, and youre assuming that they'll use Apache for a www server?

            What about a tech-savvy user telling a not-so-tech-savvy user "if you want to share a few files, and you don't want to be subject to the security holes in both Windows's built-in file sharing and IIS Personal Edition, why not use this program [apache.org]?" That's the audience I was talking about.

  • The problem has never been about being able to stop illegitimate programs, but rather to ONLY stop illegitimate programs, and not authentic ones as well. Let's name one that could find problems: Google. Okay, they wise up and allow accesses from Google (and some other select few) to go through. Problem 1: Smarter spiders can take advantage of this. Problem 2: Anyone who wants to start up a new and similar service would first have to make sure that it is registered so that it is not blocked out. This could turn into a beaurocratic nightmare, and may restrict competition by disallowing new, small, and innovative contenders

    Please don't cure the illness by killing the patient.
  • To hide our email addresses or hinder their harvesting is like trying not to be seen by the wrong people when you go to a party. It's useless, every girl can tell you that.

    Instead just learn who is right for you and say 'no' to the others.

    The effort to install and test that module is wasted - and better put into quick and effective spam-blocking techniques, backed up by propper site policies.

Reality must take precedence over public relations, for Mother Nature cannot be fooled. -- R.P. Feynman

Working...