Robotcop: It's the Law 54
Voivod writes: "Inspired by the recent Slashdot and Evolt
discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"
make something spider-proof... (Score:1)
Re:make something spider-proof... (Score:1)
Arms Races (Score:1)
Right now it provides pretty good protection, especially if the spider needs to get in and out of the site within a set period of time. If you can think of ways to circumvent how Robotcop works, please point them out so we can figure out a solution!
Re:Arms Races (Score:4, Insightful)
looking over the technical review [robotcop.org] and the readme [robotcop.org], a few initial, random, and sporadic thoughts:
the blocking of valid users seems rather annoying (NAT users, some proxy users) and a bad spider could get around the short interval by increasing its sleep time.
IPv6 could screw your implementation. If i have access to a huge number of IP addresses then i could access your website through any one of those addresses. A spider could run an initial probe of a few million websites through one ip, change ips, then grab a second page from all those websites, change ips, grab webpage, etc etc.
if i know a website is running robotcop, can i screw over valid users by forging my ip address, accessing robots.txt, then accessing a honeypot dir? can i screw over all users by cycling through all ips and doing this (yeah that's time consuming, maybe i could just screw over users from one range?)?
The main problems i see from the robotcop approach is that it assumes everyone who accesses robots.txt is a robot and it assumes valid users will not follow certain paths through the website.
This is different for email poisoners b/c if i'm a user and i get to page with a bunch of (invalid) email addresses, it doesn't matter. i click back and continue on my way. but for something that actually *blocks* users, it's a bit different.
As it stands now, i could go to an internet cafe (often they use nat) and block every other user from seeing any site protected by robotcop.
How about tying both User-Agent and IP address to form valid/invalid users? that way a bad user behind NAT might get blocked while a good user could go on. The more information you can tie to one particular thread of access, the more likely you are to single out one particular user.
Instead of only blocking ips that seem to be bad spiders, why not feed themm specific information? that way if it is a user you can let them go on - "if you are a valid user, enter the word in the graphic below in this text field and click 'ok'!"
It really seems that whatever you do, it is possible to work around. Set cookies? i write a bot that keeps track of cookies. hidden webbugs/urls? my bot avoids these.
I can see robotcop as working in small cases, like for a limited number of servers on the internet, b/c then it is not worth the bot writer's time to implement work arounds. But once it becomes worth their time, you have a game of evolution.
Not that that's bad; keep a small enough base of users and you probably wont need to update methods all that often.
Re:Arms Races (Score:1)
on second thought if you did this then a spider could generate varying user_agents to get around robotcop.
maybe once you've served someone a page asking for identification of sentience you could use these and other params (ACCEPT_LANGUAGE, etc) to identify a valid user, but not to identify a robot.
Re:Arms Races (Score:1)
> address to form valid/invalid users?
This is a *fantastic* improvement. I'm going to add this to the next version.
Thanks for your other comments as well. I agree that some kind of, "We think you may be an evil spider. Do this to prove that you aren't..." would also be a good improvement. It's just a matter of coming up with the best method.
> hidden webbugs/urls? my bot avoids these.
Not to, uh, ask you to reveal trade secrets or anything, but does your bot actually fetch button images to make sure they're not transparent or something? I agree that it's possible to detect and avoid most of the common "fake link" techniques but 100% seems very unlikely.
Re:Arms Races (Score:2, Interesting)
Maybe you could thwart this by seeing if there are traversal patterns coming from all over the place ("GET
Re:Arms Races (Score:1)
heh. i actually don't have any such bot, but meant to imply that it would be possbile to avoid them.
the most i do with robots on my website is feed them special links - you can load robots.txt on my site and note the minute changes.
do bad spiders follow predictable patterns? like always entering at /, as opposed to ever having referers like from a google search? pretty much any patterns you can identify will help develop heuristics to distinguish between valid users and robots.
my coworker [rexroof.com] points out you only need 2 ips to get around just robots mode - one to get robots.txt, other to traverse dirs that robots shouldn't be traversing.
you should focus most on easily modified identification methods, as well as ease of configuration. maybe eventually there would be room for a distributed fingerprint database of known spiders, or known spider ips, that you could use in conjunction with current methods.
Re:make something spider-proof... (Score:1)
The only way not to get trapped by robotcop is to obey the robots.txt rules file.
whats the difference between a bad robot obeying rules and a good one?
sure ppl may try to work on robots that dont fall in the honey pot, but those dirs on real servers are not going to be as obvious as in the example, i do think however the random pages that a trapped bot gets is too much of a give away thet your trapped, but they now in a loosing fight!
What about spoofing spiders? (Score:3, Interesting)
How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.
Am I missing something?
Re:What about spoofing spiders? (Score:4, Informative)
Check out the robotcop.org [robotcop.org] site. It has examples of how to set all this up.
Re:What about spoofing spiders? (Score:2)
I guess I could see setting this up similar to a web bug, 1x1 pixel image with link same color as background, but you could then modify the spider not to follow those types of links.
I think this will be better than nothing, but you start to enter a robotcop / spider arms race.
Re:What about spoofing spiders? (Score:1)
Sorry, you're right - I'll add some example honeypot links to the documentation. The simplest example I can think of is something like this:
<a href="/honeypot" alt=""><b></b></a>There are infinite variations on this theme, like the transparent gif you mentioned, which makes it very difficult for evil spiders to avoid them. Just make sure you test with lynx/w3m first. :-)
Protect lynx/links/w3m users by using two steps (Score:2)
There are infinite variations on this theme, like the transparent gif you mentioned, which makes it very difficult for evil spiders to avoid them. Just make sure you test with lynx/w3m first.
Just make sure that it takes at least two steps to get from content to the honeypot. This way, it becomes much more difficult to accidentally tab to a link and activate it, shutting off an entire ISP's proxied access to the web server.
Re:Protect lynx/links/w3m users by using two steps (Score:1)
Re:Protect lynx/links/w3m users by using two steps (Score:1)
Re:Protect lynx/links/w3m users by using two steps (Score:1)
Re:What about spoofing spiders? (Score:1)
Google (Score:1)
Re:Google (Score:1, Informative)
But aren't poisoned addresses just stupid? (Score:2)
Re:But aren't poisoned addresses just stupid? (Score:1)
It's trivial for email harvesters to run MX lookups on the collected domains to see if they're valid, and spammers are used to "test runs" of millions of addresses to weed out the bad ones, so giving them garbage e-mail addresses probably has little impact on their operation.
Re:But aren't poisoned addresses just stupid? (Score:1)
In other words, while a fake address wouldn't be that useful to give to one spammer, a lot of the harvester bots are just people selling the addresses to other people. Hence, a single bad address might got to a thousand people. If they get sold 'a million addresses', and 50% of those are crap, then we've just cut the amount of spam sent in half. (And, just as importantly, we've cut the number of responses in half.)
Some email address sellers claim the addresses are 'verified', but they usually aren't.
Some people consider it an added bonus that this make email address harvesters look unreliable, which they are anyway, but that really is counteracted by the people that are being 'helped' by this are the spammers. But sowing distrusts in that industry is a good thing, even if it does indirectly help some spammers realize that address sellers are lying.
Re:But aren't poisoned addresses just stupid? (Score:3, Informative)
1- why not add valid addresses that get sent to /dev/null? e.g. aaaaaa@example.com through zzzzzz@example.com. you'd get a substantial amount of traffic for that many addresses, but you could modify the amount of addresses to whatever your bandwidth/server could handle.
2- what are the legalities involved in creating a webpage with a specific email address, perhaps send.mail.here.to.be.charged@example.com, and placing it *only* on a webpage with a blatant notice of "If you mail this address you allow me to charge you $100/byte sent to this address" or a more specific terms of use (in order to encompass selling the address to others) and then charge once you get mail to that address? could a terms of use be created that would make getting money legal?
3 - how many of you use Matt Wright's (*shudder* when you hear his name) formmail? how many of you use fake formmail scripts?
for a while now i've been using a fake formmail script that only prints out a webpage saying "thank you for using this script" but doesn't actually send mail. Some people see that output (ignoring the html comment that says "I HATE YOU YOU STUPID PIECE OF SHIT"), think the script has worked, and run a program to submit spam to the script to "send" mail to a few thousand addresses.
so far my fake script has saved thousands of addresses from getting spam. some people test the script with their address first and then dont come back when they dont get the mail, but i could modify the script to send out the first mail from an ip, but not the subsequent mail.
but im wondering, has anyone else done work on this or heard of work like this?
If you don't understand what i'm talking about in point 3- "Matt Wright" (is he a real person?) has a series of scripts, one is formmail.pl which allows mail to be sent to any address. some people search for servers with formmail.pl on them and use those scripts to pseudonymously send mail to other people. We had seen this quite a bit at work, which inspired me to create the fake formmail.pl.
are there any other common scripts like formmail.pl that could be faked in the same manner?
Re:But aren't poisoned addresses just stupid? (Score:2)
I wrote one a few weeks ago that catches FormMail probes and mails a warning message to the person who's probing. Since putting the script in place across several domains, I've seen a significant decrease in repeat offenders. I used to get scanned by the same people day in and day out (i.e. the recipient value in the GET requests was the same), and I had a few who'd scan me weekly. Not anymore.
FWIW, the script is here [shat.net]. It's written in PHP, so you'll have to either redirect requests for formmail.pl to the PHP version, or use a CGI wrapper (hence the shebang line at the top of a PHP script
I also like the idea of a FormMail honeypot. Basically such a script would accept and deliver the first message received from any IP address; this would be the test message indicating that the probe was successful, so you'd want to make sure it was actually sent. Subsequent accesses to the script from the same
Shaun
Re:But aren't poisoned addresses just stupid? (Score:1)
No, they don't. Many of the people gathering the lists of addresses are also selling the lists, and the longer the list, the more impressive it sounds to the people who buy the lists. By the time the buyer realizes how many of the addresses are bogus, the list seller has already been paid. Remember, these are spammers, not ethical people.
Even if they're gathering lists for their own use, consider this: when they send the spam with a forged email address and route it through some unsuspecting schmuck's open relay, the bouces are a problem for whoever owns the domain in the bogus From header and/or the unsuspecting schmuck. It poses no problem for the spammer, so there's no motivation to verify. Why should they? If they actually cared about clogging the network, they wouldn't be spamming in the first place.
Check this link [whatever.net] to see the 'user unknown' messages rejected by my server in the last couple/few days. Many of the addresses you see have been showing up in that list for years.
You cannot overestimate the laziness or a spammer, nor can you underestimate their concern for anyone or anything else.
From experience... (Score:2)
*shudders*
Re:From experience... (Score:1)
does robotcop have an 'Allow' feature to always allow certain ips? that and some whois's would solve this problem - it looks like google is 216.239.32.0 - 216.239.63.255
Re:From experience... (Score:1)
Nope, but I can definitially add one. I did a lot of testing on my servers to make sure the big search engines or tools like wget wouldn't get into fights with Robotcop. Also, part of the *point* of Robotcop is to bring in enforcement behind the robots.txt file, so any "white hat" vendor currently ignoring that file should be bullied into doing so.
But I'll add a RobotcopAllow directive anyway. :-).
Needs More (Score:2)
Over 20 Million Members(tm) on one proxy (Score:2)
That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute
What about over 20 Million Members on one ISP's [aol.com] proxy? A story circulating around several tech news sites (about the high likelihood of AOL 8 using Mozilla's Gecko engine) places AOL's U.S. market share at about 30%. Do you really want to drive away 30% of your audience? What about the billion-plus people behind China's NAT?
perl examples (Score:2)
They really seem to catch some weird things that I never thought might be wandering around on my website. I recommend lifting the ban on anyone after a while, though, because you can (almost) never be too certain what you've banned.
Robots.txt (Score:1)
Solution?
Re:Robots.txt (Score:1)
If a spider can use multiple IPs then it can avoid being tracked, but it still has to worry about falling into trap directories and noticing that it is trapped. If the spider is in no hurry then this is no big deal either, as it can just wait out your ban or return as another IP. Robotcop, or any other software for that matter, can't protect you against a spider like that.
One thing to keep in mind is that much email address harvesting it done from cheap dialup accounts over short periods. Harvester vendors want to sell software to dumb users who think they are collecting "free" lists to spam at. These spiders don't have the option of working slowly or jumping IPs and are easy pray for Robotcop.
Re:Robots.txt (Score:1)
Actually, they do have the option [rosinstrument.com], they just need to be rewritten to take advantage of it [samair.ru]. What we have here is an arms race; its only a matter of time before email siphoning bots have proxy-bouncing built into them.
good for FTP sites (Score:2, Interesting)
on my network we have a http [lyon.edu] and ftp [lyon.edu] mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.
if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.
Re:wget (Score:1)
Re:wget (Score:1)
Re:wget (Score:1)
robots.txt isn't just used to stop search engines indexing confidential information, it can also used to prevent robots from falling into things that they can't get out of. This can range from intentional tarpits to some genuine CGIs.
Re:wget (Score:1)
Remember that robots.txt is there for the protection of the client as well as the server. There are infinite rabit holes in many websites and the robots.txt format was invented to stop software from falling in. IMHO anything that is not a browser being controlled interactively by a human should obey it.
Hmm. There's a robots civil rights lawsuit of the future waiting to happen... :-)
hax0red (Score:1)
This is where it gets sorta funny
So I headed over to network-tools.com [network-tools.com] and looked up the IPs. Each one of them resolved to a webcrawler for a search engine. So I emailed him back explaining that it was just the search engines indexing his pages. It took several more emails to convince him that it was harmless.
Now that I look back at this anecdote, it doesn't seem that humorous, but I guess at the time I was pretty amused by the fact that a medical student was panicking thinking that a webcrawler was 'hacking' his system (if you're wondering, btw, these online quizzes have absolutely no weight in the medical school courses -- it's just for practice).
-Sou|cuttr
Not compatible with Windows Apache (Score:2)
The Robotcop download page [robotcop.org] states that no binaries are available for versions of Apache HTTP Server designed for M$ Windows, and the binaries that do exist (for Red Hat Linux x86 and FreeBSD x86) aren't very compatible with mod_ssl.
"So compile it yourself!" For one thing, according to the compilation instructions [robotcop.org], those who want to compile Robotcop for Windows will have to wait a year (estimated) until Apache 2.0 is no longer eta but Released. For another, not everybody can afford a license for M$ Visual Studio, which is required to build Apache HTTP Server [apache.org]; apparently, this popular Win32 version of GCC [mingw.org] doesn't cut it.
In other words, Robotcop won't work for consumers who serve web pages from their home workstation that runs Windows.
Re:Not compatible with Windows Apache (Score:1)
Apache 1.3 on Unix is multiprocess, and Robotcop uses libMM for sharing things like spider lists between these processes. Apache 1.3 on Win32 is multithreaded, and would require a totally different approach to this. At the time I thought Apache 2 was closer and let me avoid this work, but I'm thinking your estimate is more accurate now. I'm not too familiar with the Apache on Win32 community, but your comment is request #2 for Robotcop to support it, so I might go ahead and do it.
Regarding your src vs binary comments, you might be interested to know that about 95% of visitors to the Robotcop [robotcop.org] site choose to download the source instead of the much easier to install binary. It'll be interesting to see how Win32 is different once I have something up for that.
P.S. Don't call people "consumers". Even if they are Windows users, it's not nice. :-)
"Consumers" meaning home Windows users (Score:1)
P.S. Don't call people "consumers". Even if they are Windows users, it's not nice. :-)
Then what is the correct term for people who go into Best Buy, buy a PC, and use only the operating system that Microsoft forced the PC vendor to pre-install because the buyer doesn't know better? I used "consumer" to refer to those who use Windows on their home computers not by choice but by ignorance of other options or by lack of drivers for proprietary devices.
Re:"Consumers" meaning home Windows users (Score:1)
Re:"Consumers" meaning home Windows users (Score:1)
I'm sorry, but Microsoft or the PC vendor has never made anyone keep the original OS on it.
Yes they have. A PC has been sold whose BIOS verifies that the OS is Windows. It's called the Xbox. I know the Xbox is a game console, but it's only the beginning.
But now youre talking about a not so tech-savvy user, and youre assuming that they'll use Apache for a www server?
What about a tech-savvy user telling a not-so-tech-savvy user "if you want to share a few files, and you don't want to be subject to the security holes in both Windows's built-in file sharing and IIS Personal Edition, why not use this program [apache.org]?" That's the audience I was talking about.
Legit uses (Score:1)
Please don't cure the illness by killing the patient.
Wrong attitude. (Score:1)
Instead just learn who is right for you and say 'no' to the others.
The effort to install and test that module is wasted - and better put into quick and effective spam-blocking techniques, backed up by propper site policies.