Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet Spam

Gmail Spam Filter Testing 285

An anonymous reader writes "What can you do with 1000MB of e-mail space on your Gmail account? One guy, by the name of Aaron Pratt ( prattboy@gmail.com ), has decided to test the spam filters of Google's Gmail service by having his Gmail account blasted with every kind of spam imaginable. He is testing to see how well Gmail's spam filters can sort out the spam from legitamate email (yes, he does get personal emails from people). As of May 25th, he was at about 30% of his Gmail account's 1GB capacity. You can track his progress on his website, http://gmail.prattboy.net (Google cache of this site: cache: gmail.prattboy.net). Here is also an article talking about Aaron's efforts from webpronews.com"
This discussion has been archived. No new comments can be posted.

Gmail Spam Filter Testing

Comments Filter:
  • Not that impressive (Score:5, Informative)

    by chrisgeleven ( 514645 ) on Monday June 14, 2004 @10:42AM (#9419829) Homepage
    Seems like Gmail only filters approx. 50% of spam. That is not very impressive, since the top anti-spam software and e-mail clients (such as Outlook 2003 and Mozilla Thunderbird) can easily reach 95% accuracy in spam filtering.

    I am starting to second guess whether I should transfer everything to my Gmail account.
  • by kryptkpr ( 180196 ) on Monday June 14, 2004 @10:45AM (#9419854) Homepage
    Spammers have thought of this already, and they send nearly-identical messages.. Ever notice the random strings of letters and/or numbers at the bottom/in the subjects of spams?
  • by XO ( 250276 ) <blade.eric@NospAM.gmail.com> on Monday June 14, 2004 @10:47AM (#9419879) Homepage Journal
    Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!

    (example, after two weeks of using spam-assassin, it decided that every e-mail sent to me was spam.. i no longer received anything in my Inbox, everything was transferred to the Spambox. It took me another two weeks tweaking spam-assassin's kill rate down to about a 50% accuracy, and now i actually receive all my emails.)

  • My own gmail testing (Score:5, Informative)

    by Twid ( 67847 ) on Monday June 14, 2004 @10:48AM (#9419889) Homepage
    I did some testing of my own. I forwarded a ton of spam from my personal account to my gmail account, just to see what would get through and what would be filtered. For me, gmail was really effective, but strangely, one Nigerian e-mail scam mail didn't get tagged.

    It was from " Mr Jubril Udeh Manager of Credit and Accounts Department of North Atlantic Securities Sarls Lome-Togo Republic."

    Now, the funny part is not that the mail made it through, but that google also decided to show me contextual ad's on that account. Currently, the ads are:
    - Payroll Cards a Poor Substitute for Checking Account
    - Tips for Tackling Check Fraud
    - Sophos hoax description: Ethiopian airline letter
    - FAP non-US Investment FAQs

    In the past the mail has also shown me ads on how to open an off-shore bank account. I'm glad google is willing to help me with the $10.5 million dollars that I'm about to receive! :)

  • by Sulka ( 4250 ) <sulka@@@iki...fi> on Monday June 14, 2004 @10:49AM (#9419902) Homepage Journal
    Checksums are nearly useless against spam. It only takes one byte to change the checksum value and probably more than 90% of spam contain a personalization code to check which addresses are functional. Different code = different checksum.

    This doesn't mean it wouldn't be possible to create a system which would automatically detect individual spam messages based on tagging known spam, you just have to be smarter about the detection than just plain MD5ing the email body.
  • by Kredal ( 566494 ) on Monday June 14, 2004 @10:56AM (#9419964) Homepage Journal
    tikora@gmail.com, I think.

    Mine is kredal@gmail.com, if you're interested. (:
  • by aismail3 ( 735831 ) on Monday June 14, 2004 @11:02AM (#9420018)
    When I add up the figures from May 13 to 19, I get that 4869 messages were received. 4717 of those were spam, and 1820 were marked, so Gmail's success rate was 38.6%.
  • by wo1verin3 ( 473094 ) on Monday June 14, 2004 @11:07AM (#9420060) Homepage
    It's a good thing you're not using Outlook. :)

    I get those in Eudora and they don't seem to do much, my friends with Outlook however... not so lucky. :)
  • by Satai ( 111172 ) * on Monday June 14, 2004 @11:11AM (#9420092)
    no, you inversed it. You want MB/message, not message/MB.

    3778 messages / 213 MB = 17.37 messages / MB
    213 MB / 3778 messages = 0.0564 MB / message

    So that's pretty reasonable.
  • by xandroid ( 680978 ) on Monday June 14, 2004 @11:21AM (#9420186) Homepage Journal
    Try looking at the source -- when this happens to me, I see that the random words are plaintext, and the intended advertisement is in HTML (which I've blocked).
  • by Halo1 ( 136547 ) on Monday June 14, 2004 @11:21AM (#9420195)
    Most of the time, these messages contain both a text/plain section with only random words, and then a text/html part with the real payload. If you use mutt or so, you most likely only see the text/plain stuff. Another trick is using just a text/html section with random text, but also with an image that contains the real payload.
  • Re:whining? (Score:4, Informative)

    by cmacb ( 547347 ) on Monday June 14, 2004 @11:27AM (#9420255) Homepage Journal
    Actually the TOS for Gmail says that doing things to attract spam is a violation, so they could just close the account on that basis. Also, if you don't sign on for a certain period of time (a few months I think) the account gets deleted. I had a Yahoo ID for years before I ever knew there was an e-mail address associated with it. I never read the mail associated with my AIM id and I probably still have free hotmail and a few other things like that floating around. Failure of these companies to delete idle accounts is what causes all the good names to be taken. I think Google is more on-top of this than many of the others.
  • by ryen ( 684684 ) on Monday June 14, 2004 @11:37AM (#9420355)
    those emails could possibly also contain embedded image tags (known as web beacons). when you open an email and attempt to 'download' the image, some server on the net knows it was you who retreieved the image and has just verified that your email address is active and spammable.
  • by cks3 ( 698800 ) <ck&sampletheweb,com> on Monday June 14, 2004 @11:39AM (#9420369) Homepage Journal
    Oh, wait, it was me! http://slashdot.org/comments.pl?sid=105335&cid=896 5252

    Eh, I only got 180MB worth of email and spam out of the deal though, before I decided to delete the account. The Gmail Spam filter was rather horrible at the time; catching only the most tried and true SPAM, letting tons of other SPAM through, and then randomly flagging legitimate messages from people whom it had not flagged before. I think it has improved some since then.

  • Re:whining? (Score:5, Informative)

    by Beryllium Sphere(tm) ( 193358 ) on Monday June 14, 2004 @11:43AM (#9420411) Journal
    >The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte

    Why?

    Google uses commodity IDE drives. Those retail for about fifty cents a gigabyte. Google's not paying retail.

    I read a quote from a Googleperson that by the time the drive is installed in a system, powered, cooled, backed up and administered Google is paying two dollars for a gigabyte.

    Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.
  • by FooAtWFU ( 699187 ) on Monday June 14, 2004 @12:00PM (#9420570) Homepage
    >>It's cheaper to just send mail to everyone
    >no it's not.
    It doesn't matter how cheap it is when 80% of spam supposedly comes from infected zombie computers. (I'm too lazy to actually LINK to the recent story on this.)
  • by ravydavygravy ( 230429 ) on Monday June 14, 2004 @12:01PM (#9420572) Homepage
    Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!

    Rubbish - I've used thunderbird for many months now, with an account that gets quite a bit of spam. I have yet to see thunderbird make a wrong guess at whats spam and whats not. If anything, thunderbird is more likely to go the other way - allowing spam through - than deleting real email.
  • by Thuktun ( 221615 ) on Monday June 14, 2004 @12:04PM (#9420604) Journal
    gzip it and compare the files. a short tracking code will make a negligible difference.

    Not necessarily.

    Lempel-Ziv based algorithms, like the one used by gzip, build a compression dictionary on the fly. Any "personalization" added to the message will affect the dictionary to varying degrees from then onward. If it's near the beginning, the personalization would greatly skew the selected dictionary identifiers. Though probably this would have little effect on the actual compression of the data, it would radically change the representation of the compressed image. The farther this personalization is from the start of the data to be compressed, the less effect it will have.
  • by Anonymous Coward on Monday June 14, 2004 @01:49PM (#9421773)
    false positive : spam getting past the filter ratio...

    A false positive is not one of spam getting past the filter, it's one of non-spam getting blocked.

    I.e. the filter says it's spam, and it isn't - in the same way that a false-positive medical test says you have a virus even when you don't.

  • Re:Not a fair test (Score:4, Informative)

    by SWroclawski ( 95770 ) <serge@wroclaws[ ]org ['ki.' in gap]> on Monday June 14, 2004 @02:01PM (#9421891) Homepage
    Any evidence that they reject mail for various reasons? I'm sure there is. You can go ahead and see which RFCs they're in compliance with and which they aren't.

    If you don't have a PTR record associated with your host, try to send mail to them, or malform your EHLO or something else.

    You don't need to be "really sure" mail is spam- I'm talking about doing things like standards complaince checking, which will result in mail being rejected at delivery time.

    Is this just random theorizing, or does GMail really fail to deliver some emails it thinks is spam?

    There's no reason to get insulting. RFC 2821 has a number of requirements for delivery of mail that many services ignore.
  • by einTier ( 33752 ) * on Monday June 14, 2004 @04:09PM (#9423183)
    False positive = condition you are testing for comes up positive, when it should be negative.

    False negative = condition you are testing for comes up negative, when it should be positive.

    Put in the context of a spam filter, it depends on whether you are testing for spam or for legitimate emails. If you are testing for spam (if spam then...), a false positive would be an email that is not spam getting sent to the spam folder or deleted. A false negative would be spam that lands in your inbox.

  • Re:Cache? (Score:2, Informative)

    by Calamity Jane ( 223787 ) on Monday June 14, 2004 @11:34PM (#9426613) Homepage
    The cache link is pointing to the cache of his website, not of Google's.

"May your future be limited only by your dreams." -- Christa McAuliffe

Working...