Forgot your password?
typodupeerror
Communications Google Spam

Gmail Now Rejects Emails With Misleading Combinations of Unicode Characters 79

Posted by Soulskill
from the we-look-forward-to-being-caught-in-your-new-web dept.
An anonymous reader writes: Google today announced it is implementing a new effort to thwart spammers and scammers: the open standard known as Unicode Consortium's "Highly Restricted" specification. In short, Gmail now rejects emails from domains that use what the Unicode community has identified as potentially misleading combinations of letters. The news today follows Google's announcement last week that Gmail has gained support for accented and non-Latin characters. The company is clearly okay with international domains, as long as they aren't abused to trick its users.
This discussion has been archived. No new comments can be posted.

Gmail Now Rejects Emails With Misleading Combinations of Unicode Characters

Comments Filter:
  • by Anonymous Coward on Tuesday August 12, 2014 @05:32PM (#47658379)

    ...

    • by Wootery (1087023)

      Could've sworn Slashdot had zero support for unicode characters.

      (I appear to be unable to paste in a 'Trademark' symbol. What is this magic, AC?!)

  • ...of the e-mail. Any attempt to block spam or phising on the basis of mixing character sets would have to confront the fact that some people do need to mix character sets. Typically representations of Mari in the Latin alphabet, for example, also make use of the Greek letters beta and eta. In fact, eta is used in Latin representations of several minority languages of Russia. And the Reddit crowd loves making weird smilies in their English-language writing by means of symbols drawn from Indian scripts.
    • by Russ1642 (1087959) on Tuesday August 12, 2014 @05:36PM (#47658415)

      If this spells death to those ridiculous smilies then it's ok with me.

    • I routinely substitute Cyrillic letters for Latin on Disqus and other forums to get around their filters (which block for more than mere "profanity").

      Slashdot does not allow non-ASCII characters — although it does not attempt to screen out profanity either.

      • by Ichijo (607641)

        Slashdot does not allow non-ASCII characters...

        ...unless they're in code page 1252.

      • by hondo77 (324058)

        Slashdot does not allow non-ASCII characters...

        Óh réällý?

        • by mi (197448)

          Óh réällý?

          That's pretty cool. I guess, the entire ISO-8859-15 is Ok? But not Cyrillics :-( Or else, you would've seen some Ukrainian-Russian conflict right here...

    • by tlhIngan (30335)

      ...of the e-mail. Any attempt to block spam or phising on the basis of mixing character sets would have to confront the fact that some people do need to mix character sets. Typically representations of Mari in the Latin alphabet, for example, also make use of the Greek letters beta and eta. In fact, eta is used in Latin representations of several minority languages of Russia. And the Reddit crowd loves making weird smilies in their English-language writing by means of symbols drawn from Indian scripts.

      Or pe

      • by Dutch Gun (899105)

        Heuristics could pretty easily determine if someone communicate only in English in their e-mails, and as such, any legitimate e-mails that contain large amounts of non-English words or characters should be viewed with greater suspicion. For those that routinely communicate in more than one language and use non-ascii sets, the heuristic should be able to account for that fact.

        These sorts of rules are always fuzzy by nature. Obviously, whether an e-mail is determined to be legitimate or not is due to many d

    • by TubeSteak (669689)

      Good that this applies to from: and not the body of the e-mail.

      That's not at all good and filtering the body exactly what I want.
      Spammers already spoof the from: domain and then link you out to exactly the type of domain that Gmail is now filtering.

      There's no reason Gmail can't flag [body] links to domains that use mixed character sets.

  • OK, good. Now if ICANN applied that tougher standard to domain name registrars, we'd make progress. But no, ICANN still allows registrars to register domain names without forcing them to comply with the most restrictive profile.

  • This looks like fun, I probably wouldn't catch that bank example and family certainly wouldn't. Looks like pretty much any word could substitute one letter.

    No idea exactly what these "combinations" are. The example used one letter substitution. Using this example and the little display of new letters there would appear to be billions of potentially misleading combinations.

    • The "restrictive profile" that Google is using for the filtering is defined in Unicode as any combination of the Latin character set with another set or sets, with the exception of very specific combinations (selected legitimate combinations of Asian sets that contain radically different letter forms and thus are unlikely to cause confusion).

    • by godrik (1287354)

      I'd like to see the precise rules (but too lazy to RTFA now). There are many non-english words that can be highly confusing. In french "telephone" is "téléphone" which could be though as a way to trick users. Also turkish have a dotless i, I would not be surprised it appears in words with similar spelling in english.

  • Sounds bad (Score:2, Insightful)

    by Anonymous Coward

    If I start a business with a unicode domain, and if later a scammer registers an ascii domain that is similar looking, then Gmail will blackhole my business, not the scammer, because I'm the one using unicode.

    • In the Russian borrowed word "radio", the Cyrillic characters a and o look identical to the same English letters (the rest are completely different).

      The Russian word "radio" should be in (the specific Russian Cyrillic subset of) Unicode/UTF, while the English word "radio" should be in Unicode/ASCII.

      Mixing and matching character sets in URLs or email address typically indicates "intent to confuse". Within text, it usually just confuses translators and spell checkers.
  • whack-a-mole 3.0 (Score:2, Insightful)

    by Anonymous Coward

    And the latest round of whack-a-mole begins...

  • ...absolutely nothing! The scammers will just find some other way to create their automated email garbage.
    • by pla (258480)
      This is going to do...absolutely nothing! The scammers will just find some other way to create their automated email garbage.

      You kidding? Thanks to allowing these new email addresses, I have an entirely new category of auto-deletable spam. These won't "confuse" me because I'll never see them. Win/Win!

      Go ahead, spammers, get cute. Just makes my life thaaat much easier.
    • by AHuxley (892839)
      It looks after working with ads in English.
      It looks after other interested parties looking for expected keywords.
  • Damn, now i see it's just domains, i tought they killed all my german and french spammers.

  • by Anonymous Coward on Tuesday August 12, 2014 @05:56PM (#47658573)

    90% of the population would be better off with a white listed email account, i.e. if you are not on their list the email does not get through. END OF STORY.

    I would seem to be more efficient to filter mail IN than to filter it out. Most people would have 20 or so people they actually want mail from.
    I have mail accounts strictly for family and my local email rules enforce this
    I have mail accounts for "sign up" sessions for competitions that I know are going to get spammed to hell
    I have mail account for work, another for my business , etc etc all with differing contacts.

    White listing would pretty much kill off spam, if there is zero chance of it getting though, what is the point. Currently spammers get through because of out dated spam lists, new tricks to get around baynesian filters, etc etc etc. White lists would negate the need.

    Google, if you set up a white listed email system, my friends and family will happily sign up.

    • by Anonymous Coward

      Seriously,
      most filters are now "very good". And, I make new acquaintenances, connections and friends. They have new email addresses that aren't in the whitelist. But, the filters pretty much just work.

      • by Anonymous Coward

        One way you could make whitelists work is to have a "secret handshake", a word that you require in the subject of mail from addresses that aren't whitelisted yet. You would regularly change that word and give it to new acquaintances along with your email address.

        The problem with the whitelist approach is something else: A lot of spam already pretends to be from someone you know. Spammers don't just collect individual email addresses anymore. They collect email address pairs: Who knows who.

    • by Dutch Gun (899105)

      E-mail authentication seems like a better solution than whitelisting in the long term. Whitelisting can kill off spam, but that's sort of like saying you can fix a broken arm by amputation. It's technically true, but removes a lot of useful functionality.

      The big problem with e-mail spam is that the e-mail sender can be trivially forged. If we employed ubiquitous authentication systems that proved a specific domain was used, and blocked non-authenticated users (or at the very least, flag them with a big w

    • Google, if you set up a white listed email system, my friends and family will happily sign up.

      They already happily sign up. Gmail is the largest email provider in the world.

      BTW the Gmail spam filter, like any good one, does have per-user whitelists. If you reply to mail or mark mail from a sender as not spam, the filter will leave mail from those senders alone (modulo caveats like the sender properly authenticating). Thus the filter spends almost all of its effort on email from senders you haven't interacte

    • by AmiMoJo (196126) *

      A whitelist would break site sign-up and password reset emails. You could never whitelist every legit site as hundreds are launched every day. Users will never figure out how to add sites to their whitelists before signing up, and can barely cope with such emails ending up in their spam folders.

      Having said that, gmail filters 99.9% of spam for me, and I can tolerate hitting delete for the 1 in 1000 that gets though.

    • Google, if you set up a white listed email system, my friends and family will happily sign up.

      They did, it's called Google+. Nobody seems to like it.

  • by Anonymous Coward

    If you use Unicode for domains, addresses, certificates and whatnot you are begging for an endless cascade of support problems and glitches, not to mention security vulnerabilities. Let others exercise all these broken codes paths for you while you avoid the fail. Eventually, after most of the broken code gets cycled out of use, many years from now, you may then safely allow this stuff into real systems.

    Unicode breaks all sorts of stuff in subtle and unfixed ways. A fine example from a widely used Micros

    • by Immerman (2627577)

      But why would anyone waste resources properly fixing a bug that doesn't affect anyone? The only way these things will get fixed properly, is if they start causing a lot of problems. And the only way they'll cause problems is if people start using them.

      Meanwhile, why should most of the world's population have to deal with an internet incapable of handling addresses in their language? How would you like it if you woke up tomorrow to discover that all web addresses could only be written in Arabic? The Web m

  • As much as I can appreciate the intent and the fact that this will solve 99.999% of people's problems for this type of spamming and create 00.0000000001% of problems for legitimate users, it still feels a little like Google is trying to be the thought police on this one; you know free speech and all.

  • IME, Gmail is rejecting a lot of legitimate mail nowadays.

    Their filters used to be good, but they completely fucked it up lately.

  • by jones_supa (887896)

    As an interesting background fact, I heard that Google has an advanced Al doing all this stuff completely autonomously.

    His real name is Albert, by the way.

  • GMail doesn't accept all comers. Get too many complaints and they'll reject you... this is just new ideas to add to that filter. There's a list of words you can't say on GMail without it getting read, they don't publish those lists because that'll never be said to them.

  • And so this "standard" was designed in this way because country A didn't want it's script mixed up with country B, introducing vulnerabilities into the DNS system in the process. As in '' '' and 'A' all encode to different unicode er .. codes.
    • by Anonymous Coward

      They're called "code points" actually. A particular code point can be encoded in different ways (for example, the encoding of 'ß' in UTF-8 is different from the encoding in UTF-16, but they both represent the same code point.) Yeah, something like that ought to be used for network addresses...

  • They are right doing so. There are letters in different alphabets whose typing is very very similar -- or in fact they are written exactly the same, depending on the font used.

    This can be exploited for interesting uses. For example, "E" and "ÃZ"** are respectively the latin "e" and the greek "epsilon" vowels, but they are indistinguishable in caps, at least in Arial font. The second one is the UTF 395 code. My name has an "E" on it, and for my email signature I spell my name using the traditional latin

  • by Chrisq (894406) on Wednesday August 13, 2014 @03:30AM (#47661003)

    It allows combinations of Latin + Han + Hiragana + Katakana; Latin + Han + Bopomofo; or Latin + Han + Hangul.

    There are a lot of equally safe combinations - what about Latin + Devanagari + Tamil? There would be no look-alike characters and it would allow a lot of people to put their name in multiple scripts that are likely to be meaningful to certain audiences (e.g. someone from Tamil Nadu sending an email to people throughout India and internationally). I'm sure that there are many other combinations that wouldn't have "look alike" issues but which would be useful

  • The "highly restricted" spec is meant to catch suspicious combos like in the mybank example - but does not catch full-ascii (which is an even more restrictive level) trickery like tvvitter.com (notice the two "v" chars). that combo in particular is now known, but goes to demonstrate that trickery does not need charsets larger than 7-bit... some people simply get caught by hsbc.net...

Things are not as simple as they seems at first. - Edward Thorp

Working...