Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
AI Software

Can Author Obfuscation Trump Forensic Linguistics? (webis.de) 84

An anonymous reader writes: Everyone possesses their own writing style, which may be used to identify authors even if they wish to remain anonymous: linguists employ stylometry to settle disputes over the authorship of historic texts as well as more recent cases, and are called to verify the authors of suicide notes or threatening letters. Computer linguists carry out research on software for forensic text analyses, and a recent study shows many of these approaches to be reproducible. Now, a competition has been announced to develop obfuscation software to hide an author's style with the task: "Given a document, paraphrase it so that its writing style does not match that of its original author, anymore." We'll see what comes out of that. Meanwhile, the question remains: Who will win in the long run? Forensic linguists, or obfuscation technology?
This discussion has been archived. No new comments can be posted.

Can Author Obfuscation Trump Forensic Linguistics?

Comments Filter:
  • Ummm (Score:5, Funny)

    by wbr1 ( 2538558 ) on Thursday January 21, 2016 @11:00AM (#51343985)
    Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.
    • by Anonymous Coward
      I do something similar to encrypt my e-mails, I run it through ROT-13 6 times. It's foolproof.
    • by Anonymous Coward

      Well, with the current quality of machine translation you'll lose a lot of content too.

      • Which is why you proof read it after the final pass, and make adjustments then. Yes, it might be possible that your adjustments can still be enough to identify you, but it seems much less likely to me.
    • Re: (Score:3, Interesting)

      by ohieaux ( 2860669 )
      English
      Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.
      Afrikaans
      Wil teks verduisteren ? Net hardloop dit deur 'n taal of 5 , dan terug na die oorspronklike taal gebruik van iets soos Google vertaal. Geen parafrasering nodig .
      Albanian
      Dëshironi tekstin errët ? Vetëm të drejtuar atë nëpërmjet një gjuhe ose 5 , pastaj kthehet për të përdorur gj
      • That's pretty close, actually. Hmm ... are there languages with syntax sufficiently different from Romance languages to overcome this?

  • because of reasons which are not obvious and which I will not reveal although you already know them.
    • by ranton ( 36917 )

      Well in this case the reason is fairly obvious. Since the question asked about the long run, it is safe to assume machines which can comprehend natural language will be used to obfuscate text in the long run. Once that happens, I would assume obfuscation will easily win. It could not only win, but it could almost certainly be able to produce false positives.

  • I doubt this is possible to do very well. Consider [1], where they were able to identify authors from compiled code. Not with close to 100% accuracy, but it's still surprising that your source code style is identifiable with optimization enabled and symbols stripped out.

    [1] ftp://ftp.cs.wisc.edu/paradyn/... [wisc.edu]

    • Someone should write a English compiler.
      • They'd fail utterly.

        Remember that Star Trek episode where the robots kept saying "Norma, coordinate" up until Kirk and Spock made his brain explode? Picture that.

        English is far too malleable and imprecise.

      • Comment removed based on user account deletion
    • they succeeded with nothing like 100% using a small sample set which has the side effect of avoiding confusion.

      Put another way: face recognition seems promising with similar accuracy rate when limited to a small set of faces. But once you open the flood gates the accuracy goes way down.

      Proponents fall back on the "it works as a pre-filter" which, depending on the size of the population you are working with, might have sufficient true positive with a low enough false positive to make it workable. But it is a

  • by I'm New Around Here ( 1154723 ) on Thursday January 21, 2016 @11:21AM (#51344193)

    Back in the 1970's Stephen King wrote some novels under the pseudonym Richard Bachman. It worked for a while, but people were able to figure out that Bachman wrote in the same style as the famous Stephen King. Eventually the secret broke.

    I wonder if those novels written under the pseudonym would make a good test of the system. Run them through the process, give the results to newer readers of King's known works, and see if they notice the similarities others did in the past.

    • and then there are authors who have a diverse writing style. Try author identifying software | reader identification of anonymized works on a corpus including the work of Walter Jon Williams -- and I doubt that he is the only author to vary style.

      • Heck, separate Lord of the Rings into narrative and dialog and compare those. Tolkien used different styles there. The time I remember that he tried using the dialog-type language in narrative and description, at the first formal dinner Frodo attends in Rivendell, it sounded ridiculous.

  • If someone is serious about obfuscating their writing, they will be able to. Especially once they get access to the software that would be used to examine it.

    However, most people are not going to even bother attempting to obfuscate.

    • This seems pretty true, if I was writing something that I would not want traced back to me I would not trust some program anyway.

      Maybe if I was super paranoid about the NSA or Google somehow linking my random internet comments all to me, then a program might have some use.

      It would be interesting to see if the program could go through /. AC postings and see if they can match them up to a user.

      • I've actually done some obfuscation of my own communications. Years ago, I worked for a tech company where most of my co-workers were about half my age at best, and their word usage, grammar and syntax often made them look like high school dropouts, especially when compared to my writing. (No, I'm not bragging; it's just that unlike them, I cared about such things and tried harder than they did to get it right.)

        One of the ways we had for giving feedback was an internal website where we could "ask the su
  • To quantify the degree of obfuscation, they have precise computational metrics based on their stylometric algorithms. But to judge the quality of the obfuscation, there is no objective metrics. Instead

    To measure soundness and properness, obfuscations will be sampled and handed out to participants for peer-review.

    which seems to me to make the contest rather less meaningful. Why not just peer review the quality of all obfuscations exceeding some minimum standard?

  • by Anonymous Coward

    As a trained linguist, though not an expert on forensic linguistics, I believe that successful automated obfuscation will win and be essentially unbeatable, but probably also detectable. By rewriting a text automatically, valuable information is destroyed that a forensic linguist has to reply upon. (When humans try to obfuscate text, on the other hand, they tend to add such information, potentially even making the task easier for the forensic linguist. For example, black mailers commonly imitate foreign ac

  • You are looking for a tool that extracts the meaning from a text then re-writes it in a standardized, canonical format, or at least "washes" it into one of a list of possible formats such that if you take a bunch of random input from a bunch of different authors, you can't tell from the output who wrote what.

    I expect this will be successful within 10 years if we work hard on it.

  • This strikes me as an extremely difficult task, assuming the tolerance for losing meaning is low. Maybe IBM Watson work applies.

    • by umghhh ( 965931 )
      That may be true but most of what is written anywhere in this world is meaningless drivel so the problem of losing meaning does not exist really. The need to obfuscate neither I admit.
      This leaves us with people that actually have something to say. I reckon there would be a tiny minority among them, that would want to have such service but that also means its production would most likely be economically unfeasible.
      Then there are trolls and 50c soldiers which could use the service of course but I guess it
  • This will depend heavily on which language the original and end documents are in. Or: Success relies strongly on source and target vernacular.

    English has numerous words for the same thing. Try to say a guy is cute, handsome, beautiful, or hot in Portuguese and it all translates to "Bonito".

  • by Anonymous Coward on Thursday January 21, 2016 @12:05PM (#51344529)

    This has nothing to do with Trump.

    • This has nothing to do with Trump.

      Yes it does, the guy has a built in obfuscation engine that sits between the part of his brain that handler rational thougt and his mouth but it only kicks in when he is giving political speeches.

  • We should all just move to newspeak to eliminate the detection / obfuscation arms race entirely.

  • This is very true. I can identify every single post by apk, even if he posts Anonymously. I must be a genius.
  • by jheath314 ( 916607 ) on Thursday January 21, 2016 @12:12PM (#51344581)

    The TFA assumes that stylometry gives somewhat reliable results. It doesn't. Something as simple as an editor cleaning up a work can throw off the analysis.

    Even in the optimal scenario (an unedited work by a single author who isn't trying to hide or imitate a different style), the best algorithms have abysmally high failure rates.

    (KNN)â"50 neighbors: 0.69 success, 0.28 fail
    Decision Tree 0.58 success, 0.42 fail
    Mean Margins Tree 0.65 success, 0.36 fail

    Stylometry is reasonably effective at correctly identifying when two works by the same author have the same style. It is garbage when it comes to determining when two works have different authors. If I were to guess, I'd say the problem is that the variation in style between authors (compared to the variation within a single author's work) is not always wide enough to allow for reliable identification.

    Stylometry is interesting, certainly, but the prospect of such an unreliable method being used for important is alarming.

    • Indeed. I've been reading H. Beam Piper's "Fuzzy" stories to my kids and it is quite amusing to have the "veridicator" play such a prominent role as an infallible method of separating truth from lies (although the narration admits the possibility of unintentional deception wherein someone truly believes what they are saying it emphatically rejects the possibility of deliberate deception).

      To the topic at hand, it is certainly interesting and even useful when applied intelligently. For example, it is well est

  • All the obfuscation software has to do is change things so it casts enough doubt. I assume the stylometry analysis doesn't return a 1 or 0, it probably returns a probability. Once the probability is below a certain threshold, the job is done. An example of obfuscating: How about a simple machine translation to another language?
    • An example of obfuscating: How about a simple machine translation to another language?

      That would certainly obfuscate it for people who didn't speak the other language.

      Were you suggesting a round trip? Things may have moved on, but I remember playing with this some years back and the results were changed way beyond style.

  • Could it do anything for Trump's linguistics?

  • Supposing it works (not saying it's likely), this would be a big problem for catching plagiarists. Copy somebody's text, run it through this, and then hand it in: boom, you're done. You could certainly have anti-plagiarism software that runs this in reverse (or you take your database of comparison docs and run them all through the obfuscator, something along those lines) but if they do it right and there's some degree of randomness, it introduces a massive dose of plausible deniability to any plagiarism cas

  • If the intent is to obfuscate the style, just run it through a few languages and back as someone already suggested. But I'm guessing they want something that doesn't look like word salad.

    We call an obfuscation software

      1. safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
      1. sound, if its obufscated texts are textually entailed with their originals, and
      1. proper, if its obfuscated texts are inconspicuous.

    Yup, right there: proper. They're basically asking for someone to write t

  • Enough with all the Trump articles jeez!
  • For one example, see

    "Obfuscating Document Stylometry to Preserve Author Anonymity"
    Gary Kacmarcik & Michael Gamon

    This technique is not an automated one, but hey, all you need is more software.

  • I hope part of the competition is to retain meaning and have correct grammar. Because if not, you might as well just do content spinning and declare it done.

You can not win the game, and you are not allowed to stop playing. -- The Third Law Of Thermodynamics

Working...