Can Author Obfuscation Trump Forensic Linguistics? (webis.de) 84

Posted by timothy on Thursday January 21, 2016 @11:57AM from the abe-lincoln-predicted-this dept.

An anonymous reader writes: Everyone possesses their own writing style, which may be used to identify authors even if they wish to remain anonymous: linguists employ stylometry to settle disputes over the authorship of historic texts as well as more recent cases, and are called to verify the authors of suicide notes or threatening letters. Computer linguists carry out research on software for forensic text analyses, and a recent study shows many of these approaches to be reproducible. Now, a competition has been announced to develop obfuscation software to hide an author's style with the task: "Given a document, paraphrase it so that its writing style does not match that of its original author, anymore." We'll see what comes out of that. Meanwhile, the question remains: Who will win in the long run? Forensic linguists, or obfuscation technology?

Can Author Obfuscation Trump Forensic Linguistics?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 84 Comments Log In/Create an Account

Comments Filter:

Ummm (Score:5, Funny)

by wbr1 ( 2538558 ) writes: on Thursday January 21, 2016 @12:00PM (#51343985)

Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.

- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  I do something similar to encrypt my e-mails, I run it through ROT-13 6 times. It's foolproof.
  - Re: (Score:3)
    
    by s.petry ( 762400 ) writes:
    
    I can one up you, because I use ROT6 13 times. Way more better!
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  Well, with the current quality of machine translation you'll lose a lot of content too.
  - Re: (Score:2)
    
    by Golddess ( 1361003 ) writes:
    
    Which is why you proof read it after the final pass, and make adjustments then. Yes, it might be possible that your adjustments can still be enough to identify you, but it seems much less likely to me.
- Re: (Score:3, Interesting)
  
  by ohieaux ( 2860669 ) writes:
  
  English
  Want to obfuscate text? Just run it through a language or 5, then back to the original language using something like google translate. No paraphrasing needed.
  Afrikaans
  Wil teks verduisteren ? Net hardloop dit deur 'n taal of 5 , dan terug na die oorspronklike taal gebruik van iets soos Google vertaal. Geen parafrasering nodig .
  Albanian
  Dëshironi tekstin errët ? Vetëm të drejtuar atë nëpërmjet një gjuhe ose 5 , pastaj kthehet për të përdorur gj
  - Re: (Score:2)
    
    by drew_kime ( 303965 ) writes:
    
    That's pretty close, actually. Hmm ... are there languages with syntax sufficiently different from Romance languages to overcome this?
- Re: (Score:3)
  
  by bluefoxlucid ( 723572 ) writes:
  
  I read Trump as a noun and thought the title was nonsense.
  - Re: (Score:3, Funny)
    
    by bondsbw ( 888959 ) writes:
    
    Of course it is, that's what obfuscation does.
  - Re: (Score:2)
    
    by Anne Thwacks ( 531696 ) writes:
    
    The sooner Trump is obfuscated, the better!
  - Re: (Score:2)
    
    by mspohr ( 589790 ) writes:
    
    Maybe a computer could make sense of Palin's word salad.
    - Re: (Score:1)
      
      by balbeir ( 557475 ) writes:
      
      Clearly the English language has deteriorated into a hybrid of hillbilly, valleygirl, inner-city slang and various grunts.
      - Re: (Score:2)
        
        by bluefoxlucid ( 723572 ) writes:
        
        It is for this reason I've started a style guide to clear English. This guide includes communicative, informative, and persuasive styles, with a subsection on expletives for persuasive writing and speaking.
        Essentially, it's just Strunk and White, Dale Carnegie, and a few other pieces of broad research brought together. Informative style will provide the greatest difficulty, as I'll need to cobble it together from experience and abstract concepts, rather than other research. For example: SQ3R and its d
        
        Re: (Score:2)
        
        by TheRealHocusLocus ( 2319802 ) writes:
        
        The book *does* target general consumption
        I look forward to devouring it!
        I, for one, also saw "Trump" in the title.
        That makes two.
Obfuscation always wins (Score:2)

by deathcloset ( 626704 ) writes:

because of reasons which are not obvious and which I will not reveal although you already know them.
- Re: (Score:3)
  
  by ranton ( 36917 ) writes:
  
  Well in this case the reason is fairly obvious. Since the question asked about the long run, it is safe to assume machines which can comprehend natural language will be used to obfuscate text in the long run. Once that happens, I would assume obfuscation will easily win. It could not only win, but it could almost certainly be able to produce false positives.
Unlikely (Score:2)

by DeathToBill ( 601486 ) writes:

I doubt this is possible to do very well. Consider [1], where they were able to identify authors from compiled code. Not with close to 100% accuracy, but it's still surprising that your source code style is identifiable with optimization enabled and symbols stripped out.
[1] ftp://ftp.cs.wisc.edu/paradyn/... [wisc.edu]
- Re: (Score:2)
  
  by avandesande ( 143899 ) writes:
  
  Someone should write a English compiler.
  - Re: (Score:2)
    
    by gstoddart ( 321705 ) writes:
    
    They'd fail utterly.
    Remember that Star Trek episode where the robots kept saying "Norma, coordinate" up until Kirk and Spock made his brain explode? Picture that.
    English is far too malleable and imprecise.
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
- Re: (Score:2)
  
  by thoromyr ( 673646 ) writes:
  
  they succeeded with nothing like 100% using a small sample set which has the side effect of avoiding confusion.
  Put another way: face recognition seems promising with similar accuracy rate when limited to a small set of faces. But once you open the flood gates the accuracy goes way down.
  Proponents fall back on the "it works as a pre-filter" which, depending on the size of the population you are working with, might have sufficient true positive with a low enough false positive to make it workable. But it is a
Stephen King is not dead. (Score:5, Interesting)

by I'm New Around Here ( 1154723 ) writes: on Thursday January 21, 2016 @12:21PM (#51344193)

Back in the 1970's Stephen King wrote some novels under the pseudonym Richard Bachman. It worked for a while, but people were able to figure out that Bachman wrote in the same style as the famous Stephen King. Eventually the secret broke.
I wonder if those novels written under the pseudonym would make a good test of the system. Run them through the process, give the results to newer readers of King's known works, and see if they notice the similarities others did in the past.

- Re: (Score:2)
  
  by thoromyr ( 673646 ) writes:
  
  and then there are authors who have a diverse writing style. Try author identifying software | reader identification of anonymized works on a corpus including the work of Walter Jon Williams -- and I doubt that he is the only author to vary style.
  - Re: (Score:2)
    
    by david_thornley ( 598059 ) writes:
    
    Heck, separate Lord of the Rings into narrative and dialog and compare those. Tolkien used different styles there. The time I remember that he tried using the dialog-type language in narrative and description, at the first formal dinner Frodo attends in Rivendell, it sounded ridiculous.
- Re: (Score:2)
  
  by techno-vampire ( 666512 ) writes:
  
  German also puts the verb(s) at the end of the sentence. Translate your work into proper German, have a computer make a literal translation back to English and you'll get much the same thing as Yoda-speak.
  - Re: (Score:2)
    
    by HornWumpus ( 783565 ) writes:
    
    An average sentence, in a German newspaper, is a sublime and impressive curiosity; it occupies a quarter of a column; it contains all the ten parts of speech -- not in regular order, but mixed; it is built mainly of compound words constructed by the writer on the spot, and not to be found in any dictionary -- six or seven words compacted into one, without joint or seam -- that is, without hyphens; it treats of fourteen or fifteen different subjects, each inclosed in a parenthesis of its own, with here and t
Only for noobs (Score:2)

by penguinoid ( 724646 ) writes:

If someone is serious about obfuscating their writing, they will be able to. Especially once they get access to the software that would be used to examine it.
However, most people are not going to even bother attempting to obfuscate.
- Re: (Score:1)
  
  by EdwardFurlong ( 3697195 ) writes:
  
  This seems pretty true, if I was writing something that I would not want traced back to me I would not trust some program anyway.
  Maybe if I was super paranoid about the NSA or Google somehow linking my random internet comments all to me, then a program might have some use.
  It would be interesting to see if the program could go through /. AC postings and see if they can match them up to a user.
  - Re: (Score:2)
    
    by techno-vampire ( 666512 ) writes:
    
    I've actually done some obfuscation of my own communications. Years ago, I worked for a tech company where most of my co-workers were about half my age at best, and their word usage, grammar and syntax often made them look like high school dropouts, especially when compared to my writing. (No, I'm not bragging; it's just that unlike them, I cared about such things and tried harder than they did to get it right.)
    
    One of the ways we had for giving feedback was an internal website where we could "ask the su
Lacking objective quality metric (Score:2)

by GlobalEcho ( 26240 ) writes:

To quantify the degree of obfuscation, they have precise computational metrics based on their stylometric algorithms. But to judge the quality of the obfuscation, there is no objective metrics. Instead
To measure soundness and properness, obfuscations will be sampled and handed out to participants for peer-review.
which seems to me to make the contest rather less meaningful. Why not just peer review the quality of all obfuscations exceeding some minimum standard?
Obfuscation will win...if it works (Score:1)

by Anonymous Coward writes:

As a trained linguist, though not an expert on forensic linguistics, I believe that successful automated obfuscation will win and be essentially unbeatable, but probably also detectable. By rewriting a text automatically, valuable information is destroyed that a forensic linguist has to reply upon. (When humans try to obfuscate text, on the other hand, they tend to add such information, potentially even making the task easier for the forensic linguist. For example, black mailers commonly imitate foreign ac
Basically you are looking for a translator (Score:1)

by davidwr ( 791652 ) writes:

You are looking for a tool that extracts the meaning from a text then re-writes it in a standardized, canonical format, or at least "washes" it into one of a list of possible formats such that if you take a bunch of random input from a bunch of different authors, you can't tell from the output who wrote what.
I expect this will be successful within 10 years if we work hard on it.
hard (Score:2)

by bigdavex ( 155746 ) writes:

This strikes me as an extremely difficult task, assuming the tolerance for losing meaning is low. Maybe IBM Watson work applies.
- Re: (Score:1)
  
  by umghhh ( 965931 ) writes:
  
  That may be true but most of what is written anywhere in this world is meaningless drivel so the problem of losing meaning does not exist really. The need to obfuscate neither I admit.
  This leaves us with people that actually have something to say. I reckon there would be a tiny minority among them, that would want to have such service but that also means its production would most likely be economically unfeasible.
  Then there are trolls and 50c soldiers which could use the service of course but I guess it
Depends (Score:2)

by spaceman375 ( 780812 ) writes:

This will depend heavily on which language the original and end documents are in. Or: Success relies strongly on source and target vernacular.
English has numerous words for the same thing. Try to say a guy is cute, handsome, beautiful, or hot in Portuguese and it all translates to "Bonito".
Click bait title (Score:4, Funny)

by Anonymous Coward writes: on Thursday January 21, 2016 @01:05PM (#51344529)

This has nothing to do with Trump.

- Re: (Score:1)
  
  by Feral Nerd ( 3929873 ) writes:
  
  This has nothing to do with Trump.
  Yes it does, the guy has a built in obfuscation engine that sits between the part of his brain that handler rational thougt and his mouth but it only kicks in when he is giving political speeches.
Newspeak (Score:2)

by techsoldaten ( 309296 ) writes:

We should all just move to newspeak to eliminate the detection / obfuscation arms race entirely.
apk (Score:2)

by 110010001000 ( 697113 ) writes:

This is very true. I can identify every single post by apk, even if he posts Anonymously. I must be a genius.
Polygraph 2.0 (Score:3)

by jheath314 ( 916607 ) writes: on Thursday January 21, 2016 @01:12PM (#51344581)

The TFA assumes that stylometry gives somewhat reliable results. It doesn't. Something as simple as an editor cleaning up a work can throw off the analysis.
Even in the optimal scenario (an unedited work by a single author who isn't trying to hide or imitate a different style), the best algorithms have abysmally high failure rates.
(KNN)â"50 neighbors: 0.69 success, 0.28 fail
Decision Tree 0.58 success, 0.42 fail
Mean Margins Tree 0.65 success, 0.36 fail
Stylometry is reasonably effective at correctly identifying when two works by the same author have the same style. It is garbage when it comes to determining when two works have different authors. If I were to guess, I'd say the problem is that the variation in style between authors (compared to the variation within a single author's work) is not always wide enough to allow for reliable identification.
Stylometry is interesting, certainly, but the prospect of such an unreliable method being used for important is alarming.

- Re: (Score:2)
  
  by thoromyr ( 673646 ) writes:
  
  Indeed. I've been reading H. Beam Piper's "Fuzzy" stories to my kids and it is quite amusing to have the "veridicator" play such a prominent role as an infallible method of separating truth from lies (although the narration admits the possibility of unintentional deception wherein someone truly believes what they are saying it emphatically rejects the possibility of deliberate deception).
  To the topic at hand, it is certainly interesting and even useful when applied intelligently. For example, it is well est
- Re: (Score:2)
  
  by turning in circles ( 2882659 ) writes:
  
  Brevity is the soul of obfuscation. "Can a program designed to obfuscate author identity defeat a program designed to verify author identity?"
Seems simple to me (Score:2)

by Billy the Mountain ( 225541 ) writes:

All the obfuscation software has to do is change things so it casts enough doubt. I assume the stylometry analysis doesn't return a 1 or 0, it probably returns a probability. Once the probability is below a certain threshold, the job is done. An example of obfuscating: How about a simple machine translation to another language?
- Re: (Score:2)
  
  by Hognoxious ( 631665 ) writes:
  
  An example of obfuscating: How about a simple machine translation to another language?
  That would certainly obfuscate it for people who didn't speak the other language.
  Were you suggesting a round trip? Things may have moved on, but I remember playing with this some years back and the results were changed way beyond style.
- Re: (Score:3)
  
  by dgatwood ( 11270 ) writes:
  
  You touched a key point there, without actually saying it, which is that the ability of forensic linguistics to recognize a person is inversely proportional to the number of people who could have written the content.
  For example, let's say that you're a native Russian speaker, and that your English grammar has certain linguistic quirks that are typical of Russian speakers writing English, e.g. missing all the definite and indefinite articles ("We read book, da?"). If exactly one Russian has access to some
  - Re: (Score:3)
    
    by dgatwood ( 11270 ) writes:
    
    Alternatively, delete all the definite and indefinite articles. Then they'll blame your one Russian coworker.
Yes, but... (Score:2)

by Bearhouse ( 1034238 ) writes:

Could it do anything for Trump's linguistics?
Supposing it works... (Score:2)

by werepants ( 1912634 ) writes:

Supposing it works (not saying it's likely), this would be a big problem for catching plagiarists. Copy somebody's text, run it through this, and then hand it in: boom, you're done. You could certainly have anti-plagiarism software that runs this in reverse (or you take your database of comparison docs and run them all through the obfuscator, something along those lines) but if they do it right and there's some degree of randomness, it introduces a massive dose of plausible deniability to any plagiarism cas
What spam house is funding this? (Score:2)

by drew_kime ( 303965 ) writes:
If the intent is to obfuscate the style, just run it through a few languages and back as someone already suggested. But I'm guessing they want something that doesn't look like word salad.
We call an obfuscation software

safe, if a forensic analysis does not reveal the original author of its obfuscated texts,

sound, if its obufscated texts are textually entailed with their originals, and

proper, if its obfuscated texts are inconspicuous.
Yup, right there: proper. They're basically asking for someone to write t
Slashdot is going down the toilet (Score:1)

by Soccerguy1832 ( 2926149 ) writes:

Enough with all the Trump articles jeez!
Yes, plenty of research out there already (Score:1)

by SlideRuleGuy ( 987445 ) writes:

For one example, see
"Obfuscating Document Stylometry to Preserve Author Anonymity"
Gary Kacmarcik & Michael Gamon
This technique is not an automated one, but hey, all you need is more software.
Isn't This Content Spinning? (Score:2)

by Tsu Dho Nimh ( 663417 ) writes:

I hope part of the competition is to retain meaning and have correct grammar. Because if not, you might as well just do content spinning and declare it done.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Ummm (Score:5, Funny)

Re: (Score:1)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Obfuscation always wins (Score:2)

Re: (Score:3)

Unlikely (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Stephen King is not dead. (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Only for noobs (Score:2)

Re: (Score:1)

Re: (Score:2)

Lacking objective quality metric (Score:2)

Obfuscation will win...if it works (Score:1)

Basically you are looking for a translator (Score:1)

hard (Score:2)

Re: (Score:1)

Depends (Score:2)

Click bait title (Score:4, Funny)

Re: (Score:1)

Newspeak (Score:2)

apk (Score:2)

Polygraph 2.0 (Score:3)

Re: (Score:2)

Re: (Score:2)

Seems simple to me (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Yes, but... (Score:2)

Supposing it works... (Score:2)

What spam house is funding this? (Score:2)

Slashdot is going down the toilet (Score:1)

Yes, plenty of research out there already (Score:1)

Isn't This Content Spinning? (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals