How Journalists Data-Mined the Wikileaks Docs 59
meckdevil writes "Associated Press developer-journalist extraordinaire Jonathan Stray gives a brilliant explanation of the use of data-mining strategies to winnow and wring journalistic sense out of massive numbers of documents, using the Iraq and Afghanistan war logs released by Wikileaks as a case in point. The concepts for focusing on certain groups of documents and ignoring others are hardly new; they underlie the algorithms used by the major Web search engines. Their use in a journalistic context is on a cutting edge, though, and it raises a fascinating quandary: By choosing the parameters under which documents will be considered similar enough to pay attention to, journalist-programmers actually choose the frame in which a story will be told. This type of data mining holds great potential for investigative revelation — and great potential for journalistic abuse."
I just used grep -P (Score:4, Interesting)
Worked miracles after I've gotten around the ugly HTML format they use to release all those INFORMATIONS. Still, there was very little new or worthwhile in the heap of those news clips and rumour aggregations. Frankly, the more I grep it, the less it looks like the "largest leak in history", and the more it seems like "the largest controlled release of information" in history.
/ takes off conspiracy theory hat // flame on
Comment removed (Score:3, Interesting)
Re:We're Not Limited to Only One Context (Score:2, Interesting)
If memory serves, and I'm not missing something in my quick re-read of the Wikipedia page, the leaked cables were not all made available to everyone. They were distributed to five major news organizations so more than one editorial staff could reasonably decide which material was newsworthy and which was too sensitive to publish (sarcastic example: the GPS coordinates of Obama's real long-form birth certificate). This is a reasonably good idea, but it does mean that there are only a handful of people who have access to all the documents.
Have you ever heard that when you find something it's always in the last place you look? That's because you stop looking for it once you're satisfied. Similarly, an editor searching for terms that might confirm a previously-unsubstantiated rumor he's got tucked away in a story on the shelf may find what he's looking for, but he won't find the really juicy stuff he didn't know to look for.
In a perfect world, the system would correct for this because some enterprising young journalists who are willing to "pound the pavement" and read the whole thing would uncover the stuff they missed. But because of the limited set of people who have access, that won't happen for a decade or two at the earliest. It's a necessary evil to prevent information like the locations of and personnel at sensitive sites from falling into the wrong hands.
Re:But this isn't reasoned analysis (Score:5, Interesting)