Giant, Free Index To World's Research Papers Released Online 10
In a project that could unlock the world's research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles -- including many paywalled papers. Nature reports: The catalogue, which was released on October 7 and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded. Malamud says that because his index doesn't contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers' copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place.
Some researchers who have had early access to the index say it's a major development in helping them to search the literature with software -- a procedure known as text mining. [...] Computer scientists already text mine papers to build databases of genes, drugs and chemicals found in the literature, and to explore papers' content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are restricted to mining only open-access papers, or those articles they (or their institutions) have subscriptions to. Some publishers have said that researchers looking to mine the text of paywalled papers need their authorization. And although free search engines such as Google Scholar have -- with publishers' agreement -- indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching. That doesn't allow large-scale computerized analysis using more specialized searches, Malamud says.
Some researchers who have had early access to the index say it's a major development in helping them to search the literature with software -- a procedure known as text mining. [...] Computer scientists already text mine papers to build databases of genes, drugs and chemicals found in the literature, and to explore papers' content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are restricted to mining only open-access papers, or those articles they (or their institutions) have subscriptions to. Some publishers have said that researchers looking to mine the text of paywalled papers need their authorization. And although free search engines such as Google Scholar have -- with publishers' agreement -- indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching. That doesn't allow large-scale computerized analysis using more specialized searches, Malamud says.
Hopefully he doesn't get Swartz'ed (Score:3, Informative)
This project sounds similar to what Aaron Swartz [wikipedia.org] was trying to do. And to think, he's doing it right under the return of the Obiden administration... Good luck with this.
Re: (Score:1)
Why are we still using closed journals? (Score:5, Interesting)
It is ridiculous that even though the internet has existed for more almost 40 years and HTML almost 30 years, starting in the very institutions that depend on shared knowledge and access to published papers, that academia is still largely tied to the commercial journal industry.
I can see no good reason why researchers and the institutions they work in haven't turned to open access publishing for their work - especially since there is absolutely no financial incentive for them to do so.
This new index is a great effort and will undoubtedly be a boon to all those researchers who have to do research existing work that locked away in closed published journals, but a better step yet would be to ditch those journals entirely going forward and not be still tied to this stranglehold on science.
Re: Why are we still using closed journals? (Score:4, Insightful)
Re: (Score:2)
Re: (Score:1)
The labour of review coordinators, content managers and distributors, admin staff, etc. at a journal do not necessarily work for free. Not every author will know how to lay out their article into an HTML or LaTeX template, etc.
Re: (Score:2)
You do understand that any NIH funded work is freely available after 1 year, right? That would be by far the majority of published papers. The work is not "locked away in closed published journals" as you suggest.
The whole shtick about publishers being inherently evil is myopic and misses substantial, important detail.
With regard to this new index, it will certainly be interesting to see how it plays out in the courts, as it appears to have been designed to circumvent copyright laws. How useful would it
An incredible resource (Score:3)
An incredible resource, and it does say In_memoriam Shamnad Basheer ; Aaron Swartz. The file is 4.7TB zipped, which unzipped expands to 37.9TB. Wow, I was thinking how on earth to get that much storage when it turns out it only costs $1700 these days! W00t! :)
(WD 40TB My Cloud Pro Series PR4100 Network Attached Storage - NAS on AMZ)
I wonder if this is better than what Google can provide. Certainly it being libre and potentially on your desktop, or on every person's desktop and phone in a university, is incalculable. I see a doi but not sure if arxiv links are included.. anyway interested to see who will host it!