Extracting Meaning From Millions of Pages 138

Posted by kdawson on Friday June 12, 2009 @08:48AM from the data-mining-gone-large dept.

freakshowsam writes "Technology Review has an article on a software engine, developed by researchers at the University of Washington, that pulls together facts by combing through more than 500 million Web pages. TextRunner extracts information from billions of lines of text by analyzing basic relationships between words. 'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. The prototype still has a fairly simple interface and is not meant for public search so much as to demonstrate the automated extraction of information from 500 million Web pages, says Oren Etzioni, a University of Washington computer scientist leading the project." Try the query "Who has Microsoft acquired?"

Extracting Meaning From Millions of Pages

This discussion has been archived. No new comments can be posted.

Search 138 Comments Log In/Create an Account

Comments Filter:

Re:Not entirely helpful (Score:2, Informative)

by msbmsb ( 871828 ) writes: on Friday June 12, 2009 @11:12AM (#28308611)

Semantic processing systems like this (it's not something new) aren't usually able to determine correctness. The truth of a statement is assumed and the best these NLP [wikipedia.org] engines can do at the moment is identify conflicts and maybe use some reputation metrics to assign a veracity rating to a particular statement, or notify the user that there are differing conclusions. These systems are just really, like the summary states, "information extraction [wikipedia.org]" systems. Just as a regular search engine will return you the results from the data set, that's what these types of semantic extraction engines usually do, except the data is processed in a semantically-organized way so that you can query with semantics/natural language constraints instead of just keywords and boolean constraints.

There are some that incorporate some intention or opinion polarity detection, but even those are not capable to sorting "truth" versus "conspiracy".

Additionally, semantic extraction output, like named entities [wikipedia.org] and semantic relations [wikipedia.org], are useful for many other applications.

Why WTC name is spelled in American (Score:3, Informative)

by tepples ( 727027 ) writes: <tepples@@@gmail...com> on Friday June 12, 2009 @12:16PM (#28309511) Homepage Journal

Damn my correct spelling of English words!
Because the World Trade Center was located on American soil, its name is spelled in American dialect.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Extracting Meaning From Millions of Pages 138

Extracting Meaning From Millions of Pages More Login

Extracting Meaning From Millions of Pages

Re:Not entirely helpful (Score:2, Informative)

Why WTC name is spelled in American (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot