Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Google Software The Internet

Extracting Meaning From Millions of Pages 138

freakshowsam writes "Technology Review has an article on a software engine, developed by researchers at the University of Washington, that pulls together facts by combing through more than 500 million Web pages. TextRunner extracts information from billions of lines of text by analyzing basic relationships between words. 'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. The prototype still has a fairly simple interface and is not meant for public search so much as to demonstrate the automated extraction of information from 500 million Web pages, says Oren Etzioni, a University of Washington computer scientist leading the project." Try the query "Who has Microsoft acquired?"
This discussion has been archived. No new comments can be posted.

Extracting Meaning From Millions of Pages

Comments Filter:
  • by msbmsb ( 871828 ) on Friday June 12, 2009 @11:12AM (#28308611)
    Semantic processing systems like this (it's not something new) aren't usually able to determine correctness. The truth of a statement is assumed and the best these NLP [wikipedia.org] engines can do at the moment is identify conflicts and maybe use some reputation metrics to assign a veracity rating to a particular statement, or notify the user that there are differing conclusions. These systems are just really, like the summary states, "information extraction [wikipedia.org]" systems. Just as a regular search engine will return you the results from the data set, that's what these types of semantic extraction engines usually do, except the data is processed in a semantically-organized way so that you can query with semantics/natural language constraints instead of just keywords and boolean constraints.

    There are some that incorporate some intention or opinion polarity detection, but even those are not capable to sorting "truth" versus "conspiracy".

    Additionally, semantic extraction output, like named entities [wikipedia.org] and semantic relations [wikipedia.org], are useful for many other applications.
  • by tepples ( 727027 ) <tepples.gmail@com> on Friday June 12, 2009 @12:16PM (#28309511) Homepage Journal

    Damn my correct spelling of English words!

    Because the World Trade Center was located on American soil, its name is spelled in American dialect.

Work without a vision is slavery, Vision without work is a pipe dream, But vision with work is the hope of the world.

Working...