Extracting Meaning From Millions of Pages 138
freakshowsam writes "Technology Review has an article on a software engine, developed by researchers at the University of Washington, that pulls together facts by combing through more than 500 million Web pages. TextRunner extracts information from billions of lines of text by analyzing basic relationships between words. 'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. The prototype still has a fairly simple interface and is not meant for public search so much as to demonstrate the automated extraction of information from 500 million Web pages, says Oren Etzioni, a University of Washington computer scientist leading the project." Try the query "Who has Microsoft acquired?"
Try the query.... (Score:4, Funny)
Re: (Score:1)
Re: (Score:1)
What happens to TextRunner once it is slashdotted?
Re: (Score:1)
Re: (Score:2)
Oh man, it's the new sucks-rules-o-meter for sure. Who hates vista: 55 results. Who loves vista: 11 results. Obviously, vista blows hairy goats. It becomes even more clear when you look at the actual results: somehow " :D
Bookmark Islamic Screensaver download-All people (12) love screensaver-Windows Vista Downloads" counts as a hit.. ahh there, a reload with js and the spam disappears leaving 9
Re: (Score:3, Funny)
I tried to read your comment, but I did not attempt to understand it.
Re: (Score:1)
Who has not been acquired by Microsoft
doesn't return Yahoo ? ...
actually it doesn't return any result
Re: (Score:2)
Better yet: "Why does Windows suck?"
Retrieved 0 results for Why does Windows suck?.
Being in Washington, MSFT has obviously paid them off to filter out unpleasant results.
Comment removed (Score:5, Interesting)
Re: (Score:1)
Wikipedia tried and failed (Score:1)
That is how Wikipedia was meant to be. A group of statements about subjects, all of which can be referenced to some original source. So that people can look up something quickly and then look at the sources for more definite information....
Seeing how many people cite Wikipedia directly, use it as the main source for their research and the amount of newspapers that have been reported to directly quote inaccurate facts from Wikipedia... I don't think it is working properly. It requires a lot of optimism to be
Re:Wikipedia tried and failed (Score:4, Insightful)
That is how Wikipedia was meant to be. A group of statements about subjects, all of which can be referenced to some original source. So that people can look up something quickly and then look at the sources for more definite information....
Seeing how many people cite Wikipedia directly, use it as the main source for their research and the amount of newspapers that have been reported to directly quote inaccurate facts from Wikipedia... I don't think it is working properly. It requires a lot of optimism to believe "People will use that as a initial source and then verify the information"
That's not wikipedia's failure. Those same people would just be referencing nothing or a web site with zero public review and commenting without it.
Re: (Score:3, Insightful)
The major problem is that it assumes the presence of meaning in Web pages in the first place.
Re: (Score:1)
Re: (Score:2, Interesting)
Actually, just like any other search, it just shows ALL of the likely results and you are still responsible for determining for yourself which of the statements is true. It says "CIA killed JFK" but the first result it returns is "Lee Harvey Oswald killed JFK". It also seems to pare down the results somewhat, because I know I've seen conspiracies also suggesting that the KGB killed JFK, or that the Mafia killed JFK. I'm guessing that more people think the CIA killed JFK than the KGB or the Mafia.
Re:Not entirely helpful (Score:5, Funny)
So much like Wikipedia then?
Re: (Score:2)
Re: (Score:2)
[/tinfoilhat]
Re: (Score:2)
That would be because "centre" is spelled center. The correct spelling yields plenty of results.
Re: (Score:2)
Re: (Score:2)
Damn those search engines that presume the exact spelling of proper nouns!
Why WTC name is spelled in American (Score:3, Informative)
Damn my correct spelling of English words!
Because the World Trade Center was located on American soil, its name is spelled in American dialect.
Re: (Score:2)
That's why Moscow is actually called Moskva in English-speaking countries, right?
I was talking about names within a language, not names across languages. For instance, in YTMND-land, it's spelled Moskau [ytmnd.com] in German.
Re: (Score:2)
I'll bite. What's the correct English pronunciation for "chauvinism"?
Re: (Score:2)
Not entirely helpful predicting the futute? (Score:1)
"Who killed obama?"
Re: (Score:3, Insightful)
it just repeats what other people have said
I don't see anything new here, most people have done this since the beginning of time.
Re: (Score:3, Funny)
it just repeats what other people have said
I don't see anything new here, most people have done this since the beginning of time.
Yeah, Textrunner just repeats what other people have said, like most people since the beginning of time.
Re: (Score:3, Interesting)
I suppose the major problem with this is that it cannot tell the difference between truth and lies or urban legends
Most humans can't either, how do you expect a search engine to?
There will be a lot of false positives and negatives that will be hard to identify as such unless it directly works with something like snopes.com , which kind of defeats the purpose because it means someone has had to research every question anyway.
If a project like this which simply scoured the whole 'net, you wouldn't really be able to verify anything beyond people's opinions or beliefs, which may or may not be 'true'.
I think something like t
Re:Exactly (Score:2, Funny)
"The query "Who killed JFK?" suggests the CIA did it"
Hmmm....And now its not responding because its "slashdotted"
Re: (Score:2)
And you suggest they didn't?
Re: (Score:2, Informative)
Re: (Score:2)
Nascent AI? (Score:5, Funny)
I'll start stockpiling food and armor piercing rounds for the moment Skynet goes live.
Re: (Score:2)
I've always viewed intelligence as the ability to take unrelated facts and create new and original ideas from their synthesis.
Intelligence, like insanity, is finding links between seemingly unrelated facts. It can also be keen observation and recognition of interactions between things where others see chaos. Either way, truly unrelated things are just that: unrelated.
Re: (Score:2)
Re: (Score:2)
I generally view intelligence as the ability to detect and recognize patterns. If you are good at exact patterns, that is math/logic/science. If you are good at general patterns, then we are talking art/creativity/language.
Computers have ALWAYS been good at recognizing exact patterns. But they generally need a human to first detect the pattern. They have never been good at re
500 million web pages can't be wrong (Score:5, Funny)
Yet strangely, I get a result of:
TextRunner took 9 seconds.
Retrieved 0 results for what is the airspeed velocity of an unladen swallow?.
Meh, call me when this stuff can answer the really USEFUL questions in life.
Re: (Score:2)
Simply because grepping 500 million pages is slow.
Re: (Score:3, Funny)
Retrieved 0 results for what is the answer to life, the universe and everything?.
Re: (Score:2)
Re:500 million web pages can't be wrong (Score:5, Funny)
Obviously it's not indexing http://www.style.org/unladenswallow/ [style.org]
meters per second or miles per hour? what? (Score:1, Interesting)
I would go with...
But meters per second and miles per hour? WHY?!
Re: (Score:2)
Re: (Score:2)
It's in reference to Monty Python... you're lucky it was comprehensible at all.
and now for something completely different...
Re: (Score:1)
Well, at least you got 0 results. With "where is New York" I got -1 result!
Re: (Score:1)
Re: (Score:1)
Just found out: If you just type "airspeed velocity", you'll get as first two results:
It seems to have trouble understanding units, but otherwise the information is found.
Re: (Score:2)
Try: "where is the colleseum"
Zero results (Score:3, Interesting)
I tried half a dozen queries of the sort I often use Google for (example: "What is the velocity of sound in hydraulic fluid?"). No answers.
Re: (Score:1)
Re: (Score:2)
I didn't find Wolfram Alpha much help with such queries either. Besides, I just followed their advice on how to use it.
Concise (Score:2, Interesting)
Towards a web with only one page: Google (Score:1, Insightful)
Are we moving towards a web in which Google centralises everything on their own pages? These new engines present content without the need to visit pages it originates from. Is Google basically mooching off other people's websites with hardly anything - if anything at all - in return?
It could be dangerous if the only visitor a web site can expect is the Google bot.
A million monkeys on keyboards... (Score:2)
But AOL is nothing like Shakespeare.
Re: (Score:1)
what causes cancer? (Score:5, Funny)
I learned that
> smoking (387) causes cancer.
I was also surprised to learn that
> girls and women (11) cause most cases of cervical cancer
This is a great resource if you need to cite a reference for a Wikipedia article.
TextRunner confirms it: (Score:5, Funny)
Who is at Area 51
aliens (3), Carter (2), Colonel Sanders (2), Hi Group (2) is at Area 51
Who bombed WTC
Al Qaeda (5), Bush (5), Clinton (2), 4 more... bombed the WTC
Who built the pyramids (example on site):
Egyptians (298), aliens (73), Pharaohs (40), 77 more... built the pyramids
What contains antioxidants (example on site):
Coffee (17), Recent scientific research (15), food (6), 5 more... contain significant amounts of antioxidants
-- man, I gotta get me some more recent scientific research.
Re: (Score:2)
Re: (Score:2)
Who bombed WTC
Al Qaeda (5), Bush (5), Clinton (2), 4 more... bombed the WTC
So, what I really want to know is who those "4 more" where...
Slashdot is not ... (Score:2)
Slashdot isn't
a professional news site
a normal news site
a social news site
a News Site
a valid source
a reputable source
the right source
a healthy online community
a goddamn online community
a Terrorist Organization
Re: (Score:2)
Slashdot is the single most important english site (8), another extremely sophisticated example (4), another online community (3), 15 more...
Bah useless (Score:1)
User invalid, deleting user (Score:1)
Re: (Score:2)
Re: (Score:2)
online as in real life (Score:1)
"Who is your daddy?" got 0 results.
'How do I shot web?' (Score:1)
Oops (Score:2)
That's what they said about SkyNet.
Re: (Score:1)
Actually Skynet is already being built. [army-technology.com]
Human cities torture? (Score:2)
Apparently Mount Marcy, Mount Elbrus, Mount Kilimanjaro and Mount Etna are all the highest mountain. Then again, I was also informed that "high mountains are the hum of human cities torture", so I think I'll just steer clear of mountains altogether.
"What is Slasdot?" (Score:2)
Try
"What is Slasdot?"
Answer
Digg is Slashdot
I asked the obvious.... (Score:1)
Retrieved 0 results for what is the answer to life, the universe and everything.
FAIL!
Re: (Score:2)
something i've never head of (Score:1)
Correction.... (Score:5, Insightful)
"...that pulls together facts by combing through more than 500 million Web pages."
Correction:
"...that pulls together assertions by combing through more than 500 million Web pages."
Whether those assertions are correct or even reasonable is a completely different issue.
It might be interesting to then take those assertions and have some means to validate or invalidate them, but currently that's going to require meat, not metal.
Now, if you could come up with some form of AI^Walgorithm to do that automatically, then you would have something.
Re: (Score:2)
Correction:
"...that pulls together assertions by combing through more than 500 million Web pages."
I suspects it just pulls together *sentences*.
"Who has Microsoft acquired?" (Score:2)
What is the meaning of life (Score:1)
love (53), song (19), Life (16), 81 more... is the meaning of life
1) of the 81 more, 42 doesnt show up anywhere
2) the stupid javascript hiding makes copy and paste a pain
Who Framed Roger Rabbit? (Score:2)
same fuckers(2) that Framed Roger Rabbit
Who killed Kennedy? (Score:1)
I knew it...
Wow, impressive, but prior art... (Score:2)
Wow, incredible. Because doing a search of "kills bacteria" with the quotes on Google won't get you those kind of results. Oh wait, yeah it will.
Re: (Score:2)
Query: What kills Microsoft:
First in the list: Linux, Sony, Apple
Other notable: Steve Jobs
Re: (Score:2)
Query: What kills Linux
On the list, Microsoft, Dell, Apple
And... Steve Jobs.
Re: (Score:2)
This Steve Jobs sounded pretty good killer, so I did a query:
What kills Steve Jobs, the result was:
Retrieved 0 results for what kills Steve Jobs.
Re: (Score:3, Interesting)
I think you're missing the point. This is an AI project - it's research. Presumably, the questions you are typing in haven't been processed by a complicated nest of if-thens written by someone who knows English; instead, statistical models of language and meaning were extracted from the internet. Some people claim this is the equivalent of "teaching" a computer.
The first example, which is what most search engines do, leads to impressive search results but is limited by the logic people can code up. This AI,
I'm impressed (Score:2)
This has to be played with to be appreciated. On request, it delivered a set of interesting papers about US-EPA misrepresentation of science. And, it returned a nul result for "Has any climate model been validated?"
This is going to be fun
Carmen San Diego? (Score:1)
I asked "Where in the world is Carmen San Diego?". The page trhew up a Java error.
I guess nobody really knows.
In Soviet Russia... (Score:2)
... you extract millions from the meaning of pages! ;)
Sorry, couldn't resist.
Re: (Score:2)
No, the correct Russian reversal would be,
"In Soviet Russia, millions of paiges extract meaning from YOU!"
Not too Smart: "What is TextRunner" (Score:2)
produces 0 results :P
Retrieved 1 result for does god exist (Score:2, Funny)
Well, that answers that question.
Slashdotted (Score:2)
'The significance of TextRunner is that it is scalable because it is unsupervised,' says Peter Norvig, director of research at Google,
I really wondered what he was getting at with this. It seems almost nonsensical, like something someone in marketing would come up with.
Now that the site is slashdotted I know that he means if only a few people use it, it's very scalable, but if a bunch of people are directed to use it (say, through Slashdot) then it doesn't scale very well.
Re: (Score:2)
There's nothing nonsensical about it. Just because you don't know what an unsupervised learning algorithm is doesn't mean it's just a random string of words he threw together to sound fancy.
I must admit it's rather entertaining... (Score:2)
General Draper was George Bush's guru
Hurricane Katrina is George Bush's Monica Lewinsky
Tony Blair is George Bush's poodle
democratic Iraq is George Bush's formidable legacy
Iraq is George Bush's waterloo
Hillary is the democratic version of good old George W. Bush
blue socks are Critics of George W. Bush
Bruce Bartlett is George W. Bush Bankrupted America
biggest terrorist is George W. Bush
Re: (Score:1)
The same copyrighted pages that you allowed Google to crawl since you obviously didn't protect it with a robots.txt?
Re: (Score:2, Interesting)
Allowing a search engine to visit a site and allowing somebody to pass your web page content around are two completely different things.
Re: (Score:2)
You did.
Re: (Score:1)
Even worse: I asked "What is Slashdot?" and the first result was "Digg is Slashdot" ...
Re: (Score:1)
# # An unexpected error has been detected by HotSpot Virtual Machine: # # SIGSEGV (0xb) at pc=0xb77acafa, pid=21855, tid=1833073568 # # Java VM: Java HotSpot(TM) Server VM (1.5.0_14-b03 mixed mode) # Problematic frame: # V [libjvm.so+0x23dafa] # # An error report file with more information is saved as hs_err_pid21855.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp [sun.com] # Abort