Slashdot Log In
Deriving Semantic Meaning From Google Results
Posted by
michael
on Sat Jan 29, 2005 04:35 PM
from the can-also-use-tea-leaves-if-google-not-available dept.
from the can-also-use-tea-leaves-if-google-not-available dept.
prostoalex writes "New Scientist talks about Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam and their work to extract meaning of words from Google's index. The pair demonstrates an unsupervised clustering algorithm, which 'distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return', according to New Scientist."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
The elephant in the living room. (Score:4, Insightful)
The article mentions English-Spanish translation. When one language is ambiguous (from a bit of Spanish I had in HS I'm guessing English is far more ambiguous), there is no hope of easy translation. And it's worse because the bigger application may be translating the many English pages (ambiguous) to Spanish.
Re:The elephant in the living room. (Score:4, Insightful)
Parent
Language is more than words (Score:2, Insightful)
"Right" isn't really a good example of a word that might "trip even humans." A human (translator) will parse not just by word but will attempt to extract a word's meaning from the surrounding phrases, sentences or even paragraphs. The syntax of the language may also come into play. In spoken language, additional "clues" can be derived from the situation in which the word is spoken, and often
Re:Language is more than words (Score:2)
Re:Language is more than words (Score:2, Funny)
Re:The elephant in the living room. (Score:5, Interesting)
Every language has "ambiguity", but ambiguity can come in different flavors (phonological, morphological, syntactic, semantic, pragmatic). Some of the chief instigators of language change can be thought of as ambiguity on these levels. So firstly, it's hard to imagine the existence of a function mapping languages to "ambiguity levels".
The motivation for your comment about English versus Spanish probably comes from the fact that you know of more English homophones than Spanish ones. Indeed, most literate people think of their language in terms of written words, so your take on the matter is common.
(As a slight digression, your example of right the direction versus right as in 'correct, just' is pretty interesting. We can understand the semantic similarity between the two when we notice that most humans are right-handed. Thus it is extraordinarily common, cross-linguistically and cross-culturally, for the word meaning the direction 'right' to have similar meanings as dextrous, just, well-guided and so on, whereas the word meaning the direction 'left' also has meanings such as worthless, stupid. (In fact, the word dextrous was borrowed through French from the Latin word dexter meaning 'right, dexterous' or dextra meaning 'right hand'.) So the given example is one where, historically, a word had no ambiguity, but gained ambiguity because speakers started using it differently.)
Getting back to the main topic, more problematic about Section 7 of TFA is the implicit assertion that, at some point in the future, their techniques can be applied to create a function mapping words in a particular language to words in another language. Anybody who has studied more than one language has seen cases where this is difficult to do on the word-level. For instance, the French equivalent of English river is often given as riviere or fleuve. But riviere is only used by French speakers to mean 'river or stream that runs into another river or stream' whereas fleuve means 'river or stream that runs into the sea'. English breaks up river-like things by size: rivers are bigger than streams. So, in the strictest sense, there is no English word for fleuve, just as there's no French word for stream (unless there has been a recent borrowing I don't know about). This certainly does not imply that French people can't tell the difference between big rivers and small rivers; their lexicon just breaks things up differently.
These little problems can be remedied lexically, as I've just done. So fleuve is denotationally equivalent to river or stream that runs into the sea, although the latter is obviously much bulkier than its French equivalent. The real problem is that there are words in some languages whose meanings are not encoded at all in other languages. English, for example, has a lexical past-progressive tense marker, was, used in the first person singular (e.g. I was running to the store). Some languages have no notion of tense. What, then, does was mean in the context of such a language?
It's pretty well-known that Slashdotters' general policy is to tear apart every article we read, and half of those we don't. This is certainly not my intent here. Languages are complicated beasties, and everyone seems to understand that, including the writers of the article. So, we should interpret their result in Section 7 as them saying, "Well, maybe this has gotten us a baby-step closer to creating the hypothetical Perfect Natural Language Translator, but someone's gonna have to do a lot more work to see where this thing goes".
Parent
Re:The elephant in the living room. (Score:2)
Plainly it means nothing significant. In most cases the distinction it makes in English contains no useful information. We think such words are useful, because a sentence sounds "wrong" without them, or because we know that a certain type of information is being lost wit
Re:The elephant in the living room. (Score:3, Interesting)
But we must consider the fact that most humans can't produce a decent translation either, even if they think they understand both languages. I've been professionally translating movies (EN->RU) and I know to what extent the scripts are riddled with linguistic traps. An average professional human translator would
Re:wARTIME? (Score:2)
"Right."
It turns out that the right flank was the dangerous one, and attacking on the left would've guaranteed a victory for the army that unfortunately spoke English.
Re:wARTIME? (Score:4, Informative)
Parent
Re:wARTIME? (Score:2)
Don't know about the US but over here in Australia they dropped taxi dispatching by voice about 15yrs ago. Now instead of arguing with the dispatcher they listen to music....
Re:wARTIME? (Score:2, Interesting)
The lore also contains an interesting anectode about the '92 riots in LA. Apparently a group of Marines were dispatched to assist the police. Two officers were approaching a house when someone opened up with a shotgun at them. One officer shouted "cover me" -- so the Marines proceeded to lay down
Re:wARTIME? (Score:2)
The NAVY would turn out the lights and lock the doors.
The ARMY would surround the building with defensive fortifications, tanks and concertina wire.
The MARINE CORPS would assault the building, using overlapping fields of fire from all appropriate points on the perimeter.
The AIR FORCE would take out a three-year lease with an option to buy the building.
Re:wARTIME? (Score:2)
Once past that the natural language problem or elephant you speak of is really just a machine that define words in context.
Google is actually a pretty powerfull means for deriving context by some means of comparing word frequencies around various results.
Semantic meaning? (Score:2, Interesting)
Re:Semantic meaning? (Score:4, Funny)
Parent
Re:Semantic meaning? (Score:2)
I don't understand your meaning, and I'm really happy about it!
"I don't want to get into semantics" (Score:2)
"I don't want to get into semantics".
I always want to yell - "why worry about the meaning of things - it'll just cloud things".
Re:Semantic meaning? (Score:2)
In language theory/compilers, semantic meaning is information that cannot be obtained by a lexer (a.k.a information that cannot be gained through regular expressions a.k.a. non-regular language components).
The part that can be recognized by a lexer is still part of the meaning, which is the reason for the name.
Scientology (Score:3, Insightful)
Google is already starting to show signs of intelligence higher than some people.
Re:Scientology (Score:2)
Extend this to the library of congress... (Score:4, Interesting)
This obviously has its failings, but theoretically, you could use a sufficiently large database of common human language coupled with simple algorithms to perform operations like grammar checking.
An internet search would not be quite so useful for that, but I would really be interested in what would be possible with full digital access to the library of congress. I would imagine you could do things like automatically generate books based on existing material.
Would that be 'semantic meaning'... (Score:3, Insightful)
Compression is a stricter test for AI than Turing (Score:4, Informative)
This is basically what I was referring to in my response [slashdot.org] to "Using The Web For Linguistic Research" [slashdot.org] when I said:
followed by the explanation: andPrize classes (Score:2)
Good point. However it is difficult to value time in a single competitive metric whereas compression ratio (where the initial and compressed sizes include the size of the algorithm/knowledge of the AI) is a single number.
Perhaps the way around this is to have different prizes for different time classes, varying by an exponential. You'd have, say, 3 competitions with tim
Good for scholars, bad for geeks (Score:2, Interesting)
Yea (Score:2)
WoW! = I can't believe I paid $50 for this crap and can't even logon!
not many will get this (Score:3, Insightful)
Since it seems that so few really understand the term "intelligence", it is really not surprising that even fewer grasp the meaning of the term "artificial intelligence", is it?
One: intelligence is not awareness.
Although we cannot prove the existence of or even seem to really define self-awareness, it seems self-evident, at least to me, that intelligence is clearly defined and can be measured.
Therefore, I believe that we will have "artificial intelligence" soon, in fact, I'd bet Google may well be the first AI or "self intelligent' engine.
However, I suspect it will be quite awhile before we are mature enough to build a self-aware engine.
Lastly, in regards to some of the other comments, it seems to me that this paper is about using the "intelligence" included in the language we use, that Google crawls. This repository is the single largest collection of semantic weighting, therefore, algorithms could be developed that reflect this "intelligence", therefore appear themselves intelligent, even though they themselves are simply deterministic.
Whew
Intelligence vs. awareness (Score:2)
Intelligence does, however, imply the ability to perform self-directed learning. Without that, all you have is preprogrammed behavior, which is not intelligence. Given the ability to learn, an intelligent entity is likely to draw conclusions about its own existence ("I think therefore I am"), and will thus essentially be self-aware.
Of course, the builders of an artificially intelligent machine might restrict its ability to gather facts about itself - it wouldn't
Ant's vs. Nest (Score:2)
Re:not many will get this (Score:2)
What you build is a substrate. (Score:2, Interesting)
You're quite correct that cowboy-loose definitions of terms make this a very difficult discussion to have. For example, when you say "self awareness," it's unlikely that you actually mean "self awareness" in the literal sense; after all, if a computer is capable of detecting when its processor is overheating (and perhaps turn on a fan in response), it is basically "self aware," though we wouldn't confuse that with itelligence.
Rather, I think by "self awareness" here you mean, possessing narrativity; that i
Unsupervised but Reflective of Human Preferences (Score:4, Interesting)
I will give you an example. If you search news (i.e., either Google News [google.com] or Yahoo! News [yahoo.com]) for stories about the recent federal action (by Washington) involving Chinese companies and Iranians weapons improved by Chinese technology, you will discover that one of the popular news articles about this topic comes from the "New York Times". Several other newspapers redistributed the Times article, written by David Sanger (spelling?).
I read that article, but I also read articles from less popular Web news sites: e.g. "Taipei Times". The "Taipei Times" article does mention that a Taiwanese company was also implicated in the sale of weapons technology to Taiwan. Yet, "New York Times" article made no mention of this fact.
Is the "Taipei Times" telling the truth? It claims that Ecoma Enterprise Company, a Taiwanese company, was one of the culprits.
At this point, I fired up both Yahoo! Search and Google. Only on Google was I successful in locating the the ORIGINAL source of the information about American penalties against the 7 Chinese companies and the 1 Taiwanese company [gpo.gov]. The information is on page 133 of the "Federal Register" (volume 70, number 1). So, I discovered that the "Taipei Times" was telling the truth.
Guess how long I took on Google to find this information? 5 minutes. I kid you not. Even though I hate Google's employment practices, I am quite impressed with their technology.
Using Yahoo! Search, I was not able to locate the desired information.
Apparently, Google has an algorithm that, although it is unsupervised (i.e. without the kind of human interaction that corrupts Yahoo! Search), it captures the notion of what the typical person wants to find. The Google algorithm, dare I say "it", is on the verge of acquiring human sentience. THAT is, indeed, impressive.
Pray to Buddha that the middle name of the CEO is not "666" or Beelzebub. Just kidding.
Re:Unsupervised but Reflective of Human Preference (Score:2)
Limitations of NGD (Normalized Google Distance) (Score:5, Insightful)
More interesting are analyses on n-Tuples (co-occurences and orderings of n-words at a time). Anyone who does ER (Entity-Relationship) diagrams for relational databases will appreciate that many relationships involve multiple entities that are decomposable into pairwise relationships.
Another limit is that Google is atrocious on its estimates of the number of hits. The actual number of hits is only fraction (about 60%?) of the estimated from my experience. This suggests that Google has a pairwise estimator built in that may be only partially empirical. If Google simply reports an estimated number of hits based on products of probabilities, then their is no information about the pair in the NGD. Obviously, these scientists have gotten useful results, but NGD may not be as good an estimate of the co-occurence of the words as the scientists assume.
Re:Limitations of NGD (Normalized Google Distance) (Score:2, Interesting)
Reconstructing semantic space (Score:2)
I've often wondered if one can use simple pair-wise distance estimates to reconstruct a polytope or distorted simplex for the set of items within a multidimensional space. In theory, an N-object system, with non-zero pairwise distances, requires (N-1) dimensions. But in practice, many real systems don't fill the space -- being
Kabbalah (Score:2)
Been working on similar (Score:3, Interesting)
Re:Been working on similar (Score:2)
I accept your apology for relating relevant information about the subject matter of the article.
For future reference, to avoid this, it helps not to read article. If you must read it, you can always pick out a short phrase and take it out of context. If you are absolutely at a loss on how to comment on a story with presenting useful/interesting information, generally you can get away with "FRIsT POST!!!" or one of the popular Slashdot memes.
Don't worry,
Re:Been working on similar (Score:2)
Limits to semantic derivations from Google (Score:5, Interesting)
For example, searching on Google for "tom cruise" brings up pages upon pages of links, but -- from a cursory glance at the results -- it is impossible to learn anything about Tom Cruise unless one visits those results.
Our software visits each of those results (for example, the first 100) and looks for the most significant keywords and phrases used over all the data. As you might expect, these typically end up being the names of people (e.g. Nicole Kidman, Penelope Cruz) or movies (e.g. Top Gun, Color of Money) that are associated with Tom Cruise. As far as our software goes, this is ample for doing keyphrase analysis.
But the problem with deriving any additional meaning from the Internet web space is this: the biases that exist due to the very reasons for mentioning Tom Cruise (namely those things he is famous for) simply outweigh -- by a wide margin -- any other quite relevant interesting data about Tom Cruise. So, in fact, the web, in general, is an awful corpus of valid semantic data.
If you want a rough model of popular ideas then perhaps Google and the web en masse is useful (it is for our software). But if you want any real meaning at all you come to the same conclusion that has given rise to sites like Wiki: the web, to be blunt, has a whole lot of shit in it. Coming up with a perfect (and rational) filter is quite a task.
On the bright side... (Score:2, Informative)
Pretentiously titled (Score:2, Insightful)
I've perused the abstract and skimmed the body of the paper. They're fine. But the title is misleading: Automatic Meaning Discovery Using Google.
Their software has discovered meaning no more than paper has when the lexicographer is done writing her dictionary. Meaning is not the grouping of symbols.
For systems that step towards encoding meaning as human brains do, consider the Neural Theory of Language [berkeley.edu].
understanding relationships is intellegence (Score:2, Insightful)
A baby hears the word mom spoken by his mom. Gradually, the baby knows there is a relationship between that sound and a smily face.
The child, growing up, starts to see relationships. Intense pain, which is rare, when correlated with a hot stove, has strong meaning in his mind.
Everything is learned initially through correlations. The advantage of human bei
Re:understanding relationships is intellegence (Score:2)
Google isn't very good anymore, is it?
TWW
I agree. (Score:2)
It's being able to understand the statement that cow is to grass in a similar way that balleen whales are to krill.
And then now knowing something about krill from that even if you didn't know what krill was at all.
It's not just knowing the "absolute value" of the meaning - or even that these two objects are linked o
Re:Huh? (Score:2)
In english what they are trying to do is define a word by context. For example go input a single word in google and you will get all the various contexts in which it is used.
Then by using some algo. ( serious academic handwaving here ) you place the meaning of word via context as determined for google. Thus effectively you create the potential for a program that could distinguish between there and their and do it across languages. It could also translate sayins like, the spir
Re:Huh? (Score:2)
Re:Huh? (Score:2, Funny)
I think you might've meant:
http://www.google.com/search?hl=en&q=what+is+the+a nswer+to+life%2C+the+universe%2C+and+everything%3F [google.com]