Using The Web For Linguistic Research

Using The Web For Linguistic Research 205

Posted by timothy on Sunday January 23, 2005 @04:50AM from the that's-rediculous dept.

prostoalex writes "The Economist says linguists are gradually adopting the World Wide Web as a useful corpus for linguistic research. Google is used, among other resources, to research how the written language evolves and how some non-standard examples of usage become more or less acceptable (The Economist quotes the phrase 'He far from succeeded,' where 'far from' is used as an adverb). LanguageLog is a resource linked in the article, where linguists discuss current peculiarities of the English language."

Using The Web For Linguistic Research

This discussion has been archived. No new comments can be posted.

Search 205 Comments Log In/Create an Account

Comments Filter:

I've used the web for corpus linguistics research (Score:2, Informative)

by Anonymous Coward writes: on Sunday January 23, 2005 @05:23AM (#11446744)

I've used the web for corpus linguistics research. My last big project was to look at a lot of web pages with Mexican and Chilean slang Spanish, and see if there was a difference in vocabulary usage. There was a significant difference; I could, 70% of the time, tell if a given passage was Chilean or Mexican Spanish.

I could have gotten a higher accuracy rate, but this was just a simple undergraduate project.

Non-official English (Score:2, Informative)

by Anonymous Coward writes: on Sunday January 23, 2005 @05:36AM (#11446776)

Unlike French and Italian, there is no official instution that defines 'correct' English. Essentially, the English-speaking world just 'makes it up' as it goes. Thus when I see the adverb 'really' butchered into 'real' I must try not to get annoyed. i.e. It's real hard to use your mother tongue. vs. It's really hard to use your mother tongue. Please help me here - is the misuse/non-use of 'really' something that's taught in school?

Popular usage != wanted usage (Score:3, Informative)

by KiloByte ( 825081 ) writes: on Sunday January 23, 2005 @06:33AM (#11446892)

Yes, we can record the errors made by the uneducated public (and even those done by, uhm, me). The question is: should we do that or not?

I was pretty taken aback when a council of linguist in Poland suddenly declared some widely-chastised and not even very popular errors to be valid usage. I've been brought up in the circles of people who not only put a lot of stress to the language you use, but also cruelly point out every incorrect word or phrase you use -- and this made me quite intolerant to bad speech.

Being but a dirty foreigner, I know that my English can sound bad in the ears of native English speakers -- that's why I sometimes ask people to correct me if they spot errors.

In other words: some people find careless speech repulsive. Thus, we should do whatever we can to promote correct usage as opposed to legalising incorrect uses.

Reminds me of "Meme Tree"... (Score:4, Informative)

by Slur ( 61510 ) writes: on Sunday January 23, 2005 @06:49AM (#11446935) Homepage Journal

...which was this little program I wrote around the nascence of the internet. it took any sentence as input and kept a record of which words preceded each word, and which words followed each unique word. The idea was to build up a simple map of which words could precede or follow others completely without context. From this you could follow paths that made sentences or paths that looped forever, or paths that made no sense, and some interesting paths that made unintended sense.

Why a tree? Language and geneology seem to have a common thread. Meaning is like genetics. Language is expressive. Information is a kind of tree whose branches grow as reality elaborates and past events accumulate. New terms need to be invented for the dynamics we perceive in reality, just as new names are given to individuals as they emerge into the world. Patterns, continuity, periodicity. Such things lie at the heart of material existence and provide the hooks for consciousness itself. Information theory is the next great frontier, along with particle physics. Already they have converged and diverged and converged again. And playing with artificial trees turns out to be a lot of fun.

As for the "Meme Tree" program ... The next iteration built up a more discreet map by scoring proximity of unique words in sentences and inclusion in sentences together. Again, the idea was to build a simple statistical map free of any context, simply to get a sense of pure lexical association.

The theory is that the internal consistency of these various lexical maps should roughly reflect many aspects of associative meaning. You could think of the statistical map as a Godelian bubble whose "truth" - if you will - is imposed by the laws governing the statistical associations. We don't derive the laws of language and meaning from these exercises, but we create an internally-complete map that reflects something about the nature of meaning.

There is a practical aim as well. If you can derive the strength of equivalence and the various levels and colors of associative meaning you could in theory build a "Truth Machine" capable of answering any question with a high degree of accuracy. The result of any question could be computed as any other information retrieval problem would be.

I never got around to having my little Meme Tree programs scrape the internet for random sentences. However, this should be a very simple thing to do. Google has had programming contests in the past - programs that use the Google database in interesting ways. Statistical analysis of language is basically what they do. Research projects on their data could provide stunning insights into the nature of information itself, its relation to language and to reality, and likely into our very nature as linguistic beings.

BBC voices (Score:2, Informative)

by matt me ( 850665 ) writes: on Sunday January 23, 2005 @06:50AM (#11446937)

Link on front page of bbc.co.uk - bbc.co.uk/voices/ [bbc.co.uk] - their attempt at tracking accents and dialects across the UK.

Another use of Google in Linguistics (Score:1, Informative)

by Anonymous Coward writes: on Sunday January 23, 2005 @06:56AM (#11446952)

Just a month ago I finished a paper exploring using Google counts in great detail for language analysis and other forms of meaning extraction.
"Automatic Meaning Discovery Using Google":http://arxiv.org/abs/cs.CL/0412098/ [arxiv.org]

Comments welcome, -Rudi.

Re:I rue the day... (Score:3, Informative)

by JustKidding ( 591117 ) writes: on Sunday January 23, 2005 @07:02AM (#11446966)

You may be unaware that "lol" actually is a correct word in the dutch language, meaning (having) fun.

lol (de ~) 1 [inf.] plezier
(taken from, www.vandale.nl, an authoritive dutch dictionary)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Using The Web For Linguistic Research 205

Using The Web For Linguistic Research More Login

Using The Web For Linguistic Research

I've used the web for corpus linguistics research (Score:2, Informative)

Non-official English (Score:2, Informative)

Popular usage != wanted usage (Score:3, Informative)

Reminds me of "Meme Tree"... (Score:4, Informative)

BBC voices (Score:2, Informative)

Another use of Google in Linguistics (Score:1, Informative)

Re:I rue the day... (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot