Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Software Programming

Rest In Peas — the Death of Speech Recognition 342

An anonymous reader writes "Speech recognition accuracy flatlined years ago. It works great for small vocabularies on your cell phone, but basically, computers still can't understand language. Prospects for AI are dimmed, and we seem to need AI for computers to make progress in this area. Time to rewrite the story of the future. From the article: 'The language universe is large, Google's trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10^570. Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain, which force speech recognizers into risky guesses. Even where data are lush, picking what's most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the "best" bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.'"
This discussion has been archived. No new comments can be posted.

Rest In Peas — the Death of Speech Recognition

Comments Filter:
  • by bit trollent ( 824666 ) on Monday May 03, 2010 @05:15PM (#32077254) Homepage

    I hardly type anything in to my HTC Incredible. Google's voice recognition, which is enabled on every textbox works just about perfectly.

    Seriously, get an Android phone, try out the speech recognition text entry, and then tell me speech recognition is dead.

  • World model (Score:2, Informative)

    by Anonymous Coward on Monday May 03, 2010 @05:20PM (#32077330)

    Speech recognition mechanisms/algorithms are not entirely
    the problem. What needs to back them up is called a "world
    model," and, as the name implies, this can be large and open
    ended. Humans being able to correct spoken/heard errors
    on the fly is because of having an underlying world model.

  • Mod parent up (Score:3, Informative)

    by idiot900 ( 166952 ) * on Monday May 03, 2010 @05:21PM (#32077338)

    Would that I had mod points today.

    The above is a valid English sentence and a poignant example of how difficult it is to parse language without knowledge of semantics.

  • Re:Windows 7 (Score:4, Informative)

    by adonoman ( 624929 ) on Monday May 03, 2010 @05:29PM (#32077484)
    People underestimate the value of training - we do it subconsciously when we meet people with different accents or vocal tones. At first people are hard to understand, but given an hour or so talking to someone, you eventually stop noticing their accent. Windows 7 seems to do a really good job at learning from use (it learns even without explicit training when you make corrections). I have windows 7 tablet and the voice recognition is impressive. Its handwriting recognition is even better than mine when it comes to my writing (it benefits from knowing the directions and order of strokes) - I just scratch out something vaguely resembling something I want to write and it seems to recognize it almost 100% of the time.
  • Re:Buffalo buffalo (Score:5, Informative)

    by hoggoth ( 414195 ) on Monday May 03, 2010 @05:37PM (#32077634) Journal

    Buffalo bison whom other Buffalo bison bully, themselves bully Buffalo bison.

  • Re:Buffalo buffalo (Score:5, Informative)

    by Anonymous Coward on Monday May 03, 2010 @05:38PM (#32077648)

    For those that don't know:
    http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo

      'Buffalo bison whom other Buffalo bison bully, themselves bully Buffalo bison'.

  • Re:Mod parent up (Score:1, Informative)

    by Anonymous Coward on Monday May 03, 2010 @05:40PM (#32077676)

    Hence why some of the words are capitalized.

  • Re:Well duh. (Score:1, Informative)

    by Anonymous Coward on Monday May 03, 2010 @05:57PM (#32077944)

    The article acknowledges this... mentions speech recognition topped out with a 20% word error rate, while humans have an error rate of 2%-4%.

  • by jaavaaguru ( 261551 ) on Monday May 03, 2010 @06:41PM (#32078548) Homepage

    There is nothing wrong with that phrase.

  • Re:Watermelon Box (Score:3, Informative)

    by frank_adrian314159 ( 469671 ) on Monday May 03, 2010 @07:27PM (#32079244) Homepage

    People were doing symbolic context recognition in the 60's-80's (look up frames). This went out of vogue with the use of neural nets and statistical recognition in the late 80's and continues up to this day. The problem is that getting better now probably needs new probabilistic models for symbolic context recognition, feeding up from statistical recognition of phonemes and words, feeding forward to later phrases being parsed. This would require either two teams, or one team with expertise in both areas. And, in the past, the symbolists fought the statisticians like dogs fought cats. The bottom line is that (a) we can do better, but (b) it will be more expensive to fund, and (c) requires academics to admit that their deep specialization in a given area does not provide the entire solution. Plus, grant writers like NSF, DoD, etc. are not often interested in funding large integrated projects, funding smaller, focused projects to reduce risk and to spread research finds around more broadly. As such, I predict the level of this technology to be stalled for an indefinite time (or until someone else does it).

  • by kindbud ( 90044 ) on Monday May 03, 2010 @07:33PM (#32079318) Homepage

    The word "data" pluralizes "datum." "Data are lush" correctly pluralizes the singular form of the sentence.

    Now who sounds stupid?

  • by Theovon ( 109752 ) on Monday May 03, 2010 @08:33PM (#32079936)

    When I started on my Ph.D., I started out majoring in AI. One of several reasons I changed to computer architecture (CPU design, etc.) is because I just couldn't stand the broken ways that people were doing stuff. Actually computer vision stuff isn't so bad -- at least there's room for advancement. But the speech recognition state of the art is just awful. I couldn't stand the way they did much of anything in pursuit of human language understanding.

    With automatic speech recognition (ASR), the first problem is the MFCCs. (Mel-frequency cepstral coefficients.) What they essentially do is take a fourier transform of a fourier transform of the data. This filters out not only amplitude but also frequency, leaving you only with the relative pattern of frequency. Think of this as analogous to taking a second derivative, where all you get is accelerating, leaving out position and velocity. You lose a LOT of information. Then once the MFCC's are computed, they're divided up into the top 13 (or so) dominant MFCCs, plus the first and second step-wise derivatives, giving you a 39D vector. Then the top N most common ones are tallied, and code-booked, mapping the rest to the nearest codes, leaving you with a relatively small number of codes (maybe a few hundred).

    So to start with, the signal processing is half deaf, throwing away most of the information. I get why they do it, because it's speaker independent, but you completely lose some VERY valuable information, like prosodic stress, which would be very useful to help with word segmentation. Instead, they try to guess it from statistical models.

    Next, they apply a hidden Markov model (HMM). Instead of inferring phones from the signal, the way they model it is as a sequence of hidden states (the phones) that cause the observations (the codes). This statistical model seems kinda backwards, although it works quite well, when trained properly. To train it, you need a lot of labeled data, where people have taken lots of speech recordings and manually labeled the phonetic segments. What is usually learned is mostly a unigram, where what you know are the a priori probabilities of each phone label (the hidden states), and the posterior probability of each phone given each possible prior phone. Given a sequence of codes, you find the most likely sequence of phones by computing the viterbi path through the HMM.

    Honestly, I can't complain too much about the HMM. What I do complain about is the fact that the "cutting edge" is to replace the HMM with a markov random field (just remove the arrows from the HMM), and conditional random fields (which are markov random fields with extra inputs).

    My response to using MRFs and CRFs is "big whoop", because all you're doing is replacing the statistical model, which doesn't dramatically improve recognition performance, because they haven't fixed the underlying problem with the signal processing.

    Then on top of the phone HMM, they layer ANOTHER HMM on top of it to infer words and word boundaries, based on a highly inaccurate phone sequence.

    The main problem with all of this is not that the reseachers are idiots. They're not. The problem is that the people with the funding are totally unwilling to fund anything really interesting or revolutionary. The existing methods "work", so the funding sources figure that we can just make incremental changes to existing technologies. Which is wrong. Unfortunately, any radically new technology would be highly experimental, with a high risk of failure, and would take a long time to develop. No one wants to fund anything that iffy. As a result, all the scientists working in this are spend their time on nothing but boring tweaks of a broken but "proven" reasonably effective technology.

    So I don't blame people for the conundrum, but I see no opportunity to do anything interesting, so I just couldn't stand studying it.

  • Re:AI (Score:3, Informative)

    by wurp ( 51446 ) on Monday May 03, 2010 @08:44PM (#32080036) Homepage

    Google voice recognition already does exactly that. It matches words against their database of words commonly used together via their search engine.

      This message was composed using android voice recognition on my nexus 1 phone. I had to manually correct 2 words out of the whole post.

Thus spake the master programmer: "After three days without programming, life becomes meaningless." -- Geoffrey James, "The Tao of Programming"

Working...