Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Software Programming

Rest In Peas — the Death of Speech Recognition 342

An anonymous reader writes "Speech recognition accuracy flatlined years ago. It works great for small vocabularies on your cell phone, but basically, computers still can't understand language. Prospects for AI are dimmed, and we seem to need AI for computers to make progress in this area. Time to rewrite the story of the future. From the article: 'The language universe is large, Google's trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10^570. Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain, which force speech recognizers into risky guesses. Even where data are lush, picking what's most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the "best" bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.'"
This discussion has been archived. No new comments can be posted.

Rest In Peas — the Death of Speech Recognition

Comments Filter:
  • Key words (Score:3, Interesting)

    by flaming error ( 1041742 ) on Monday May 03, 2010 @05:15PM (#32077240) Journal

    > meaning often pools in a key word or two
    It's true.

    My own hearing is not great. I often miss just a word or two in a sentence. But they are often key words, and missing them leaves the sentence meaningless. If I counted the words I understand correctly I'd probably have a 95% success rate. But if I counted the sentences I understand correctly, I'd be around 80%. So I get by, but I tend to annoy people when I ask for repeats over one missed word.

  • Windows 7 (Score:3, Interesting)

    by Anonymous Coward on Monday May 03, 2010 @05:19PM (#32077324)

    I've been using VR in Win7 for a few weeks now. I can honestly say that after a few trainings, I'm near 100% accuracy. Which is 15% better than my typing!

  • No it doesn't (Score:3, Interesting)

    by Colin Smith ( 2679 ) on Monday May 03, 2010 @05:27PM (#32077444)

    It works great for small vocabularies on your cell phone

    No. It doesn't.

    It works great for small vocabularies on your cell phone if you happen to live in the same neighbourhood as the developer where "everyone talks this way". For the rest of the world, attempting to talk with a nasal American twang in order to get the phone to understand you, is shit.

     

  • Re:Mod parent up (Score:5, Interesting)

    by x2A ( 858210 ) on Monday May 03, 2010 @05:27PM (#32077448)

    There's nothing special about computers though, people have to do that with other people... lets not kid ourselves into thinking that humans are immune to misunderstandings. No, the more you get to know someone, the way they think and express theirselves, the better you can become at communicating with them. Different words to different people have different connotations. It can take a lot of work to get all these down, and it'd be no different with a computer. For effective communication, you'd train and build up a common language with it, that might seem nonsense to outsiders... and I, for one, welcome this.

  • by Anonymous Coward on Monday May 03, 2010 @05:28PM (#32077468)
    The radiology voice dictation transcription system at my former employer was horrible. Having to read the dictated reports was equally appalling considering there was a radiologist signing off on their accuracy, and they were certainly not completely accurate. The irony is that the things the system frequently had trouble with were simple words like "not" and recognizing quantities appropriately, whereas more complicated things such as "gastroschisis" would be dictated correctly.

    I never understood it, but since I was not the radiologist, I didn't care either. I mostly was entertained by listening to them repeat the same stupid, simple word over and over trying to get the dictation system to behave, when it would have taken a fraction of the time to manually edit the document with a keyboard.
  • Re:Sorry what? (Score:2, Interesting)

    by Ethanol-fueled ( 1125189 ) * on Monday May 03, 2010 @05:30PM (#32077518) Homepage Journal

    When talking to someone else, we can politely stop them and ask : "Sorry, what did you say?"

    That dosen't always work. When accents and the command of a language are so poor, you only get a few chances to ask, "Sorry, what did you say?" After asking three times, you either look like an asshole and/or give up and spend the next few minutes nodding and smiling before trying to parse what they said, hoping you get it right.

    Which is why we need good speech-recognition and translation software. It's easy to infer the meaning of "come to me give the diagram" because there are at least intelligible words to work with. And no, I'm not being racist -- the situation applies to all cultures and languages.

  • Re:Buffalo buffalo (Score:1, Interesting)

    by Anonymous Coward on Monday May 03, 2010 @05:33PM (#32077558)

    Has anyone really been far even as decided to use even go want to do look more like?

  • Wrong problem (Score:2, Interesting)

    by slasho81 ( 455509 ) on Monday May 03, 2010 @05:45PM (#32077748)
    There won't be any meaningful development in speech recognition (or machine translation) until context is taken seriously. Context is an inseparable part of speech.
    Right now the problem being solved is audio->text. This is the wrong problem, and why the results are so lame. The real problem is audio+context->text+new context. This takes some pretty intelligent computing and not the same old probabilistic approaches.
  • Totally Not Dead Yet (Score:5, Interesting)

    by RingDev ( 879105 ) on Monday May 03, 2010 @05:50PM (#32077834) Homepage Journal

    A few years back I worked for an awesome company that did a IVR (interactive voice recording) systems.

    We had voice driven interactive systems that would provide the caller with a variety of different mental health tests (we work a lot with identifying depression, early onset dementia, Alzheimer, and other cognitive issues.

    The voice recognition wasn't perfect, but we had a review system that dealt with a "gold standard". I wrote a tool that would allow a human being to identify individual words and to label them. Then we would run a number of different voice recognition systems against the same audio chunk and compare their output to the human version. It effectively allowed us to unit test our changes to the voice recognition software.

    Dialing in a voice recognition system is an amazing process. The amount of properties, dictionaries, scripting, and sentence forming engines are mind blowing.

    Two of the hardest tests for our system were things like: Count from 1 to 20 alternating between numbers and letters as fast as you can, for example 1-A-2-B-3-C. And list every animal you can think of.

    The 1-A-2-B was killer because when people speak quickly, their words merge. You literally start creating the sound of the A while the end of the 1 is still coming. It makes it extremely difficult to identify word breaks and actual words. And if you dial in a system specifically to parse that, you'll wind up with issues parsing slower sentences.

    The all animals question had a similar issue, people would slur their words together, and the dictionary was huge. It was even more challenging when one of the studies that was nation wide. We had to deal with phonetic spellings from the north east coast and southern states accents. What was even worse was that there was no sentences. We couldn't count on predictive dictionary work to identify the most likely word out of those that would match the phonetics.

    That said, getting voice recognition to work on pre-scripted commands and sentences was pretty easy.

    And I can only imagine the process has been improving in the years since. Although we were looking into SMS based options, not for a dislike of IVR, but because our usage studies with children were showing most of them were skipping the voice system and using the key pad anyway. So why bother with IVR if the study's target demographic was the youth.

    -Rick

  • by bertok ( 226922 ) on Monday May 03, 2010 @05:54PM (#32077886)

    I hardly type anything in to my HTC Incredible. Google's voice recognition, which is enabled on every textbox works just about perfectly.

    Seriously, get an Android phone, try out the speech recognition text entry, and then tell me speech recognition is dead.

    I've tried Google voice recognition, but I found that it just detected gibberish unless I spoke with a fake American accent.

  • Re:IBM? (Score:5, Interesting)

    by N1ck0 ( 803359 ) on Monday May 03, 2010 @05:55PM (#32077904)

    IBM closed many of their speech research offices 1-2 years ago and transferred most of the research/data to Nuance's Dragon Naturally Speaking research.

    Full Disclosure: I work for Nuance

  • by luther349 ( 645380 ) on Monday May 03, 2010 @05:57PM (#32077940)
    speech software has been evolving at a steady pace. but the issue isn't that its the fact 90% of the users out there don't use it. if you live in a loud place with kids or other noise it will not work well. windows 7 has built in speech software and how many people use it. i played with the latest dragon speech software and i gotta admit its very good even without traning it. i did emails with it without any issue. but as i said speech software is more a toy then anything usefull. as people said it probly will have a good use on a cell phone rather then on a pc being it would be a easy way to chat rather then using the cell phones keypad. .
  • Re:Mod parent up (Score:1, Interesting)

    by RockoTDF ( 1042780 ) on Monday May 03, 2010 @05:58PM (#32077960) Homepage
    You raise a good point about learning. A problem with AI researchers is that they are scared of neural networks for reasons that have been solved since the 1980s, and are stuck with expert systems or other "symbol manipulating" programs. The problem with these programs is that they *suck* at learning. I really think that if the AI community looked at neural nets more often they would get closer to figuring this language thing out. With billions and billions of sentences it is hard to create a good system using the aforementioned techniques.
  • Re:AI (Score:3, Interesting)

    by ircmaxell ( 1117387 ) on Monday May 03, 2010 @06:01PM (#32078002) Homepage
    Exactly... In order to do anything more than just "the word that was just spoken was 'x'", you need contextual and object clues. Hofstadter did a great job talking about this in his book Gödel, Escher, Bach: An Eternal Golden Braid. Right now, computers can do nothing more than simple symbol lookups. Speech recognition tries to find the word that matches the vocal pattern. So when it stumbles, the result is useless (the same goes for OCR). With contextual recognition, it can more accurately guess at what was said (that's all we do. When we hear an address that ends with "United States", we automatically know that it's the same as if we heard "USA"). That's something that I do believe is possible, we just haven't gotten to that point yet. The problem is that right now, we don't have any kind of actual contextual analysis possible. We do have some hard coded context clues, but nothing that represents a system that can "learn". The interesting thing though is that to teach an AI program to "speak" a language you need to give it a vast amount of input. Who has lots of input, and gets constant information regarding the accuracy of said input? Search engines. So if anyone can do it, I'd bet that Google is a position to do it (along with the other major engines, it just seems like it would be one of Google's projects)...
  • Rest in Peas (Score:3, Interesting)

    by CODiNE ( 27417 ) on Monday May 03, 2010 @06:05PM (#32078058) Homepage

    I know it's just an imaginary example of how bad text-to-speech is... but it is realistic and disappointing.

    Even an idiot like me knows what Markov chain [wikipedia.org] is. Perhaps the standard voice apps are so entrenched they're not recoding their apps to take advantage of huge leaps in memory capacity compared to when they first started selling.

  • Re:Mod parent up (Score:5, Interesting)

    by brian_tanner ( 1022773 ) on Monday May 03, 2010 @06:16PM (#32078196)
    I think you're probably about 10-20 years out of date with your criticism. AI these days is *all about* statistical machine learning which is *all about* data and not about formal or expert systems at all. This is what Google and others are doing. The AI you are describing is from the late 80s and early 90s.

    Neural networks are part of the story, but many of the ideas from ANNs have been improved upon when more structured settings are available. There is actually a resurgence right now in deep neural network though.
  • YouTube (Score:2, Interesting)

    by MaXimillion ( 856525 ) on Monday May 03, 2010 @06:41PM (#32078546)
    Considering Google is now offering automatic transcription of all YouTube videos, I'd say they certainly haven't given up on speech recognition yet.
  • Re:Well duh. (Score:5, Interesting)

    by Chris Burke ( 6130 ) on Monday May 03, 2010 @06:42PM (#32078582) Homepage

    Or have an ounce of poetry in you... ;)

    Hmm... I guess I don't have that since I don't know what it is. That's okay, I can find out with the help of my AI using the latest in voice recognition software! Computer, what is "poetry"?

    Computer: "Poetry" is a form of literary art, frequently using an organized metric and rhyme scheme, that attempts to evoke an emotional response in the reader through the use of metaphor.

    Huh, okay, that's interesting. But computer, what is a metaphor?

    Computer: A "meta" is for people who lack the capabilities to contribute directly to a field or endeavor, but who still wish to sound educated and useful by discussing the nature of the field or endeavor itself. Example: "Physics has way too much math for me, but meta-physics is right up my alley!"

    Yeah, now I'm just confused.

  • by maxwells_deamon ( 221474 ) on Monday May 03, 2010 @07:09PM (#32079016) Homepage

    by definition the second phrase eliminates any remaining use cases after the count down finishes.

  • Re:Mod parent up (Score:2, Interesting)

    by Anonymous Coward on Monday May 03, 2010 @07:44PM (#32079444)

    The main difference right now between human speech recognition and computer speech recognition is how the results are handled.

    If I said "I had a hard time staying a wake", both a person and a computer would misunderstand and think I said "I had a hard time staying awake." However, if it was in the middle of a discussion of funeral ceremonies I had conducted, both a person and a computer could figure out what I really meant.

    A person would likely hedge their response though, as either of the two meanings would be possible -- they'd probably respond with laughter and judge my physical reaction to that to identify which sentence I meant -- or, they might make some leading comment that forced me to add context.

    Computers however, are expected to "know the answer" with no further cues, and as such, are designed to "best guess" between the two options. They're probably better at this than a person would be in the same situation -- especially if the person didn't know what the verb "to stay" actually meant, or what a "wake" was. People give many context cues based on tone and non-verbal interaction that a computer is just never designed to pick up on. Added to the fact that tonal cues are extremely tribal, and the complexity balloons.

    For artificial verbal content recognition to really take off, the computer needs to be trained not only on words, grammars, and other parts of speech and lexical context, but also on tribal uses of tone -- most people have a pretty good grasp of the tones used in their own "tribe", and can identify many of the neighbouring "tribes" to the extent that they know what other cues to look for to complete the context. If a computer was trained in all the major inflections for all systems of language in the world, it would likely be better than most humans in a random sampling of sentences.

    A Chinese ESL individual who learned English in Alabama would have an extremely difficult time understanding what someone from Newfoundland or the Hebrides was trying to say to them -- but a computer properly trained should be able to translate between the two with no difficulty.

  • by bmo ( 77928 ) on Monday May 03, 2010 @08:35PM (#32079962)

    Only you talk like you. There is no archive of speech large enough to encompass every speaker of a language except one that has a record of each and every speaker. And it still doesn't solve the teaching problem. The shotgun approach is problematic in many ways, most of all the size of the database and you'd still wind up teaching the speech platform to find what accent you're using, because if you ask most people, they don't have any accents at all.

    Actually, I think the solution would be to make personal datasets portable, to standards, so when you go from one device to another, all you need to do is plug in your own dataset (or access it from the network) et voila, instant voice recognition wherever you go by systems designed to use that dataset standard. Sort of like an ODF for speech datasets. This way it's distributed, you don't have a humongously unwieldy database to manage, and it's personalized.

    But that requires standards which don't yet exist, because every speech recognition platform reinvents the wheel every single time.

    --
    BMO

  • It's obviously tuned for that, but it wouldn't be fair to ask it to understand Scottish, now, would it? ;) Seriously, though, in my expat group we had at least ten English-speaking countries represented, and I had little trouble in most cases. There was still one New Zealander who I never understood, even after a year, and generally gave up asking him to repeat himself after the third time in a row and just tried to fake it. I'd get maybe 10-20% of the any sentence from him.

    Speech recognition of random accents, even in one language, is virtually impossible. I think the computer needs to be given a clue about the accent of the speaker.

  • Re:Buffalo buffalo (Score:3, Interesting)

    by ClosedSource ( 238333 ) on Monday May 03, 2010 @09:19PM (#32080340)

    If only speech recognition's problems were limited to these low-probability sentences. I've had a number of SR systems fail to recognize my "yes" and "no" responses.

  • Re:Buffalo buffalo (Score:3, Interesting)

    by asc99c ( 938635 ) on Tuesday May 04, 2010 @08:45AM (#32083654)

    I'd never heard this one before, guess it's the American version! The one I was taught was a complaint by a pub landlord to their sign writer:
    You've left too much space between pig and and and and and whistle

Organic chemistry is the chemistry of carbon compounds. Biochemistry is the study of carbon compounds that crawl. -- Mike Adams

Working...