The Future of Speech Technologies 101
prostoalex writes "PC Magazine is running an interview with two of the research leaders in IBM's speech recognition group, Dr. David Nahamoo, manager of Human Language Technologies, and Dr. Roberto Sicconi, manager of Multimodal Conversational Solutions. They mainly discuss the status quo of speech technologies, which prototypes exist in IBM Labs today, and where the industry is headed." From the article: "There has to be a good reason to use speech, maybe you're hands are full [like in the case of driving a car]. ... Speech has to be important enough to justify the adoption. I'd like to go back to one of your original questions. You were saying, 'What's wrong with speech recognition today?' One of the things I see missing is feedback. In most cases, conversations are one-way. When you talk to a device, it's like talking to a 1 or 2 year old child. He can't tell you what's wrong, and you just wait for the time when he can tell you what he wants or what he needs."
Language Acquisition... (Score:5, Interesting)
You see, the problem right now is that there's really not much data that's in the public domain for linguists/psychologists/what-have-you to study, because it's incredibly, incredibly laborious to do longitudinal studies of children's utterances, or of input to the child. People spend hours and hours and hours transcribing 20 minutes of tape. They're understandably reticent to just share their data out of the goodness of their hearts. Even when they do, it's never a large sampling of children-and-their-interlocutors from-birth-to-age-X, it's usually just one child and maybe his or her parents from age 8 months to 3 years.
So we have arguments about whether or not kids hear certain forms of input (Have you used passive voice with your child recently? Where's your child going to learn subjacency?) that go back and forth between psychologists and linguists, and people perform corpus studies on 3 children and feel that that's representative -- never mind the fact that these three kids were all harvested from the MIT daycare centre, and were the children of grad students or faculty members, and thus may not be representative of the population at large.
Speech recognition would make it much, much easier to amass large corpora of data for larger samples of the population. It'd make it much more likely for people to share their data. And, what's more, it'd likely be possible to have a phonetic and syntactic-word-stub (for lack of a better word) transcription made from the same recording. We'd have a better idea of how the input determines how language is acquired by children, and what sorts of stages children go through.
IBM Speech - Needs Superhuman sales to survive? (Score:5, Interesting)
Scansoft, who earlier all but cornered the market for Optical Character Recognition (OCR) technology, did the same with speech recognition by acquiring the largest players in this space, SpeechWorks and Nuance. Scansoft changed their name to Nuance as a part of that last acquisition.
IBM, meanwhile, has been struggling to find a market for their "Superhuman" (sneer) speech reco technology. A few years ago, they sold distribution of their retail desktop product, ViaVoice, to (wait for it) Scansoft. Their commercial product was RS/6000-AIX-only until a couple of years ago, when they ported it to more platforms, including Windows and Linux, and integrated it more tightly with their Rational and WebSphere marketing platforms.
The current enterprise product sounds really sexy, at least for Rational-WebSphere shops. You can develop your WebSphere VXML application in Eclipse and leverage all those groovy WebSphere services you've built. No (or not much) special skill required!
The problem is that their target market is Telecom Managers, who face a choice between IBM, with a few hundred ports installed, and Nuance (-ScanSoft-SpeechWorks), with tens- or hundreds-of-thousands of installed speech reco ports. Telecom Managers live in a world where their clients expect six-sigma/five-nines reliability. This is a hard sell to make.
The question is, how long can IBM keep pouring money into speech R&D and product development in the face of dismal sales? Some in the industry expect the answer is, "Not too much longer." And that. of course, makes nervous enterprise buyers even more nervous and less likely to buy.
integration (Score:4, Interesting)
can it replace court reporters? (Score:4, Interesting)
In any case, I warned her about the potential for voice recognition technology to render court reporters obsolete. It probably won't happen, but the mere prospect tipped her in the direction of foregoing the opportunity. Was that a mistake?
The same concern applies also to medical transcription.
Re:its been a while (Score:4, Interesting)
Using it when you have a cold, sore throat or when you have been indulging in your favorite alcoholic beverage can corrupt your voice profile and set you back considerably.
Never let someone else use it under your voice profile.
Will voice rec systems ever be 100% accurate and spearker independant? Maybe, but I don't expect to see it for a long time.
Re:Language Acquisition... (Score:2, Interesting)
Very interesting. Since you're a linguist, I wonder if you might address a concern I've had about speech recognition technology in general.
I've dabbled a bit with Dragon Naturally Speaking in the past (v.7) and frankly found it still too immature to be of much use to me. I find it still far easier to deal with an accurate yet artificial interface (keyboard and mouse) than an inaccurate but more "organic" interface (speech recognition).
But one of the things that stood out from the experience was the way in which I found myself quickly (if frustratedly) adapting my speech patterns to comply with the machine ability to interpret me.
Is anyone out there considering the consequences of speech recognition technology on the evolution of human speech? It seems to me that any speech technology is going to be imperfect to some extent, but the better it gets, more people are going to use it and those people will inevitably end up adapting their speech patterns to the machine.
Could this technology end up homogenizing human speech patterns to fit the computer's speech recognition model? Is this even a valid concern in your opinion, and if so, is anyone in the linguistics field considering these implications?
So why is voice input in decline? (Score:4, Interesting)
Try TellMe. Call 1-800-555-TELL. It's a voice portal. Buy movie tickets. Get driving directions. News, weather, stock quotes, and sports. All without looking at the phone. So what's the problem?
Don't question my intelligence, it's fake. (Score:1, Interesting)
Screw speech recognition (Score:2, Interesting)
Babblin' all over the place is dumb.
Instead of speech recognition let's work on better speech synthesis. Here we are in 2006 and the average synthesized voice sounds hardly better than my freakin' Phasor card I had for my Apple
Re:MOD PARENT UP (Score:1, Interesting)
I don't think too many people realized how much more useful Apple's speech technology became when S.I. was teamed with AppleScript. I'm sure IBM's technology is/was superior, but so was Token Ring. Sometimes the better approach is start at 'cheap and appealing' before jumping to advanced.
Re:its been a while (Score:2, Interesting)
We have all the problems mentioned (except drinking). There are also some others that you might not consider. For example:
As the day wears on, the radiologist will get tired, and the recognition will become worse.
Also:
A radiologist who started at 6:30AM will see the sound characteristics of the room change dramatically as more people begin working and activity in the reading room increases. Even environmental systems cycling on & off can affect the recognition.
Despite this, when we receive a complaint about the voice recognition and we observe the user in action, they usually achieve 90-95% accuracy. That is really the most the vendor ever claimed was possible.
It is my understanding that for radiology practices in which the doctors share the profits, the voice recognition systems are a hit. You can see why when you look at the numbers. When we adopted the system, we had been using transcriptionists at a cost of about $600,000/year. After the change the annual cost of the speech recognition system was about $100,000. That doesn't take into account the greatly decreased turn-around time. Now we could have your report emailed to your doctor before you get your pants back on.