Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Technology Science

Speech Recognition in Silicon 328

Ben Sullivan writes "NSF-funded researchers are working to develop a silicon-based approach to speech recognition. "The goal is to create a radically new and efficient silicon chip architecture that only does speech recognition, but does this 100 to 1,000 times more efficiently than a conventional computer." Good use of $1 million?"
This discussion has been archived. No new comments can be posted.

Speech Recognition in Silicon

Comments Filter:
  • by CrazyJim1 ( 809850 ) on Tuesday September 14, 2004 @10:58AM (#10246002) Journal
    My friend and I were talking about this. In countries that are more totalitarian, it could be used to root out "dangerous people" www.geocities.com/James_Sager_PA
  • accuracy (Score:5, Insightful)

    by tubbtubb ( 781286 ) * on Tuesday September 14, 2004 @10:59AM (#10246009)

    100 to 1000 times more efficient worth $1M? meh. maybe.
    100 to 1000 times more accurate worth $1M? definitely.
  • by Anonymous Coward on Tuesday September 14, 2004 @10:59AM (#10246010)
    Damned straight it is! In government terms, that's a pittance. In government-funded science terms, it's downright INFINITESIMAL. It isn't even couch change, it's more like the stale pretzel under the couch cushion.

    But, of course, cue the armchair blogging fanatics without a formal science education, waxing poetic about the infinite power and glory of x86 hardware running clever open source software. Maybe we could do it in perl!
  • Sarcasm? (Score:2, Insightful)

    by Anonymous Coward on Tuesday September 14, 2004 @10:59AM (#10246014)
    Good use of $1 million?
    For something that would be worth hundreds of times that in the form of a finished product, I would hope so. The only dispute might be that the researchers' efforts would be better spent on other things.
  • by Oxy the moron ( 770724 ) on Tuesday September 14, 2004 @11:00AM (#10246028)

    On the one hand, it is obvious how much more efficient this would make our day-to-day tasks. Being able to "jot" notes with speech instead of writing, schedule tasks in seconds, the list goes on and on...

    This is certainly beneficial... but think about the impact on the economy! Imagine all the "Administrative Professionals" who could, almost instantly, be out of work. I for one would rather pay even $5,000 for a good piece of software to take all my notes than pay a secretary $28,000/year or so.

    Then again, when I posed this situation at my wife's office (she's a paralegal) one of the attorneys responded, "Until they come up with software that can find my lost keys and bring me coffee, the secretary's job is secure."

  • by MankyD ( 567984 ) on Tuesday September 14, 2004 @11:00AM (#10246032) Homepage
    I'm curious to see if their research will improve Natural Language Queries, as opposed to just improving speech recognition. There is an important difference between having to say: SELECT name FROM users WHERE id=12345 and saying: Pull up the name of employee number 12345.
  • by handy_vandal ( 606174 ) on Tuesday September 14, 2004 @11:01AM (#10246041) Homepage Journal
    Speech recognition on a chip, yes.

    But only "silicon" in the sense that every other silicon chip is silicon.

    No magical "silicon" breakthroughs to see here, keep moving.

    -kgj
  • Only 1million? (Score:3, Insightful)

    by Gyorg_Lavode ( 520114 ) on Tuesday September 14, 2004 @11:03AM (#10246052)
    Thats impressive for just 1 million, working in defense and knowing our contactors. 1 million dollars is bearly enough to get them to tell you how much it would cost for them to do the initial research to tell you if they can actually build what you want.

    (I did not read the article as it is slashdotted so I am relying on the summary's statement of 1 million dollars.)

  • by Fortress ( 763470 ) on Tuesday September 14, 2004 @11:03AM (#10246054) Homepage
    ...is always underestimate your costs and run over budget later. That $1 million will turn into $1 billion before anything comes of this. Hell, it'll take over a million to get the development organization up and running.

  • by L0neW0lf ( 594121 ) on Tuesday September 14, 2004 @11:04AM (#10246072)
    I once did a lot of work with speech recognition software, having a former significant other who was disabled. I tested a number of programs, and found the biggest problem to be the wide variances in users' dialects. The programs all have to be trained initially to recognize a single users' voice. This means that a program trained for a Bostonian may not work for someone from Arkansas, Texas, or Louisiana. Also, the programs' effectiveness decreased over time if you did not use it regularly.

    I don't know how possible it will be to make a program that can recognize all English users. Will someone who speaks Oxford English be recognized as well as a surfer from California? I doubt it.
  • by GMail Troll ( 811342 ) on Tuesday September 14, 2004 @11:04AM (#10246077)
    "People who are serious about software should make their own hardware" - Alan Kay

    This seems like a situation where a hardware accelerated approach is pretty sensible. I'm guessing there is large amounts of signal processing involved in speech recognition. With a custom chip like this it probably helps greatly to offload some of that onto a dedicated chip in the same way as GPUs are used on graphics cards. The only problem I can see is that there might not be much market for it. GPUs have an obvious market (games), but there is less demand for speech processing. Star-Trek style interfaces are nice to dream of but for most common tasks a keyboard and mouse will probably give you a faster and more accurate interface.

    gmail invite [google.com]

  • by Darkon06 ( 714661 ) on Tuesday September 14, 2004 @11:05AM (#10246087)
    I see some results. So far theres been quite a few attempts at speech recongnition. Generally they all fall short, they don't like accents, and often mis-interpret. I know because awhile back we looked at something for my grandfather, he can't keep his hand steady enough to write anymore... *shrug*
  • by randombit ( 87792 ) on Tuesday September 14, 2004 @11:12AM (#10246151) Homepage

    - Voice controlled robots ("You missed a corner, vacuum cleaner")
    - Data search by voice ("Find me a channel that plays Star Trek")


    Kinda jumping ahead of yourself, aren't you? There are two steps to an operation like these, speech to text, and understanding the text you get out. Speech recognition gives you the first part, but you still have to be able to pull apart the sentence and figure out what it means.

    Also, the article didn't say more accurate than software, it said more efficient. You know, uses less power and stuff like that? If the applications you mention (like search via voice) were possible/usable, you could run them today on an upper-end PC no problem.
  • Re:Funny... (Score:4, Insightful)

    by loginx ( 586174 ) <xavier&wuug,org> on Tuesday September 14, 2004 @11:12AM (#10246155) Homepage
    I want to sing the general tone of a song I heard on the radio in a microphone and have google direct me to that album on froogle.

    THAT would be awesome!
  • Re:accuracy (Score:3, Insightful)

    by SillyNickName4me ( 760022 ) <dotslash@bartsplace.net> on Tuesday September 14, 2004 @11:15AM (#10246182) Homepage
    > 100 to 1000 times more efficient worth $1M? meh. maybe.
    > 100 to 1000 times more accurate worth $1M? definitely.

    Accuracy does not have to be a problem with modern speech to text systems, but the need to 'train' them to get that accuracy, and the need to talk to it in a somewhat distinctive way, make them far less efficient.

    I'd rather say that the time it takes to get used to a speech recognition system (and to get it used to you where appliable), together with the soemwhat heavy cpu requirements, are what currently stops use. To me that means that the first thign that is required is efficiency, the accuracy is already there.

    (I have been using speech to text for over a decade now, starting out with another hardware solution in the first half of the 90s (IBM's VoiceType Dictation, back then called Personal Dictation System if I'm not mistaken, and even that system already had an almost as good accuracy as I manage myself)
  • by Anonymous Coward on Tuesday September 14, 2004 @11:25AM (#10246266)
    so we should never try to advance society until what you feel as basic problems (that WILL NEVER BE SOLVED) are fixed?

    bravo
    lets go back to living in mud huts too, because there was energy spent on making better walls while some people were starving.

    not to mention: 10,000 people, what is $10 going to do for them?

    wow they can have half a dozen ultra cheap meals.
    that really helps a lot
  • by Jeff DeMaagd ( 2015 ) on Tuesday September 14, 2004 @11:32AM (#10246391) Homepage Journal
    Agreed. Secretaries are needed to do paper handling, take calls and filing too. A business that prides itself on professionalism and service would IMO not rely on short cuts like the voice mail maze. So they aren't just a personal refresment gopher. Any business should still need that sort of thing.

    So what if dictation is taken away from secretaries, they still need to check the grammar and arrangement as dictation is almost always free-form without the same structure as a good written letter.
  • Re:History.. (Score:2, Insightful)

    by giblfiz ( 125533 ) on Tuesday September 14, 2004 @11:41AM (#10246526)
    An excilent point, However if one were to make something along the lines of a PDA or phone with voice recognition the dedicated hardware would stay useful for much longer because you not only need to wait for the CPUs to catch up, but they need to pull so far ahead that they can compete in power consumption as well. (Which may be entirely impossible)

    task specific silicon becomes very useful when you don't have as much space/power/heat-disipation as you want.
  • Re:History.. (Score:3, Insightful)

    by Jeff DeMaagd ( 2015 ) on Tuesday September 14, 2004 @11:41AM (#10246529) Homepage Journal
    These chips wouldn't go into a computer, there are numerous non-computer devices that could use good, low power speech recognition.

    Will a general purpose CPU fit or operate in a phone that can be on for a week? I almost never shut off the phone and it still lasts a week, and I don't want to sacrifice that run time for speech recognition.

    Granted, ARM chips are getting more powerful but the power consumption is still a limiting factor for their designs.
  • Re:To, two, too (Score:2, Insightful)

    by j_cavera ( 758777 ) on Tuesday September 14, 2004 @11:45AM (#10246575)
    Speech recognition is a two-part process. The silicon is to speed up part one: word recognition. The first thing to do is to figure out that the person is saying:

    Computer, set timer for (to|too|two) (ours|hours).

    Step two changes that into: ... two hours.

    based on context. That's where the AI programmers get their turn at the problem.
  • Re:Funny... (Score:5, Insightful)

    by Christopher Thomas ( 11717 ) on Tuesday September 14, 2004 @11:48AM (#10246610)
    I work on product X and think of all the possibilities (list slightly feasible but most likely never going to happen features).

    If this is really true what they're saying then people should put tons more money into product X!


    Actually, use of speech recognition technology to index video clips for search engines _is_ both a very desirable technology, and something that can be done fairly easily (most professionally produced video, at least, takes great pains to have one speaker at a time and keep noise to a minimum). There's a fair bit of video content accessible via the web right now, and this will only increase (most new digital cameras can take video clips now - remember how quickly still pictures flooded the web when digicams first became available?).

    Speech recognition technology has trouble when it's trying to sort out a noisy environment or a degraded communications channel, and has trouble holding useful open-ended conversations (as opposed to task-driven), but it's very capable in most other contexts. After all, the field has been under study for decades.

    In summary, your mocking of the parent post is premature.
  • by Armchair Dissident ( 557503 ) * on Tuesday September 14, 2004 @12:04PM (#10246806)
    Every time a dollar value is placed on a piece of research, some idiot comes along and say "Hey! This could be spent providing clean drinking water, and food and shelter", as if only research that directly provides clean drinking water or food or shelter is worth funding. Quite frequently the idiot making this statement is in a perfect position to provide money to ensure that more people have access to these facilities, and just as frequently that idiot isn't doing so.

    I'm sure that when America and Russia were engaged in the space race there were people saying "Hey! This money could be better spent on disaster relief!". And where are we now? Only a few short decades later we have sattelites that tell us where hurricanes are going so that we can evacuate areas and people who would otherwise die surviveWe have a global reliable telecommunications satellites so that disaster relief agencies in third world countries can inform people of what supplies are required, and people who would otherwse die survive.

    Without the massive investment in jet airline technology that could otherwise have been spent "saving the starving", we would not be able to travel to disaster areas within hours of an incident. And so the list goes on.

    If you personally want to see more money invested in agencies that provide disaster relief, or reliable shelter or clean water then you only have to donate to the right charities, and encourage others to do the same. It doesn't take many people to donate out of their pockets to provide $1 million. You can start here [savethechildren.org].
  • Re:To, two, too (Score:2, Insightful)

    by CyberLord Seven ( 525173 ) on Tuesday September 14, 2004 @12:14PM (#10246931)
    Exactly, and that's where the real problem lies. If people think it's going to be difficult to identify the same word spoken by people from different regions then they probably have not given much thought to the fact that many words with different meanings sound the same in English and also that there are phrases such as "fat-chance" and "slim-chance" that mean exactly the same thing.
  • by TheSync ( 5291 ) on Tuesday September 14, 2004 @12:21PM (#10247016) Journal
    So far, analog neuromorphic VLSI has hit a dead-end in terms of real applications. Also digital signal processing has been speeding up to the point where it can go almost as fast as a lot of the parallel analog models.

    The one exception is that the work on analog retina models lead to the development of the Foveon X3 [foveon.com] technology, which is just packing R,G, and B CMOS sensors into a single vertical column on a chip. But again, the neuromorphic part of the retina model is not the X3 technology, the X3 technology is stacking CMOS sensors.

    Analog neuromorphic VLSI did have one big result, the electrical engineers managed to teach the biologists a lot about signal processing, and the cross-pollination of this knowledge has lead to discoveries such as ripple analysis [iop.org] in auditory cortex.

  • by Masker ( 25119 ) on Tuesday September 14, 2004 @12:23PM (#10247043)
    Natural language processing and speech recognition are two entirely separate problem spaces.

    Natural language processing tasks involve parsing strings of tokens and mapping them to commands to be executed. So, from your example, "Pull up the name of employee number 12345", the natural language system must map "Pull up" to "SELECT", "the name" to "name", "of employee number 12345" to "FROM users where id = 12345". Really, it's largely a problem of context, and your example shows an excellent problem: the "of employee number 12345" to "FROM..." map requires the contextual information of where to pull this information from. Surely multiple tables of a database could have an "employee number" field in them. Do you want all of the tuples which matches, or just from a certain table? Now, in the context of looking up a bunch of other employees, maybe I know what table you've been hitting a lot, and can determine what you're asking, but without that context, I have no idea.

    In fact, everyday speech has a lot more ambiguity in it than could be handled without keeping large amounts of state, be it contextual or experiencial/situational. For example, if I overhear two people in a conversation, and the first thing I hear is: "Yeah, but he's been lying all though his campaign, and I for one don't support him," I have no idea which politcal candidate might be speaking of. However, if I saw that person wearing a shirt for a political campaign last week, then I have enough context to make a reasonable guess that he's talking about that person's opponent.

    Speech recognition is a "lower level" than that: it's about matching acoustic information into speech sounds and then using the speech sounds to determine the word that was said. This is a hugely complex task that has a number of unsolved problems (of which these are the 3 that I can think of off the top of my head):

    1) "speech sounds" are fuzzy categories, and are not canonical targets.
    2) salient "features" of phonemes are disputed, contradictory and large amounts of redundancy/conflicting info are built into the speech signal
    3) idiosyncratic speaker-to-speaker differences make the phoneme categories even fuzzier and can complicate the task even for the one speech recognition system that we know works: the human brain.

    At any rate, the problems that need to be solved for speech recognition are not the same problems in natural language processing. While there may be some cross over in pattern-matching, the specifics of the problem spaces make it unlikely that you will get much benefit for NLS (natural language systems) from just making the algorithms faster.

    Which, in fact, is my main criticism of this article: the algorithms that we have now are piss-poor, and making them faster doesn't intrinsically make them better. Unless there's been some huge advance in the field that I'm unaware of, you'd still have to train a SRS (speech recognition system) on your idiolect, by reading some pre-selected passages to it. This model has lots of problems, most specially that it's tailored to an individual. Imagine if you had to have each person that you spoke with read some canned paragraphs to you the first time you met so that you could interact....

    [sorry I don't have sources for all of this; I'm AFB, and I don't have time to dredge up info right now. But, apparently, I have time to write one long-ass entry...]
  • by CrudPuppy ( 33870 ) on Tuesday September 14, 2004 @12:26PM (#10247079) Homepage

    making quantum leaps in speech recognition has tremendous potential for deaf and hard-of-hearing (I am the latter)

    Imagine being in a meeting (almost always a problem for hearing impaired people) and having real-time subtitles.

    $1 million is a TINY price considering upwards of 20% of the nation has some hearing loss and hearing aids cost on the order of $4000 a pair.
  • by Anonymous Coward on Tuesday September 14, 2004 @01:24PM (#10247699)
    Whoa! Not so fast! Voice RECOGNITION is one thing, UNDERSTANDING and translating is something different...
  • by obiquity ( 658885 ) on Tuesday September 14, 2004 @01:33PM (#10247790)
    I am an assistant prof at a major research institution and $1,000,000 is not as much as you would imagine. Firstly most universities take ~ 50% of grants immediately as overhead. You're down to 500K. Second this is spread out over 4 - 5 years, now you're down to about 125 K a year. Third, if we have grants we profs are required to pay our own summer salaries. On average this could be 25K, so you're down to 100 k/ year. In sciece and engineering we are expected to pay our grad-students if we have grants. Yearly salary with additional overhead (in the US, Canada is a bit less) comes to almost 50K/year A post-doctoral researcher would be hard to find for less than 50K/year with overhead. So really it supports a grad student and a post-doc and maybe some equipment for four years. Compared to the resources of industry it sometimes seems kind of puny. But the freedom is worth it. Just some info, OBQT
  • by slobber ( 685169 ) on Tuesday September 14, 2004 @02:02PM (#10248123)
    This should be about algorithms, not architecture. Anything they can do in silicon can and should be implemented and perfected in PC software first. I don't care if it takes PC 10 minutes to recognize 10 second sentence as long as it does it accurately. As soon as that happens, then by all means cut its power consumption and speed it up x1000 by doing it in silicone. If all they are doing is speeding up existing, relatively low accuracy algorithms, then their effort is of limited use.

    Too be honest, I doubt that putting a few clever algorithms together will ever achieve any respectable accuracy no matter how fast those algorithms are. Sure, it might accurately recognize words from limited vocabulary when spoken clearly and/or in simple sentences. If this is their goal, then it is quite achievable. It sounds to me though that they are aiming much higher as in "dictating a detailed email". I think that so many things have to happen from effective noise filtering to proper phonetic model representation to parsing to content-based correction. Latter step is especially problematic since it requires a huge knowledge database which takes humans years to accumulate. I am not saying that these difficulties are insurmountable, but simply that their goals are too ambitious for the current state of our technology and knowledge. I'd love to be proven wrong on that account though.
  • by mOdQuArK! ( 87332 ) on Wednesday September 15, 2004 @12:36AM (#10253174)
    If you're going for adding things to your field of vision, why not overwrite the Japanese version of the sign with the text written in plain english?

    Well, for those of us who actually like seeing the thing which is being translated, covering everything up would make the experience a little less rich. Also, over time, if you always see the two things together, you might be able to recognize patterns (hey, that set of ideograms always means Tokyo!), so if your batteries go dead, you still have a chance of navigation.

    Top this off with the audio translation playing the sound back of the translated words of someone speaking to you

    I prefer subtitles for similar reasons as for the signs, plus there is the added issue of cognitive modality - it is harder for you to concentrate on an audio translation if you can hear the person speaking to you at the same time (brain has to filter out similar sensory information), whereas I find it fairly easy to follow subtitles for meaning even while using the audio from the person only as an emotional "channel" (brain can use complementary sensory info).

    The other stuff you mention (colloquialisms, vernacular, etc) I agree with, except that I actually like to see the Babelfish-like (straight) translations in some of those instances, perhaps with a background notation of the slang's translation, its probable meanings & maybe its origin (although I doubt you would look at all that stuff while in the middle of conversation :-).

Today is a good day for information-gathering. Read someone else's mail file.

Working...