Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI Software

Even the Best Speech Recognition Systems Exhibit Bias, Study Finds (venturebeat.com) 142

An anonymous reader quotes a report from VentureBeat: Even state-of-the-art automatic speech recognition (ASR) algorithms struggle to recognize the accents of people from certain regions of the world. That's the top-line finding of a new study published by researchers at the University of Amsterdam, the Netherlands Cancer Institute, and the Delft University of Technology, which found that an ASR system for the Dutch language recognized speakers of specific age groups, genders, and countries of origin better than others. The coauthors of this latest research set out to investigate how well an ASR system for Dutch recognizes speech from different groups of speakers. In a series of experiments, they observed whether the ASR system could contend with diversity in speech along the dimensions of gender, age, and accent.

The researchers began by having an ASR system ingest sample data from CGN, an annotated corpus used to train AI language models to recognize the Dutch language. [...] When the researchers ran the trained ASR system through a test set derived from the CGN, they found that it recognized female speech more reliably than male speech regardless of speaking style. Moreover, the system struggled to recognize speech from older people compared with younger, potentially because the former group wasn't well-articulated. And it had an easier time detecting speech from native speakers versus non-native speakers. Indeed, the worst-recognized native speech -- that of Dutch children -- had a word error rate around 20% better than that of the best non-native age group. In general, the results suggest that teenagers' speech was most accurately interpreted by the system, followed by seniors' (over the age of 65) and children's. This held even for non-native speakers who were highly proficient in Dutch vocabulary and grammar.
One solution to remove the bias is to mitigate it at the algorithmic level. "[We recommend] framing the problem, developing the team composition and the implementation process from a point of anticipating, proactively spotting, and developing mitigation strategies for affective prejudice [to address bias in ASR systems]," the researchers wrote in a paper detailing their work.

"A direct bias mitigation strategy concerns diversifying and aiming for a balanced representation in the dataset. An indirect bias mitigation strategy deals with diverse team composition: the variety in age, regions, gender, and more provides additional lenses of spotting potential bias in design. Together, they can help ensure a more inclusive developmental environment for ASR."
This discussion has been archived. No new comments can be posted.

Even the Best Speech Recognition Systems Exhibit Bias, Study Finds

Comments Filter:
  • This isn't bias (Score:5, Insightful)

    by Lordpidey ( 942444 ) on Friday April 02, 2021 @09:05AM (#61227806) Homepage

    Having difficulty understanding people who aren't speaking the language as it's intended to, such as old people with a stutter, is not bias.

    • by DeplorableCodeMonkey ( 4828467 ) on Friday April 02, 2021 @09:17AM (#61227848)

      Having difficulty understanding people who aren't speaking the language as it's intended to

      There is no single regional variation of English that's authoritative in 2021. Midwestern-accented American English has more native speakers or close enough variations that it probably exceeds the entire population of the British Isles. Is English with an Irish accent not speaking as intended? How about Yankee accents that make statements like "pahk da cah?" What about people who speak otherwise flawless, native proficiency English with an Indian accent? Then you get into the whole ball of wax that is "urban black English" euphemistically called "ebonics."

      Iterating support is acceptable, but excluding these variations because it's hard to support them is not something actual speakers get to do on a daily basis if they live in an area with a confluence of regional influence.

      • Well this is machines we're talking about in which choice is upon the user. So if it's not perfect then that keeps translators in a job. Yay job security.

      • by Dan East ( 318230 ) on Friday April 02, 2021 @09:37AM (#61227906) Journal

        Iterating support is acceptable, but excluding these variations because it's hard to support them is not something actual speakers get to do on a daily basis if they live in an area with a confluence of regional influence.

        That isn't the same thing at all. There are people who simply do not speak the language accurately, either due to regional dialect, laziness or physical limitations, that distort the language to the point that some words cannot be understood correctly. If someone pronounces the words "cot" and "cat" exactly the same, the ability of any AI or other human being to accurately delineate between those words is vastly reduced. Now you have to "think harder" and analyze the context of what they said and attempt to discern what they probably meant. It doesn't matter if you train an AI on that variation, the fact still remains that the distortion of the pronunciations affects the ability to discern what is being said.

        • I don't know of any one person who speaks the language accurately all the time. It's too much work. If someone attempts to do this at all times, you will still find flaws, and even determining the flaws is difficult because *there are no rules*! Is the H stressed too much, or too little, maybe we should consult the ISO standard on H's? Even the queen of England does not speak English the same way as her father did, or her great grandmother. And the English she spoke in 1940 is different from the Englis

        • by Calydor ( 739835 )

          The cot was on the cat. The cat was on the cot. The cot fell on the cat. The cat fell on the cot. There is NO way an AI is going to figure out which story is being told unless it was actually there and looking at the event.

        • Who decides what is "accurate" speech? Every language has multiple ways to pronounce things. There is no "correct" standardization (in pronunciation and grammar). Many attempts have been made, but in the end the question is: is it understandable to the people you interact with?

          Accents and other variations exist because it evolves over time (similar to genetic variations - many random variations - but in the case of language much more variation is "acceptable" to the filtering function). American English is

        • If someone pronounces the words "cot" and "cat" exactly the same, the ability of any AI or other human being to accurately delineate between those words is vastly reduced. Now you have to "think harder" and analyze the context of what they said and attempt to discern what they probably meant.

          "Not" and "knot" sound exactly the same, and you can't redesign English to make your algorithm cleaner.

          If Bostonians want to pronounce cart like cat, then deal with it the same way your algorithm deals with the rest of our ugly language, stop being a baby about it and get back to coding. I'm so sorry you have to "think harder"... not.

        • by AmiMoJo ( 196126 )

          People cope with accents and poor quality audio (e.g. phones) all the time. They cope because they can tell if the word "cot" or "cat" is more likely from context, and because people use many common phrases.

          AI does the same thing, improving recognition by both listening and considering likely next words.

          Of course these things are all dependent on the training data, and if it doesn't include samples of Scottish accents or know that "waynes" is another word for children it's going to struggle. Unlike a human

          • by stdarg ( 456557 )

            Don't forget one of the most important techniques of real humans to understand -- asking.

            "The cat is on the cot"
            -- "What??"
            "The cat is on the cot!"
            -- "Are you saying cot, like a sleeping cot?"
            "Yeah obviously"

            If you want the AI to understand all variants of English without interaction it probably will never achieve mastery of any of them.

      • I have difficulty recognising English as spoken around Glasgow in Scotland. I also often have difficulty with English as spoken by many Indians (even those who regard English as their mother tongue) if they were not educated overseas. I do not regard this as bias. It is a fact that the accents are very different from those that are familiar to me. I do not regard it as strange, at all, that ASR, trained from easily available sources, would have a similar issue with interpreting unfamiliar accents. Continuin

      • by Z80a ( 971949 )

        Yes, but the article is written in a way that implies the speech recognition systems are self aware and able to hate specific ethnic groups, instead of just not being very good at dealing with the variation the languages have.

        • by AmiMoJo ( 196126 )

          No it isn't. TFA is pointing out that if speech recognition doesn't understand some people that can be quite annoying for them, e.g. they can't use automated phone services.

      • I'm pretty sure there is a bell curve of the most common pronunciation of a majority of words. Most speech recognition systems are going to be global, not regional. So the cost to manage outlier pronunciations is going to be a lot more expensive than managing common pronunciations.

      • There is no single regional variation of English that's authoritative in 2021

        Yes there is, it's called received pronunciation.

      • by quenda ( 644621 )

        speaking as intended? How about Yankee accents that make statements like "pahk da cah?"

        That is proper non-rhotic English, as spoken in UK, Australia etc. It is the other parts of the US who "parrrrk" their "carrrr" talking like pirates, who are using an old and outdated accent :-)

    • It is, in AI terms.
      Early Speech Translations system were bias to only 1 person. Then it went to a particular dialect, then to be able to deal with more dialects.
      The AI to translate speech seems to be having trouble now with older people, who are often a combination of the previous generations of a dialect mixed with current dialect and the fact that often when people get older their mussels and bones in their face (as well often missing teeth) will change how they say things as well.

      In terms of AI, it wil

    • Sure it is. It's biased towards the able.

      That it's much harder to create speech recognition without bias doesn't mean it's not biased. It just means that the bias is difficult to avoid. Humans have the same problem, mostly for the same reason. We're biased towards wanting to hear speech by people who are good at it, because we're better at understanding it.

      • by jythie ( 914043 )
        Or more specifically, we are bias towards wanting to hear speech by people with the same dialect we speak, or more accurately we are bias towards people who speak the same way the people we interact with the most speak it.
      • by kqs ( 1038910 )

        We're biased towards wanting to hear speech by people who are good at it, because we're better at understanding it.

        Nope. We're biased towards speech from people who talk like us, because we're better at understanding it. Trying to claim right vs wrong just shows that you are terribly confused about linguistics in general (and English in particular). Your "good" speech would be harshly criticized by an english teacher born 50 years before you, and also by an english teacher born 50 years after you. The rules change all the time, and "the rules" are just a consensus of multiple regional dialects all moving in roughly

    • It is bias once this becomes the only method to use your vehicle's entertainment system while driving, or while trying to get help from a bank. I used to just press '0' to get past the automated system but that hardly works any more.
    • I’d suggest that it actually is a bias, at least in the most clinical sense of the word. This algorithm favors understanding certain speech over others. That’s a bias, strictly speaking.

      It’s not unfair. There’s nothing deliberate about it. It’s not an example of bigotry, prejudicial beliefs, or any other -isms. While each of these may be typical of biases, none of them are necessary for a bias to exist, and this is one such case where I’d agree that none of them apply.

    • "Having difficulty understanding people who aren't speaking the language as it's intended to, such as old people with a stutter, is not bias."

      Nor is it not understanding Bavarians.

    • Having difficulty understanding people who aren't speaking the language as it's intended to, such as old people with a stutter, is not bias.

      Agreed. Bias referring to statistical inaccuracy is not the same as bias referring to prejudice. An obvious question to those to might be concerned about bias in this case is what distribution of accuracies across demographics groups would be more desirable. While a target of the maximum achievable accuracy for all groups is obvious, a goal of completely equal accuracies would be nonsensical.

    • No this is not about "as intended", it's about accents. *Everyone* has an accent. Even within a single family living in the same house the members will speak words differently. There isn't a set of international language body that says "this is the one true way to speak this word in this language", so the phrase "as intended" is ridiculous nonsense. People in London laugh at New York for it's ridiculous accent, and also laugh at York for it's ridiculous accent; meanwhile people in New York laugh at those

  • Hi, my name is Werner Brandes. My voice is my passport. Verify Me

  • by Karganeth ( 1017580 ) on Friday April 02, 2021 @09:06AM (#61227810)
    Cars use more gas when transporting Dutch people than Japanese people. Engineers claim that it's because Dutch people are a lot taller than Japanese people and hence heavier, but I know that the real reason is they're racist and intentionally made it that way to hold Asian people down.
    • That makes no sense. Using less gas benefits Asians, while tall and fat Dutch people are harmed by higher fuel consumption.

      • That makes no sense. Using less gas benefits Asians, while tall and fat Dutch people are harmed by higher fuel consumption.

        Ah... let me help you understand this:

        • Using less gas is GOOD if Asians use less gas AND Asians are oppressed.
        • Using less gas is BAD if Asians use less gas AND Asians are not oppressed.
        • Using less gas is GOOD if Dutch use less gas AND Dutch are oppressed.
        • Using less gas is BAD if Dutch use less gas AND Dutch are not oppressed.

        Always remember, that "Truth" is fundamentally political. Any physical/objective truth is about powerful people bullying the downtrodden.

        Now you know everything relevant about woke.

  • In other words, you don't have to put in the work to speak the language clearly. It's up to the rest of us to train our ears to understand your fucked up language skillz

    I think I'd rather just not talk to you.
    • by Merk42 ( 1906718 )
      I'm sure however you speak is the one true dialect for English.
      • by microbox ( 704317 ) on Friday April 02, 2021 @12:22PM (#61228526)
        Competent people strive for clarity, in whatever language, and whatever dialect. The postmodernists specifically went after clarity as white hegemonic male power. That's part of the reason why postmodern discourse is so absurdly abstruse.
        • by Merk42 ( 1906718 )
          Yeah I guess if you're elderly and don't have the muscle strength/dexterity to be eloquent, you should just shut the fuck up and never speak.

          It would solve the problem of Boomers running and ruining everything too.
        • by kqs ( 1038910 )

          Competent people strive for clarity, in whatever language, and whatever dialect.

          Competent people strive for clarity towards their audience, and may speak somewhat differently based on their audience. If I give a presentation to a room of Very Important People, and the next day ask my friends at the bar what beer they want, I'm speaking very differently.

          Now, enunciation is a technique which we could all use which will improve our clarity towards almost everyone. No matter your dialect, poor enunciation will make it harder for anyone of a different dialect to understand. But enunciat

    • In other words, you don't have to put in the work to speak the language clearly. It's up to the rest of us to train our ears to understand your fucked up language skillz

      I had to call to check on an order which I hadn't received. The person who answered the call had a thick accent that I couldn't immediately place. It wasn't Sajay calling himself Bob, either. The call went along well enough even though the the store I ordered from said the delivery company dropped off the package two days before at m
  • by bobstreo ( 1320787 ) on Friday April 02, 2021 @09:19AM (#61227856)

    It's bias in the training data used for the speech recognition systems.

    • by Merk42 ( 1906718 )
      It's this.

      So many see the word 'bias' here and want to get on their "stupid woke SJW" tirade, but it's just as simple as the AI (for lack of a better term) just wasn't fed enough variety of use cases.
      This is a basic concept about being proficient in anything.
      • by oblom ( 105 )

        Except more variety of cases is likely to:
        1. Cost much more in training time.
        2. Make the model more generic, yet less accurate overall.

        • 1. Cost much more in training time.

          Training time is relatively cheap.

          The cost barrier is collecting and cleaning the training data.

          Where do you get ten million samples of properly labeled Indian-accented speech?

          2. Make the model more generic, yet less accurate overall.

          Boosting [wikipedia.org] can solve this problem. Use a meta-NN to categorize the accent, then pass it on to a specialized NN trained on that accent.

          • Where does one get that much data of Indian-accented speech? I'm certain any call centre would gladly sell you their recorded conversations at bargain basement prices.

            • I'm certain any call centre would gladly sell you their recorded conversations at bargain basement prices.

              You need more than recorded conversations. You also need a transcript.

              Creating the transcripts is expensive.

              Even if you pay Indians on Mechanical Turk to clean the data for 10 cents, ten million samples will cost $1M.

              You still have the problem that most Indian call centers are in Tamil-speaking regions, and native Tamil speakers have a different English accent than Hindi speakers.

          • by oblom ( 105 )

            Thank you, very good points. I'll dig into boosting.

        • 1. Hardly.
          2. False. Models benefit so much from more data, they may well be able to generalize better (and so perform better on all groups) with more data from minority groups.

          The differences between classes in this model are not large, and not unexpected (it likely simply is harder to understand what non-native speakers or children mean). The concern should in any case be on its worst case performance, not on equality.

          Sometimes it's right to sacrifice performance on groups a model does "best" on for the sa

      • There's a threshold here. The point where you have to train the software for hours before it has a low enough error rate to understand your dialect. You can't have a speech recognition software pick up one random sentence and interpret it correctly, if you generalize it with no limits.

      • Woke discourse is founded on equivocation fallacies [logicallyfallacious.com].
    • The very fact that researchers effectively gave up on phoneme-level speech recognition and glossed over the issue by instead throwing gigabytes of training data at a neural net does'nt help matters.

      • This is kind of my thought; does their speech-recognition system have no extrapolation at all?

        Is this not a fundamental limitation of training-set, best-fit "learning" systems, that they can't guess at what something means, and instead are just stuck in some local extrema of fitness that is likely nonsense?

        • by jythie ( 914043 )
          Nope, no extrapolation at all, just lots of cheap processing power, large training sets, and the hope that as those two increase the problems will go away.
      • by jythie ( 914043 )
        Yeah, not that long ago I was listening to a respected ML researcher describing phonemes as 'made up' and computational linguistics as 'a hack' that they hope will eventually be eliminated as GPU prices continue to drop. Even the use of phoneme detection for use in transfer learning is being seen as 'undesirable' and 'sooo last decade!', so they have kinda fetishized NOT knowing about what is going on inside the speech...
        • by kqs ( 1038910 )

          The problem here is that we've been trying phoneme-based systems for decades, and even the best ones are laughably bad compared to modern ML systems. Turns out that phonemes are not nearly as constant as we hoped between dialects, people, and situations. But if you want to write a phoneme-based system that beats the current ML systems, feel free; you'll become very rich if you succeed!

    • If a road is littered with potholes that render it treacherous for anyone to drive, would you immediately blame the training data set for why a self-driving car does poorly on that road, or would you instead consider that there may be something fundamentally broken in the world that both the car and other drivers are struggling to deal with?

      In this case, the speech patterns themselves are measurably less comprehensible, either due to diminished faculties or a lack of skill (i.e. age or a non-native tongue),

      • In this case, the speech patterns themselves are measurably less comprehensible,

        Are they? People who speak English with an Indian accent have no problem understanding each other.

        I work in Silicon Valley and have many Indian co-workers. I had difficulty understanding them when I first moved here, but not anymore. I am now used to the chopped vowels, monotone enunciation, fast pace, and quirky vocabulary.

        Indian-English isn't wrong, just different.

        • Indian-English isn't wrong, just different.

          Agreed. Having a different accent is not what I had in mind when discussing non-native speakers. For many Indians, English is a native tongue. Sure, they may have a different accent than you or I do, but their accent is just as “English” as the accents of native speakers from London or Boston or Houston, and they have just as much command of the language as any other native speaker. Your experience learning how to understand them in SV mirrors my own experience in grad school.

          Those are NOT the s

    • It may not be possible to train a system to handle all cases simultaneously. Feed it data from all subgroups and it may just end up doing a crappy job on all of them. Say you train an AI to recognize cats. It does great with domestic short-hair cats, but not long-hair. So you add long-hair training data. Now long-hair works, but we want it to recognize big cats too. So we give it pictures of lions and tigers. Now it can recognize all the cats... but it misidentifies a lot of dogs as cats, because the t

    • by tomhath ( 637240 )
      Listen to what the guy dipping water out of this well says. [youtu.be] He's English, speaking English, but you will probably have trouble understanding most of what he says.
    • Well, you gotta start from somewhere.

    • by jythie ( 914043 )
      Considering that the trained weights of these systems are what actually make them, bias in the data means bias in the system.
    • There is unfortunately no such thing as a complete training data set, since there are as many speech patterns as there are humans, meaning, there will always be bias.

    • Not necessarily. It's perfectly possible for a model to be biased on its own. The parameters of the model encode a bias, it will always be better at learning some things than others. This has been proven to be true for all learning algorithms, a learning algorithm which is equally good at learning any possible pattern can't learn at all.

      It's true that it's unlikely that the model architecture itself should be terribly biased in this particular context. But it general, it may certainly be.

      Instead of arguing

    • You'd need a huge dataset to get the bias out.

      I'm a native Flemish (== Belgian Dutch) speaker and vocabulary, pronunciation and grammar can change a lot even over small distances. When I went to college 25 miles away there were people there I didn't understand for years.

      I grew up 3 miles outside the city centre and already some of the diphthongs and words were different. See https://en.wikipedia.org/wiki/... [wikipedia.org] for an example, though that page only scratches the surface. I know 2 people who made their ma
  • bias? (Score:2, Informative)

    by Anonymous Coward

    Ok, so I'm to understand that under the current sociopolitical climate that when a tool (here, algorithm) works better under some circumstances than others, that's considered bias?

    Please, go away. Your ideas are stupid.

  • Even humans find speech that is variant enough from what they are used to hard to understand. It's not "bias"; it's just reality.
  • This topic reminds me of this [youtube.com].
  • by OrangeTide ( 124937 ) on Friday April 02, 2021 @10:25AM (#61228052) Homepage Journal

    or a message that our respective public education systems need an elocution program? It seems like a worthy exercise in primary school and would give people a more equal footing later in life no matter their socioeconomic background.

    • or a message that our respective public education systems need an elocution program? It seems like a worthy exercise in primary school and would give people a more equal footing later in life no matter their socioeconomic background.

      This will go over swimmingly, tell Texas to speak more like Ohio, so that the computers can understand them.

  • Next time you talk to Alexa, think about how you are speaking to it. Its not conversational. Talking to AIs has become its own dialect. The AIs train us just as much as we train them. I'd hardly call that bias.

  • When Sundar Pichai and Satya Nadella became the big enchiladas the local quip was, "Finally! Google and Microsoft will fix the speech recognition bugs on south Indian accents. The belief is the developers will demo any speech recognition features to their bosses. The bosses will insist, "lets make sure we use voice samples that will cover Sundar/Satya. We don't want this thing barfing if they test it during the demo." Probably this kind of actual work place dynamics explains why accuracy of speech recognit
  • What amazes me is that "AI experts" still call it AI even though these systems can't do everything within the goal as well as a human. I have trouble with accents too, but generally through me asking the other person to speak a little slower I can get by. It's the same with self driving. There are so many situations that these experts want to discount as "not necessary" because it is beyond their ability to do with the current technology, but in my mind that's what makes it look vary much like a calculat
    • Perhaps it's not that unreasonable to ask some speakers to slow down for the AI, if that's what native speakers of a language sometimes need as well.

      • But is the AI able to recognize that slowing down will help the problem? Does it know what slowing down is? Back to human ability again.
  • The training data is incomplete, which is understandable. Start the training for the speakers most likely to be encountered, then work on the less common cases as time allows. Otherwise they would never release a product because some non-native speaker who already had a speech impediment before they had a stroke will complain that it's biased.
  • A better headline would be..
    Even the Best Speech Recognition Systems Aren't Perfect Yet, Study Finds

  • The only _real_ solution to bias is to detract from all other groups equally to the worst-recognized group.

    Carry the school system race to the bottom into machine learning tasks! It's the only way to be equal.

  • In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

    "What are you doing?", asked Minsky.
    "I am training a randomly wired neural net to play Tic-tac-toe", Sussman replied.
    "Why is the net wired randomly?", asked Minsky.
    "I do not want it to have any preconceptions of how to play", Sussman said.

    Minsky then shut his eyes.
    "Why do you close your eyes?" Sussman asked his teacher.
    "So that the room will be empty."
    At that moment, Sussman was enlightened.

    There is no way to avoid bias in machine learning.

    • The training itself is literally bias.

      The weights of the neural.net are *literally* bias. That's what they mean.

      The "algorithm" can't neutralize the input it's been given.
      Let alone if it doesn't even know what bias is considered the neutral one to the reader. (Hint: Thee is no absolute neutral. That is invalid in physics and unscientific nonsense.)

  • What is the best speech-to-text software for English?

    Nuance Dragon Speech recognition software? [nuance.com]

    Does Dragon Professional Individual allow several different recorded voices? For example, husband and wife?
  • When I first heard Jimmy Carter speak, I only understood about one word in five. My accent is a LOT different. That a machine speech recognizer should have a similar problem is not a surprise.

  • Speak like the grammar model or get involved in creating one that encompasses your speech pattern.

  • I am a native english speaker with 4 decades experience with it as my native tongue. I had to rewatch the first 10 minutes of the movie Snatch because I had to retrain my neural network to process the language in it. I had never had exposure to that accent or speech style, and I had to learn it. That type of bias in a training set will be an issue for AIs just as it is for people. Even a logistically infeasible planet-scale training dataset will not solve this problem because languages and dialects evol

  • The whole point of pattern matchers is bias.

    You mean "Doesn't PERFECTLY align with *my* bias."
    Guess what: No human does either. Thankfully.
    We're just usually not perfectly honest to each other. Social grease, yadda yadda.

    What you listed sounds perfectly normal for any human too.
    You're just one of those prejudiced hateful bullies who automatically assume bias means hate. Which tells us more about you than about that software.

    The "problem" here is:
    You forgot to add one factor to your model: The reader!
    Once th

For God's sake, stop researching for a while and begin to think!

Working...