Google Researchers Boost Speech Recognition Accuracy With More Datasets 16

Posted by BeauHD on Thursday April 15, 2021 @09:25PM from the state-of-the-art-results dept.

What if the key to improving speech recognition accuracy is simply mixing all available speech datasets together to train one large AI model? That's the hypothesis behind a recent study published by a team of researchers affiliated with Google Research and Google Brain. They claim an AI model named SpeechStew that was trained on a range of speech corpora achieves state-of-the-art or near-state-of-the-art results on a variety of speech recognition benchmarks. VentureBeat reports: In pursuit of a solution, the Google researchers combined all available labeled and unlabelled speech recognition data curated by the community over the years. They drew on AMI, a dataset containing about 100 hours of meeting recordings, as well as corpora that include Switchboard (approximately 2,000 hours of telephone calls), Broadcast News (50 hours of television news), Librispeech (960 hours of audiobooks), and Mozilla's crowdsourced Common Voice. Their combined dataset had over 5,000 hours of speech -- none of which was adjusted from its original form. With the assembled dataset, the researchers used Google Cloud TPUs to train SpeechStew, yielding a model with more than 100 million parameters. In machine learning, parameters are the properties of the data that the model learned during the training process. The researchers also trained a 1-billion-parameter model, but it suffered from degraded performance.

Once the team had a general-purpose SpeechStew model, they tested it on a number of benchmarks and found that it not only outperformed previously developed models but demonstrated an ability to adapt to challenging new tasks. Leveraging Chime-6, a 40-hour dataset of distant conversations in homes recorded by microphones, the researchers fine-tuned SpeechStew to achieve accuracy in line with a much more sophisticated model. Transfer learning entails transferring knowledge from one domain to a different domain with less data, and it has shown promise in many subfields of AI. By taking a model like SpeechStew that's designed to understand generic speech and refining it at the margins, it's possible for AI to, for example, understand speech in different accents and environments.

Google Researchers Boost Speech Recognition Accuracy With More Datasets

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 16 Comments Log In/Create an Account

Comments Filter:

That just makes the model fuzzier. (Score:2, Insightful)

by ulatekh ( 775985 ) writes:

Mixing together machine-learning datasets like this is only going to lead to the model becoming globally fuzzier. Sure, it may accept more input as valid, but it'll also increase the chance of false positives. I guess there's no way to solve this problem except thoughtfully.
- Re: (Score:3)
  
  by Rei ( 128717 ) writes:
  
  Speech recognition is also context-dependent - indeed, the meaning can vary even when people are saying the exact same thing (let alone when they're saying different-but-similar things). For example, if someone says to Spotify, "Play Salt", but the person has an Icelandic accent, or has previously liked Icelandic bands, they probably mean the song "Salt" by Mammút rather than the song "Salt" by Ava Max. "Play Fairytale" - Have they liked Eurovision songs? They probably want the Alexander Rybak song. D
  - Re: (Score:2)
    
    by rtb61 ( 674572 ) writes:
    
    SHHH don't tell anyone but for speech recognition to be accurate it has to be individualised. So for Google to claim what they claim, you know what they did, they data mined your voice, so they can data mine everything you say in the range of a internet connected microphone.
    The reality is this is pretty fucking evil and nothing to brag about. What should happen in a moral and sound world, is an app should be provided to locally train speech recognition to you and not listen in on everything you say to data
    - Re: (Score:2)
      
      by shanen ( 462549 ) writes:
      
      Basically concurrence with the thread, but I want to note that the newest version works much more poorly than it used to. The dictation results are worse and cannot be corrected as easily as the previous versions.
      What makes that seem especially weird to me is that the old approach made it quite easy to give corrective feedback to the recognition engine. Especially for customizing for each individual.
      I used to use the dictation for first drafts (though not for Slashdot), and it was quite good. The latest is
- Re: (Score:2)
  
  by Anubis IV ( 1279820 ) writes:
  
  The thing is, speech happens in a variety of situations. Being able to recognize a particular word and how it sounds different coming from a studio recording than from a phone call than from a voice memo on a windy day than from a voice assistant in a room where a child in the background is banging on a xylophone while screaming the lyrics to “Let It Go” is a useful skill. Sure, making it “globally fuzzier” may make your speech recognition system less useful in a carefully controlled
- Re: (Score:3)
  
  by AmiMoJo ( 196126 ) writes:
  
  Speech recognition doesn't just rely on understanding what it is hearing now, it also looks at if the words make sense when strung together or in the context they are being spoken. That's how humans do it too, even if we don't hear every word clearly we can usually guess what is being said (or at least the general meaning) from experience.
- Re: (Score:2)
  
  by comodoro ( 4850881 ) writes:
  
  That's too general a statement. Apparently [arxiv.org] the model performs well. The thoughtful part of the solution may already be there in form of the Conformer RNN-T architecture.
  By the way, 5000h is not that much. It is just public datasets, I am sure Google has (and uses) many times more data.
Not At All Creepy (Score:4, Funny)

by bill_mcgonigle ( 4333 ) * writes: on Thursday April 15, 2021 @11:22PM (#61279274) Homepage Journal

> dataset of distant conversations in homes recorded by microphones,
hey, Wiretap, what's a recipe for pancakes?

Speech Recognition with respect to SEO purpose (Score:1)

by globtierinfotech ( 7979428 ) writes:

Google Voice Searches is changing the way Google handles search queries. Here is a step by step guide which you can follow to use Google voice searches and actions. Earlier Google voice searches supported only two languages, Hindi and English. But to make Indian users more connect to the web Google added more Indian languages to the voice typing and voice search on Android phones like Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Telugu, and Urdu. So that the users can now use voice input in their
No ethics (Score:2)

by MrL0G1C ( 867445 ) writes:

But if you're black then they don't understand anything you say.
When asked what their ethics division thought about this, Google said they rose objections so they fired them all.
All the better to surveil you with, my dear? (Score:1)

by Rick Schumann ( 4662797 ) writes:

It is Google after all, you have to wonder.
Youtube closed captions (Score:2)

by lfp98 ( 740073 ) writes:

After making a few Youtube videos I checked the auto closed captioning just for fun. I just use a cell phone so the audio is not that great plus I tend to mumble anyway, but the speech recognition almost always gets it right, even pretty obscure, obsolete technical terms. Really it already does much better than the average human.
FTFS (Score:2)

by fyngyrz ( 762201 ) writes:

By taking a model like SpeechStew that's designed to understand generic speech and refining it at the margins, it's possible for AI to, for example, understand speech in different accents and environments.
Having been raised in the northeast US, I find that strong accents of the deep south and southwest are nearly impenetrable. It is considerably easier for me to understand a native Spanish or Chinese speaker attempting my region's version of English and mangling the hell out of it than it is to understand s
Spying on peoples's conversations in their home??! (Score:1)

by lynx_linux ( 697752 ) writes:

Spying on peoples's distant conversations in their own homes via microphones??!! Now, THAT IS PURE EVIL AND MALICIOUS! PURE, VILE EVIL!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google Researchers Boost Speech Recognition Accuracy With More Datasets 16

Google Researchers Boost Speech Recognition Accuracy With More Datasets More Login

Google Researchers Boost Speech Recognition Accuracy With More Datasets

That just makes the model fuzzier. (Score:2, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Not At All Creepy (Score:4, Funny)

Speech Recognition with respect to SEO purpose (Score:1)

No ethics (Score:2)

All the better to surveil you with, my dear? (Score:1)

Youtube closed captions (Score:2)

FTFS (Score:2)

Spying on peoples's conversations in their home??! (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot