Microsoft Claims Its Speech Transcription AI is Now Better Than Human Professionals (qz.com) 98
Microsoft announced today a system that can transcribe the content of a phone call with "the same or fewer errors" than real actual human professionals trained in transcription -- even when the human transcript is double-checked by a second human for accuracy. As you can imagine, this is a huge milestone for speech recognition. From a Quartz report:The team doesn't attribute this achievement to any breakthrough in algorithm or data, but the careful tuning of existing AI architectures. To test how their algorithm stacked up against humans, first researchers had to get a baseline. Microsoft hired a third-party service to tackle a piece of audio for which they had a confirmed 100 percent accurate transcription. The service worked in two stages: one person types up the audio, and then a second person listens to the audio and corrects any errors on the transcript. Based on the correct transcript for the standardized tests, the professionals had 5.9 percent and 11.3 percent error rates. After learning from 2,000 hours of human speech, Microsoft's system went after the same audio file -- and scored 5.9 percent and 11.1 percent error rates. That minute difference ends up being about a dozen fewer errors.
Microsoft's next challenge is making this level of speech recognition work in noisier environments, like in a car or at a party. This implementation is crucial for Microsoft, and goes well beyond just transcription.
Re: (Score:1)
... has just wasted all of our time...
My apologies for my error. I should have asked: Isn't that the dying operating system company?
Re: (Score:1)
This daily MS-bashing circle jerk, the nauseating fanboyism, and politically motivated anti-science goons are the three reasons I stopped using /.
Re: (Score:1)
Gotta love irony.
Yeah, I drop in once or twice a year. Nothing's changed. Other tech sites have more straightforward news and Quora's comments have a much better signal to noise ratio.
Re: Self-Reflection (Score:2)
Have my criticism and observations upset you AC? Struck a nerve?
Re: (Score:1)
I thought it was bad the day I had to train some foreign workers up to replace me.
At least they were human. IT'd be worse having to train up an AI to take your job...
Re:Microsoft? (Score:4, Funny)
Hush! As long as MS exists, I have total job security!
Re: (Score:2)
It's less them having trouble understanding me, it's more me having trouble understanding them. If MS built a speech recognition software that can translate the output of an Indian call center, my hat is off to them!
11.1 vs. 11.3 percent (Score:1)
That minute difference ends up being about a dozen fewer errors.
If 0.2% is a dozen, then 1% is sixty, so 100% is six thousand errors.
Yikes.
Right ... (Score:5, Funny)
--
This comment was transcribed by Microsoft's new AI transcription software.
Re: (Score:2)
I've taken to typing and saying "ducking" all the time anyway. Soon to be added as a new meaning in the dictionaries.
Those ducks, always up to something nasty.
Re: (Score:2)
I've taken to typing and saying "ducking" all the time anyway. Soon to be added as a new meaning in the dictionaries.
Those ducks, always up to something nasty.
I used to have an office that overlooked a river. I can't speak for all ducks, but the resident mallards ... yes, they were almost always up to those types of things.
Re: (Score:2)
Homosexual necrophiliac rape, if I recall correctly.
Yes, I do remember correctly, and it was indeed a Mallard doing the deed (and being done-unto, too).
Almost unremarkable that it was a Dutch report, and was considered so remarkable that it took 6 years from event to publication.
I'd not a
Voice Control (Score:5, Insightful)
If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.
Re: (Score:3)
Agreed, however people down south don't move their mouse with "the typical hospitallllity of us folk 'round here" as opposed to the people up north who couldn't give a rats ass.
Speech is incredibly dense to parse. Where a near perfect operation is required for a mouse, voice control can have a couple bumps in its' road before (and while) being highly adopted.
Re: (Score:3)
If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.
Wrong example. Mouse usability requires constant visual feedback and almost constant human correction. That is the reason why we can't really use a mouse without looking directly at the screen.
In any case, flawless transcription accuracy of one single human voice out of 7.5 billion voices already happens with Google Voice. The problem occurs when Google Voice is not tuned to the voices of the other 7.49999 billion people. Do you think that's what Microsoft is using in the backend this second time [blogspot.com] around?
Re: (Score:2)
This!
We have input today that is perfect. More important, we sometimes have to do input that can break hours if not days of work if executed wrongly. Hitting the wrong key at the wrong time can at least be chalked off as human error, Saying "down" do scroll and it being interpreted as "shutdown" (along with the frustrated "NO, dammit" being interpreted as the answer to "save work (y/n)?") is more a problem of the input parser than the human in front of the screen.
Unless it is AT LEAST at par with other mean
Re: (Score:2)
If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.
And yet touchpads are still vastly more common on laptops than trackpoints...
any better than "Show me to buy milk"? (Score:3)
Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".
Re: (Score:2)
"Show me my most at-risk opportunities".
Huh, you mean Xiaomi is coming out with moist asterisks? How very interesting!
Say what? (Score:1)
Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".
Eye thin queue meant two say:
Lie canny hue man wood thin cab out mill core "owe pen reminders" when he ring "Show meme I Moe stat risk copper tune it ease.
Re: (Score:2)
Hey! It looks like you have obtained illegal access to the system used to caption news broadcasts!
Strange success criteria (Score:2, Interesting)
Dialog windows: "Do you want to register for your FREE Windows 10 Upgrade?"
Me (vocally): "No, no... of for the love of all that's sacred, NO!"
Windows: "This may take a while. Please do not power down your computer ..."
Re: (Score:2)
customer relations record: the customer loves windows as if it's the most sacred thing to him
Now put it to good use! (Score:5, Informative)
Automated closed captioning for the hearing impaired would be one. I'm not hearing impaired, but I use the CC system with the volume low when I am watching TV while everyone else in the house is sleeping. I also use it when everyone is awake and noisy. It is amazing how awful some CC can be.
Re: (Score:2)
It is amazing how awful some CC can be.
At first I thought, based on your post you'd really meant to say: "It is amazing how awesome CC can be."
Interestingly, both are true.
Re: (Score:1)
Re: (Score:2)
Based on Spanish-language soundtracks, there's no doubt that some CC is human-generated. On my Stargate discs, the foreign-language captions aren't even saying the same sentences as the alternate-language voices.
Re: (Score:2)
No. This is a whole new technology: artificial stupidity. Its going to change the world, I tell you. (Mostly for the worse, I suspect!)
Re: (Score:2)
It will be decades before artificial stupidity is anywhere near natural stupidity on any metric.
Natural stupidity is surprisingly flexible and resilient---it can crop up anywhere and is almost impossible to stop.
Artificial stupidity requires significant investment and evolutionary design before it can approach the persistence and impact we see naturally.
Re: (Score:2)
Oh yes, my body is ready.
And please make an API for all those horrible podcast and audioblog sites out there that make me miss out on industry trends.
And maybe... talk to Google about YouTube CC.
*blech!*
Re: (Score:2)
I'd love to see a YouTube feature that allows you to get the automatically generated transcript of a video without having to actually watch the video. For videos that are intended to be informative, having the transcript and grepping it for keywords and the context they are used would help you determine if it's worth watching a lengthy video. It maybe even just outright give you the information you want without having to sit through a half hour video.
Re: (Score:2)
I've begun to suspect that YouTube is often used by the lazy and illiterate to to avoid actually taking the effort to type and format what should realistically have been text articles.
Re: (Score:2)
Make a short skit, act it out, take the CC output and redo the skit with the new words.
https://www.youtube.com/watch?... [youtube.com]
Re: (Score:2)
No good when you live with a nutter who thinks they cause cancer.
Re: (Score:2)
I wished more of those CCs were manually typed out by humans.
Re: (Score:2)
It should really improve YouTube too. Having an accurate transcription of a video means it becomes much more searchable than if all you have is the title and summary text. The current automatic transcription on YouTube is nearly useless.
Re: (Score:2)
Yeah, came here to say that. We usually have ours on, and I can't seem to resist reading it. The frequency of errors and quirks is such that I've nearly started making a list of the worst ones. Any show from England tends to have "[indecipherable]" stuck in repeatedly, even when I would have said the language was perfectly clear.
One of my favorites was "read my copy of At Last Shrub" which turned out to be "Atlas Shrugged".
C'mon guys (Score:1)
Re: (Score:2)
Question: how did they find the errors that the two-human team missed? Presumably with a third human. Does this mean a three-person team can beat out both a two-person team and ASR? Or was there a script that was used to generate the audio? That would raise other questions, such as the accuracy of the speakers.
I had the same question. We ran into a similar problem in a school project making an AI that interpreted results from a polysomnogram. In theory we got over ~90% accuracy, but different humans would score the same sleep study differently, which basically meant that humans got 90% accuracy compared to each other too.
They have a 100% accurate translation? (Score:2)
How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.
Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.
The subject is "transcription" not "translation". (Score:2)
Transcription is obviously a lot more straightforward, and the goalposts should be pretty easy to set.
Re: (Score:2)
Quick and fast are easier to discriminate than "fast" and "fat".
Consider the following iterative algorithm...
"That is a fast car" - is translated to
That is a fat car *Context filter - strict vs slang - replace fat with phat*
*Context filter - apply ghetto style - replace "That" with "Dat"*
*Context filter - apply ghetto style - replace "is" with "be"*
*Context filter - apply ghetto style - replace "a" with "one"*
That be one phat car.
Re: (Score:2)
How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.
Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.
Who gave you free reign to make such assertions? You need to tow the line or we'll see to it that you loose your posting privileges here!
Microsoft Lies. Case Closed. (Score:1)
Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.
640 lies oughtta be enough for anyone. Ignore them by now.
Re: (Score:2)
Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.
640 lies oughtta be enough for anyone. Ignore them by now.
Just to begin with, they have been working on this for a while...
https://www.youtube.com/watch?... [youtube.com]
Govt Survellience (Score:3, Insightful)
Finally (Score:2)
Re: (Score:2)
Is it.... (Score:1)
based on that twitter chat-bot that turned racist and trollish in a matter of hours? I have been looking for a way to UTF-TRUMP encode my documents!
"Humans" (Score:1)
Everything depends on how dumb the transcriber and/or checker is.
Defused (Score:4, Interesting)
Re: (Score:2)
Re: (Score:1)
What gets to me more is the choice of how to pronounce the value...
No self-respecting scientist would ever say "one point twenty one". That's "one point two one." Or is 1.201 "one point two hundred and one" and thus more?
Re: (Score:2)
My test goes like this:
Dear Aunt
Let's set so double the killer delete select all.
I'm sure the NSA will be happy (Score:2)
Well, there's a whole bunch (Score:2)
What humans were these? (Score:1)
The humans had a 5.9% error rate AFTER proofreading by another person? That's either a lousy speaker, a terrible recording, or really bad transcription. That's not something to brag about, frankly. I used to get an error rate of under 2% with IBM ViaVoice back in 1994. This doesn't seem like progress to me.
Classic speech recognition failure (Score:2)
https://www.youtube.com/watch?... [youtube.com]
Speech Injection (Score:1)
Bot: Good day sir.
Jim: Semi colon drop table language
Bot:???????????
ROTFLMAO! (Score:2)
I have just this to say about that: folks, I wouldn't let alpha software out to users.
They brought in "hybrid" phones here last year (VOIP). For voicemail, it sends an mp3, and a "transcription". Frequently, the "transcription", "powered by Microsoft speech technology", resembles early "computer poetry". And by "early", I'm talking 1960s or '70s.... with significant portions bearing zero resemblance to what was said.
mark