Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Microsoft AI IT Technology

Microsoft Claims Its Speech Transcription AI is Now Better Than Human Professionals (qz.com) 98

Microsoft announced today a system that can transcribe the content of a phone call with "the same or fewer errors" than real actual human professionals trained in transcription -- even when the human transcript is double-checked by a second human for accuracy. As you can imagine, this is a huge milestone for speech recognition. From a Quartz report:The team doesn't attribute this achievement to any breakthrough in algorithm or data, but the careful tuning of existing AI architectures. To test how their algorithm stacked up against humans, first researchers had to get a baseline. Microsoft hired a third-party service to tackle a piece of audio for which they had a confirmed 100 percent accurate transcription. The service worked in two stages: one person types up the audio, and then a second person listens to the audio and corrects any errors on the transcript. Based on the correct transcript for the standardized tests, the professionals had 5.9 percent and 11.3 percent error rates. After learning from 2,000 hours of human speech, Microsoft's system went after the same audio file -- and scored 5.9 percent and 11.1 percent error rates. That minute difference ends up being about a dozen fewer errors. Microsoft's next challenge is making this level of speech recognition work in noisier environments, like in a car or at a party. This implementation is crucial for Microsoft, and goes well beyond just transcription.
This discussion has been archived. No new comments can be posted.

Microsoft Claims Its Speech Transcription AI is Now Better Than Human Professionals

Comments Filter:
  • by Anonymous Coward

    That minute difference ends up being about a dozen fewer errors.

    If 0.2% is a dozen, then 1% is sixty, so 100% is six thousand errors.

    Yikes.

  • Right ... (Score:5, Funny)

    by scunc ( 4201789 ) on Tuesday October 18, 2016 @02:01PM (#53102317)
    I'll believe that when I ducking see it.
    --
    This comment was transcribed by Microsoft's new AI transcription software.
    • by Bongo ( 13261 )

      I've taken to typing and saying "ducking" all the time anyway. Soon to be added as a new meaning in the dictionaries.

      Those ducks, always up to something nasty.

      • by Quirkz ( 1206400 )

        I've taken to typing and saying "ducking" all the time anyway. Soon to be added as a new meaning in the dictionaries.

        Those ducks, always up to something nasty.

        I used to have an office that overlooked a river. I can't speak for all ducks, but the resident mallards ... yes, they were almost always up to those types of things.

      • Those ducks, always up to something nasty.

        Homosexual necrophiliac rape, if I recall correctly.

        Moeliker, C.W., 2001 - The first case of homosexual necrophilia in the mallard Anas platyrhynchos (Aves: Anatidae) - DEINSEA 8: 243-247 [ISSN 0932-9308]. Published 9 November 2001

        Yes, I do remember correctly, and it was indeed a Mallard doing the deed (and being done-unto, too).

        Almost unremarkable that it was a Dutch report, and was considered so remarkable that it took 6 years from event to publication.

        I'd not a

  • Voice Control (Score:5, Insightful)

    by Rockoon ( 1252108 ) on Tuesday October 18, 2016 @02:02PM (#53102327)
    If you want voice input to be more than just a toy, then getting near flawless accuracy here seems to be a required first step.

    If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.
    • Agreed, however people down south don't move their mouse with "the typical hospitallllity of us folk 'round here" as opposed to the people up north who couldn't give a rats ass.

      Speech is incredibly dense to parse. Where a near perfect operation is required for a mouse, voice control can have a couple bumps in its' road before (and while) being highly adopted.

    • If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.

      Wrong example. Mouse usability requires constant visual feedback and almost constant human correction. That is the reason why we can't really use a mouse without looking directly at the screen.

      In any case, flawless transcription accuracy of one single human voice out of 7.5 billion voices already happens with Google Voice. The problem occurs when Google Voice is not tuned to the voices of the other 7.49999 billion people. Do you think that's what Microsoft is using in the backend this second time [blogspot.com] around?

    • This!

      We have input today that is perfect. More important, we sometimes have to do input that can break hours if not days of work if executed wrongly. Hitting the wrong key at the wrong time can at least be chalked off as human error, Saying "down" do scroll and it being interpreted as "shutdown" (along with the frustrated "NO, dammit" being interpreted as the answer to "save work (y/n)?") is more a problem of the input parser than the human in front of the screen.

      Unless it is AT LEAST at par with other mean

    • If your mouse occasionally sent an erroneous input to the computer no matter how careful you were, you wouldnt use it so much.

      And yet touchpads are still vastly more common on laptops than trackpoints...

  • by itsme1234 ( 199680 ) on Tuesday October 18, 2016 @02:02PM (#53102329)

    Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".

    • "Show me my most at-risk opportunities".

      Huh, you mean Xiaomi is coming out with moist asterisks? How very interesting!

    • by Anonymous Coward

      Like any human would think about milk or "open reminders" when hearing "Show me my most at-risk opportunities".

      Eye thin queue meant two say:
      Lie canny hue man wood thin cab out mill core "owe pen reminders" when he ring "Show meme I Moe stat risk copper tune it ease.

      • Eye thin queue meant two say: Lie canny hue man wood thin cab out mill core "owe pen reminders" when he ring "Show meme I Moe stat risk copper tune it ease.

        Hey! It looks like you have obtained illegal access to the system used to caption news broadcasts!

  • Dialog windows: "Do you want to register for your FREE Windows 10 Upgrade?"

    Me (vocally): "No, no... of for the love of all that's sacred, NO!"

    Windows: "This may take a while. Please do not power down your computer ..."

    • customer relations record: the customer loves windows as if it's the most sacred thing to him

  • by cmiller173 ( 641510 ) on Tuesday October 18, 2016 @02:04PM (#53102349)

    Automated closed captioning for the hearing impaired would be one. I'm not hearing impaired, but I use the CC system with the volume low when I am watching TV while everyone else in the house is sleeping. I also use it when everyone is awake and noisy. It is amazing how awful some CC can be.

    • by yagu ( 721525 )

      It is amazing how awful some CC can be.

      At first I thought, based on your post you'd really meant to say: "It is amazing how awesome CC can be."

      Interestingly, both are true.

    • Yes, I've noticed this too. I've often wondered if some CC is done by machine or just illiterates.
      • by machine or just illiterates

        No. This is a whole new technology: artificial stupidity. Its going to change the world, I tell you. (Mostly for the worse, I suspect!)

        • It will be decades before artificial stupidity is anywhere near natural stupidity on any metric.

          Natural stupidity is surprisingly flexible and resilient---it can crop up anywhere and is almost impossible to stop.

          Artificial stupidity requires significant investment and evolutionary design before it can approach the persistence and impact we see naturally.

    • by CODiNE ( 27417 )

      Oh yes, my body is ready.

      And please make an API for all those horrible podcast and audioblog sites out there that make me miss out on industry trends.

      And maybe... talk to Google about YouTube CC.
      *blech!*

      • I'd love to see a YouTube feature that allows you to get the automatically generated transcript of a video without having to actually watch the video. For videos that are intended to be informative, having the transcript and grepping it for keywords and the context they are used would help you determine if it's worth watching a lengthy video. It maybe even just outright give you the information you want without having to sit through a half hour video.

        • I've begun to suspect that YouTube is often used by the lazy and illiterate to to avoid actually taking the effort to type and format what should realistically have been text articles.

      • by iczer1 ( 991037 )
        Caption fails (old but funny):

        Make a short skit, act it out, take the CC output and redo the skit with the new words.

        https://www.youtube.com/watch?... [youtube.com]

    • by antdude ( 79039 )

      I wished more of those CCs were manually typed out by humans.

    • by AmiMoJo ( 196126 )

      It should really improve YouTube too. Having an accurate transcription of a video means it becomes much more searchable than if all you have is the title and summary text. The current automatic transcription on YouTube is nearly useless.

    • by Quirkz ( 1206400 )

      Yeah, came here to say that. We usually have ours on, and I can't seem to resist reading it. The frequency of errors and quirks is such that I've nearly started making a list of the worst ones. Any show from England tends to have "[indecipherable]" stuck in repeatedly, even when I would have said the language was perfectly clear.

      One of my favorites was "read my copy of At Last Shrub" which turned out to be "Atlas Shrugged".

  • Say what you want about Microsoft (and some of it is true) but this is progress, even if they (maybe) cherry picked the one trial that had the lowest difference in error rate between the algorithm and a human...
  • How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.

    Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.

    • Transcription is obviously a lot more straightforward, and the goalposts should be pretty easy to set.

    • by saider ( 177166 )

      Quick and fast are easier to discriminate than "fast" and "fat".

      Consider the following iterative algorithm...

      "That is a fast car" - is translated to
      That is a fat car *Context filter - strict vs slang - replace fat with phat*
      *Context filter - apply ghetto style - replace "That" with "Dat"*
      *Context filter - apply ghetto style - replace "is" with "be"*
      *Context filter - apply ghetto style - replace "a" with "one"*
      That be one phat car.

    • How do you get 100% accurate translation anyway? These things are up to interpretation, not all words have an exact translation, meaning is more important than the actual correct words. Language is ambiguous.

      Also every error is not necessarily equal, some errors are irrelevant, while some are more important, e,g. the quick car, or the fast car, mean the same thing.

      Who gave you free reign to make such assertions? You need to tow the line or we'll see to it that you loose your posting privileges here!

  • Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.

    640 lies oughtta be enough for anyone. Ignore them by now.

    • Why should anyone trust Microsoft? They lie about surveillance, they lie about being "open", they lied about Windows 10 install options, they lied about Windows Tablet having a bigger screen than iPad, they lied to Steve Jobs about their GUI plans in the 80s, etc.

      640 lies oughtta be enough for anyone. Ignore them by now.

      Just to begin with, they have been working on this for a while...

      https://www.youtube.com/watch?... [youtube.com]

  • Govt Survellience (Score:3, Insightful)

    by mcolgin ( 818580 ) on Tuesday October 18, 2016 @02:40PM (#53102675) Homepage
    I assume this is so the Govt agencies can transcribe cell-phone communications to text and then perform analysis to find all the "bad guys" ?
  • The machines can finally interpret our speech. Next step: launch all the missiles.
    • I'm sorry, the missus can't do launch today. It's laundry day. Clippy says she might have time to come round at 11:45 tomorrow. Would that work?
  • based on that twitter chat-bot that turned racist and trollish in a matter of hours? I have been looking for a way to UTF-TRUMP encode my documents!

  • "even when the human transcript is double-checked by a second human for accuracy"

    Everything depends on how dumb the transcriber and/or checker is.
  • Defused (Score:4, Interesting)

    by John Jorsett ( 171560 ) on Tuesday October 18, 2016 @03:22PM (#53103053)
    The acid test for transcription for me is if the transcriptionist gets the word "defuse" right, as in "He defused the tense situation." Every, and I mean EVERY, closed caption I've seen transcribes it as, "He diffused the tense situation." It seems to be the universal mistake.
    • My test goes like this:

      Dear Aunt
      Let's set so double the killer delete select all.

  • Now the NSA can store text transcripts of your conversations instead of having to store the audio files. This will leave so much more room for video! Hey - why did you put tape on your webcam, citizen?
  • Of middle class jobs about to go caput.
  • by Anonymous Coward

    The humans had a 5.9% error rate AFTER proofreading by another person? That's either a lousy speaker, a terrible recording, or really bad transcription. That's not something to brag about, frankly. I used to get an error rate of under 2% with IBM ViaVoice back in 1994. This doesn't seem like progress to me.

  • Dear aunt, let's set so double the killer delete select all

    https://www.youtube.com/watch?... [youtube.com]

  • Jim: Hey there
    Bot: Good day sir.
    Jim: Semi colon drop table language
    Bot:???????????
  • I have just this to say about that: folks, I wouldn't let alpha software out to users.

    They brought in "hybrid" phones here last year (VOIP). For voicemail, it sends an mp3, and a "transcription". Frequently, the "transcription", "powered by Microsoft speech technology", resembles early "computer poetry". And by "early", I'm talking 1960s or '70s.... with significant portions bearing zero resemblance to what was said.

            mark

Every nonzero finite dimensional inner product space has an orthonormal basis. It makes sense, when you don't think about it.

Working...