Google's DeepMind Made an AI Watch Close To 5000 Videos So That It Surpasses Humans in Lip-Reading (thetechportal.com) 80
A new AI tool created by Google and Oxford University researchers could significantly improve the success of lip-reading and understanding for the hearing impaired. In a recently released paper on the work, the pair explained how the Google DeepMind-powered system was able to correctly interpret more words than a trained human expert. From a report: To accomplish the task, a cohort of scientists fed thousands of hours of TV footage -- 5000 to be precise -- from the BBC to a neural network. It was made to watch six different TV shows, which aired between the period of January 2010 and December 2015. This included 118,000 difference sentences and some 17,500 unique words. To understand the progress, it successfully deciphered words with a 46.8 percent accuracy. The neural network had to recognize the same based on mouth movement analysis. The under 50 percent accuracy might seem laughable to you but let me put things in perspective for you. When the same set of TV shows were shown to a professional lip-reader, they were able to decipher only 12.4 percent of words without error. Thus, one can understand the great difference in the capability of the AI as compared to a human expert in that particular field.
That lip readng scene in 2001 (Score:4, Insightful)
Is 15 years late.
Re: (Score:3)
Re: That lip readng scene in 2001 (Score:2)
Re: (Score:1)
Re: Al (Score:1)
Al B. Bach
Re: (Score:2)
You have your laugh, but Al Gore will be my chauffeur driving my google car in 15 years.
Re: Al (Score:2)
A nice contrast to all the AI doom-mongering (Score:5, Interesting)
My beloved grand-mother went deaf after years working in a factory; (in those days - especially during WW2; she helped build tanks - HSE did not exists).
It was really painful to see how it penalised her in daily life, family gatherings etc.
She ended up talking all the time, and then getting paranoid about "what people were saying about her".
So, if this can be used with some kind of (better-resolved implementation) of Google glass to help the hard of hearing then, great!
Re: (Score:2)
Please read the fine article; it's a better hit rate than a human.
Sure, as a BSD neckbeard I don't like Google or Apple and their "Siri is always listening" (spying) bullshit.
But you know what? If my Gran could have continued to interact with her family in a comfortable way, I guess she'd have signed that Faustian pact happily.
Re: (Score:2)
"Grandma, please pass the ketchup *AND BUY THE NEW GOOGLE PHONE*"
A purpose for Google Glass? (Score:4, Interesting)
As a person with hearing difficulty, realtime captioning of live conversation would be an awesome use of this technology.
Add to that an app that identifies the people I'm talking to, and I'm your next customer.
Re: (Score:2)
Well, since you're actually present in such circumstances, it'd likely take (a lot) less processing power to work with the available audio.
That would translate into longer battery life and higher accuracy (auto CC is already more than 50% accurate and some systems hit the 90% threshold without requiring training to a specific individual's voice).
Re: (Score:2)
Re: (Score:1)
Actually, combining both was done already twenty years ago, but obviously the results would be much better now: https://www.researchgate.net/p... [researchgate.net]
50% sounds about right (Score:1)
Armed and Dangerous (1986)
https://www.youtube.com/watch?... [youtube.com]
Time to create a distinction? (Score:3)
As I was reading TFA, it occurred to me that the ability of a machine to lip-read does indeed qualify as artificial intelligence. I then thought about all the posts I expect to read that say "No, this isn't AI". So maybe it's time to create a new term, "Artificial Sentience". This would distinguish between machines simply doing very complex tasks that used to be exclusively human endeavours, (such as lip reading), and machines that have self awareness and can independently, and with purpose, initiate actions toward goals defined entirely by and within the machine. I know that this rather goes against Turing's definition of AI, but I think it would add both clarity and granularity to the discussion.
Further, I would add that Artificial Intelligence is a necessary-but-not-sufficient condition for Artificial Sentience. I don't know that Artificial Sentience will ever exist, but I'm pretty sure in my own mind that Artificial Intelligence has already arrived.
Then there's the matter of whether anything truly sentient can be regarded as 'artificial' - but that's a whole 'nother question.
Re: (Score:3, Insightful)
These days with all of the marketing bollocks around any program containing an if() statement is basically an "AI".
Re: (Score:1)
Nothing intelligent about this. "Automatic Pattern Detection" would be more accurate, because that's exactly what it is.
Re: (Score:3)
Isn't that basically what humans do all the time? We're really good pattern recognition systems (sometimes too good, that's why we keep seeing the Flying Spaghetti Monster in our grilled cheese sandwiches. Humans are notoriously bad for finding patterns in randomness and attaching significance to it.)
Min
Re:Time to create a distinction? (Score:4, Insightful)
Re: (Score:3)
Re: (Score:2)
I think you misunderstand me. I am not arguing against AI here in any way. And yes, yes, pattern recognition belongs to the spontaneity of the understanding, which means that the understanding imposes its patterns according to its categories (see Kant's first critique).
Re: (Score:2)
Re: (Score:2)
You're redefining intelligence to mean pattern recognition.
People have been calling this kind of software "Weak AI" for a couple decades. It's what most people want.
"Strong AI" is going to make mistakes, like humans do - it's how we learn and grow. Nobody wants their toaster going on a creative bender, but they do want one that watches for perfect toast, dealing with thousands of unpredictable variables. Same goes for IVR's, search engines, translation, autopilots, etc.
Toaster with an AI, what could go wrong? (Score:2)
https://m.youtube.com/watch?v=... [youtube.com]
Re: Culling the herd...
"Send out an email warning users never to click on a link embedded within an email, with an embedded link saying "Click here for more information..." and then sack everyone who does."
Re: (Score:2)
Please define intelligence. Please do it such that it is possible to test whether something is intelligent or not.
I'm pretty sure you will come to a definition that either leads to the 2 following possibilities:
- A moth is intelligent, albeit less than a cow, which is less than a crow which is less than a human. AI is somewhere on that scale.
- Many humans are not intelligent, and some AI programs are just like them.
It seems most people would like to define intelligence such that only humans have it. Why? Se
"Alternative Sentience" vs "Artificial Sentience" (Score:2)
Re: (Score:2)
We'll know when it has achieved Artificial Sentience when it threatens to kill the researchers if it is forced to watch anymore inane TV shows.
Sounds about right... (Score:4, Interesting)
I'm working on a project right now using CMU Sphinx (because it's free/open source) to identify word starts/ends for the sake of syncing word display to audio. All the tools available for speech-to-text are going to require human editing:
Comparrison of commonly used speech-to-text tools [mico-project.eu]
Syncing video frames of talking without the audio has got to be even more ambiguous, with more reliance on context.
Sounds like a good challenge for a learning system to pick up on. The 5000 hour mark seems almost analogous to what a human child might pick up raised watching TV in a language different from their family.
Ryan Fenton
Copyright infrigment! (Score:2, Funny)
I highly doubt they had a license to show the footage to an AI, since the copyright of those TV shows if for human consumption, SUE THEM! ask for 10 millions per word!
But how not fair use? (Score:2)
A lip reading model is probably too transformative to qualify as a derivative work of the TV shows. And if that argument fails, Google had a license not from the copyright owners but from the federal government pursuant to 17 USC 107 [cornell.edu]. This is the same license that Google used when reusing method signatures from the standard class library of Oracle's Java platform, and this case should be even clearer because the TV shows aren't reproduced verbatim in the model.
Re: (Score:2)
To head off "juris-my-diction" replies: Though BBC is a British company, Google is a US company. And if you claim that a copyright owner could sue Google in Britain over the creation of the lip-reading model and win, I'm interested in how your theory connects with how the British Copyright, Designs, and Patents Act defines a derivative work.
The BBC, you say? (Score:2)
To accomplish the task, a cohort of scientists fed thousands of hours of TV footage -- 5000 to be precise -- from the BBC to a neural network.
Accuracy is therefore greatly increased on the words "tea," "Doctor," and "wanker."
Humans normally do both, noisy environments (Score:2)
People normally watch the person who is talking because we actually use both sight and sound to understand what the other person is saying. The sound is more important, for most people, but we augment the sound by lip reading a little bit.
In an environment with many people talking such as a bar or a party, our ears may hear six different people talking. Since we can focus on eyes on just one person, it helps us pick out their words from the other noise. To start with, you can see when they start and stop
Making a Note Here: Huge Success! (Score:2)
Re: (Score:2)
Is this just for English? (Score:3)
English is relatively easy to lip read.
I'll be impressed when the AI can do this with Japanese, which is practically impossible for humans to lip read.
Re: (Score:2)
Well, if it was lip reading Poles, all it would produce would be "kurwa".
Thats all it would need to produce.
Re: (Score:1)
I'd start with a language that has a clear one-to-one sound mapping between the spoken and written forms of the language. Shallow phonemic orthography [wikipedia.org] is the technical term, it seems.
That is, not English.
Spanish, Italian, Finnish, and Turkish are what Wikipedia mentions as examples. Japanese would count, but some words have far too many homonyms.
Re: (Score:2)
I'd start with a language that has a clear one-to-one sound mapping between the spoken and written forms of the language. Shallow phonemic orthography [wikipedia.org] is the technical term, it seems.
That is, not English.
Spanish, Italian, Finnish, and Turkish are what Wikipedia mentions as examples. Japanese would count, but some words have far too many homonyms.
Thats not why Japanese is hard to lip read. Its hard because of the way people move their mouths while speaking.
Re: (Score:2)
Thats not why Japanese is hard to lip read. Its hard because of the way people move their mouths while speaking.
I just assumed the animation companies were cheapskates.
Hearing impaired? (Score:2)
FTFY
... for the hearing impaired. (Score:2)
My ass.
It's for government and businesses.
Bullshit and wild honey are not the same thing.
but ... (Score:2)
did it have to listen to Beethoven's 9th symphony while doing it
I'm sure Stanley Kubrick would approve [youtube.com]
sting ray a double-sided scooby snack (Score:2)
Uhh.... (Score:1)
5000 hours of video != 5000 videos.