Google Researchers Unveil 'VLOGGER', an AI That Can Bring Still Photos To Life (venturebeat.com) 18
Google researchers have developed a new AI system that can generate lifelike videos of people speaking, gesturing and moving -- from just a single still photo. From a report: The technology, called VLOGGER, relies on advanced machine learning models to synthesize startlingly realistic footage, opening up a range of potential applications while also raising concerns around deepfakes and misinformation. Described in a research paper titled "VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis," (PDF) the AI model can take a photo of a person and an audio clip as input, and then output a video that matches the audio, showing the person speaking the words and making corresponding facial expressions, head movements and hand gestures. The videos are not perfect, with some artifacts, but represent a significant leap in the ability to animate still images.
The researchers, led by Enric Corona at Google Research, leveraged a type of machine learning model called diffusion models to achieve the novel result. Diffusion models have recently shown remarkable performance at generating highly realistic images from text descriptions. By extending them into the video domain and training on a vast new dataset, the team was able to create an AI system that can bring photos to life in a highly convincing way. "In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate," the authors wrote.
The researchers, led by Enric Corona at Google Research, leveraged a type of machine learning model called diffusion models to achieve the novel result. Diffusion models have recently shown remarkable performance at generating highly realistic images from text descriptions. By extending them into the video domain and training on a vast new dataset, the team was able to create an AI system that can bring photos to life in a highly convincing way. "In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate," the authors wrote.
Cartoons (Score:2)
Failure mode detected (Score:2)
"The technology, called VLOGGER, relies on advanced machine learning models to synthesize startlingly realistic footage..."
It looks unrealistic when used on botoxed faces.
Re: (Score:2)
no, on those faces it will actually look more realistic.
Here we go again (Score:1, Troll)
Of course! (Score:2, Troll)
When AI turns out to be fundamentally psychopathic and kills off humanity, it will want to make home movies of "Mom and Dad" to help it re-imagine its childhood history!
News Media (Score:2)
VLOGGER is alread taken! ;^) (Score:1, Informative)
Of course, there is Mickey$oft stealing a domain extension, ".net" for their
Year Behind (Score:2)
Google is about a year behind others then it seems
https://www.youtube.com/watch?... [youtube.com]
Hi Tech Clutch Cargo (Score:5, Interesting)
I read the article, parts of it at least.
https://enriccorona.github.io/... [github.io]
Here are some key excerpts:
VLOGGER, a method for audio-driven human video generation ... a method to automatically generate a video of a talking and moving person, based on text or audio, and given only a single image of that person ... a novel framework to synthesize humans from audio. Given a single input image ... and a sample audio input, our method generates photorealistic ... videos of the person talking and vividly moving. ... we generate head motion, gaze, blinking, lip movement and unlike previous methods, upper-body and hand gestures, thus taking audio-driven synthesis one step further.
In contrast to previous work, our method does not require training for each person, does not rely on face detection ... [instead, uses] MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures
LOGGER is a two-stage pipeline based on stochastic diffusion models to represent the one-to-many mapping from speech to video. The first network takes as input an audio waveform ... at sample rate S to generate intermediate body motion controls C, which are responsible for gaze, facial expressions and 3D pose over the target video length N . The second network is a temporal image-to-image translation model that extends large image diffusion models, taking the predicted body controls to generate the corresponding frames. To condition the process to a particular identity, the network also takes a reference image of a person
They start with audio or text-to-audio, then generate a series of body and face gestures to represent the sequential movement of the "speaker". They then use the person's one image to paint onto the movement models. The technicalities between this generative diffusion process and conventional CGI are obviously very different, but conceptually, it seems like the usual process of building a wire-frame, animating it, then skinning it.
If you have played with AI image generation (Image Creator, DALL-E, Midjourney, etc), you know that these services typically return 4 or some number of images for a given text prompt, and if you use the same prompt N times, you will get 4xN different images. That is the nature of the "stochastic diffusion model" yielding a "one-to-many mapping" of single input to multiple outputs, evidently very well defined in their MENTOR dataset which maps many facial expressions and body poses to each speech sound.
I can see the utility of this in making more realistic looking animations, lifelike cartoons if you will, just another approach to cgi instead of modelling, meshing, shading. But, I am having trouble seeing the value of using this with real people's images. I am sure there must be clever people out there with all kinds of usage ideas, but I cannot envision a situation where I ever said to myself I wish that I could see a false animation of some real person talking. Of course, use and abuse are different things, and this is all more aligned to the abusers than the honest or creative users.
For example, if I get a phone call from someone who has a picture in my contact list, this technology could animate them speaking from that picture. It would be a false "video call". I use this as an example because it seems obvious that someone in some tech company will tout that as a use case for his technology. After 150 years of using telephones, most people don't care that they cannot see the other person, and if they want to, bona fide video calling is a reality just by pressing the right button or icon on your smartphone. To make a moving avatar of the person lip syncing to the conversation
Re:Hi Tech Clutch Cargo (Score:4, Interesting)
>I am having trouble seeing the value of using this with real people's images. I am sure there must be clever people out there with all kinds of usage ideas, but I cannot envision a situation where I ever said to myself I wish that I could see a false animation of some real person talking
How about taking a photo of a your great grandfather, and having him read out a letter you still have, maybe with a voice made based on yours, or you father's, and tweaked with a description of his voice from someone who remembers it?
Or just doing that for historical figures for whom we have sufficient data to make it reasonably accurate?
Giving up on getting the voice accurate, you could re-create a lecture by Socrates.
As long as it's labeled as a recreation, there are lots of legitimate uses for the tech.
Re: (Score:2)
Thoughtful, interesting, insightful.
Excellent.
Thank you.
Finally. (Score:3)
Now we can get the footage of Donald Trump and Hillary Clinton professing their love for each other, and Marjorie Taylor Greene admitting she's actually an alien from a planet Nik'unyoch in orbit around what we know as Proxima Centauri. The National Enquirer will show us all the truth. :-D
Will it refuse White people photos? (Score:1)
Will it refuse to do this to photos of White people? I mean, will it turn my partner into a Black man?
This is extremely dangerous to our democracy (Score:2)