[R] Meta Unveils Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Successful-Western27@alien.topB to Machine Learning@academy.gardenEnglish · 2 years ago

When generating videos from text prompts, directly mapping language to high-res video tends to produce inconsistent, blurry results. The high dimensionality overwhelms models.

Researchers at Meta took a different approach - first generate a high-quality image from the text, then generate a video conditioned on both image and text.

The image acts like a “starting point” that the model can imagine moving over time based on the text prompt. This stronger conditioning signal produces way better videos.

They built a model called Emu Video using diffusion models. It sets a new SOTA for text-to-video generation:

“In human evaluations, our generated videos are strongly preferred in quality compared to all prior work– 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video.”
“Our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work.”

The key was “factorizing” into image and then video generation.

Being able to condition on both text AND a generated image makes the video task much easier. The model just has to imagine how to move the image, instead of hallucinating everything.

They can also animate user-uploaded images by providing the image as conditioning. Again, reported to be way better than previous techniques.

It’s cool to see research pushing text-to-video generation forward. Emu Video shows how stronger conditioning through images sets a new quality bar. This is a nice compliment to the Emu Edit model they released as well.

TLDR: By first generating an image conditioned on text, then generating video conditioned on both image and text, you can get better video generation.

Full summary is here. Paper site is here.

You must log in or register to comment.

Chat

Machine Learning@academy.garden

machinelearning@academy.garden

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !machinelearning@academy.garden

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
1 user / week
1 user / month
1 user / 6 months
1 local subscriber
1 subscriber
786 Posts
3.03K Comments
Modlog

mods:
communick@academy.garden