When generating videos from text prompts, directly mapping language to high-res video tends to produce inconsistent, blurry results. The high dimensionality overwhelms models.

Researchers at Meta took a different approach - first generate a high-quality image from the text, then generate a video conditioned on both image and text.

The image acts like a “starting point” that the model can imagine moving over time based on the text prompt. This stronger conditioning signal produces way better videos.

They built a model called Emu Video using diffusion models. It sets a new SOTA for text-to-video generation:

  • “In human evaluations, our generated videos are strongly preferred in quality compared to all prior work– 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video.”
  • “Our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work.”

The key was “factorizing” into image and then video generation.

Being able to condition on both text AND a generated image makes the video task much easier. The model just has to imagine how to move the image, instead of hallucinating everything.

They can also animate user-uploaded images by providing the image as conditioning. Again, reported to be way better than previous techniques.

It’s cool to see research pushing text-to-video generation forward. Emu Video shows how stronger conditioning through images sets a new quality bar. This is a nice compliment to the Emu Edit model they released as well.

TLDR: By first generating an image conditioned on text, then generating video conditioned on both image and text, you can get better video generation.

Full summary is here. Paper site is here.