Researchers at Meta AI announced Emu Edit today. It can edit images precisely based on text instructions. It’s a big advance for “instructable” image editing.
Existing systems struggle to interpret instructions correctly - making imprecise edits or changing the wrong parts of images. Emu Edit tackles this through multi-task training.
They trained it on 16 diverse image editing and vision tasks like object removal, style transfer, segmentation etc.
Emu Edit learns unique “task embeddings” to guide it towards suitable edits based on the instruction text. Like a “texture change” vs “object removal”.
In evaluations, Emu Edit significantly outperformed prior systems like InstructPix2Pix on following instructions faithfully while preserving unrelated image regions.
With just a few examples, it can adapt to wholly new tasks like image inpainting by updating the task embedding rather than full retraining.
There’s still room for improvement on complex instructions. But Emu Edit demonstrates how multi-task training can majorly boost AI editing abilities. It’s now much closer to human-level performance on translating natural language to precise visual edits.
TLDR: Emu Edit uses multi-task training on diverse edits/vision tasks and task embeddings to achieve big improvements in instruction-based image editing fidelity.
Full summary is here. Paper here.
Looks like too much work to recreate easily.