What are your thoughts on the DallE3 “paper” which doesn’t cover technical or architectural details? The only useful takeaway seems to be “higher quality data is better” and “image captioning models that provide a great amount of detail can create good datasets.”
Honestly I’m surprised we even got that, and I think we might not have except that other researchers independently figured out synthetic captions around the same time.