What are your thoughts on the DallE3 “paper” which doesn’t cover technical or architectural details? The only useful takeaway seems to be “higher quality data is better” and “image captioning models that provide a great amount of detail can create good datasets.”
All these models are built on top of one another and they cite previous works they built on top of. T5 encoder (imagen) + data captioned with GPT-V. Improved SD VAE that they also open sourced.
I wished they would have published their hyper params but alas.
What else did you want to see from the paper?