Feel like diffusion and iterative mask/predict are pretty conceptually similar—my hunch is that diffusion might have a higher ceiling by being able to precisely traverse a continuous space, but operating on discrete tokens probably could converge to something semantically valid w fewer iterations.
Also Bert is trained w MLM which technically is predicting the og text from a “noisy” version, but noise is only introduced via masking, and it is limited to a single forward pass, not iterative!
This has been explored a little for nlp and even audio tasks (using acoustic tokens)!
https://aclanthology.org/2022.findings-acl.25/ and https://arxiv.org/abs/2307.04686 both come to mind
Feel like diffusion and iterative mask/predict are pretty conceptually similar—my hunch is that diffusion might have a higher ceiling by being able to precisely traverse a continuous space, but operating on discrete tokens probably could converge to something semantically valid w fewer iterations.
Also Bert is trained w MLM which technically is predicting the og text from a “noisy” version, but noise is only introduced via masking, and it is limited to a single forward pass, not iterative!