mgostIH@alien.topBtoMachine Learning@academy.garden•[R] Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in TransformersEnglish
1·
1 year agobecause there’s no obvious way to parallelize the causal self-attention with a FF
You can just use triangular matrices, autoregressive language modelling can be done even with linear only layers. See page 12 of https://arxiv.org/abs/2309.06979
Previous discussion of Fast Feed Forward, which is a paper from the author this one is based on.