[R] Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

APaperADay@alien.top · 1 year ago

[R] Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

ganzzahl@alien.top · 1 year ago

The reasons this isn’t done:

Fixed max sequence length (shorter sequences aren’t less computation)
Very short max sequence length (50 tokens in this paper!)
Very inefficient training (for a target sequence with N tokens, this requires N forward passes for the decoder, as opposed to 1 with attention, because there’s no obvious way to parallelize the causal self-attention with a FF

mgostIH@alien.top · 1 year ago

because there’s no obvious way to parallelize the causal self-attention with a FF

You can just use triangular matrices, autoregressive language modelling can be done even with linear only layers. See page 12 of https://arxiv.org/abs/2309.06979