@MichalO19

MichalO19@alien.top · 2 years ago

If I understood correctly the original explanation on github for RWKV, BlinkDL agrees that softmax attention is very capable in theory but he thinks Transformers are not using it to full potential, so theoretically less capable architectures can beat them.

This might be true, but I kind of doubt it. I played a bit with the 3B RWKV with a prompt like

User: What is the word directly after "bread" in the following string "[like 20 random words]" 
Assistant: The word directly after "bread" is "

(note the preferred for RWKV ordering of a question before data, but I tested the other way around too) and unless the query word is very early in the string it gives me a random word. Even 1.3B transformer models seems to answer this correctly much more often (though not always correctly).

MichalO19@alien.top · 2 years ago

If I am reading this RWKV_v5_demo.py right this is essentially a Retentive Network (so a Linear Transformer) but without the positional encoding, with the token shifts from previous RWKVs, and with trainable matrix valued decay factors (instead of fixed decay factors like in RetNet).

Gotta say it’s a pretty clean architecture but I will believe it surpasses Mistral when I see it. I don’t think a linear transformer has a serious chance to beat a standard transformer with the same number of parameters.

It might have a chance for general 0-shot question answering, but I expect it to be much worse in particular for in-context learning/memory tasks, simply because the softmax attention is way more capable than linear attention as a learning algorithm (theoretically it can learn in-context any key->value mapping, while linear attention by definition can only learn linear key->value mappings (whatever that means in the embedding space), and also risks double-writing into memory things it already knows).

But hey, let’s see.