RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it.

vatsadev@alien.top · 10 months ago

RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it.

cztomsik@alien.top · 10 months ago

I have my doubts too. RWKV4 was great, but in practice it was always worse than any LLAMA. I think it might be because it’s way more sensitive to sampling. Because every token destroys the previous state completely. So once it goes wrong way, it will never recover. This happens with other architectures too but all the data are still in the context and the model can still recover but RWKV does not have any (previous) context, so it can’t recover.

That said, RWKV is awesome and I am super-excited about it. Either we can solve this problem with sampling or we can just slap small attention block on top of it and do fine-tuning then together. Either way, the future is bright in my opinion.

Also, if you think about it, it’s a miracle that such architecture even works and manages to learn instruction following.

Also RWKV is great because you can “freeze” the state, save it, and then always just restore it, and continue the conversation (or whatever). Which together with small memory requirements makes it very compelling for serving multiple users without occupying a lot of GPU memory, and also instead of “engineering the prompt” you are really engineering the initial state. Obviously it’s way more sensitive to fine-tuning, it will “revert” to its mood sooner.