So RWKV 7b v5 is 60% trained now, saw that multilingual parts are better than mistral now, and the english capabilities are close to mistral, except for hellaswag and arc, where its a little behind. all the benchmarks are on rwkv discor, and you can google the pro/cons of rwkv, though most of them are v4.

Thoughts?

  • cztomsik@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I have my doubts too. RWKV4 was great, but in practice it was always worse than any LLAMA. I think it might be because it’s way more sensitive to sampling. Because every token destroys the previous state completely. So once it goes wrong way, it will never recover. This happens with other architectures too but all the data are still in the context and the model can still recover but RWKV does not have any (previous) context, so it can’t recover.

    That said, RWKV is awesome and I am super-excited about it. Either we can solve this problem with sampling or we can just slap small attention block on top of it and do fine-tuning then together. Either way, the future is bright in my opinion.

    Also, if you think about it, it’s a miracle that such architecture even works and manages to learn instruction following.

    Also RWKV is great because you can “freeze” the state, save it, and then always just restore it, and continue the conversation (or whatever). Which together with small memory requirements makes it very compelling for serving multiple users without occupying a lot of GPU memory, and also instead of “engineering the prompt” you are really engineering the initial state. Obviously it’s way more sensitive to fine-tuning, it will “revert” to its mood sooner.