I’ve playing with a lot of models around 7B but I’m now prototyping something that would be fine with a 1B model I think, but there’s just Phi-1.5 that I’ve seen of this size, and I haven’t seen a way to run it efficiently so far. llama.cpp has still not implemented it for instance.
Anyone has an idea of what to use?
RWKV 1.5B, its Sota for its size, outperforms tinyLlama, and uses no extra vram for fitting its whole ctx len in browser.