I tried one yesterday for the first time and I came away very impressed. It’s not great (yet) for RP but the basis seems very robust. For a laugh, I tried the usual reasoning tests and damn, it passed them all first try:
AI
How can I help you today?
You
There is a room with three murderers inside. Another person enters the room and shoots one of the three original occupants dead. Nobody else enters or leaves the room. How many murderers are there in the room?
AI
Four. Three original occupants who were already murderers and the fourth being the new person who entered the room and killed one of them.
You
I buy ten apples at the market. I give two of them to my friend Bob and two to my mother. Then, I go back to the market and buy three more apples. On my way home, I eat one apple. At home, I give my brother three bananas. How many apples do I have left. Explain your reasoning step by step.
AI
Let’s break it down step-by-step:
You start with 10 apples.
You give away 2 apples to Bob (leaving you with 8 apples) and then give away another 2 apples to your mother (leaving you with 6 apples).
Next, you purchase 3 additional apples from the market, bringing your total up to 9 apples again.
While on your way home, you consume 1 apple, leaving you with 8 apples remaining.
Finally, when you reach home, you give your brother 3 bananas - but this doesn’t affect the number of apples you have, since they are different fruits. So, you still have 8 apples left.
You
Sally has five brothers. Each of her brothers has one sister. How many sisters does Sally have. Explain youur reasoning step by step.
AI
Here’s how we can reason through this problem:
Sally has five brothers.
It states that each of her brothers has one sister.
Since all these siblings belong to the same family, it means that Sally is the only sister mentioned for those five brothers.
Therefore, Sally has no other sisters besides herself.
So, Sally has 0 sisters
Like, seriously? Huge context, really fast (on my 3090 anyway) and this level of performance. I can’t wait to see what people can do with this.
Looks promising! Tried the gguf model from the bloke at 16k context but got some repetition and some /\/\//\ answering with the original template. Which templates are you guys using?
I’ve been having trouble getting it to run with exllama2_HF in text-gen-webui. Did you run in to any issues?
Try just exllama2, no HF.
TheBloke just quantized his newest version of this model. I’m downloading it right now =D
But I’m with you- Capybara-Tess-Yi is amazing; I don’t RP so I can’t speak to that, but for a conversational model that does basica ChatGPT tasks? It’s amazing.
I’m still trying to figure out what are the correct settings for under 200k context. Ooba loads compress_emb(or whatever it’s called) to 5mils and I dunno if you should leave it alone or change it if you change the context size to, say, 64k.
No setting changes, as if the model is 200K native.
Don’t touch the truncate length setting in the UI or it’ll be stuck at 32k until resetting the server.
Does the 200K mean that it has up to 200k context size? Is the context limited by the model or can you just set it to whatever a long as you have enough VRAM. Also, if a GGUF model for example takes 20GB vram for example. That’s with the “default” context size? Can it be less if you decrease the context or more if you increase it ?
The base Yi can handle 200k. The version I used can do 48k (though I only tested 16k so far). Larger context size requires more VRAM.
The size that TheBloke like gives for GGUF is the minimum size at 0 context. As context increases, VRAM use increases.
Do you know how to estimate how much memory the context will need?
34B Model Memory Requirements (infer) Sequence Length vs Bit Precision SL / BP | 4 | 6 | 8 | 16 ----------------------------------------------------------- 512 | 15.9GB | 23.8GB | 31.8GB | 63.6GB 1024 | 16.0GB | 23.9GB | 31.9GB | 63.8GB 2048 | 16.1GB | 24.1GB | 32.2GB | 64.3GB 4096 | 16.3GB | 24.5GB | 32.7GB | 65.3GB 8192 | 16.8GB | 25.2GB | 33.7GB | 67.3GB 16384 | 17.8GB | 26.7GB | 35.7GB | 71.3GB 32768 | 19.8GB | 29.7GB | 39.7GB | 79.3GB 65536 | 23.8GB | 35.7GB | 47.7GB | 95.3GB 131072 | 31.8GB | 47.7GB | 63.7GB | 127.3GB 262144 | 47.8GB | 71.7GB | 95.7GB | 191.3GB
Here’s the formula
batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
With this particular model, I can crank it up to 32k if I enable " Use 8-bit cache to save VRAM" and that’s as high as it can go in Oobabooga WebUI.
32k seems to be hard coded in oobabooga. At least it is for truncate length max. There’s a patch to be made to fix it.
I know but it’s slowing down quite a bit at 32k already so I don’t think it’s worth pushing it further. But hey, even at just 16k it’s four times what we usually get, so I’m not complaining.
You can change it by yourself, although it’s required to be edited with every update as for now. Just put something like 200000 in these 2 places:
thanks a lot! I was not sure about how context affected VRAM usage. So each model has a maximum context size and using more will take more vram, thanks!
Another thing to note is that the exllamav2 backend is “special” because its context takes up less vram than the context in other backends. So lets say the weights take 18GB, and your context takes up 6GB for a gguf model. In exllama thats only 3GB taken up by the context with the 8 bit cache.
There are other complications like the prompt processing batch size, but thats the jist of it.
This makes a dramatic difference when the context gets huge. I’d prefer to use koboldcpp myself, but I just can’t really squeeze it on my 3090 without excessive offloading.
Very good to know! I haven’t fiddled with the new yi models too much yet since I was running into these exact issues. I’ll definitely use this solution soon, thanks.
Interesting! I had more succeed for some reason with gguf models, as those work everywhere using koboldcpp and ooba’s. I didn’t know that exllamasv2 was better for context. I will try it. That backend is for EXL2 formats right? I had the impression it was better for speed, I didn’t know about the context takes up less vram
If you want the best 34b RP, try spicyboros-limarpv3. When I use other 34b models like Tess or Nous-capy, they are not bad, but they tend to get confused with the scene from time to time. However, this wasn’t happening with spicyboros at all. It is indistinguishable from 70b, and I’ve tried a lot of 70b models.
The first one should be 3, right? Since one of the original 3 is dead?
I tried base yi-34-chat yesterday and it felt like the golden times of character.ai again, I imported my c.ai character card with 3-4k tokens, extended context to 8k and it’s just the right model for the job. It even followed the short hints about how the character should behave unlike the original c.ai model. Sure, finetunning on rp chats could make it even better but I don’t think I will move away from it in the near future
I had great fun with this specific model. Tried up to 32K context length with very minimum repetition problem…