Nouse-Capybara-34B 200K

mcmoose1900@alien.top · 2 years ago

Probably a Vulkan driver quirk. I would report it to Huawei, somehow, if you can. They probably have some business interest in running LLMs on their phones, and would fix it if they knew.

mcmoose1900@alien.top · 2 years ago

In addition to what others said, exl2 is very sensitive to the quantization dataset, which it uses to choose where to assign those “variable” bits.

Most online quants use wikitext. But I believe if you quantize models yourself on your own chats, you can get better results, especially below 4bpw.

mcmoose1900@alien.top · 2 years ago

Yeah. I was just thinking the Twitter corpus is, on average, a much more trollish and spammier dataset than the general internet (which is a high bar).

mcmoose1900@alien.top · 2 years ago

Have you considered running a Yi 200K model instead?

mcmoose1900@alien.top · 2 years ago

See the end result of this: https://chirper.ai/

Its… actually pretty pleasant without any humans, lol.

mcmoose1900@alien.top · 2 years ago

Lol, TheBloke should probably be manually excluded from this.

mcmoose1900@alien.top · 2 years ago

There are actually TSVs for 3D Cache on the AMD 7900 series, but AMD doesn’t use them. Presumably because it makes the chip run hotter, so they’d have to downclock it.

But I think it would be a great candidate for an ML card. Not for directly accelerating models, but for basically fitting any kind of intermediate calculations in cache to preserve all the RAM bandwidth for model weights.

mcmoose1900@alien.top · 2 years ago

I got excited when Gaudi was listed… But it was on sale precisely nowhere, lol.

mcmoose1900@alien.top · 2 years ago

Git rebasin claims to do this.

But its untested on large models. There is a branch for it in mergekit, as well as a stable diffusion implementation (which works fantastically as a regular merger).

mcmoose1900@alien.top · 2 years ago

Many reasons:

AutoModelForCausalLM is extremely slow compared to other backends/quantizations, even with augmentations like BetterTransformers.
It also uses much more VRAM than other quantization, especially at high context.
Its size is inflexible.
Loads slower
No CPU offloading
Its potentially lower quality than other quantization at the same bpw

mcmoose1900@alien.top · 2 years ago

Try exui instead of ooba.

mcmoose1900@alien.top · 2 years ago

No exui?

https://github.com/turboderp/exui

Its blazing fast, vram efficient, supports minp and has a notebook mode… what else could I ask for.

I was using ooba before, but I have dumped it because its so much slower.

mcmoose1900@alien.top · 2 years ago

Another thing to note is that the exllamav2 backend is “special” because its context takes up less vram than the context in other backends. So lets say the weights take 18GB, and your context takes up 6GB for a gguf model. In exllama thats only 3GB taken up by the context with the 8 bit cache.

There are other complications like the prompt processing batch size, but thats the jist of it.

This makes a dramatic difference when the context gets huge. I’d prefer to use koboldcpp myself, but I just can’t really squeeze it on my 3090 without excessive offloading.

mcmoose1900@alien.top · 2 years ago

No setting changes, as if the model is 200K native.

mcmoose1900@alien.top · 2 years ago

I’m sick of Nvidia’s VRAM business model

At the top end, they are actually limited by how much they can physically hang off the die (48GB for current silicon, or 196GB(?) for the interposer silicon).

But yeah, below that its price gouging. What are ya gonna do, buy an Arc?

AMD is going along with this game too. You’d see a lot more 7900s on this sub, and on GitHub, if AMD let their manufacturers double up the VRAM to 48GB.

mcmoose1900@alien.top · 2 years ago

vLLM is way faster, but its pretty barebones and VRAM spikes hard.

mcmoose1900@alien.top · 2 years ago

If you must run at high precision… the best system in that budget is probably a compromise?

Grab a 3090 or 3060 and slap it on the most RAM bandwidth you can get, with a more modest CPU. The GPU will offload prompt processing and enough response layers to help.

mcmoose1900@alien.top · 2 years ago

Did you try disabling the BOS token?

mcmoose1900@alien.top · 2 years ago

Yeah. I am running the exl2 quant though, it lets you fit a ton of context.

Tess 1.1 by itself is also said to be good: https://huggingface.co/models?sort=modified&search=tess+1.1

mcmoose1900@alien.top · 2 years ago

Tess and Capybara 34B are both excellent, I’m just running them merged. There’s really nothing else with such long context, and that is so smart at handling the long context.

mcmoose1900@alien.top · 2 years ago