🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

WolframRavenwolf@alien.top · 11 months ago

Yeah, GGUF is rather slow for me, that’s why I’ve begun to use ExLlamav2_HF which lets me run even 120B models at 3-bit with nice quality at around 20 T/s.

WolframRavenwolf@alien.top · 11 months ago

You mean my recent LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)? The quants below 3bpw probably didn’t work because the smaller models need to be run without BOS token (which was on by default), something I didn’t know then yet.

Q2_K didn’t degrade compared to Q5_K_M - given that K quants are actually higher bitrate for the most important parts, that may not be so surprising.

Still surprising that Q2_K also beat 5bpw, though. Not sure if that’s just because of the bitrate or also a factor of how EXL2 quants are calibrated.

However, all that said, I’d be careful trying to compare quant effects across models. The models themselves have a huge impact beyond quant level, and it’s hard to say which has what strength of effect.

WolframRavenwolf@alien.top · 11 months ago

koboldcpp-1.50\koboldcpp.exe --contextsize 4096 --debugmode --foreground --gpulayers 99 --highpriority --usecublas mmq --model TheBloke_lzlv_70B-GGUF/lzlv_70b_fp16_hf.Q4_K_M.gguf

ContextLimit: 3815/4096, Processing:25.07s (7.1ms/T), Generation:43.74s (145.8ms/T), Total:68.80s (4.36T/s)

WolframRavenwolf@alien.top · 11 months ago

oobabooga’s text-generation-webui, ExLlamav2_HF loader, gpu-split 22,22, 4K max seq length, 8-bit cache.

WolframRavenwolf@alien.top · 11 months ago

Q4_0? That’s the quant that was affected, as reported here and confirmed by another user.

WolframRavenwolf@alien.top · 11 months ago

Weird. Guess I’ll have to do some new benchmarks with my old driver, then upgrade to the latest version and see if/how that affects inference speeds.

WolframRavenwolf@alien.top · 11 months ago

Bookmarked! I’ll see what it says about Amy and my other characters. I spent a lot of time on their wording and am constantly optimizing it.

Speaking of optimizations for character cards, have you heard about Sparse Priming Representations (SPR)? I’ve experimented with it and while I’m not using it directly, I’m applying some of its principles to my cards, saving precious tokens.

WolframRavenwolf@alien.top · 11 months ago

Looks like the behavior I’ve seen with older LLaMA models that had their context extended beyond their normal limits when RoPE scaling was new. I’ve always wondered if it’s just a drawback of bigger context or an actual issue of the models or inference software. It just doesn’t feel right for the models to deteriorate that hard.

WolframRavenwolf@alien.top · 11 months ago

Most of these are (parts of) EOS (end of sequence) tokens. The model is supposed to send an EOS token to signal that inference is done, as without that, it would keep going until the max new tokens limit is hit.

Unfortunately some models, especially merges with different prompt formats, can get confused and output the wrong token or turn the special token into a regular string. In that case, adding that string (or a part of it) to the custom stopping strings list ensures that inference is properly concluding anyways.

In addition to that, I put the asterisk followed by username there to catch the model trying to act as the user. Just like how the software by default already includes the username followed by a colon, to catch the model trying to talk as user.

WolframRavenwolf@alien.top · 11 months ago

I just use SillyTavern. I’ve set up a bunch of presets for its Quick Reply extension, so I click through those, check the output, make my notes, and click the next one (sometimes depending on what kind of response I got). It’s semi-automatic that way.

There’s a new SillyTavern version featuring STscript, an embedded scripting language. Before I do more tests, I’ll upgrade my frontend and check that out, sounds like it would be perfect to assist me in these tests.

WolframRavenwolf@alien.top · 11 months ago

Ah, that explains it. Didn’t look at the layers.

By the way, I miss a lot of useful debug output when using loaders through oobabooga’s text-generation-webui, especially compared to koboldcpp’s debug mode where I see speeds, token probabilities, etc.

Anyone know of a way to enable such detailed output for ooba?

WolframRavenwolf@alien.top · 11 months ago

My AI Workstation:

2 GPUs (48 GB VRAM): Asus ROG STRIX RTX 3090 O24 Gaming White Edition (24 GB VRAM) + EVGA GeForce RTX 3090 FTW3 ULTRA GAMING (24 GB VRAM)
13th Gen Intel Core i9-13900K (24 Cores, 8 Performance-Cores + 16 Efficient-Cores, 32 Threads, 3.0-5.8 GHz)
128 GB DDR5 RAM (4x 32GB Kingston Fury Beast DDR5-6000 MHz) @ 4800 MHz ☹️
ASUS ProArt Z790 Creator WiFi
1650W Thermaltake ToughPower GF3 Gen5
Noctua NH-D15 Chromax.Black (supersilent)
ATX-Midi Fractal Meshify 2 XL
Windows 11 Pro 64-bit

I’m still at NVIDIA driver 531.79. If you have a newer one, did you set it up to crash instead of swap to system RAM when VRAM is full?

WolframRavenwolf@alien.top · 11 months ago

Already saw and read your post, saved it, and added Misted-7B to the top of my 7B TODO list. :)

I’m not sure about what causes the misspellings, probably both low quant and the frankenmerging combined.

I do see misspellings and grammar mistakes when using the English models in German, even the biggest ones, but it’s worth with smaller models. They understand full well what is said but can’t write it as perfectly as English. And that’s apparently at any quant. Probably because there’s less quality German in the training data compared to English, and the less parameters a model has, the less its (language) understanding and knowledge, so it makes more mistakes.

WolframRavenwolf@alien.top · 11 months ago

Yeah, there have been some community favorites I didn’t get around to use yet as I was focused completely on the latest batch of models. I wanted to test even more, but forced myself to stop and post, otherwise it would have taken another week or so and by then there would be even more new models I’d have wanted to test. Anyway, Psyfighter is definitely in my backlog, and I look forward to check it out as soon as time allows.

WolframRavenwolf@alien.top · 11 months ago

Happy to contribute as I also appreciate the insane amount of work that the model makers and mergers invest. We all benefit from the advancement of local AI, and I’m happy to do my part, however small it may be.

WolframRavenwolf@alien.top · 11 months ago

Danke schön! Guess I’ll take a break and find some rest once our AI takes over and does all the work so we can have free time. ;) Until then I’ll try my best to make sure that we have great local and owner-aligned AIs instead of only a centralized one that’s aligned to a faceless corporation/government or lowest common denominator of an equally faceless mass of people.

WolframRavenwolf@alien.top · 11 months ago

I did a speed benchmark months ago and picked Q4_0 because of that. Nowadays I’d prefer to use Q4_K_M but try to minimize differences between tests for maximum comparability, so I’ve been intentionally stuck on this quant level. (I did make some exceptions for EXL2 because it’s so much faster than GGUF, and I did test Airoboros at Q4_K_M because Q4_0 was broken, but those were exceptions.)

Now that I’m done with these tests (they go back weeks/months and allow comparisons between different sizes, too, as they were all tested the same way and with as similar a setup as possible), I’m free to change the tests and setup. I’d like to expand into harder questions so it’s not as crowded at the top (I’m still convinced GPT-4 is far ahead of our local models, but the gap seems to be narrowing, and more advanced tests could show that more clearly).

WolframRavenwolf@alien.top · 11 months ago

Yes, it’s great that we have choice. There’s a good local AI model, no matter your system or requirements.

WolframRavenwolf@alien.top · 11 months ago

Thanks for making them! :) Keep up the great work!

WolframRavenwolf@alien.top · 11 months ago

Yes, that’s the drawback. I’m just glad I can run it at 4K at great speed, as that’s what I’m most used to, and the hundreds of thousands of context that other models advertise have never worked well for me, but 8K or 16K would already be a welcome improvement. Oh well, always compromises to be made. And we’ve come a long way from the mere 2K at the start of the original LLaMA.

WolframRavenwolf@alien.top · 11 months ago

🐺🐦‍⬛ Big LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5

WolframRavenwolf@alien.top · 1 year ago

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)

WolframRavenwolf@alien.top · 1 year ago

🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

WolframRavenwolf@alien.top · 1 year ago

🐺🐦‍⬛ Huge LLM Comparison/Test: Part II (7B-20B) Roleplay Tests