@brobruh211

brobruh211@alien.top · 11 months ago

Thanks for the detailed answer! Ubuntu does seem to be much more memory-efficient compared to Windows. However, the problem just fixed itself seemingly overnight. Now I’m not running into out of memory errors. 8-bit cache is a godsend for vram efficiency.

brobruh211@alien.top · 11 months ago

Horniness aside, is Goliath really the best model right now for roleplaying? I’m getting a bit of fomo from not being able to run this model locally, so I would like to know if there are 70B or 34B models that hold their own against Goliath in terms of RP. I have 24GB vram so a 2.6bpw 70B (a little unstable) or a 5bpw 34B is the best I can run.

brobruh211@alien.top · 1 year ago

Hi! What are your settings for Ooba to get this to work? On Windows 11 on a single 3090, I keep getting CUDA out of memory error trying to load a 2.4bpw 70B model with just 4k context. It’s annoying because this used to work but after a recent update it just won’t load anymore.

brobruh211@alien.top · 1 year ago

The options you are seeing are different quants of the same model. For 7Bs, you generally want to stick to Q4_K_M and up. Generally, the bigger the file size, the closer its quality is to the original unquantized model.

For 7B models, your 16GB unified memory should be able to run the Q6_K variant with 8192 context size no problem. The model you’re looking at is good but it’s slightly dated at this point. Hard to recommend models without knowing your specific use case for it, but here goes nothing:

TheBloke/OpenHermes-2.5-Mistral-7B-GGUF (creative, decent at following instructions, good for roleplaying but also as an all-around model).
TheBloke/zephyr-7B-beta-GGUF (great at following instructions, good prose, less creative than the above for roleplaying purposes.)
TheBloke/Synatra-7B-v0.3-RP-GGUF (creative model that seems specialized for roleplaying purposes).

I recommend trying out some 13Bs as well. In my experience, a good 13B is still better than a good 7B (for roleplaying purposes at least). With 13Bs, I recommend using Q5_K_M variants with 6144 context size. KoboldCpp sets the role scaling automatically, but I’m not sure how LMStudio handles it. Here are some models you can try out:

KoboldAI/LLaMA2-13B-Tiefighter-GGUF (great all-around model for its intelligence and creativity).
TheBloke/X-NoroChronos-13B-GGUF (creative merged model that seems specialized for roleplaying purposes).

brobruh211@alien.top · 1 year ago

The latest version of KoboldCpp v1.50.1 now loads this model properly.

brobruh211@alien.top · 1 year ago

Glad to see dolphin 2.1 still getting love! I too prefer it to even its newer versions when it comes to creative storytelling.

Have you tried any 34Bs? I found even just the Q3_K_M quant of Nous-Cabypara 34B to be more creative and coherent than any smaller model I’ve tried. Try using a low-temp preset like Kobold’s Liminal Drift. If on SillyTavern, I found the Roleplay context template and instruct mode to work well with this.

brobruh211@alien.top · 1 year ago

I haven’t tried to run a model that big on CPU RAM only, but running a Q4_0 gguf of Causal 14B was already mind numbingly slow on my rig.

General rule of thumb, always utilize as much of your VRAM (GPU RAM) as possible since CPU RAM is exponentially slower. I’m guessing your connection timed out because it just took to long to load/run.

With a 4090, you can actually run lzlv 70B fully on your 24GB VRAM. Let’s not let your amazing GPU go to waste! Try these steps and let me know if it works out for you:

Paste this on the Download box of text-gen-ui: waldie/lzlv-limarpv3-l2-70b-2.4bpw-h6-exl2
Hit download. This should download an ExLlamav2 quant of lzlv that fits in your VRAM.
Select the model from the drop down and just hit Load using the default settings. (Optional) You can tick “Use 8-bit cache to save VRAM”
Enjoy! The perplexity of the file I suggested as high as lzlv_Q4_K_M, but at least you should be able to run it with no problems and get decent outputs as well.