@llama_in_sunglasses

llama_in_sunglasses@alien.top · 11 months ago

With Llama-2-70b-chat-E8P-2Bit from their zoo, quip# seems fairly promising. I’d have to try l2-70b-chat in exl2 at 2.4 bpw to compare but this model does not really feel like a 2 bit model so far, I’m impressed.

llama_in_sunglasses@alien.top · 11 months ago

To create quants of new models, one has to create Hessians for it and it uses several GB of RedPajama to calibrate these. Generating Hessians for Mistral is taking 17 minutes per LAYER on my 3090. I’ll see if it can even finish later. Much later. That’s over 16 hours just to quantize a 7B model, yikes.

The paper for this is one of the worst for me in years, full on “I know some of these words.” I didn’t think 8-dimensional sphere packing was going to be in my attempted light reading for the night.

P…S.: Rollback to transformers 4.34.0 or edit the code in hessian_offline_llama.py and change all instances of

attention_mask = model.model._prepare_decoder_attention_mask(

to

attention_mask = _prepare_4d_causal_attention_mask(

and add an import to the top of the same file.

from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask

llama_in_sunglasses@alien.top · 11 months ago

Were you involved? I think this has a pretty good chance of winding up a library, HF transformers is a legit overwrought mess and given that I scanned through most of the code just taking a look inside, that’s an impressively low line count for something that looks like it can load all of the llama family members.

llama_in_sunglasses@alien.top · 11 months ago

LoneStriker has a 2.4 bpw quant up: https://huggingface.co/LoneStriker/deepseek-llm-67b-chat-2.4bpw-h6-exl2

llama_in_sunglasses@alien.top · 11 months ago

I use dolphin-yi because it listens the best of the Yi finetunes, but I find myself screwing around with the settings for Yi more than most. I pick a different preset and tweak it if it starts looping itself.

llama_in_sunglasses@alien.top · 11 months ago

I mean, voodoo and all forms of magic with a k are basically art in my opinion.

llama_in_sunglasses@alien.top · 11 months ago

load-in-4bit takes a long time to load a model and the performance is poor in both speed and output quality.

I have compared a bunch of quant methods at https://desync.xyz/ for Mistral, llama-7b, orca2-13b if you are interested.

llama_in_sunglasses@alien.top · 11 months ago

If I prompt a frankenmerge with the usual instruct dreck I use, they fail to answer numerous questions in a useful manner. However, it’s a different story using them in chat mode or probably anything creative - the outputs can be coherent but feel way less AI-like.

llama_in_sunglasses@alien.top · 11 months ago

https://huggingface.co/chargoddard/llama2-22b/blob/main/frankenllama_22b.py shows how the tensors are padded up to fit

llama_in_sunglasses@alien.top · 11 months ago

At very least, you should be able to merge any 2 models with the same tokenizer via element-wise addition of the log probs just before sampling. This would also unlock creative new samplers. IE instead of adding logprobs, maybe one model’s logprobs constrains the other’s in interesting ways.

What, run two models at once? This doesn’t seem cost-effective for what you’d get.

Most merges that are popular are weight mixes, where portions of different models are averaged in increasingly complex ways. Goliath is a layer splice, sections of Xwin and Euryale were chopped up and interweaved together. This is the kind of merge I’m interested in but getting useful models out of the process is way more art than science.

llama_in_sunglasses@alien.top · 11 months ago

I made one too, but 34B Yi output is probably better. This model is worse at 2.9bpw compared to regular Tess-M at 4.6bpw and all of the usual Yi issues like repetition are worse. I uploaded it but I find it personally lacking. Also, uploading 50B+ models to HF is seriously a pain in the ass.

https://huggingface.co/lodrick-the-lafted/Kaiju-A-57B

llama_in_sunglasses@alien.top · 1 year ago

Don’t shuffle well, keep it in chunks.

llama_in_sunglasses@alien.top · 1 year ago

I agree, perplexity isn’t a good metric for comparing output quality. I did some deterministic tests and it’s interesting to see how far into the generation it takes before the answers start to diverge from the original model in fp16.

https://desync.xyz/mistral.html

https://desync.xyz/openhermes2.5.html

llama_in_sunglasses@alien.top · 1 year ago

I’ve tested pretty much all of the available quantization methods and I prefer exllamav2 for everything I run on GPU, it’s fast and gives high quality results. If anyone wants to experiment with some different calibration parquets, I’ve taken a portion of the PIPPA data and converted it into various prompt formats, along with a portion of the synthia instruction/response pairs that I’ve also converted into different prompt formats. I’ve only tested them on OpenHermes, but they did make coherent models that all produce different generation output from the same prompt.

https://desync.xyz/calsets.html

llama_in_sunglasses@alien.top · 1 year ago

I had okayish results blowing up layers from 70b… but messing with the first or last 20% lobotomizes the model, and I didn’t snip more than a couple layers from any one place. By the time I got the model far enough down in size that q2_K could load in 24GB of VRAM it fell apart, so I didn’t consider mergekit all that useful of a distillation/parameter reduction process.

llama_in_sunglasses@alien.top · 1 year ago

Sorry to hear that. This thread is pretty wild, almost every other model thread on LocalLlama has at most a few crazies and they get downvoted. Your Synthia models are fairly popular, so the reactions you got seems pretty out of place to me.

llama_in_sunglasses@alien.top · 1 year ago

Thanks for the model, it’s really nice to have some synthia magic on a Yi-34B 200K base.

Part of the generation from your suggested prompt:

The magnetic field of our planet is generated by an iron-nickel core that rotates like a dynamo, creating electric currents which in turn produce the magnetic force we experience as compass needles pointing northward when held still relative to this field’s direction over time periods measured in years rather than seconds or minutes because it varies slightly due to solar wind interactions with upper layers known collectively as “ionosphere.”

I found this particular output unintentionally hilarious because it reminds me a lot of the reddit comments I type out then delete because it’s just some overexplainy run-on gibberish.

llama_in_sunglasses@alien.top · 1 year ago

GGUF k-quants are really good at making sure the most important parts of the model are not x bit but q6_k if possible. GPTQ and AWQ models can fall apart and give total bullshit at 3 bits while the same model in q2_k / q3_ks with around 3 bits usually outputs sentences.

llama_in_sunglasses@alien.top · 1 year ago

If you’re planning on running the models entirely on the GPUs, your choice of CPU won’t really affect the speeds you are getting. I’d go with the Intel since this is your first PC build, I built a 7950X rig a couple months ago. I didn’t have problems getting it to boot, but it absolutely had a fit over running 4 sticks of DDR5-6000 at their rated speed. The rated speed is really only valid for 2 sticks.

llama_in_sunglasses@alien.top · 1 year ago

The lopsided CCUs in X3D parts are not the same as the ones on 7900X/7950X. The cache ensures that you need a scheduler that can put loads that need it on the cache-enabled portion and that’s asking a lot from a scheduler. The AMD parts without extra cache don’t suffer from this issue… it’s why I got a 7950X, but the 7900X is also fine and all three of these CPUs will be entirely limited by memory bandwidth if used for CPU inference.