How much does Quantization actually impact models? - KL Divergence Tests

kindacognizant@alien.top · 2 years ago

How much does Quantization actually impact models? - KL Divergence Tests

A_for_Anonymous@alien.top · 2 years ago

Thanks, this is interesting. This all said, it still looks like B is a much more important factor than quantisation down to Q3, meaning a 20B Q3 is going to write better than a 13B fp16. And such it seemed to me personally but I haven’t done any rigorous testing.

slippery@alien.top · 2 years ago

Nice work, thank you.

opi098514@alien.top · 2 years ago

Ok. So im basically an idiot. What does this mean and which one should i use?

JealousAmoeba@alien.top · 2 years ago

Would I get better results in general by running a 7B model with Q8, or a 13B model with Q4/Q5? My laptop can do either.

I’m guessing the quantized 13B model will be better but has anyone ever benchmarked 7B vs 13B for different levels of quantization?

LOLatent@alien.top · 2 years ago

I’m in the exact same boat, if you get an answer, pls lettus know! 7b q8 or 13b q4?

erikqu_@alien.top · 2 years ago

Reminds me of pruning, pruning has been shown to have little impact on model performance in other areas, although I haven’t seen it applied to this space much (afaik)

llama_in_sunglasses@alien.top · 2 years ago

I agree, perplexity isn’t a good metric for comparing output quality. I did some deterministic tests and it’s interesting to see how far into the generation it takes before the answers start to diverge from the original model in fp16.

https://desync.xyz/mistral.html

https://desync.xyz/openhermes2.5.html

kpodkanowicz@alien.top · 2 years ago

you are on fire. This is your yet another great post - btw. i changed perplexity scripts to only measure responses after the instruction and using for example, the evol dataset. The preset is configured accordingly to the model - i got completely different results than normal perplexity - interestingly, when running code isntructions on normal model and for instance roleplay instructions on coding model not just perpelxity is around 1 vs. 3 but also degradate differently

dnsod_si666@alien.top · 2 years ago

You could also use this to measure different models against each other right? And just in general, use this as a model benchmark.

Get dataset of text.
Tokenize dataset.
Measure true probabilities straight from the dataset.
Train model number 1 on tokenized dataset.
Measure KL divergence of model from true probabilities.
Repeat steps 4,5 for model number 2
Compare KL divergence of model 1 to model 2.

-Separate Idea- Also isn’t getting the true probabilities useful anyway, because then we could have the training process be:

Get dataset.
Tokenize.
Get true probabilities.
Train on probabilities instead of directly on the tokens.

Like instead of training twice (sequence to probabilities):

sequence1 -> [1, 0]
sequence1 -> [0, 1] You train it once with:
sequence1 -> [0.5, 0.5]

So you are training on less data which would reduce training costs and whatnot.

SpeedingTourist@alien.top · 2 years ago

Noob question, but can someone please help me understand the difference between K_M and K_S?

CardAnarchist@alien.top · 2 years ago

Hi there, you seem like the man to ask on this somewhat related topic to the OP,

I’ve recently found out that models output different results based on the number of layers loaded into GPU. I’ve been told that more layers loaded in = better output.

How does the loss asociated with layers not in GPU compare to the loss say between quants?

kindacognizant@alien.top · 2 years ago

That doesn’t seem correct in the slightest.

CardAnarchist@alien.top · 2 years ago

I thought it odd myself. So much so that I thought SillyTavern was bugged but that wasn’t the case.

It’s pretty easy to test yourself. Just use Koboldcpp to load in say 31 layers generate some output on seed 1 then, restart Koboldcpp with 30 layers.

Example of 31 layers of a 7B vs 30 layers on the same seed.

Each seed works the same if the layers are close enough it seems like. The output starts exactly the same before branching off.

It’s worth mentioning that the person who told me the quality was “better” with more layers loaded in simply said it was as far as he recalled.