Your settings are (probably) hurting your model - Why sampler settings matter

kindacognizant@alien.top · 3 years ago

Your settings are (probably) hurting your model - Why sampler settings matter

psi-love@alien.top · 3 years ago

Really nice explanation, thank you!

So if I only want min_p sampling of 0.05 to work with llama.cpp for example, which values should other sampling parameters like top_k (0?), top_p (1.0?) and temperature (1.0?) use, so they have no influence?

drifter_VR@alien.top · 3 years ago

Just tried Min-P with the last versions of sillytavern and koboldcpp and… the outputs were pretty chaotic… not sure if Koboldcpp is supporting Min-P yet

SillyTavern has Min-P support, but I’m not sure if it works with all backends yet. In 1.10.9’s changelog, Min-P was hidden behind a feature flag for KoboldCPP 1.48 or Horde.

Haiart@alien.top · 3 years ago

It’s working perfectly fine for me in KoboldCPP.

Check if you forgot to disable any other sampling methods, you have to disable everything and leave (Top-p at 1, Top-K at 0, Top-A at 0, Typ. at 1, TFS at 1, Seed at -1 and Mirostat Mode OFF) ONLY Min-p enabled (and, if you NEED, you can activate Repetition Penalty at 1.05~1.20 at maximum and I personally use RpRng. 2048 RpSlp. 0.9 but don’t bother with these, only if you enable Repetition Penalty.)

Also, with Min-p, you should be using higher Temperature, start with Temperature at 1.5 and Min-p at 0.05, then you can finetune these two numbers at will, read the post to understand why.

drifter_VR@alien.top · 3 years ago

Well I tried the settings given by OP with temp=1.0, will try with higher temps, thanks.

nixudos@alien.top · 3 years ago

I’m having a lot of fun with it on the following settings for story writing.
I feel like there is loads of grat potential in min_P, once I get it dialed in!

https://preview.redd.it/9in73daoix1c1.png?width=619&format=png&auto=webp&s=3f51101d0a40c02ef46de163a707164d28a68f7f

Haiart@alien.top · 3 years ago

Great, also, remember to always keep an eye on the KoboldCPP github for updates, I noticed that when you said two days ago that you were using 1.48 they already version 1.50 there.

CardAnarchist@alien.top · 3 years ago

Hi thanks a lot for this, I haven’t seen a good guide to these settings until now.

As someone who always runs mistral 7B models I have two questions,

For a general default for all mistral models would you recommend a Repetition Penalty setting of 1.20?
I run Mistral models at 8192 context. What should I set the Repetition Penalty Range at?

Thanks again for the great info and of course for making Min P!

Broadband-@alien.top · 3 years ago

I’ve experimented with turning repetition penalty off completely and haven’t noticed much of a change so far.

CardAnarchist@alien.top · 3 years ago

I setup exactly as OP’s example showed but with 1.20 Repetition Penalty. The output was… quite bad, worse than I was getting before tampering with all the settings.

I changed Repetition Penalty Range to match my context (8192) and that improved the output but it was still pretty bad.

I tried Repetition Penalty of 1.0 and that was much better but it tended to repeat after a bit (A common Mistral problem).

I tried 1.1 Repetition Penalty and it was close but still a bit too dumb / random.

1.05 Repetition Penalty seems to be a nice sweet spot for me atm. I do think the output is now better than what I had previously.

Strange you don’t see much diff with the Repetition Penalty setting. It massively alters my outputs (when setup like OP).

I’m using OpenChat 3.5 7B for reference.

Wooden-Potential2226@alien.top · 3 years ago

Thanks! V informative, will keep for reference👍🏼

ProperShape5918@alien.top · 3 years ago

Needed to use a language model just to read this.

ReMeDyIII@alien.top · 3 years ago

I find it comical it took this long to get a proper dissection of what these settings meant and to no surprise it spikes to 387 upvotes in 13 hours.

Excessive_Etcetra@alien.top · 3 years ago

I could have used this four months ago, lol. Thank you OP for finally making it make sense.

FPham@alien.top · 3 years ago

Proof is in the pudding - blind tests, just like ooba did a while ago with the older samplings.

Language is way too complex to approach it from the math side and assert “this should work better”. In theory yes, but we need blind tests.

berzerkerCrush@alien.top · 3 years ago

That’s a high quality post!

Super_Pole_Jitsu@alien.top · 3 years ago

This is absolutely golden, and is probably the reason for the absolutely shit performance I got on my local models. You should definitely write a paper about this!

nsfw_throwitaway69@alien.top · 3 years ago

min P seems similar to tail free sampling. I think the difference is that TFS tries to identify the “tail” by computing the derivative of the token probability function.

sophosympatheia@alien.top · 3 years ago

Awesome post! Thanks for investing the time into this, u/kindacognizant.

I have been playing around with your suggested Min-P settings and they kick butt. It feels close to mirostat subjectively, certainly no worse, and you made some convincing arguments for the Min-P approach. I like the simplicity of it too. I think I’ll be using Min-P primarily from now on.

_Andersinn@alien.top · 3 years ago

Thank you - I used too think I was the only one who had no idea how any of this works.

dnsod_si666@alien.top · 3 years ago

This may be a dumb question, but why do we use any sampling modifications at all? Is that not defeating the purpose of the model training to learn those probabilities?

extopico@alien.top · 3 years ago

Which koboldccp allows you to set the samplers order? The latest main branch does not have this available, in linux.

Dead_Internet_Theory@alien.top · 3 years ago

OP, this post is fantastic.

I wonder, is this a case of the community doing free R&D for OpenAI or they truly have a good reason for using naive sampling?

Also the graph comes from here, a bunch of other graphs there too.

kindacognizant@alien.top · 3 years ago

I posted that GitHub issue. That original Top K vs Top P graph wasn’t made by me, I can’t find the original source, but I made the Min P one and others.