Proposed Alternative to Repetition Penalty - Noisy Sampling

kindacognizant@alien.top · 10 months ago

Proposed Alternative to Repetition Penalty - Noisy Sampling

drifter_VR@alien.top · 10 months ago

Looks great. Your method would also have the advantage of not hurting the syntax - how many models forget the last * or " because of RepPen?

andrewlapp@alien.top · 10 months ago

Very interesting idea. If you can create a simple benchmark (really just any prompt applied to a noisy 7B model) and demonstrate a reduction in repetition compared to baseline, this method will proliferate across the open LLM development ecosystem.

Looking forward to seeing your implementation!

kindacognizant@alien.top · 10 months ago

Applying Gaussian noise randomization to the logits with a gaussian deviation factor of 1.0 is totally coherent at top k = 1 (aka it’s picking the top token post randomization) on my Lora that I trained that I’m doing testing on and I haven’t seen repetition issues thus far. How might I test this? Like what are your best benchmark ideas?

aseichter2007@alien.top · 10 months ago

rep penalty off, repeat a ton of text over and over, use the wrong instruct to make it sperg out, and watch to see deviations in the regular output, if I understand from my quick look, you should eventually have some outliers as you increase the strength of the deviation even with top k = 1. Am I sane or out of my depth?

andrewlapp@alien.top · 10 months ago

Here are some factors that may help induce repetition:

1. Llama 2 7B, Mistral 7B, or Yi 6B variant
1. Use a lossy quantization such as Q2_K (2 bit), Q4_0 (4 bit), or GPTQ (4 bit)
1. Use a sequence length of at least 1024 tokens, if not 2048
1. Use a text corpus with a lot of repetition, e.g. https://github.com/Lyrics/lyrics

Additionally, you should use lm-evaluation-harness to test for any degradation in performance in common benchmarks.

involviert@alien.top · 10 months ago

Certainly interesting! But in the end, there’s something wrong with the model if anything like that is needed. Like obviously it isn’t really fully capable of writing proper answers if it somehow thinks that writing in circles would be the best thing to do.

WolframRavenwolf@alien.top · 10 months ago

imagine a language model that was tasked to do trivial math problems, and a user always involved the number 3 in his first 5 questions. After a certain amount of context, it will bias against using the number 3 in the solution even if if it is correct.

I used to think that, but one of the Transformers devs (Joao Gante from HF) told me that it is “only applied at most once per token” within the repetition penalty range, so it doesn’t matter how often the number 3 appears in the first 5 questions, as long as the repetition penalty is a “reasonable value (e.g. 1.2 or 1.3)”, it won’t have a negative impact on tokens the model is reasonably sure about. So for trivial math problems, and other such situations, repetition penalty is not a problem.

Same with other tokens like EOS, newlines, punctuation, etc. - if the repetition penalty would affect them negatively, we’d quickly see lots of problems. So it’s not preventing the output of tokens the model is sure about, it’s trying to prevent repetition in cases the token isn’t that predetermined.

Just something non-obvious to keep in mind.

Sabin_Stargem@alien.top · 10 months ago

Someone, get Kalomaze a grant with many digits. This sort of thing is the key to bringing AI to the masses.

kindacognizant@alien.top · 10 months ago

https://preview.redd.it/0q3tus2e5y2c1.png?width=600&format=png&auto=webp&s=27eae043a0f11fab4e52853abd3888243093c8c7

out_of_touch@alien.top · 10 months ago

One question I have in regards to this stuff is if we improve the way we randomize the next token, does that increase the likelihood of the “thesaurus” problem occuring? I.e. where the model just keeps using more like “flowery” words because it doesn’t want to keep reusing the same ones. I find that becomes a problem with a long enough context in a chat when using some of the other settings designed around avoiding repetition. Like sometimes my characters will start out talking normally and slowly progress into talking like college professors giving poetry lectures.

FPham@alien.top · 10 months ago

On somehow similar note of adding noise during finetuning to help with generalization: I you using oobabooga, you can look at Training PRO

https://github.com/FartyPants/Training_PRO

And then experiment with NEFtune noise scale.

It is somehow simillar idea - but on the other end - pretraining, I assume you are talking about adding noise in interference in sampler. Worth pursuing for sure - the results, however are unpredictable before trying it…

pseudonerv@alien.top · 10 months ago

I’m too lazy to try this, but since you are likely the right person, here is my idea.

Equalize the probability in each accumulated probability bucket.

Just like min-P essentially set the last few percent of token 0 probability. You can set the first few tokens that have accumulated probability 50% equal probability, and then the next accumulated 30% all equal probability, and then the next acc. 15% all equal. And the last 5% 0.

For example, if the next token with their normalized probabilities are

fantastic 0.3
good  0.2
great  0.2
awesome  0.1
normal  0.1
bad  0.05
sad  0.05

The first 50% include “fantastic” and “good”. The next 30% include “great” and “awesome”. The next 15% include “normal” and “bad”. And the last 5% is “sad”. You then make the tokens in each bucket the same probability, and renormalize all the probabilities, you get

fantastic  0.263
good  0.263
great  0.158
awesome  0.158
normal  0.079
bad  0.079
sad  0.0

You might try different widths of the bucket and see how it goes. Let me know if you actually try this.

nuvalab@alien.top · 10 months ago

Thanks for writing this, it’s an interesting idea and very relevant to the issue that I am trying to solve too - creative writing, which definitely hates repetition, and very interested to try out what you proposed once it’s available :)

One technical question for this approach: Wouldn’t it change the original distribution of training data / output, specially in case where there is one and obviously good one next token to choose from? I can see the value when multiple next tokens are all considered great with close probability, but curious how would it behave otherwise in terms of consistency in correctness.

EvokerTCG@alien.top · 10 months ago

Aside from repetition, isn’t this effectively a new sampling method? You could call it Fuzzed Greedy Sampling.

mybitcoinpro@alien.top · 10 months ago

Looks cool, but how to use it with linux? Will changing branch will be enough or need to do something else?

kindacognizant@alien.top · 10 months ago

exp-dynatemp-minp-latest branch had the changes pushed

mybitcoinpro@alien.top · 10 months ago

Thanks! ;)

CardAnarchist@alien.top · 10 months ago

So do you think this approach is better then Dynatemp?

Or are you planning to put forward both modifications, leaving Dynatemp out of this Kobold build to better test just the noise modification?

kindacognizant@alien.top · 10 months ago

DynaTemp is still available in the test build.

I’m not sure which method is superior or anything yet, need more testing and opinions, but it looks promising because it scales well

a_beautiful_rhind@alien.top · 10 months ago

Hope you can do another patch for exllamav2, with tabbyAPI it kicks.

Proposed Alternative to Repetition Penalty - Noisy Sampling

Proposed Alternative to Repetition Penalty - Noisy Sampling

Noisy Sampling

- Context Free

- Scales with Confidence