Proposed Alternative to Repetition Penalty - Noisy Sampling

kindacognizant@alien.top · 1 year ago

Koboldcpp! Single exe, runs with very little dependency bloat, and is still blazing fast as long as you can offload the whole model.

kindacognizant@alien.top · 1 year ago

Temperature last is the assumed default of llama.cpp which means it is working.

Unfortunately the HF loader seems to have a bug with Min P in Ooba.

kindacognizant@alien.top · 1 year ago

That doesn’t seem correct in the slightest.

kindacognizant@alien.top · 1 year ago

Play with your sampler settings. The impact in creativity changes pretty significantly.

See this, for example:

https://preview.redd.it/yg9jg6r4f93c1.png?width=595&format=png&auto=webp&s=f5f38dd788a60439bf83693dd67cbdef25bbe7d2

The important elements are:

- Min P, which sets a minimum % relative to the top probability token. Go no lower than 0.03 for coherence at higher temps.

- Temperature, which controls how much the smaller probability options are considered and makes them more probable.

kindacognizant@alien.top · 1 year ago

DynaTemp is still available in the test build.

I’m not sure which method is superior or anything yet, need more testing and opinions, but it looks promising because it scales well

kindacognizant@alien.top · 1 year ago

exp-dynatemp-minp-latest branch had the changes pushed

kindacognizant@alien.top · 1 year ago

Elon Musk has made it profitable

kindacognizant@alien.top · 1 year ago

https://preview.redd.it/0q3tus2e5y2c1.png?width=600&format=png&auto=webp&s=27eae043a0f11fab4e52853abd3888243093c8c7

kindacognizant@alien.top · 1 year ago

Applying Gaussian noise randomization to the logits with a gaussian deviation factor of 1.0 is totally coherent at top k = 1 (aka it’s picking the top token post randomization) on my Lora that I trained that I’m doing testing on and I haven’t seen repetition issues thus far. How might I test this? Like what are your best benchmark ideas?

kindacognizant@alien.top · 1 year ago

Proposed Alternative to Repetition Penalty - Noisy Sampling

kindacognizant@alien.top · 1 year ago

Considering there’s an implementation of the cosine scheduler with warmup steps, is there any implementation of a scheduler that starts slow, then rapidly accelerates, and finally stabilizes to learn the subtle features (like a sigmoidal function?) To avoid starting too high in the first place.

https://preview.redd.it/qb1z0n7oci2c1.png?width=1200&format=png&auto=webp&s=15dbab7b3a18ab918defbbbe2ab6816aaa46b489

kindacognizant@alien.top · 1 year ago

So you’re saying my intuition isn’t wrong, necessarily, that slow training to learn the small subtle details could work as long as the dataset wasn’t *too* limited in scope?

kindacognizant@alien.top · 1 year ago

https://huggingface.co/datasets/kalomaze/MiniSymposium-Demo-Dataset

Feel free to submit examples to the community tab

kindacognizant@alien.top · 1 year ago

I am inclined to believe gpt4 since it consistently claims this across both the API and your comment… but I’m not sure

kindacognizant@alien.top · 1 year ago

gpt4 is claiming this comment’s claim is wrong, but I can’t trust it blindly ofc, i’ll look into my initial claim to verify

kindacognizant@alien.top · 1 year ago

> Multiple passes at lower learning rates isn’t supposed to produce different results.

Oh, I was wrong on this, then, my bad.

So would my interpretation be correct that this is essentially causing the overfitting to still happen, just significantly slower, and that a higher LR would work? The problem is at first the average loss tanked in the span of like a single epoch to near zero which overfit, but this LR didn’t have the same effect.

kindacognizant@alien.top · 1 year ago

Train Smarter, Not Harder? - MiniSymposium 7b

kindacognizant@alien.top · 1 year ago

How much does Quantization actually impact models? - KL Divergence Tests

kindacognizant@alien.top · 1 year ago

I posted that GitHub issue. That original Top K vs Top P graph wasn’t made by me, I can’t find the original source, but I made the Min P one and others.

kindacognizant@alien.top · 1 year ago

I need people to test my experiment - Dynamic Temperature

kindacognizant@alien.top · 1 year ago

What kind of sampler settings are you using? You can force models to get really out there in terms of creativity depending on what you use.

kindacognizant@alien.top · 1 year ago

I am of the opinion that security through obscurity (of model weights) does not work.

The capabilities of these models would have to be consistently powerful beyond what the current state of the art is, and not just consistently, but by orders of magnitude to carry out the threats that have been proposed as pseudo-realistic risk.

Using your own compute instead of scraped GPT API keys when open models are at a state where their generalized performance is not directly comparable greatly diminishes the threat of bad actor risks. I’d maybe start to sweat if GPT4 was getting better instead of worse every time they do a rollout.

This is also another alignment paper that cites theoretical examples of biochemical terrorism. We live in a post-internet era where that type of information has already landed in the hands of the people who would be the most capable of carrying it out, but the post-internet era has consequentially also made those kinds of attacks much more difficult to carry out.

As the number of routes for possible attack vectors increases, the number of ways for that attack to be circumvented also increases.

kindacognizant@alien.top · 1 year ago

I’m guessing GQA helped. Llama2 70b and 34b used Grouped Query Attention but it wasn’t used for Llama2 7/13b.

https://preview.redd.it/je2q9vhllq0c1.png?width=871&format=png&auto=webp&s=d23b1cdd307dfa54fb4dd788a0f6ea90ee23fa94

kindacognizant@alien.top · 1 year ago

What frontends do you use?

kindacognizant@alien.top · 1 year ago

Your settings are (probably) hurting your model - Why sampler settings matter