@StaplerGiraffe

StaplerGiraffe@alien.top · 10 months ago

Honestly, no idea. I have more theoretical than practical understanding. But my idea of the warmup phase is to arrange the initial totally random weights of a network into something where you can optimize on. When finetuning you don’t start from randomness, you start from a trained checkpoint, so I expect that the warmup phase is pointless (at least for SGD, no idea if it helps adaptive optimizers). So believe you should go from high learning rate to low learning rate, unless somebody knows better.

Oh, and when training Loras, remember that changing alpha also changes the learning rate by the same factor if I remember right. So many tests about optimal alpha are probably invalid, because people didn’t adjust the learning rate.

StaplerGiraffe@alien.top · 10 months ago

You are correct. Small learning rate allows to do fine adjustments to parameters and thereby learning subtle features. However, initially learning subtle features is useless, since you need to learn the coarse features first. That’s why learning rate schedulers go from large learning rate to small learning rate. The tricky bit is doing the minimal amount of training on a large learning rate. That is where various optimizers come in, which try do automate these kinds of things.

You could try to do this by hand by saving checkpoints periodically, and try to find the point where you go from undertrained to overtrained. Then pick a checkpoint which is slightly undertrained, and start training from there with a lower learning rate.

StaplerGiraffe@alien.top · 11 months ago

Thanks for the writeup. What’s your subjective experience with 2.4bpw or 2.5bpw models? Are they severely degraded, or still quite smart?

StaplerGiraffe@alien.top · 11 months ago

Sure, it provides the same API as KoboldAI.

StaplerGiraffe@alien.top · 11 months ago

Perhaps you are using a wrong fork of KobolAI, I get much more tokens per second. Did you open the task manager and check that the GPU memory used indeed increases when loading and using the model?

Otherwise try out Koboldcpp. It needs gguf instead gptq, but needs no special fork. With cublas enabled you should get good token speeds for a 13B model.