Why can't we just run local reinforcement learning?

Revolutionalredstone@alien.top · 2 years ago

Why can't we just run local reinforcement learning?

LuluViBritannia@alien.top · 2 years ago

I have been generating art with AI. There is an extension meant for exactly that : you literally tell the AI “good” or “bad” for each result, and it affects the weights of the model.

Sadly, it’s sheer impossible to run. Reinforcement learning isn’t just about “picking a random weight and changing them”. It’s rewriting the entire model to take your feedback into account. And that, while running the model, which in itself already takes most of your compute resource.

You need a shitton of VRAM and a very powerful GPU to run Reinforcement Learning for images. It’s even worse for LLMs, which are much more power-hungry.

Who knows, maybe there will be optimizations in the next years, but as of right now, reinforcement learning is just too demanding.

Void_0000@alien.top · 2 years ago

How hard can it be?

Seriously though, what makes it require more VRAM than regular inference? You’re still loading the same model, aren’t you?

ihexx@alien.top · 2 years ago

there’s lots of different kinds of RL algos with different requirements

In general though, the tradeoff you’re making is: data efficiency vs compute complexity

On one end, evolutionary methods & gradient-free optimization methods are simple, but data hungry.

On the other end, are things like model based RL (eg building reward models to train your generator model) are more data efficient, but are more complex since they have more moving parts and more live models to train.

So to answer:

Seriously though, what makes it require more VRAM than regular inference? You’re still loading the same model, aren’t you?

No, on the model-based end, you’re training at least 2 models: the generator and the reward model.

On the evolutionary & gradient free end, you need far more data than supervised learning, since reinforcement learning doesn’t tell the agent what to do at every time step, only after N time steps, so you’re getting basically 1/Nth the training signal for each step compared to supervised learning.

Basically, we as GPU poors are in the wierd position where anything we can train under these limitations would probably have worse performance than just training a larger model off supervised datasets

LuluViBritannia@alien.top · 2 years ago

Well, first of all, this is something you do while running the model. Sure, it’s the same model, but it’s still two different processes to run in parallel.

Then, from what I gather, it’s closer to model finetuning than it is to inference. And if you look up the figures, finetune requires a lot more power and VRAM. As I said, it’s rewriting the neural network, which is the definition of finetuning.

So in order to get a more specific answer, we should look up why finetuning requires more than inference.

Bod9001@alien.top · 2 years ago

what’s the name of the extension?

UnignorableAnomaly@alien.top · 2 years ago

Don’t know if it’s the same one but I’ve played with this: https://github.com/dvruette/sd-webui-fabric Doesn’t use much VRAM at all and works decently once you get enough Likes and Dislikes. However, as you add more likes/dislikes, generation will slow considerably.

Chaosdrifer@alien.top · 2 years ago

Please look up fine tuning and LoRA, those are the method to “evolve “ a model after it is born.

Revolutionalredstone@alien.top · 2 years ago

Oh Awesome!

Thank you!

ShengrenR@alien.top · 2 years ago

There are a few important things here:
1 - You CAN do this. you CAN just go in to the network, modify a random value… but then how do you evaluate if your change made the network ‘better’ or ‘worse’ - you’d have to run just about literally every possibility through that would touch ‘that value’ that you changed to see how the overall effect was ‘better’ or ‘worse’ - this isn’t really ‘reinforcement learning’ though because your modified ‘jiggle’ is the network change… RL will look at a string of actions, see if an action was beneficial… then compute the required changes to that network to take into account that action being ‘good’ vs 'bad. So, this isn’t really a description of what RL is, but it’s an interesting idea… the catch being that it’s prohibitively expensive to evaluate what changes are beneficial vs a hindrance.

2- So, rather than trying to jiggle the value stored in the network and compute the overall changes across all potential outcomes… Lets just look at individual outcomes from the current network… see how they compare against a new ‘truth’ input we want to emulate (a string of token inputs)… then do that against a ton of different input strings. OH. That’s what the pretraining already is :) and the difference between what my model thought the next token should be, vs what the true token in our comparison string is ‘perplexity’ and that’s how they train the initial foundation model.

3- RL would be generating an entire output generation and then feeding a ‘yes/no’ signal back into the network to encourage/discourage that overall output; This is what openai does after they’ve released their initial model and run ‘RLHF’ on the model given human feedback from users. The issue here, for you ‘at home,’ is that unless you’re the level of mad scientist that can automate evaluation of ‘better vs worse’ without a human head there to say yes/no, then you need to be the head that evaluates yes/no for each output… and to move the model very far without dramatically overshooting something useful… you have to move the model in very, very small steps… which means you, yourself, need to sit around and say yes/no to a LOT of outputs. Would be nice to get a lot of friends in on that input to try to guide the thing so you didn’t have to spend five thousand years just working on that yes/no to move the thing along… https://github.com/LAION-AI/Open-Assistant oh hey, some folks started doing just that.

Revolutionalredstone@alien.top · 2 years ago

WOW

amazing info thank you kidly my dude!

Gonna be reading this for a while…