Dynamic LoRAs -- Crazy idea?

BrainSlugs83@alien.top · 10 months ago

Slightly off-topic – I’ve been testing 13b and 7b models for awhile now… and I’m really interested if people have a good one to check out, because at least for now, I’ve settled on a 7b model that seems to work better than most other 13b models I’ve tried.

Specifically, I’ve been using OpenChat 3.5 7b (Q8 and Q4) and it’s been really good for my work so far, and punching much higher than it’s current weight class… – Much better than any of the 13b models I’ve tried. (I’m not doing any specific tests, it just seems to understand what I want better than others I’ve tried. – I’m not doing any function calling but even the 4bit 7b model is able to generate JSON as well as respond coherently.)

Note: specically using the original (non-16k) models; the 16k models seem to be borked or something?

Link: https://huggingface.co/TheBloke/openchat_3.5-GGUF

BrainSlugs83@alien.top · 10 months ago

Interesting, is Partner in Crime (PIC) like an open source co-pilot type project? I haven’t heard of it before (did you coin this phrase yourself, or is it well known)?

I ask because the tasks you describe (json/md/function calling/empathy) and then the name itself, all basically make it sound like the “open source” models equivalent of a co-pilot model.

BrainSlugs83@alien.top · 10 months ago

He’s not wrong, but they’re are lots of things that can throw a wrench into the predictability, for example, if you’re using a hugging face model, and the weights file changes out from under your nose.

Or if the hardware you’re executing on has a bug (like the IEEE floating point issue on 486s back in the day).

Or if the model has the precision reduced or increased by the hardware it’s running on in a significant way.

Or the stochastic random bits are unobservable, etc.

In these cases it still is deterministic, it’s just not easy to determine, especially when small hardware changes (as opposed to algorithmic ones) can change the output.

BrainSlugs83@alien.top · 10 months ago

It looks like a neat project, and correct me if I’m wrong, but it looks like their goal is just doing MoE and blending. And not really any dynamic context sliding?

BrainSlugs83@alien.top · 10 months ago

Neat, that’s really similar to what I was thinking of. – I know SD is transformer based, but has anyone done this with LLMs?

BrainSlugs83@alien.top · 10 months ago

That’s a very interesting project which is similar in many ways to what I’m thinking of. – They’re doing something a little different than what I was thinking of, but it’s still really neat. – I’m going to check that out. Thanks for sharing it!

BrainSlugs83@alien.top · 10 months ago

No. I’m not advocating for creating a text-to-LoRA model. Though that would be a neat project, I think you’d have a monumental training task under your hands… and really… it just doesn’t seem that practical. Fine-tuning isn’t expensive enough to merit trying to train or build that netowrk anyway, so “the juice wouldn’t be worth the squeeze”.

Picking up a correct LoRA for a response is what an MoE system is (Mixture of Experts).

What I’m proposing is training a regular LLM to occasionally spit out tokens which signal another ML network to periodically run, which will make minor runtime adjustments to the current LORA to keep it “on track”.

Like a thousand tiny micro adjustments over the course of a long conversation. – Which could be used to shift the current latent space into one where the model has an “intuitive” or “latent” understanding of much of what is currently in the context – so that the actual context and attention tokens could be freed up for later use.

Basically if the network is already in the optimal LoRA the ML network would just spit out an identity tensor for the LoRA so that it never changes.

But as the LLM realizes it’s no longer in the realm of it’s current latent space, it spits out a special “think-harder” token, which signals the ML network to run.

The ML network takes the current context and pushes it into a weighted vectorized embedding that is representative of the current “state”, and spits out a tensor which makes micro adjustments to the LoRA / PEFT adapter.

That was one such application for this that I was proposing.

BrainSlugs83@alien.top · 10 months ago

Can you explain more? – I thought this would make it so that context and attention could be freed up and reused for new tokens.

Like… it would also allow for a much larger context size without the quadratic memory consumption – possibly even static memory consumption.

BrainSlugs83@alien.top · 10 months ago

Yeah, it was an industrial TENS unit applied to my upper neck and back to help with some pain I’ve been dealing with since I was rear ended a few months ago.

You can’t really do much but just sit there and think…, so that’s what I did. Maybe I’ll bring an audio book next time. 😅

BrainSlugs83@alien.top · 10 months ago

Dynamic LoRAs -- Crazy idea?

BrainSlugs83@alien.top · 10 months ago

How does the q8_0 version of that model do?

BrainSlugs83@alien.top · 10 months ago

Why do you need an LLM for this? Just use any NER model. It will be blazing fast and run locally.

BrainSlugs83@alien.top · 10 months ago

Ollama is not cross platform (yet), so it’s off the table for me. Looks neat, but I don’t really see the point when there’s already a bunch of cross platform solutions based on llama.cpp.

BrainSlugs83@alien.top · 10 months ago

It’s just easier to run (and deploy!) cross platform compiled code than to setup 10 different python envs and cross your fingers that it might work this time.