@metaprotium

metaprotium@alien.top · 10 months ago

Hard to say. You’d probably be better off trying a model that’s been fine tuned for use as an assistant. It also helps to add stuff as a system prompt to guide the model, assuming you pick an instruction fine tuned one. Id be surprised if that failed but try not to judge the models too harshly if their views align with an average of the training data. In my (admittedly, limited) experience, none of the models are ‘woke’ as you say. They’re very average. Makes sense given what they were trained on. Perhaps you will find that human bias is a user, and not model, error.

metaprotium@alien.top · 10 months ago

Most quantized models on the hub are quantized with GPTQ / AWQ and other techniques. These techniques are optimized for inference and are faster than load_in_4bit. load_in_4bit uses the bitsandbytes library and is more useful for training LoRAs on a limited amount of VRAM.

metaprotium@alien.top · 10 months ago

Man, I bet they’re saving so much money too.

metaprotium@alien.top · 10 months ago

60% sparsity with no quality loss is really good. I’ll have to look into the methodology cause that’s impressive

metaprotium@alien.top · 10 months ago

It’s almost a shame chatGPT blew up in the way that it did. “AI” became a buzzword and every company found a way to shove it into their business model. Now the future of NLP is cloudy because it’s become an ouroboros of data. I think dataset selection and cleaning will become a more important area of research. I’d be surprised if “shoving terabytes of raw webscraper data” continues being feasible in the future

metaprotium@alien.top · 10 months ago

It doesn’t really make that much sense at runtime. By the time you get to running large enough models (think GPT-4) you will already have infrastructure built up from training, which you can then use for inference. Why not run queries through that 1 data center, to minimize latency? For pooled computing resources (prompts are run through 1 member in a pool, kinda like sheepit render farm) it would make more sense, but you’re still limited by varying user hardware and software availability. People might have 1060s or 4090s, mistral 7Bs or llama-70Bs. Providing a service to end users means either (1) forcing users to accept quality inconsistency, or (2) forcing providers to maintain very specific software and hardware, plus limiting users to few models.

metaprotium@alien.top · 10 months ago

Windows is a pain. Linux is great, and a pain for everything else. MacOS is fine, but the hardware it’s running on isn’t