Cheapest way to run local LLMs?

ClassroomGold6910@alien.top · 2 years ago

Cheapest way to run local LLMs?

ThinkExtension2328@alien.top · 2 years ago

Honestly the m1 is probably the cheapest solution you have , get your self LLM studio and try out a 7b_K_M model your going to struggle with anything larger then that. But that will let you get to experience what we are all playing with.

ClassroomGold6910@alien.top · 2 years ago

3b’s work amazingly and super smoothly but 7b models while running at a fair 15 tokens per second prevent me from using any other application at the same time and occasionally freeze my mouse and screen temporarily until the response is finished

ClassroomGold6910@alien.top · 2 years ago

What’s the difference between `K_M` models, also why is `Q_4` legacy but not `Q_4_1`, it would be great if someone could explain that lol

ThinkExtension2328@alien.top · 2 years ago

Not sure about the K but the M means medium loss of info during the quantisation phase afaik

Sea_Particular_4014@alien.top · 2 years ago

Q4_0 and Q4_1 would both be legacy.

The k_m is the new “k quant” (I guess it’s not that new anymore, it’s been around for months now).

The idea is that the more important layers are done at a higher precision, while the less important layers are done at a lower precision.

It seems to work well, thus why it has become the new standard for the most part.

Q4_k_m does the most important layers at 5 bit and the less important ones at 4 bit.

It is closer in quality/perplexity to q5_0, while being closer in size to q4_0.