Is it worth using a bunch of old GTX 10 series cards ( like 1060 1070 1080 ) for running local LLM?

dpak90@alien.top · 2 years ago

Is it worth using a bunch of old GTX 10 series cards ( like 1060 1070 1080 ) for running local LLM?

croholdr@alien.top · 2 years ago

I tried it. Something like 1.2 tokens inference on lamma 70b with a mix of cards (but 4 1080s). Would process would crash occasionally. Ideally every card would have the same vram.

Going to try it with 1660 TI’s. I think it may be the ‘sweet spot’ in power to price to performance.

FullOf_Bad_Ideas@alien.top · 2 years ago

Did you use some q3 gguf quant with this?

Material1276@alien.top · 2 years ago

Another consideration is that I was told by someone with multiple cards, that if you split your layers across multiple cards, they don’t all process the layers simultaneously.

So, if you are on 3x cards, you don’t get a parallel performance benefit of all cards working at the same time. It processes layers on card 1, then card 2, then card 3.

The slowest card will obviously have the worst speed. Not sure what this will do for your load times of a model or your electricity bill, as well as the fact you need a system big enough to fit them all in.

Murky-Ladder8684@alien.top · 2 years ago

Those series of Nvidia gpus didn’t have tensor cores yet and believe they started in 20xx series. I not sure how much it impacts inference purposes vs training/fine tuning but worth doing more research. From what I gathered the answer is “no” unless you use a 10xx for like monitor output, TTS, or other smaller co-llm use that you don’t want taking vram away from your main LLM GPUs.

BlissfulEternalLotus@alien.top · 2 years ago

I wish they come up with some extendable tensor chips that can work with old laptops.

Currently only 7b is the only model we can run comfortably. Even for 13 b, it’s slower and it needs quite a bit of adjustment.

512DuncanL@alien.top · 2 years ago

You might as well use the cards if you have them already. I’m currently getting around 5-6 tokens per second when running nous-capybara 34b q4_k_m on a 2080ti 22gb and a p102 10gb (basically a semi lobotomized 1080ti). The p102 does bottleneck the 2080ti, but hey, at least it runs at a near usable speed! If I try running on CPU (I have a r9 3900) I get something closer to 1 token per second.

FullOf_Bad_Ideas@alien.top · 2 years ago

How did you get your 2080 ti to 22gb of VRAM?

512DuncanL@alien.top · 2 years ago

Modded cards are quite easy to obtain in china

WaterPecker@alien.top · 2 years ago

Hopefully the proposed S-LoRa’s will allow to do more with less.

candre23@alien.top · 2 years ago

The ONLY pascal card worth bothering with is the P40. It’s not fast, but it’s the cheapest way to get a whole bunch of usable vram. Nothing else from that generation is worth the effort.