Chassis only has space for 1 GPU - Llama 2 70b possible on a budget?

Jugg3rnaut@alien.top · 2 years ago

Chassis only has space for 1 GPU - Llama 2 70b possible on a budget?

Sea_Particular_4014@alien.top · 2 years ago

Your 512GB of RAM is overkill. Those Xeons are probably pretty mediocre for this sort of thing due to the slow memory, unfortunately.

With a 4090 or 3090, you should get about 2 tokens per second with GGUF q4_k_m inference. That’s what I do and find it tolerable but it depends on your use case.

You’d need a 48GB GPU, or fast DDR5 RAM to get faster generation than that.

Dankmre@alien.top · 2 years ago

Op seems to want 5-10 T/s on a budget with 70B… Not going to happen I think.