I want to run a 70B LLM locally with more than 1 T/s. I have a 3090 with 24GB VRAM and 64GB RAM on the system.
What I managed so far:
- Found instructions to make 70B run on VRAM only with a 2.5B that run fast but the perplexity was unbearable. LLM was barely coherent.
- I randomly made somehow 70B run with a variation of RAM/VRAM offloading but it run with 0.1 T/S
I saw people claiming reasonable T/s speeds. Sine I am a newbie, I barely can speak the domain language, and most instructions I found assume implicit knowledge I don’t have*.
I need explicit instructions on what 70B model to download exactly, which Model loader to use and how to set parameters that are salient in the context.
Have you tried with FP4 & RAM offloading combined?