Running large models below requirements?

Maelstrom100@alien.top · 1 year ago

Running large models below requirements?

FullOf_Bad_Ideas@alien.top · 1 year ago

I believe that gpu offloading in llama.cpp can be used to merge your vram and ram. I would suggest you to try some airoboros llama 2 70b q3_k_m quant and Tess-m-1.3 q5_k_m once TheBloke makes quants. There will be some leftover space in your RAM after loading Tess, but it’s a model with 200k context, so you will need it for context. Max out your vram and maybe use batch size of -1 to trade prompt processing speed for more vram space, try offloading both with cublas and clBLAST. Last time I checked, it seemed like using clBLAST allowed to offload more layers to gpu in the same memory footprint.