Question about the possibility of running large models on a 3070ti 32gb ram, what’s the best way to run them if possible, without quality loss?

Speed isn’t an issue, just want to be able to run such models ambiently.

  • FullOf_Bad_Ideas@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I believe that gpu offloading in llama.cpp can be used to merge your vram and ram. I would suggest you to try some airoboros llama 2 70b q3_k_m quant and Tess-m-1.3 q5_k_m once TheBloke makes quants. There will be some leftover space in your RAM after loading Tess, but it’s a model with 200k context, so you will need it for context. Max out your vram and maybe use batch size of -1 to trade prompt processing speed for more vram space, try offloading both with cublas and clBLAST. Last time I checked, it seemed like using clBLAST allowed to offload more layers to gpu in the same memory footprint.