Hi. I have LLaMA2-13B-Tiefighter-exl2_5bpw and (probably) the same LLaMA2-13B-Tiefighter.Q5_K_M.

I run it on 1080Ti and old threadripper with 64 4-channel DDR4-3466. I use oobabooga (for GGUF and exl2) and LMStudio. I have 531.68 Nvidia driver (so I recieve OOM, not RAM-swapping when VRAM overflows).

1st question: I read that exl2 consume less vram and work faster than gguf. I try to load it on Oobabooga (ExLlamaV2_HF) and it fits in my 11gb VRAM consume ~10gb) but produce only 2.5 t/s, while GGUF (lama.cpp backend) with 35 layers offloaded on GPU - 4.5 t/s. Why? I don’t set some important settings?

2nd question: In LMStudio (lama.cpp backend?) with the same settings and same gpu offloaded 35 layers I got only 2.3 t/s. Why? Same backend, same GGUF, same settings for sampling and context.

  • tntdeez@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    exl2 processes most things in FP16, which the 1080ti, being from the Pascal era, is veryyy slow at. GGUF/llama.cpp on the other hand is capable of using an FP32 pathway when required for the older cards, that’s why it’s quicker on those cards.