Hi. I have LLaMA2-13B-Tiefighter-exl2_5bpw and (probably) the same LLaMA2-13B-Tiefighter.Q5_K_M.
I run it on 1080Ti and old threadripper with 64 4-channel DDR4-3466. I use oobabooga (for GGUF and exl2) and LMStudio. I have 531.68 Nvidia driver (so I recieve OOM, not RAM-swapping when VRAM overflows).
1st question: I read that exl2 consume less vram and work faster than gguf. I try to load it on Oobabooga (ExLlamaV2_HF) and it fits in my 11gb VRAM consume ~10gb) but produce only 2.5 t/s, while GGUF (lama.cpp backend) with 35 layers offloaded on GPU - 4.5 t/s. Why? I don’t set some important settings?
2nd question: In LMStudio (lama.cpp backend?) with the same settings and same gpu offloaded 35 layers I got only 2.3 t/s. Why? Same backend, same GGUF, same settings for sampling and context.
exl2 processes most things in FP16, which the 1080ti, being from the Pascal era, is veryyy slow at. GGUF/llama.cpp on the other hand is capable of using an FP32 pathway when required for the older cards, that’s why it’s quicker on those cards.