ExLlamaV2: The Fastest Library to Run LLMs

alchemist1e9@alien.top · 1 year ago

ExLlamaV2: The Fastest Library to Run LLMs

ModeradorDoFariaLima@alien.top · 1 year ago

Too bad I think that Windows support for it was lacking (at least, last time I checked it). It needs a separate thing to make it work properly, and this thing was only for Linux.

ViennaFox@alien.top · 1 year ago

It works fine for me. I am also using a 3090 and text-gen-webui like Liquiddandruff.

tgredditfc@alien.top · 1 year ago

In my experience it’s the fastest and llama.cpp is the slowest.

randomfoo2@alien.top · 1 year ago

I think ExLlama (and ExLlamaV2) is great and EXL2’s ability to quantize to arbitrary bpw, and its incredibly fast prefill processing I think generally makes it the best real-world choice for modern consumer GPUs, however, from testing on my workstations (5950X CPU and 3090/4090 GPUs) llama.cpp actually edges out ExLlamaV2 for inference speed (w/ a q4_0 beating out a 3.0bpw even) so I don’t think it’s quite so cut and dry.

For those looking for max batch=1 perf, I’d highly recommend people run their own benchmarks at home on their own system and see what works (also pay attention to prefill speeds if you often have long context)!

My benchmarks from a month or two ago: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831

tgredditfc@alien.top · 1 year ago

Thanks for sharing! I have been struggling with llama.cpp loader and GGUF (using oobabooga and the same LLM model), no matter how I set the parameters and how many offloaded layers to GPUs, llama.cpp is way slower to ExLlama (v1&2), not just a bit slower but 1 digit slower. I really don’t know why.

CardAnarchist@alien.top · 1 year ago

Can you offload layers with this like GGUF?

I don’t have much VRAM / RAM so even when running a 7B I have to partially offload layers.

llama_in_sunglasses@alien.top · 1 year ago

I’ve tested pretty much all of the available quantization methods and I prefer exllamav2 for everything I run on GPU, it’s fast and gives high quality results. If anyone wants to experiment with some different calibration parquets, I’ve taken a portion of the PIPPA data and converted it into various prompt formats, along with a portion of the synthia instruction/response pairs that I’ve also converted into different prompt formats. I’ve only tested them on OpenHermes, but they did make coherent models that all produce different generation output from the same prompt.

https://desync.xyz/calsets.html

JoseConseco_@alien.top · 1 year ago

So how much vram would be required for 34b model or 14b model? I assume no cpu offloading right? With my 12gb vram, I guess I could only feed 14bilion parameters models, maybe even not that.

CasimirsBlake@alien.top · 1 year ago

No chance of running this on P40s any time soon?

kpodkanowicz@alien.top · 1 year ago

It’s not just great. It’s a piece of art.

beezbos_trip@alien.top · 1 year ago

Does it run on Apple Silicon?

intellidumb@alien.top · 1 year ago

Based on the releases, doesn’t look like it. https://github.com/turboderp/exllamav2/releases

SomeOddCodeGuy@alien.top · 1 year ago

I wish there was support for metal with ExLlamav2. :(

mlabonne@alien.top · 1 year ago

I’m the author of this article, thank you for posting it! If you don’t want to use Medium, here’s the link to the article on my blog: https://mlabonne.github.io/blog/posts/ExLlamaV2_The_Fastest_Library_to_Run%C2%A0LLMs.html

ReturningTarzan@alien.top · 1 year ago

I’m a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn’t really require flash-attn-2 to run “properly”, it just runs a little better that way. But it’s perfectly usable without it.

Great article, though. thanks. :)

mlabonne@alien.top · 1 year ago

Thanks for your excellent library! It makes sense because I started writing this article about two months ago (chatcode.py is still mentioned in the README.md by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it’s still the case? I updated these two points, thanks for your feedback.

a_beautiful_rhind@alien.top · 1 year ago

Hey he finally gets some recognition.

MonkeyMaster64@alien.top · 1 year ago

Is this able to use CPU (similar to llama.cpp)?

Darius510@alien.top · 1 year ago

God I cant wait until we’re past the command line era of this stuff

fallingdowndizzyvr@alien.top · 1 year ago

I’m the opposite. I shun everything LLM that isn’t command line when I can. Everything has it’s place. When dealing with media, GUI is the way to go. But when dealing with text, command line is fine. I don’t need animated pop up bubbles.

lxe@alien.top · 1 year ago

Agreed. Best performance running GPTQ’s. Missing the HF samplers but that’s ok.

ReturningTarzan@alien.top · 1 year ago

I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don’t personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can’t be long before there’s an update to expose those parameters in the UI.

ExLlamaV2: The Fastest Library to Run LLMs

ExLlamaV2: The Fastest Library to Run LLMs

Just a moment...