I’m a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn’t really require flash-attn-2 to run “properly”, it just runs a little better that way. But it’s perfectly usable without it.
Thanks for your excellent library! It makes sense because I started writing this article about two months ago (chatcode.py is still mentioned in the README.md by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it’s still the case? I updated these two points, thanks for your feedback.
I’m the author of this article, thank you for posting it! If you don’t want to use Medium, here’s the link to the article on my blog: https://mlabonne.github.io/blog/posts/ExLlamaV2_The_Fastest_Library_to_Run%C2%A0LLMs.html
I’m a little surprised by the mention of
chatcode.py
which was merged intochat.py
almost two months ago. Also it doesn’t really require flash-attn-2 to run “properly”, it just runs a little better that way. But it’s perfectly usable without it.Great article, though. thanks. :)
Thanks for your excellent library! It makes sense because I started writing this article about two months ago (
chatcode.py
is still mentioned in theREADME.md
by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it’s still the case? I updated these two points, thanks for your feedback.