• 2 Posts
  • 14 Comments
Joined 1 year ago
cake
Cake day: October 26th, 2023

help-circle









  • By the hard work of kingbri, Splice86 and turboderp, we have a new API loader for LLMs using the exllamav2 loader! This is on a very alpha state, so if you want to test it may be subject to change and such.

    TabbyAPI also works with SillyTavern! Doing some special configurations, it can work as well.

    As a reminder, exllamav2 added mirostat, tfs and min-p recently, so if you used those on exllama_hf/exllamav2_hf on ooba, these loaders are not needed anymore.

    Enjoy!



  • The major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.

    When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.

    Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.

    If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn’t have issues like 5bpw/bits.

    The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.