I want to use the ExLlama models because it enables me to use the Llama 70b version with my 2 RTX 4090. I managed to get it to work pretty easily via text generation webui and inference is really fast! So far so good…
However, I need the model in python to do some large scale analyses. I cannot seem to find any guide/tutorial in which it is explained how to use ExLlama in the usual python/huggingface setup.
Is this just not possible? If it is, can someone pinpoint me to some examplary code in which ExLlama is used in python.
Much appreciated!
Check out turbo’s project https://github.com/turboderp/exui
He just put it up not long ago and he has Speculative Decoding working on it. I tried it with Goliath 120b 4.85bpw exl2 and was getting 11-13 t/s vs 6-8 t/s without it. It’s barebones but works.