Nice. A lightweight loader. Will make us free of gradio.
Gradio is a 70MB requirement FYI. It has become common to see people calling text-generation-webui “bloated”, when most of the installation size is in fact due to Pytorch and the CUDA runtime libraries.
Gradio is a 70MB requirement
That doesn’t make it fast, just small. Inefficient code can be compact.
I think there is room for everyone - Text Gen is a piece of art - it’s the only thing in the whole space that always works and is reliable. However, if im building an agent and getting a docker build, I can not afford to change text gen etc.
By the hard work of kingbri, Splice86 and turboderp, we have a new API loader for LLMs using the exllamav2 loader! This is on a very alpha state, so if you want to test it may be subject to change and such.
TabbyAPI also works with SillyTavern! Doing some special configurations, it can work as well.
As a reminder, exllamav2 added mirostat, tfs and min-p recently, so if you used those on exllama_hf/exllamav2_hf on ooba, these loaders are not needed anymore.
Enjoy!
Does anyone know if they expose all the good stuff that Guidance uses for their guided generation and speedup? This plus guidance (kv cache, grammar control, etc) would be fast fast!