I’ve been out of the loop for a bit, so despite this thread coming back again and again, I’m finding it useful/relevant/timely.
What I’m having a hard time figuring out is if I’m still SOTA with running text-generation-webui and exllama_hf. Thus far, I ALWAYS use GPTQ, ubuntu, and like to keep everything in RAM on 2x3090. (I also run my own custom chat front-end, so all I really need is an API.)
I know exllamav2 is out, exl2 format is a thing, and GGUF has supplanted GGML. I’ve also noticed a ton of quants from the bloke in AWQ format (often *only* AWQ, and often no GPTQ available) - but I’m not clear on which front-ends support AWQ. (I looked a vllm, but it seems like more of a library/package than a front-end.)
edit: Just checked, and it looks like text-generation-webui supports AutoAWQ. Guess I should have checked that earlier.
I guess I’m still curious if others are using something besides text-generation-webui for all-VRAM model loading. My only issue with text-generation-webui (that comes to mind, anyway) is that it’s single-threaded; for doing experimentation with agents, it would be nice to be able to run multi threaded.
Looks really nice - I watched the video demo and I can’t say that my coding experience really calls for any of the things in the demo. Most of what I deal with is managing integration of a large set of data models. The actual coding is the easy part, figuring out what to code is the hard part.