I know ooba supposedly work for windows, I had it up and running in Ubuntu but windows error corrected the boot record so I can’t access that environment anymore.
But I’m not interested in roleplay chat too much, so I’m fine with and might actually prefer to run it thorugh a python script. (I’d like to get more than one model up and running simultaneously for an “LLM” village NPC interaction experiment. but I digress. )
Looking at HF I see some code snippets, but there’s a variety of libraries and approaches to it? Is there anything considered a “gold standard” as of late for local windows LLMs that is not a pain in the ass to set up and supports the latest quantization flavors? I’ll aim to run on 24GB Vram but I also have 64GB system RAM and the option to run on both would be appreciated, but primarily I’m aiming for GPU.
Usually it’s going to depend on what format models you’re using.
I’m a big GGUF user, so I would use https://github.com/abetlen/llama-cpp-python.git.
If you’re a big GTPQ user, you might use https://github.com/PanQiWei/AutoGPTQ or https://github.com/turboderp/exllama.
If you’re just looking for non-quantized models, or maybe you just like to use this anyway, you could use https://huggingface.co/docs/transformers/index.