If you’re on Windows, I’d download KoboldCPP and TheBloke’s GGUF models from HuggingFace.
Then you just launch KoboldCPP, select the .gguf file, select your GPU, enter the number of layers to offload, set the context size (4096 for those), etc and launch it.
Then you’re good to start messing around. Can use the Kobold interface that’ll pop up or use it through the API with something like SillyTavern.
Well… none at all if you’re happy with ~1 token per second or less using GGUF CPU inference.
I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.
You’d need 2 x 3090 or an A6000 or something to do it quickly.