Can I run llama2 13B locally on my Gtx 1070? I read somewhere minimum suggested VRAM is 10 GB but since the 1070 has 8GB would it just run a little slower? or could I use some quantization with bitsandbytes for example to make it fit and run more smoothly?
Edit: also how much storage will the model take up?
I have a GTX 1080 with 8GB VRAM and I have 16GB RAM. I can run 13B Q6_K.gguf models locally if I split them between CPU and GPU (20/41 layers on GPU with koboldcpp / llama.cpp). Compared to models that run completely on GPU (like mistral), it’s very slow as soon as the context gets a little bit larger. Slow means that a response might take a minute or more.
You might want to consider running a mistral fine tune instead.