llama2 13B on Gtx 1070

Suleyman_III@alien.top · 2 years ago

llama2 13B on Gtx 1070

frontenbrecher@alien.top · 2 years ago

use koboldcpp to split between GPU/CPU with gguf format, preferably a 4ks quantization for better speed. I am sure that it will be slow, possibly 1-2 token per second.

nhbis0n@alien.top · 2 years ago

I run 7B’s on my 1070. ollama run llama2 produces between 20 and 30 tokens per second in ubuntu.

Pusteblumenschnee@alien.top · 2 years ago

I have a GTX 1080 with 8GB VRAM and I have 16GB RAM. I can run 13B Q6_K.gguf models locally if I split them between CPU and GPU (20/41 layers on GPU with koboldcpp / llama.cpp). Compared to models that run completely on GPU (like mistral), it’s very slow as soon as the context gets a little bit larger. Slow means that a response might take a minute or more.

You might want to consider running a mistral fine tune instead.