ae_dataviz@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 2 years ago

Quantizing 70b models to 4-bit, how much does performance degrade?

1

Quantizing 70b models to 4-bit, how much does performance degrade?

ae_dataviz@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 2 years ago

The title, pretty much.

I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

Chat

Dry-Vermicelli-682@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
So you have 2 GPUs on single m/b… and the llama.cpp thing knows to use both? Does this work with AMD GPUs too?
- harrro@alien.topB
  link
  fedilink
  English
  arrow-up
  1·
  2 years ago
  Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.
  
  Not sure on AMD support but for nvidia it’s pretty easy to do.