Motylde@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 2 years ago

Why there are quantized models in the hugging face hug?

4

1

Why there are quantized models in the hugging face hug?

Motylde@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 2 years ago

4

Hi. I’m using Llama-2 for my project in python with transformers library. There is an option to use quantization on any normal model:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_4bit=True,
)

If it’s just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?

Chat

metaprotium@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
Most quantized models on the hub are quantized with GPTQ / AWQ and other techniques. These techniques are optimized for inference and are faster than load_in_4bit. load_in_4bit uses the bitsandbytes library and is more useful for training LoRAs on a limited amount of VRAM.