@lone_striker

lone_striker@alien.top · 1 year ago

I’m not sure how this would be applicable in those other scenarios you’ve mentioned; anything is possible. There may be other uses for this novel decoding method. But being touted as being X percent faster than transformers in a useful way isn’t one of them.

lone_striker@alien.top · 1 year ago

It’s an innovative approach, but the practical real-world use case where it is beneficial are very very narrow:
https://twitter.com/joao_gante/status/1727985956404465959

TL;DR: you have to have massive spare compute to get a modest gain in speed. In most cases, you get slower inference. They are also comparing speeds to relatively slow native transformers inference. Exllamav2, GPTQ, and llama.cpp compared to base transformers performance is much more impressive.

lone_striker@alien.top · 1 year ago

For the 2.4bpw and 2.6bpw exl2 models, you have to change a setting in ooba to get them to generate coherent text. Disable this setting:

Add the bos_token to the beginning of prompts

https://preview.redd.it/4v8m7ciu0y1c1.png?width=356&format=png&auto=webp&s=785837b8466a3bcda3e49477424b7c377a8d542f

The very low bpw models need the above setting as well as being more strict with the prompt format. The higher bpw models are more flexible and can deal with prompt formats they were not specifically tuned for.

I would also set the VRAM for 2.4 to use only a single GPU. Spreading them out over two GPUs is not needed and will slow them down. That’s the main reason I generate 2.4 (and 2.6bpw) versions is to allow people with only a single 3090 or 4090 to run 70B models at full speeds. Though obviously quality will be lower than the higher-bit models. For 2.6bpw to fit on a single 24 GB VRAM GPU, you will need to enable the cache_8bit option.