GPT-Fast: A fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

programmerChilli@alien.top · 11 months ago

GPT-Fast: A fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

llama_in_sunglasses@alien.top · 11 months ago

Were you involved? I think this has a pretty good chance of winding up a library, HF transformers is a legit overwrought mess and given that I scanned through most of the code just taking a look inside, that’s an impressively low line count for something that looks like it can load all of the llama family members.

GPT-Fast: A fast and hackable implementation of transformer inference in &lt;1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

GPT-Fast: A fast and hackable implementation of transformer inference in &lt;1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

GPT-Fast: A fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

GPT-Fast: A fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!