We’re happy to release GPT-Fast, a fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!

Check out the blog post describing the techniques here: https://pytorch.org/blog/accelerating-generative-ai-2/

And check out the code here: https://github.com/pytorch-labs/gpt-fast

To be clear, this is intended more as a minimal “tutorial” of how you get really good inference performance rather than a library. Hopefully y’all find it useful!

Happy to answer any questions.

  • llama_in_sunglasses@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Were you involved? I think this has a pretty good chance of winding up a library, HF transformers is a legit overwrought mess and given that I scanned through most of the code just taking a look inside, that’s an impressively low line count for something that looks like it can load all of the llama family members.