I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x

But when i inference codellama 13b with oobabooga(web ui)

It just make 5tokens/s

It is so slow.

Is there any config or something else for a100???

  • opi098514@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Sounds like you might be using the standard transformer loader. Try exllama or exlamav2

  • uti24@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Sounds like you run it on CPU. If you using oobabooga you have to explicitly set how many layers you offload to GPU and by default everything runs on CPU (at least gguf models)

  • hudimudi@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Uhmmm where did you buy that a100? Was it a good deal? lol. Just kidding, you probably set sth up wrong or the drivers are messing up. Is the card working fine otherwise in benchmarks?

  • a_beautiful_rhind@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Something is wrong with your environment. even P40s give more than that.

    Other option is you don’t get enough tokens to get proper t/s speed. What was the total inference time?

  • easyllaama@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.

  • nuvalab@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    That sounds like CPU speed. What you see from `watch nvidia-smi -d -n 0.1` while you’re running inference ?

  • henk717@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Tried a 13B model with Koboldcpp on one of the runpod A100’s, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.