Hi. I’m currently running a 3060 12Gb | R7 2700X | 32gb 3200 | Windows 10 w/ latests nvidia drivers (vram>ram overflow disabled). By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Are these values what is expected of my setup? Or is there something i can do to improve speeds without changing the model?

Its pretty much unusable at this state, and since it’s hard to find information about this topic i figured i would try to ask here.

EDIT: running the model on the latest version of the text-generation-webui

  • -Ellary-@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    R5 5500 (on stock 3600Mhz) | 3060 12gb | 32gb 3600, Win10 v2004.
    I’m using LM Studio for heavy models (34b (q4_k_m), 70b (q3_k_m) GGUF.
    On 70b I’m getting around 1-1.4 tokens depending on context size (4k max),
    I’m offloading 25 layers on GPU (trying to not exceed 11gb mark of VRAM),
    On 34b I’m getting around 2-2.5 tokens depending on context size (4k max).
    I’m offloading 30 layers on GPU (trying to not exceed 11gb mark of VRAM),
    On 20b I was getting around 4-5 tokens, not a huge user of 20b right now.

    So I can recommend LM Studio for models heavier then 13b+, woks better for me.
    Here is a 34b YI Chat generation speed:

    https://preview.redd.it/h4d0lbm5u63c1.png?width=903&format=png&auto=webp&s=fdc161b136879d1c1de6ef065cb80f35f188e46f