I want to run a 70B LLM locally with more than 1 T/s. I have a 3090 with 24GB VRAM and 64GB RAM on the system.

What I managed so far:

  • Found instructions to make 70B run on VRAM only with a 2.5B that run fast but the perplexity was unbearable. LLM was barely coherent.
  • I randomly made somehow 70B run with a variation of RAM/VRAM offloading but it run with 0.1 T/S

I saw people claiming reasonable T/s speeds. Sine I am a newbie, I barely can speak the domain language, and most instructions I found assume implicit knowledge I don’t have*.

I need explicit instructions on what 70B model to download exactly, which Model loader to use and how to set parameters that are salient in the context.

  • silenceimpaired@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I could never get up and running on Linux with Nvidia. I used Kobold on Windows, but boy is it painful on Linux.

    • TuuNo_@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Well, I have never used Linux before since the main purpose of my pc is gaming. But I heard running LLMs on Linux is overall faster.

    • giblesnot@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      I don’t know what you were running into but I’m running Pop_OS 22.04 (a modified version of Ubuntu,) as my OS with a 3090 and everything I have tried I just follow the basic install instructions on the home page and it works. Ooga booga, Automatic1111, Tortoise TTS, Whisper STT, Bark, Kobald, etc. I just follow the “run these commands” linux instructions and everything is groovy.