• 0 Posts
  • 2 Comments
Joined 10 months ago
cake
Cake day: November 22nd, 2023

help-circle

  • From a hardware perspective, a computer today has 3 types of storage : internal, RAM, and VRAM. As you probably know, the difference is RAM and VRAM only store stuff required for running applications. But what we have to understand for the matter here, is that since both those are needed for running apps, they are wired to be fast.

    Thus, their hardware architecture have been made to be faster. In terms of speed: internal < RAM < VRAM. This will affect the speed of the AI you run depending on where it’s stored.

    How fast, you wonder? Well on my machine, running an LLM in my VRAM gives me 30 it/s. On my CPU, it’s 1it/s. The GPU is literally 30x faster, which makes sense.

    Did you follow? Now the interesting part is, today, you can run an AI while loading it in either the VRAM, or the RAM, or even the internal drive. When you load an AI (be it an LLM or Stable Diffusion), you can choose to activate parameters called lowvram or medvram. Medvram stores part of the AI in the RAM, and lowvram stores part of it in the internal drive. As a result, medvram makes things slow BUT allow you to run the AI if you couldn’t without it ; similarly, lowvram makes things even slower BUT make it work, by temporarily occupying your internal space.

    So don’t worry about whether it uses the VRAM, the RAM or even the internal. The only two questions are : can you store an AI’s weights, and if you can, can it run fast enough? For both those questions, the answer is easy to get : just try.

    If you get an Out of Memory Error, it means the AI is too big for current settings. Then try with optimization parameters. If it still doesn’t work even with lowvram, then, it’s too big ^^'.

    Now inbefore you try some models randomly among the thousands available today: look at the weights of models, which is given in the name. For example, Synthia 7B models have 7 billion parameters. The parameters (or weights) are what is stored on the drive. So the bigger the number, the bigger the memory requirement.

    Another thing to look at (and that’s where things get even more confusing, lol) : there are base models, then there are quantized versions of those models. A quantized model is much more performant (=faster and needs less memory) with similar capabilities. The quantized models are what you should look for, especially the GPTQ versions.

    With 6GB VRAM, I can run a 7B model. I guess you can try 13B with 12GB. But since you have 64GB RAM, you can even try a bigger model, and have it run on the RAM instead of the VRAM! See it’s it’s not too slow for you. If it is, come back, I have another secret to share.

    As for MAC, somebody correct me if I’m wrong, but I think they have a unified RAM, a system that serves as both the RAM and the VRAM.