From what I’ve read mac somehow uses system ram and windows uses the gpu? It doesn’t make any sense to me. Any help appreciated.
M1 / M2 / M3, etc., Mac is using a new system called “unified memory”, in which the onboard RAM is shared between the CPU and GPU.
On other systems, the CPU’s RAM is separate from the GPU’s VRAM
Not true. Any PC using integrated graphics uses unified memory.
Anything smaller than 64+12-5.
Do you mean
(RAM gb + VRAM gb - 5gb) * billion parameters?
And what is the 5 for? Thanks
Just curious- any chance to run something on Intel Arc?
Yes. MLC Chat runs great with no fuss. The same as running it on nvidia or AMD. Then things get more fussy. There’s ooba, fastchat and of course Intel’s own BigDL. The Arcs actually run on llama.cpp too, OpenCL and Vulkan, but it’s dog slow. Like half the speed of the CPU. Considering it happens in both OpenCL and Vulkan, there’s something about llama.cpp that isn’t friendly to the Arc architecture. Vulkan under MLC is fast.
In general, you have two options:
Running the model on your graphics card, or running it using your CPU.
On your graphics card, you put the model in your VRAM, and your graphics card does the processing.
If you use your CPU, you put the model in your normal RAM and the cpu does all the processing.
The graphics card will be faster, but graphics cards are more expensive.
You can even mix the two if you can’t fit your model in your VRAM, where you put as much as possible on your VRAM and whatever left over on RAM and CPU. But this won’t be as fast as running it fully on your graphics card.
Mac has a different kind of architecture where it’s a mix of the two, and can be incredibly fast, but I haven’t used it so can’t speak too much on it.
From a hardware perspective, a computer today has 3 types of storage : internal, RAM, and VRAM. As you probably know, the difference is RAM and VRAM only store stuff required for running applications. But what we have to understand for the matter here, is that since both those are needed for running apps, they are wired to be fast.
Thus, their hardware architecture have been made to be faster. In terms of speed: internal < RAM < VRAM. This will affect the speed of the AI you run depending on where it’s stored.
How fast, you wonder? Well on my machine, running an LLM in my VRAM gives me 30 it/s. On my CPU, it’s 1it/s. The GPU is literally 30x faster, which makes sense.
Did you follow? Now the interesting part is, today, you can run an AI while loading it in either the VRAM, or the RAM, or even the internal drive. When you load an AI (be it an LLM or Stable Diffusion), you can choose to activate parameters called lowvram or medvram. Medvram stores part of the AI in the RAM, and lowvram stores part of it in the internal drive. As a result, medvram makes things slow BUT allow you to run the AI if you couldn’t without it ; similarly, lowvram makes things even slower BUT make it work, by temporarily occupying your internal space.
So don’t worry about whether it uses the VRAM, the RAM or even the internal. The only two questions are : can you store an AI’s weights, and if you can, can it run fast enough? For both those questions, the answer is easy to get : just try.
If you get an Out of Memory Error, it means the AI is too big for current settings. Then try with optimization parameters. If it still doesn’t work even with lowvram, then, it’s too big ^^'.
Now inbefore you try some models randomly among the thousands available today: look at the weights of models, which is given in the name. For example, Synthia 7B models have 7 billion parameters. The parameters (or weights) are what is stored on the drive. So the bigger the number, the bigger the memory requirement.
Another thing to look at (and that’s where things get even more confusing, lol) : there are base models, then there are quantized versions of those models. A quantized model is much more performant (=faster and needs less memory) with similar capabilities. The quantized models are what you should look for, especially the GPTQ versions.
With 6GB VRAM, I can run a 7B model. I guess you can try 13B with 12GB. But since you have 64GB RAM, you can even try a bigger model, and have it run on the RAM instead of the VRAM! See it’s it’s not too slow for you. If it is, come back, I have another secret to share.
As for MAC, somebody correct me if I’m wrong, but I think they have a unified RAM, a system that serves as both the RAM and the VRAM.
Uu this was nice thanks. Can you tell me where do i put --lowram with the text webui repo?
> From what I’ve read mac somehow uses system ram and windows uses the gpu?
Learn reading.
SOME MAC’s - some very specific models - do not have a GPU in the classical sense but an on chip GPU and super fast RAM. You could essentially say they are a graphics card with CPU functionality and only VRAM - that would come close to the technical implementation side.
They are not using “sometimes this, sometimes that”; just SOME models (M1, M2, M3 chips) have basically a GPU that also is the CPU. The negative? Not expandable.
Normal RAM is a LOT - seriously a lot - slower than the VRAM or HBM you find on high end cards. They are not only way faster (DDR6 now, while current computers are now DDR5) but also not 654 bit wide but 384 or WAY higher (2048 i think) so their transfer speed in GB/S makes normal computers puke.
That, though, comes at a price. Which, a I pointed out, starts with being non flexible - no way to expand RAM, it is all soldered on. Some of the fast RAM is reserved - but essentially on a M2 Pro you get a LOT of RAM usable for LLM.
You now say you have 64gb ram - unless you bought a crappy card to a modern computer, that means you also have RAM that is way slower than what is normal today. So, you likely are tuck to the 12gb VRAM to run fast. Models come in layers, and you can offload some to the normal RAM, but it is a LOT slower than the VRAM - so not good to use a lot of it.
I have the same specs and basically any of the 7b and 13b models run great using Ollama.
7B models will work just fine with FP4 or INT4. Similarly 13B as well with little offloading if needed.
Imo running any AI on anything other than VRAM makes it so slow it’s unusable. So, you can play around with 13b quantized models.
7b and 13b fully in vram. 7b models have 35, 13b have 43 layers iirc
70b involving ram