Macs with 32GB of memory can run 70B models with the GPU.

fallingdowndizzyvr@alien.top · 2 years ago

Macs with 32GB of memory can run 70B models with the GPU.

pulse77@alien.top · 2 years ago

You may try and run one of Q4 models without problems: because llama.cpp uses mmap to map files into memory, you can go above available RAM and because many models are sparse it will not use all mapped pages and even if it needs it, it will swap it out with other pages on demand… I was able to run falcon-180b-chat.Q6_K which uses about 141GB on a 128GB Windows PC with less than 1% SSD reads during inference… I could even run falcon-180b-chat.Q8 which uses about 182GB but in this case SSD was working heavily during inference and it was unbearably slow (0.01 tokens/second)…

fallingdowndizzyvr@alien.top · 2 years ago

Yes. I’ve done that before on my other machines. Llama.cpp in fact defaults to that. The hope for me was that since the models are sparse that the OS would cache the relevant parts of the models in RAM. So the first run through would be slow but subsequent runs would be fast since those pages are cached in RAM. How well that works or not really depends on how much RAM the OS is willing to use to cache mmap and how smartly it does it. My hope was that if it did it smarty with sparse data that it would be pretty fast. So far, my hopes haven’t been realized.