M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate)

farkinga@alien.top · 2 years ago

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate)

bebopkim1372@alien.top · 2 years ago

My M1 Max Mac Studio has 64GB of RAM. By running sudo sysctl iogpu.wired_limit_mb=57344, it did magic!

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/****/****/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 57344.00 MiB
ggml_metal_init: maxTransferRate               = built-in GPU

Yay!

farkinga@alien.top · 2 years ago

Yeah! That’s what I’m talking about. Would you happen remember what it was reporting before? If it’s like the rest, I’m assuming it said something like 40 or 45gb, right?

CheatCodesOfLife@alien.top · 2 years ago

64GB M1 Max here. Before running the command, if I tried to load up goliath-120b: (47536.00 / 49152.00) - fails

And after sudo sysctl iogpu.wired_limit_mb=57344 : (47536.00 / 57344.00)

So I guess the default is: 49152

fallingdowndizzyvr@alien.top · 2 years ago

So I guess the default is: 49152

It is. To be more clear, llama.cpp tells you want the recommendedMaxWorkingSetSize is. Which should match that number.

bebopkim1372@alien.top · 2 years ago

Maybe 47536MB is the net model size. For LLM inference, memory for context and optional context cache memory are also needed.

bebopkim1372@alien.top · 2 years ago

It was 48GB and now I can use 12GB more!

FlishFlashman@alien.top · 2 years ago

≥64GB allows 75% to be used by GPU. ≤32 its ~66%. Not sure about the 36GB machines.

Zugzwang_CYOA@alien.top · 2 years ago

How is the prompt processing time on a mac? If I were to work with a prompt that is 8k in size for RP, with big frequent changes in the prompt, would it be able to read my ever-changing prompt in a timely manner and respond?

I would like to use Sillytavern as my front end, and that an result in big prompt changes between replies.

bebopkim1372@alien.top · 2 years ago

For M1, when prompt evaluations occur, BLAS operation is used and the speed is terrible. I also have a PC with 4060 Ti 16GB, and cuBLAS is the speed of light compared with BLAS speed on my M1 Max. BLAS speeds under 30B modles are acceptable, but more than 30B, it is really slow.

Zugzwang_CYOA@alien.top · 2 years ago

Good to know. It sounds like macs are great at asking simple questions of powerful LLMs, but not so great at roleplaying with large context stories. I had hoped that an M2 Max would be viable for RP at 70b or 120b, but I guess not.

Jelegend@alien.top · 2 years ago

I am getting the following error on running this command on Mac Studio M2 Max 64GB RAM

sysctl: unknown oid ‘iogpu.wired_limit_mb’

Can soeome help me out here on what to do here?

bebopkim1372@alien.top · 2 years ago

Do you use macOS Sonoma? Mine is Sonoma 14.1.1 - Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000 arm64.

CheatCodesOfLife@alien.top · 2 years ago

That totally worked. I can run goliath 120b on my m1 max laptop now. Thanks a lot.

Zestyclose_Yak_3174@alien.top · 2 years ago

Which quant did you use and how was your experience?

CheatCodesOfLife@alien.top · 2 years ago

46G goliath-120b.Q2_K

So the smallest one I found (I didn’t quantize this one myself, found it on HF somewhere)

And it was very slow. about 13t/s prompt_eval and then 2.5t/s generating text, so only really useful for me when I need to run it on my laptop (I get like 15t/s with 120b model on my 2x3090 rig at 3bpw exl2)
As for the models it’s self, I like it a lot and use it frequently.

TBH, this ram thing is more helpful for me because it lets me run Q5 70b models instead of just Q4 now.

Agusx1211@alien.top · 2 years ago

fallingdowndizzyvr@alien.top · 2 years ago

As per the latest developments in that discussion, “iogpu.wired_limit_mb” only works on Sonoma. So if you are on an older version of Mac OS, try “debug.iogpu.wired_limit” instead.

M1/M2/M3: increase VRAM allocation with sudo sysctl iogpu.wired_limit_mb=12345 (i.e. amount in mb to allocate)

M1/M2/M3: increase VRAM allocation with sudo sysctl iogpu.wired_limit_mb=12345 (i.e. amount in mb to allocate)

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate)

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate)