Maybe 47536MB is the net model size. For LLM inference, memory for context and optional context cache memory are also needed.
Maybe 47536MB is the net model size. For LLM inference, memory for context and optional context cache memory are also needed.
For M1, when prompt evaluations occur, BLAS operation is used and the speed is terrible. I also have a PC with 4060 Ti 16GB, and cuBLAS is the speed of light compared with BLAS speed on my M1 Max. BLAS speeds under 30B modles are acceptable, but more than 30B, it is really slow.
It was 48GB and now I can use 12GB more!
My M1 Max Mac Studio has 64GB of RAM. By running sudo sysctl iogpu.wired_limit_mb=57344
, it did magic!
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/****/****/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 57344.00 MiB
ggml_metal_init: maxTransferRate = built-in GPU
Yay!
Another question is about memory and context length. Does a big memory let you increase the context length with smaller models where the parameters don’t fill the memory? I feel a big context would be useful for writing books and things.
Of course. Long context also requires VRAM. Larger VRAM is always good for LLM or other AI stuff.
I’m using M1 Max Mac Studio with 64GB of memory. I can use up to 48GB of memory as VRAM. I don’t know how much memory is on your M3 Pro, so I am talking about my case. 7B models are easy. 13B and 20B models are okay. Maybe 30B models are also okay. More than 30B models are tough.
One definite thing is that you must use llama.cpp or its variant (oobabooga with llama.cpp loader, koboldcpp derived from llama.cpp) for Metal acceleration. llama.cpp and GGUF will be your friends. llama.cpp is the only one program to support Metal acceleration properly with model quantizations.
Using llama.cpp or its variants, I found that prompt evaluation, BLAS matrix calculation, is very slow especially than cuBLAS from NVIDIA CUDA Development Kit. If model parameters are bigger, the prompt evaluation times get also longer.
I heard that design of M3 GPU has quite changed, so I guess it may speed up for BLAS, but not sure…
Among 7B models, my recommendation for RP is https://huggingface.co/maywell/Synatra-7B-v0.3-RP. Of course, TheBloke quantized this model too.
My main computer is M1 Max Mac Studio. It has 64GB of memory and I can use up to 48GB of video memory. However, it is difficult to use because the modules, libraries, and software support are not very good to use. If you’re a software developer, you will have tough time to make everything working well.
I bought 4060 Ti 16GB 2 months ago, and I felt that it was very easy making everything runs well on CUDA Development Kit. With Metal from Apple Silicon, I had quite tough time. With Metal environment, I almost always had quite minor problems, and sometimes there was no solution at all. But with NVIDIA’s GPU, things like this never happened. Only small VRAM is the problem.
I have no experience with A770, but I guess it is similar with Metal.
Do you use macOS Sonoma? Mine is Sonoma 14.1.1 - Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000 arm64.