So I’m considering getting a good LLM rig, and the M2 Ultra seems to be a good option for large memory, with much lower power usage/heat than 2 to 8 3090s or 4090s, albeit with lower speeds.
I want to know if anyone is using one, and what it’s like. I’ve read that it is less supported by software which could be an issue. Also, is it good for Stable Diffusion?
Another question is about memory and context length. Does a big memory let you increase the context length with smaller models where the parameters don’t fill the memory? I feel a big context would be useful for writing books and things.
Is there anything else to consider? Thanks.
M2 Ultra user here. I threw some numbers up for token counts: https://www.reddit.com/r/LocalLLaMA/comments/183bqei/comment/kaqf2j0/?context=3
With the 147GB of VRAM I have available, I’m pretty sure I could use all 200k tokens available in a Yi 34b model, but I’d be waiting half an hour for a result. I’ve done up to 50k in CodeLlama, and it took a solid 10 minutes to get a response.
The M2 Ultra’s big draw is its big RAM; its not worth it unless you get the 128GB model or higher. You have to understand that the speed of the M2 ultra doesn’t remotely compare to something like a 4090; CUDA cards are gonna leave us in the dust.
Another thing to consider is that we can only use ggufs via Llamacpp; there’s no support for anything else. In that regard, I’ve seen people put together 3x or more Tesla P40 builds that have the exact same limitation (can only use Llamacpp) but cost half the price or less.
I chose the M2 Ultra because it was easy. Big VRAM, and it took me less than 30 minutes from the moment I got the box to be chatting to a 70b q8 on it. But if speed or price are a major consideration, moreso than level of effort to set up? In that case the M2 ultra would not be the answer.
This is something I’ve noticed with large context as well. This is why the platform built around LLMs is what will be the major differentiator for the foreseeable future. I’m cooking up a workflow to insert remote LLMs as part of a chat workflow and successfully tested running inference on a fast Mistral-7B model and a large Dolphin-Yi-70B on different servers from a single chat view successfully about an hour ago. This will unlock the capability to have multiple LLMs working together to manage context by providing summaries, offloading realtime embedding/retrieval to a remote LLM, and a ton of other possibilities. I got it working on a 64GB M2 and a 128GB M3. Tonight I will insert the 4090RTX into the mix. The plan is to have the 4090 run small LLMs. Think 13B and smaller. These run and light speed on my 4090. Its job can be to provide summaries of the context by using LLMs finetuned for that purpose. The new Orca13B is promising little agent that so far follows instructions really well for these types of workflows. Then we can have all 3 servers working together on a solution. Ultimately, all of the responses would be merged into the “ideal response” and output as the “final answer”. I am not concerned with speed for my use case as I use LLMs for highly technical work. I need correctness above all even if this means waiting a while for the next step.
I’m also going to implement a mesh VPN so we can do this over WAN and scale it even more with a trusted group of peers.
The magic behind ChatGPT is the tooling and how much compute they can burn. My belief is the model is less relevant than folks think. It’s the best model no doubt, but if we were allowed to run it on the CLI as a pure prompt/response workflow between use and model with no tooling in between, my belief is it would be a lot like the best open source models…
Is it not possible to port ExLlamaV2 to metal? At least on a 4090, it’s much (much) faster at processing the input than llama.cpp
I imagine there’s a lot of work to do so, but I can’t imagine it’s impossible. Probably just not something folks are working on.
I don’t particularly mind too much, because the quality difference between exl2 and gguf is hard for me to work past. Just last night I was trying to run this NeuralChat 7b everyone is talking about on my windows machine in 8bpw exl2, and it was SUPER fast, but the model was so easily confused; before giving up on it, I grabbed the q8 gguf and swapped to it (with no other changes) and suddenly saw why everyone was saying that model is so good.
I don’t mind speed loss if I get quality, but I can’t handle quality loss to get speed. So for now, I really don’t mind only using gguf, because it’s perfect for me.
Hmm, I didn’t notice a major quality loss when I swapped from mistral-7b-openorca.Q8_0.gguf (running in koboldcpp) to Mistral-7B-OpenOrca-8.0bpw-h6-exl2 (running in text-gen-webui). Maybe I should try again. Sure you were using comparable sampling settings for both? I noticed for example SillyTavern has entirely different presets per backend.
Still need to try the new NeuralChat myself also, I was just going to go for the exl2, so this could be a good tip!