So I’m considering getting a good LLM rig, and the M2 Ultra seems to be a good option for large memory, with much lower power usage/heat than 2 to 8 3090s or 4090s, albeit with lower speeds.

I want to know if anyone is using one, and what it’s like. I’ve read that it is less supported by software which could be an issue. Also, is it good for Stable Diffusion?

Another question is about memory and context length. Does a big memory let you increase the context length with smaller models where the parameters don’t fill the memory? I feel a big context would be useful for writing books and things.

Is there anything else to consider? Thanks.

  • LocoMod@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This is something I’ve noticed with large context as well. This is why the platform built around LLMs is what will be the major differentiator for the foreseeable future. I’m cooking up a workflow to insert remote LLMs as part of a chat workflow and successfully tested running inference on a fast Mistral-7B model and a large Dolphin-Yi-70B on different servers from a single chat view successfully about an hour ago. This will unlock the capability to have multiple LLMs working together to manage context by providing summaries, offloading realtime embedding/retrieval to a remote LLM, and a ton of other possibilities. I got it working on a 64GB M2 and a 128GB M3. Tonight I will insert the 4090RTX into the mix. The plan is to have the 4090 run small LLMs. Think 13B and smaller. These run and light speed on my 4090. Its job can be to provide summaries of the context by using LLMs finetuned for that purpose. The new Orca13B is promising little agent that so far follows instructions really well for these types of workflows. Then we can have all 3 servers working together on a solution. Ultimately, all of the responses would be merged into the “ideal response” and output as the “final answer”. I am not concerned with speed for my use case as I use LLMs for highly technical work. I need correctness above all even if this means waiting a while for the next step.

    I’m also going to implement a mesh VPN so we can do this over WAN and scale it even more with a trusted group of peers.

    The magic behind ChatGPT is the tooling and how much compute they can burn. My belief is the model is less relevant than folks think. It’s the best model no doubt, but if we were allowed to run it on the CLI as a pure prompt/response workflow between use and model with no tooling in between, my belief is it would be a lot like the best open source models…