Hey all,
So I am trying to run some of the various models to try to learn more, but use for some specific research purposes. I have my trusty 16 core Threadripper (Gen 1) with 64GB ram, SSD and an AMD 6700XT GPU.
I installed Ubuntu server… no GUI/desktop, to hopefully maximize hardware for AI stuff. It runs Docker on boot and it auto starts Portainer for me. I access that via web from another machine, and have deployed a couple of containers. I deployed the ollama container and the ollama-webui container.
Those work. I am able to load a model and run it. But they are insanely slow. My Windows machine with 8 core 5800 cpu and 32GB ram (but a 6900XT gpu) using LMStudio is able to load and respond much faster (though still kind of slow) with the same model.
I understand now after some responses/digging, that GPU is obviously much faster than CPU. I would have hoped a 16 core CPU with 64GB RAM would still offer some decent performance on the DeepSeek Coder 30b model, or the latest meta codellama model (30b). But they both take about 4+ minutes to start to respond to a simple “show me a hello world app in …” and they take forever to output too… like 2 or 3 characters per second.
So first, I would have thought it would run much faster on a 16 core machine with 64GB ram. But also… is it not using my 6700XT GPU with 12GB VRAM? Is there some way I need to configure docker for ollama container to give it more RAM, cpus and access to GPU?
OR is there a better option to run on ubuntu server that mimics the OpenAI API so that webgui works with it? Or perhaps a better overall solution that would load/run models much faster utilizing the hardware?
Thank you.