tl;dr: I’m considering building a budget machine for tinkering with LLMs, but I’m not sure if this is a good idea and how to go about it.
For context: I work in a university department. I currently have access to a 2080 Ti on a shared machine, and we’re in the process of acquiring a small server with 2 L40 cards. So for any larger experiments, I will be able to use this shared machine.
However, I think I would like to have my own small machine for tinkering: trying different models and techniques, and just playing around, and preparing larger experiments to be run on the server. My focus is on teaching and education not on state-of-the-art research.
With aiming for a good amount of VRAM, the 4060 Ti 16GB seems to be the most obvious choice; I also like the low power requirements (regarding energy and cooling). But this card seems to have a poor reputation overall. I’m also not sure what currently the sweet spot w.r.t. the the CPU and memory is – I completely lost track of Intel’s and AMD’s generations over the last years.
Some additional comment regarding some common opinions
- I simply like to have my own hardware and cloud services seem to be more expensive in the long run.
- There is not really a good market of used GPUs where I’m located (Singapore), so the common suggestion “go with as used 3090” does not really work.
Any good suggestions, or am I naive with my idea of a budget machine? Thanks a lot!
You can absolutely do interesting and useful things with very little hardware, with quantized models, especially if you don’t mind if inference is slow. My preferred quantization is q4_K_M (with GGUF and llama.cpp).
I started with a spare Lenovo T560 Thinkpad with 8GB of RAM, which handled 7B models no problem. That’s a $120 eBay purchase. Once I was hooked, I shifted to one of the Dell T7910 in the homelab and moved up to larger models.
I’m still not using a GPU for anything. It’s been CPU inference, which is slow but otherwise great.
You could get just about any $300 desktop and put a decent GPU in it (16GB VRAM will allow fast inference with 13B models, and 24GB should allow heavily-quantized 30B) and enjoy fast inference. The most expensive bit is the GPU.
See this sub’s wiki for more detailed hardware tips.