Looking for CPU Inference Hardware (8 Channel Ram Server Motherboards)

jasonmbrown@alien.top · 2 years ago

Looking for CPU Inference Hardware (8 Channel Ram Server Motherboards)

mcmoose1900@alien.top · 2 years ago

A big issue for CPU only setups is prompt processing. They’re kind of OK for short chats, but if you give them full context the processing time is miserable. Nowhere close to 5 tok/sec.

There is one exception: the Xeon Max with HBM. It is not cheap.

So if you get a server, at least get a small GPU with it to offload prompt processing.

fallingdowndizzyvr@alien.top · 2 years ago

A big issue for CPU only setups is prompt processing. They’re kind of OK for short chats, but if you give them full context the processing time is miserable. Nowhere close to 5 tok/sec.

That’s where context shifting comes into play. So the entire context doesn’t have to be reprocessed over and over again. Just the changes.

FaustBargain@alien.top · 2 years ago

my setup

EPYC Milan-X 7473X 24-Core 2.8GHz 768MB L3

512GB of HMAA8GR7AJR4N-XN HYNIX 64GB (1X64GB) 2RX4 PC4-3200AA DDR4-3200MHz ECC RDIMMs

MZ32-AR0 Rev 3.0 motherboard

6x 20tb WD Red Pros on ZFS with zstd compression

SABRENT Gaming SSD Rocket 4 Plus-G with Heatsink 2TB PCIe Gen 4 NVMe M.2 2280

you can probably get away with a non-x without really an performance difference. it might make a difference in very tiny models, but that’s not the point of getting such a beastly machine.

I got the Milan-X because I also use it for cad, and circuit board development, and gaming, and video editing so it’s an all in one for me.

also my electric bill went from $40 a month to $228 a month, but some of that is because I haven’t setup the suspend states yet and the machine isn’t sleeping the way I want it to yet. I just haven’t gotten around to it. i imagine it would cut the bill in half, and then maybe choosing the right fan manager and governors might save me another $30 a month.

I can run falcon 180b unquantized and still have tons of ram left over.

fallingdowndizzyvr@alien.top · 2 years ago

also my electric bill went from $40 a month to $228 a month

I take it you live in a low cost electricity area if your bill was $40 before that. Where I live, people can pay 10 times that even if they just live in an apartment. So in high cost areas like mine, the power and thus electricity cost savings for something like a Mac would end up paying for it.

Aaaaaaaaaeeeee@alien.top · 2 years ago

No way, you’re that one guy I uploaded the f16 airoboros for ! I was hoping you’d get the model and I think you did it :)

FaustBargain@alien.top · 2 years ago

sounds like me ;) Thanks!

Aphid_red@alien.top · 2 years ago

To get 3-5 tokens a second on 120GB models requires a minimum of 360-600 GB/s throughput (just multiply the numbers~), likely about 30% more due to various inefficiencies, as you usually never reach the maximum theoretical RAM throughput and there are other steps to evaluating the LLM than just the matmuls. So 468-780 GB/s.

This might be what you’re looking for, as a platform base:

https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ73-LM1-rev-10

24 channels of DDR-5 gets you up to 920 GB/s of total memory throughput, so that meets the criterion. About as much as a high-end GPU, actually. The numbers on genoa look surprisingly good (well, maybe not the power consumption; ~1100W for CPU and RAM is a lot more than the ~300W the A100 would use, you could probably power limit it to 150W and still be faster.).

Of course, during prompt processing, you’ll be bottlenecked by the CPU speed. I’d estimate a 32-core genoa CPU does ~ 2 tflops or so of fp64 (based on 9654’s number of 5.4 tflops, it’ll be a bit more than a third due to higher clock speed), so perhaps 4 tflops of fp32 (fp16 I don’t think is native instruction yet in genoa afaik, and fp32 should be 2x of fp64 using AVX). Compare 36 tflops for the 3090; so it’s going to be 1/5th the speed at prompt processing, which is compute limited (two CPUs), or 1/10th if that’s unoptimized for numa. Honestly, that’s not too bad. But, if you want the best of both worlds, add in a 3090, 4090 or 7900XTX and offload the prompt processing with BLAS, and you get decent inference speed for a huge model (basically, roughly equal or better than anything except A100/H100), and also good prompt processing, as the kv cache should fit in the GPU memory.

As far as CPU prices… . the 9334 seems to range from about $700 (used, quality samples) to $2700 (new), and would have the core count. A bit of a step up is the 9354 which has the full cache size. That might be relevant for inference.

fallingdowndizzyvr@alien.top · 2 years ago

Of course, during prompt processing, you’ll be bottlenecked by the CPU speed

Context shifting will help with that.

jasonmbrown@alien.top · 2 years ago

I appreciate the info this is probably the closest to what I am asking for. It seems no matter what I look at unless I have 10,000 to fork over I am going to be restricted in someway or another.

Caffeine_Monster@alien.top · 2 years ago

Don’t forget to include memory costs.

128GB+ of ecc ddr5 is not cheap