Huh its not really faster than Tesla P40s then for some reason.
Huh its not really faster than Tesla P40s then for some reason.
There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.
Not sure where they got 694GB/s for the Tesla P40, they’re only 347GB/s of memory bandwidth.
What kind of token/s do you get with 2x3090 for the 70B models?
Dual CPUs would have terrible performance. This is because the processor is reading the whole model everytime its generating tokens and if you spread half the model onto a second CPU’s memory then the cores in the first CPU would have to read that part of the model through the slow inter-CPU link. Vice versa with the second CPU’s cores. llama.cpp would have to make a system to spread the workload across multi CPUs like they do across multi GPUs for this to work.
A V100 16GB is like $700 on ebay. RTX 3090 24GB can be had for a similar amount.
Wait what? I am getting 2-3t/s on 3x P40 running Goliath GGUF Q4KS.
Wonder what card you have that’s 20GB?
Definitely thought this was for his homelab
You don’t NEED 3090/4090s. A 3x Tesla P40 setup still streams at reading speed running 120b models.