My AMD 7950X3D ( 16 core 32 threads), 64GB DDR5, Single RTX 4090 on 13B Xwin GGUF q8 can run at 45T/S. With exllamav2, 2x 4090 can run 70B q4 at 15T/s. Motherboard is Asus Pro Art AM5. In Local LLama, I think you can run similar speed with RTX 3090s. But in SD, 4090 is 70% better though.
‘The performance hit is too much on multigpu systems when using GGUF’
I agree. GGuF has multi GPU panelty. But it”s the most friendly to Apple silicons. I have same setup with you. one 4090 can run Xwin 13b at 40t/s. but when 2 cards present, it get only 1/4 of speed at 10t/s. So to get it fast, I have to flag CUDA device to single card while 2 cards present.
Since GGUF liks single GPU, those who have 3090/4090 will find 34B the best spot with the format.
Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.