I have tried single 4090 or 3090 to run 13B GGUF q8 getting 40-45t/s. It;s so fun to play at that speed. When run with 70B GGUF, I have to activate both cards and only get 5t/s. MultiGPU panelty? I know exllamav2 can be a lot better, however, it seemed that I can”t run exllamav2 with latest Chinese models for some unknown reason in ooga UI. So upset!
So for those who know and have been using nvlinked 2x3090, how fast is it to run 70b GGUF in terms of q4-q8 tokens/s? Is it simply as single 48GB 3090?
You must log in or register to comment.