I performed an experiment with eight 33-34B models I use for code evaluation and technical assistance, to see what effect GPU power limiting had on the RTX 3090 inference.
All models were gguf, q4 quants. Each model was run only once due to time constraints. Each model was served the identical prompt to generate a bash script according to instructions.
I’ll abstain from attempting an analysis, you can draw your own conclusions.
Test data below:
Set Meas GPU% M1 M2 M3 M4 M5 M6 M7 M8 T/s EFFICIENCY
300 291 79,50% 27,26 26,14 33,21 27,54 34,83 32,56 24,58 31,1 29,65 101
280 274 80,50% 26,78 27,18 33,25 27,39 34,19 30,48 26,31 31,34 29,62 108
260 253 81,50% 26,03 23,61 29,91 26,33 31,73 30,48 26,27 30,39 28,09 111
240 233 82,00% 23,71 23,13 31,49 23,64 30,12 29,72 22,5 30,93 26,91 115
220 217 84,50% 19,76 20,04 25,34 19,99 26,89 24,93 22,18 25,06 23,02 106
200 197 87,50% 15,46 14,35 19,86 15,63 20,45 19,56 16,42 19,43 17,65 89
180 179 89,50% 11,39 10,57 14,58 11,03 14,67 14,13 12,32 13,65 12,79 71
160 161 93,00% 7,93 6,79 9,1 7,42 9,45 8,82 7,94 8,78 8,28 51
140 160 95,00% 7,31 6,8 8,9 6,78 9,14 7,52 7,37 8,27 7,76 48
120 160 95,00% 6,81 6,31 8,19 6,97 8,46 7,56 6,93 8,24 7,43 46
M1 51L airoboros-c34b-3.1.2.Q4_K_M.gguf
M2 51L Zephyrus-L1-33B.q4_K_M.gguf
M3 51L codellama-34b-instruct.Q4_0.gguf
M4 51L phind-codellama-34b-v2.Q4_K_M.gguf
M5 51L tora-code-34b-v1.0.Q4_0.gguf
M6 51L wizardcoder-python-34b-v1.0.Q4_0.gguf
M7 64L yi-34b.Q4_K_M.gguf
M8 51L ziya-coding-34b-v1.0.Q4_0.gguf