I am not able to reproduce a 2x speedup that I read the others achieved with a 70B model and a 1B model draft, using a 7B model and a 1B model draft
model: dolphin-llama2-7b.Q4_K_S.gguf
model draft: tinyllama-1.1b-1t-openorca.Q4_K_S.gguf
here are the results
-------------------------
main -m …/models/tinyllama-1.1b-1t-openorca.Q4_K_S.gguf -p “Building a website can be done in 10 simple steps:\nStep 1:” -n 400 -e
llama_print_timings: load time = 278.28 ms
llama_print_timings: sample time = 110.42 ms / 400 runs ( 0.28 ms per token, 3622.56 tokens per second)
llama_print_timings: prompt eval time = 641.88 ms / 20 tokens ( 32.09 ms per token, 31.16 tokens per second)
llama_print_timings: eval time = 15281.09 ms / 399 runs ( 38.30 ms per token, 26.11 tokens per second)
llama_print_timings: total time = 16221.94 ms
Log end
------------------
main -m …/models/dolphin-llama2-7b.Q4_K_S.gguf -p “Building a website can be done in 10 simple steps:\nStep 1:” -n 400 -e
llama_print_timings: load time = 1429.41 ms
llama_print_timings: sample time = 108.39 ms / 400 runs ( 0.27 ms per token, 3690.24 tokens per second)
llama_print_timings: prompt eval time = 3139.63 ms / 20 tokens ( 156.98 ms per token, 6.37 tokens per second)
llama_print_timings: eval time = 79913.13 ms / 399 runs ( 200.28 ms per token, 4.99 tokens per second)
llama_print_timings: total time = 83348.57 ms
Log end
------------------
speculative -m …/models/dolphin-llama2-7b.Q4_K_S.gguf -md …/models/tinyllama-1.1b-1t-openorca.Q4_K_S.gguf -p “Building a website can be done in 10 simple steps:\nStep 1:” -n 400 -e
encoded 19 tokens in 3.412 seconds, speed: 5.568 t/s
decoded 402 tokens in 115.028 seconds, speed: 3.495 t/s
n_draft = 16
n_predict = 402
n_drafted = 301
n_accept = 198
accept = 65.781%
draft:
llama_print_timings: load time = 213.69 ms
llama_print_timings: sample time = 1597.32 ms / 1 runs ( 1597.32 ms per token, 0.63 tokens per second)
llama_print_timings: prompt eval time = 421.24 ms / 19 tokens ( 22.17 ms per token, 45.11 tokens per second)
llama_print_timings: eval time = 19697.97 ms / 505 runs ( 39.01 ms per token, 25.64 tokens per second)
llama_print_timings: total time = 118450.52 ms
target:
llama_print_timings: load time = 1342.55 ms
llama_print_timings: sample time = 107.07 ms / 402 runs ( 0.27 ms per token, 3754.48 tokens per second)
llama_print_timings: prompt eval time = 78435.09 ms / 431 tokens ( 181.98 ms per token, 5.49 tokens per second)
llama_print_timings: eval time = 17902.46 ms / 92 runs ( 194.59 ms per token, 5.14 tokens per second)
llama_print_timings: total time = 117198.22 ms