Look ahead decoding offers massive (~1.5x) speedup for inference

lans_throwaway@alien.top · 1 year ago

wind_dude@alien.top · 1 year ago

What would happen if you replace the decoder during finetuning? Would you also see a speed up but at the expense of vram?

_Lee_B_@alien.top · 1 year ago

Hmm, it looks like such a standard linear algebra optimisation that I’m surprised GPUs don’t do it automatically. But yep, looks good, either way.

CasimirsBlake@alien.top · 1 year ago

Any chance P40s can benefit from this through llama.cpp?

FlishFlashman@alien.top · 1 year ago

This seems like this approach could also be useful in situations where the goal isn’t speed, but rather “quality” (by a variety of metrics).