I understand that a bigger memory means you can run a model with more parameters or less compression, but how does context size factor in? I believe it’s possible to increase the context size, and that this will increase the initial processing before the model starts outputting tokens, but does someone have numbers?
Is memory for context independent on the model size, or does a bigger model mean that each bit of extra context ‘costs’ more memory?
I’m considering an M2 ultra for the large memory and low energy/token, although the speed is behind RTX cards. Is this the best option for tasks like writing novels, where quality and comprehension of lots of text beats speed?
Formula to calculate kv cache, as in space used by context
batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
This blog post is really good, I recommend you to read it.
Usually bigger models have more layers, heads and dimensions, but I am not sure whether heads or dimensions grow faster. It’s something you can look up though.
Thanks. I would guess the seqlen is the sum of the input and output length as it feeds back on itself.
yeah, it will be a sum of tokens that the next token is generated on. I don’t know how often KV cache is updated.