I understand that a bigger memory means you can run a model with more parameters or less compression, but how does context size factor in? I believe it’s possible to increase the context size, and that this will increase the initial processing before the model starts outputting tokens, but does someone have numbers?
Is memory for context independent on the model size, or does a bigger model mean that each bit of extra context ‘costs’ more memory?
I’m considering an M2 ultra for the large memory and low energy/token, although the speed is behind RTX cards. Is this the best option for tasks like writing novels, where quality and comprehension of lots of text beats speed?
Potentially dumb but related question:
I know the Mac M* Series chips can use up to ~70% of their universal RAM for “GPU” (VRAM) purposes. The 20GB used to load up a Yi-34B model just about uses all of that up.
So: given I still have maybe 8GB of remainder RAM to work with (assuming I leave 4GB for the system), would I be able to apply a 128K context buffer and have that located in “normal” RAM?
I’m assuming the heavy computational load is performed on the inferencing itself, and the model itself would be loaded in “VRAM” and the GPU side of the chip handles that - but can the context buffer be loaded and work at a decent speed in the remaining RAM? Or does everything - the context buffer and model - both have to use “VRAM” to work at a decent speed?