@qrios

qrios@alien.top · 1 year ago

I don’t think it’s so simple as “the nature of the beast.”

From my own experiments, you can maintain coherence by having stuff scale more the further back it is, but at some cost to accuracy. So stuff further back is more confused, but still accessible, and stuff more recent is still grounding the generation.

I haven’t tested super thoroughly though.

qrios@alien.top · 1 year ago

That 32k is theoretical. Base mistral hasn’t actually been trained to work with contexts that large. You might want to look at amazon’s finetune of it for long context. https://huggingface.co/amazon/MistralLite

I suspect it’ll amount to trading off quality for size though.

qrios@alien.top · 1 year ago

I realize this is totally unhelpful but, the DGX - 8x H100 costs just slightly more than the median price of a new house in the US . . .

I’m not saying this is a poor decision but . . . man that is one hell of a decision.