Brand New Mistral 16k Context Size Models got released last night from NurtureAI!

perlthoughts@alien.top · 2 years ago

Brand New Mistral 16k Context Size Models got released last night from NurtureAI!

mll59@alien.top · 2 years ago

First, thank you for sharing. However, I was a bit puzzled by these finetunes since many finetunes based on Mistral can simply support longer context out of the box by using NTK scaling, see here. Alas, I couldn’t find any information about what NurtureAI did to extend the context in their model cards.

I’ve tested the NurtureAI synthia-7b-v2-16k-q8_0.gguf, using koboldcpp v1.49 using the native rope configuration of the model (which has a rope base freq of 1000000), in an existing conversation of 14971 tokens, asking it to generate a standup comedy about the preceding conversation and it produced incoherent babbling. Using the original model synthia-7b-v2.0.Q8_0.gguf (which has a rope base freq of 10000) with --ropeconfig 1.0 45000 gives me a coherent standup comedy that makes sense.

How well this NTK scaling on Mistral-based finetunes works depends on the finetune, for some it works better than for others. For example, when I ask the original zephyr-7b-beta.Q8_0.gguf finetune, in an existing conversation of 25872 tokens, to produce a rhyming poem about the preceding conversation, the resulting poem actually mostly rhymes. Other original finetunes, like synthia-7b-v2.0.Q8_0.gguf, seem still coherent at this context size but are not able to produce rhyming poems anymore.

Anyway, based on my experiments, these extended context models by NurtureAI do not work for me and just using NTK scaling on original Mistral-based finetunes does.

perlthoughts@alien.top · 2 years ago

I also released chupacabra 7b awq version to get extra crispy.

permalip@alien.top · 2 years ago

I’m not sure who told who that Mistral models are only 8k or 4k. The sliding window is not the context size, it is the embedding positions that is the context size which is 32k.

TeamPupNSudz@alien.top · 2 years ago

I’m not sure who told who that Mistral models are only 8k

The official Mistral product information.

Our very first foundational model: 7B parameters, fast-deployed and easily customisable. Small, yet powerful for a variety of use cases. Supports English and code, and a 8k context length. link

Does Mistral themselves actually mention 32k anywhere?

permalip@alien.top · 2 years ago

It has 32k, they mention it in their config “max_position_embeddings”: 32768. This is the sequence length.

https://preview.redd.it/5r2c9592vr0c1.png?width=256&format=png&auto=webp&s=be88f25168e3cec16cbe7f9aad15f678edf97e99

mcmoose1900@alien.top · 2 years ago

But “true” 16K-32K models like MistralLite seem to perform much better at long context than the default Mistral config.

permalip@alien.top · 2 years ago

There is nothing “true” context length about MistralLite. You are essentially removing the sliding window by doing what Amazon or Yarn is doing.

https://preview.redd.it/rqe1hwc1vr0c1.png?width=256&format=png&auto=webp&s=79f14a98c097d2e8fb5718ffa4d524353b059a10

vasileer@alien.top · 2 years ago

is this a scam or what? none of the models above are from NurtureAI:

- zephyr-beta is trained by HuggingFace and is 32K by default

- neural-chat is from Intel

- synthia is from migtissera

Original links:

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

https://huggingface.co/Intel/neural-chat-7b-v3-1

https://huggingface.co/migtissera/SynthIA-7B-v2.0

MugosMM@alien.top · 2 years ago

NurtureAI extended the context size to 16k

vasileer@alien.top · 2 years ago

the context was already 32K

https://preview.redd.it/5jl7c7a53i0c1.png?width=958&format=png&auto=webp&s=ae51ae2b52717bb5ab14bed76580e7e0a45075ed

MINIMAN10001@alien.top · 2 years ago

So assuming this release does anything at all the only thing I can think of would be that instead of “hidden size” cause being 4k giving a 4k sliding window into 32k context it would be a hidden size of 16k giving a 16k window into the 32k context.

However that’s just speculation on my part because… Otherwise the release means nothing… Which would be weird.

Flag_Red@alien.top · 2 years ago

That’s not what hidden size does.