So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
Having used it a lot, I can say for sure that without much prompting it readily produces junk web text, urls etc, so it is not a fully filtered or fully synthetic dataset.
My guess would be that it’s just ‘a bit better filtered than llama-2’, and maybe slightly more trained on that set. Slightly better quality set, slightly more trained on that set.
My intuition based on this, is that per parameter size EVERYTHING open source could be optimized considerably more.