So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
I assume the progress is based on well structured, high quality training data, combined with an incremental “learning schedule”. At least that’s where some reports of massive progress seem to be coming from and it’s also very intuitive that this would help a lot.