Wondering what everyone thinks in case this is true. It seems they’re already beating all open source models including Llama-2 70B. Is this all due to data quality? Will Mistral be able to beat it next year?
Edit: Link to the paper -> https://arxiv.org/abs/2310.17680
In all honestly…i don’t know. I’ve used Turbo for role playing purposes A LOT and to me the model just seems to…get things better than most others and by that i mostly mean in terms of instructing it to behave a certain way. If i told it to generate 150 words, it generated 150(or close to that amount words). If i told him to avoid generating something specific, it avoided generating that thing(For example when i told Turbo to avoid roleplaying from user’s point of view, and it did just that while lower parameter models seem to ignore that). This is a behavior usually noticeable only in higher parameter models as lower parameter models seems to be visibly struggling with following very specific instructions, so that’s why i have a hard time believing that turbo is only 20B. It MIGHT be the dataset quality issue preventing lower parameter models from following more specific and complex instructions, but what Turbo displayed in my experience doesn’t scream “low” parameter model at all.
20B may not be quantized, also the amount of training done on top may not be the same.
What has your experience with mistral been? Because going from llama 13B finetunes to mistral 7B, I found that it was remarkably better at following instructions (Prompt engineering finally felt like it was not just guessing and checking). Considering it is just a 7B, a 20B might be that good (It could also just be a MoE of 20B models)
I only really use Mistral Claude and Collective cognition but from perspective of a role player who uses LLMs mostly for just that my overall experience with Mistral(finetunes) has been mostly positive. 7B’s speed is undeniable, so this is a very major benefit it has over 13Bs and for a 7B it’s prose is excellent as well. What i also noticed about mistral models is that unlike 13Bs such as mythomax or remm-slerp they tend to pay closer attention to character cards as well as your own user description and will more commonly mention things stated in the said description.(For example my user description in SillyTavern had a note saying that my persona is commonly stalked by ghosts and model actually made a little joke about it saying “how are your ghostly friends are doing these days” which is something that NO 13B i used has done before) Still though 7B IS just 7B so model tends to hallucinate quite a bit, constantly tries to change the formatting of the roleplay and tends to roleplay as you unless you REALLY finetune the settings to borderline perfection so i have to swipe and/or edit responses quite a bit.