Wondering what everyone thinks in case this is true. It seems they’re already beating all open source models including Llama-2 70B. Is this all due to data quality? Will Mistral be able to beat it next year?
Edit: Link to the paper -> https://arxiv.org/abs/2310.17680
Given how good 7b Mistral is in my personal experience, it seems that a model 3x its size can BE GPT3.5 Turbo is no longer implausible.
I think so too. I really hope those guys are funded to improve the model. Serious talent in that team.
It is given the age - if you would build it today, with what research has shown now - yes, but GPT 3.5 predates that, It would indicate a brutal knowledge advantage of OpenAi compared to published knowledge.
GPT 3.5 turbo was released on March 1 2023, for what it’s worth. Which makes it not a very old model.
Only if you assume that 3.5 TURBO is not a TURBO version of GPT 3.5 THAT would make the RELEASE in March 2022, likely with 6 months or more of training and tuning. So, you say that when they did the turbo version, they started fresh, with new training data and an approach based on the MS ORCA papers which were released in June, and still did not change the version number?
Let me say your assumption bare a thread of logic.
Oh it’s a TURBO version you say? Is that a technical term? I never said whatever you seem to think I said.
In all honestly…i don’t know. I’ve used Turbo for role playing purposes A LOT and to me the model just seems to…get things better than most others and by that i mostly mean in terms of instructing it to behave a certain way. If i told it to generate 150 words, it generated 150(or close to that amount words). If i told him to avoid generating something specific, it avoided generating that thing(For example when i told Turbo to avoid roleplaying from user’s point of view, and it did just that while lower parameter models seem to ignore that). This is a behavior usually noticeable only in higher parameter models as lower parameter models seems to be visibly struggling with following very specific instructions, so that’s why i have a hard time believing that turbo is only 20B. It MIGHT be the dataset quality issue preventing lower parameter models from following more specific and complex instructions, but what Turbo displayed in my experience doesn’t scream “low” parameter model at all.
20B may not be quantized, also the amount of training done on top may not be the same.
What has your experience with mistral been? Because going from llama 13B finetunes to mistral 7B, I found that it was remarkably better at following instructions (Prompt engineering finally felt like it was not just guessing and checking). Considering it is just a 7B, a 20B might be that good (It could also just be a MoE of 20B models)
I only really use Mistral Claude and Collective cognition but from perspective of a role player who uses LLMs mostly for just that my overall experience with Mistral(finetunes) has been mostly positive. 7B’s speed is undeniable, so this is a very major benefit it has over 13Bs and for a 7B it’s prose is excellent as well. What i also noticed about mistral models is that unlike 13Bs such as mythomax or remm-slerp they tend to pay closer attention to character cards as well as your own user description and will more commonly mention things stated in the said description.(For example my user description in SillyTavern had a note saying that my persona is commonly stalked by ghosts and model actually made a little joke about it saying “how are your ghostly friends are doing these days” which is something that NO 13B i used has done before) Still though 7B IS just 7B so model tends to hallucinate quite a bit, constantly tries to change the formatting of the roleplay and tends to roleplay as you unless you REALLY finetune the settings to borderline perfection so i have to swipe and/or edit responses quite a bit.
13 billion parameters for instruction following, and 7 billion for safety
No fucking way. GPT-3 has 175B params. In no shape or form they could have discovered the “secret sauce” to make an ultra smart 20B model. TruthfulQA paper suggests that bigger models are more likely to score worse, and ChatGPT’s TQA score is impressively bad. I think the papers responsible for impressive open-source models are max 12-20 months old. Turbo version is probably quantized, that’s all.
Have you read the orca paper?
The main question is why price it so far below Davinci level, which is 175B?
There’s still a lot of room for models to be trained on more data. Take a look at the Llama papers - at the time training was stopped the loss was still going down. Mistral is on par with L2 13B to L1 30B and it’s a measly 7B model. If GPT-4 truly has a dataset of 13T tokens, the scaling law equations from the Chinchilla paper illustrate that a 20B model trained on 13T tokens would reach lower loss levels than a 70B model trained on 2T tokens. Llama 1 already illustrated that a 7B model could outperform previous open source models (GPT-J-6B, Fairseq-13B, GPT-NeoX-20B, OPT-66B) just by virtue of training on more data and it’s the reason the Llamas are so good to begin with
Model size is important, sure, but there are a lot of important things besides model size when it comes to training a good model
I think it’s plausible. Gpt3.5 isn’t ultra smart. It’s very hood most of the time, but it has clear limitations.
Seeing what mistral achieved with 7b, I’m sure we can get something similar to gpt3.5 in 20b given state of the art training and data. I’m sure OpenAI is using some tricks as well that aren’t released to the public.
Scaling laws suggest that you can reduce parameter count by increasing the number tokens. There is a limit however and that seems to be at around 32% of the original model size: https://www.harmdevries.com/post/model-size-vs-compute-overhead/
So that would put the resulting model at around 56B. Not sure how they got it down further, maybe through quantization.
The scaling laws have quite a bit more wiggle room if you’re willing to accept less benefit for your buck at training time. They mention that it isn’t a hard threshold but more like a region where you can expect diminishing returns, which is true. The thing the original Chinchilla paper didn’t emphasize is that diminishing returns aren’t really “diminishing”. Yes, you have to put in more training compute to reach a given level of quality, but more often than not training compute pales in comparison to inference compute, since whereas the former is a large cost you pay once and then you’re done, the latter is a continuous cost you pay for as long as you host your LLM. Given enough time, inference compute will always pull ahead of training compute.
If you take a look at the scaling equations they used (the exact constants used may vary between model architectures and datasets, but they still give a reasonably good approximation) we have, for a model with N parameters and a dataset size of D tokens the loss is given by (see eq. 10 in 2203.15556.pdf (arxiv.org) ):
L(N, D) = 1.69 + 406.4 / N^0.34 + 410.7 / D^0.28
If you were to take Llama 2 70B’s values and plug them in, we’d end up with:
L(70*10^9, 2*10^12) = 1.69 + 406.4 / (70*10^9)^0.34 + 410.7 / (2*10^12)^0.28 = 1.9211
By comparison, if we were to take Turbo’s values and plug them in (here I’ll use 13T training tokens, since that’s the popular estimate for GPT-4’s training set size so I’ll assume they used it for Turbo as well) we’ll end up with:
L(20*10^9, 13*10^12) = 1.69 + 406.4 / (20*10^9)^0.34 + 410.7 / (13*10^12)^0.28 = 1.905
So in this case, Turbo actually does end up coming ahead of Llama 2 by virtue of the larger training corpus. It also means that if future models significantly increase the pretraining dataset size (whether that’s Llama 3, Llama 4, Mistral, or some other one) there’s a very real chance that smaller models can reach this level of quality in the future
There is strong evidence in the literature that you can reduce parameter count if you increase the number of training tokens (as well as compute time). Not saying that’s what they did here, but also I wouldn’t be surprised given how important it is for inference to be as efficient as possible here.
Can someone explain what exactly is #p?
the future is looking exciting! lets hope that people like max tegmark don’t succeed in convincing the governments to stop companies from sharing weights with open source.
Your reminder that OpenAi also has access to an enormous amount of hand-annotated and human-generated data for training on: https://www.theverge.com/features/23764584/ai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots
We’ve seen multiple times that data quality matters a lot. Not surprising if they can fine-tune a 20b model into a high-quality chatbot.
Thats insane. Medium-high end macs could run it locally.
It looks weird going from 75B text-davinci-003 to 20B gpt-3.5-turno. But a) we don’t know how they count this - a quantization effectively halves the number of parameters and b) we don’t know anything how they made it.
except c) they threw much more money at it, using humans to clean the dataset. A clean dataset can make 20B sing. We are using META chaos in llama2 70b with everything thrown at it…
text-davinci-003 is 175B. You missed a 1 there
wtf? Really? I mean I kinda thought that too because of the way GPT3.5 compares to Falcon 180B. Even tho Falcon has more parameters still GPT3.5 works way better than it. I credited all this to the Data used to train the model. I believe that Not just more parameters but more quality data will help AI Models increase proportionally in terms of quality & performance.
Can’t believe that ChatGPT is just 20B, I always thought that it’s 175B Model. What about the actual 175B+ Model? Are they going to be AGI? lol.
If this is true then it means all Open Source Models are trained cheaply and is nothing compared to what OpenAI did.People used to confuse the different GPT-__s all the time. The author probably read something about NeoX-20b and thought it was 3.5.
This paper has been withdrawn.
Contains inappropriately sourced conjecture of OpenAI’s ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion