Look at the board members. It’s Open Philanthropy/Effective Altruism coup to prevent AI progress, not these mundane concerns like accounting.
Look at the board members. It’s Open Philanthropy/Effective Altruism coup to prevent AI progress, not these mundane concerns like accounting.
Maybe. If it won’t be killed because safety. I also think that it’s implausible it’ll be much better than what we have, or even catch up to 3.5 fully.
I think this is ass-covering. Microsoft Research don’t know the scale of ChatGPT? What are the odds?
They have to deny the leak by providing a non-credible attribution instead of saying “lmao we just talked to OpenAI engineers over a dinner”, sure. But this doesn’t mean that they, or Forbes, or multiple people who tested Turbo speed, compared costs and concluded it’s in the 20B range, or others are wrong. I’d rather believe that Forbes got an insider leak about a model as it was getting readied.
We know that Turbo is quantized, at least.
And it really started even with GPT-3. We built this model. And I actually did the initial productionization of it. And so you have this research model that takes all these GPUs. We compressed it down to basically end up running on one machine. So that was effectively a 10 to 20x reduction in footprint. And then over time, we’ve just been improving inference technology on a lot of angles. And so we do a lot of quantization, we do a lot of, honestly, there’s a lot of just like systems work because you have all these requests coming in, you need to batch them together efficiently.
What are you surprised about? Yes, it’s 5 versions younger, they’re running a data flywheel, definitely scraping GPT-4 interactions when people use it through their website; v2 was just a shout-out to the community, to build goodwill and bring attention to their project. They’ve said on HN that they might release newer models – when they’re no longer close to SoTA.
Thank you. Yes, it is the 7th iteration that started with our open-source models. We do plan to open source this model as well down the road, once we’ve released a few more generations.
It’s business, not charity.
According to Skywork-13B’s paper it did train on the training split of GSM8K, but it also scores perfectly the same (and highly) on test set and the GPT-4 synthesized reference set (at least in terms of loss), so I think it did learn something useful rather than just memorized stuff. And we know from Chinchilla laws that, assuming same-quality dataset, you get the same loss with 10B x 2T tokens as with 6B x 6T tokens. Most currently known <10Bs (probably not Mistral though) have not much than 2T tokens in them, so it’s not very hard to beat this bar.
Still, it’s 6B, and its data quality is inherently reduced for our use cases, because it’s trained on heavily Chinese data, and we know there is only lossy transfer between languages (the more their difference, the worse it gets, and probably the smaller the model the worse it is at learning similarities). It can’t reason deeply, it won’t blow your mind; it sure didn’t blow mine, it made very strange hallucinations, as often happens with Chinese models. And on English text, its loss is way higher than that of, say, LLaMA-13B.
Because they’re very concerned about using LLMs for help in creating bioweapons, and a small portion of the data will go a long way. I believe this will lead to scrutinizing datasets.
All of this is a red herring. The bigger issue is going to be checking of the data for biological sequences and such.
34B was a train-on-test champ apparently, so I’m less hyped about this than about other big Chinese models.