• 0 Posts
  • 8 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle



  • I think this is ass-covering. Microsoft Research don’t know the scale of ChatGPT? What are the odds?

    They have to deny the leak by providing a non-credible attribution instead of saying “lmao we just talked to OpenAI engineers over a dinner”, sure. But this doesn’t mean that they, or Forbes, or multiple people who tested Turbo speed, compared costs and concluded it’s in the 20B range, or others are wrong. I’d rather believe that Forbes got an insider leak about a model as it was getting readied.

    We know that Turbo is quantized, at least.

    And it really started even with GPT-3. We built this model. And I actually did the initial productionization of it. And so you have this research model that takes all these GPUs. We compressed it down to basically end up running on one machine. So that was effectively a 10 to 20x reduction in footprint. And then over time, we’ve just been improving inference technology on a lot of angles. And so we do a lot of quantization, we do a lot of, honestly, there’s a lot of just like systems work because you have all these requests coming in, you need to batch them together efficiently.


  • What are you surprised about? Yes, it’s 5 versions younger, they’re running a data flywheel, definitely scraping GPT-4 interactions when people use it through their website; v2 was just a shout-out to the community, to build goodwill and bring attention to their project. They’ve said on HN that they might release newer models – when they’re no longer close to SoTA.

    Thank you. Yes, it is the 7th iteration that started with our open-source models. We do plan to open source this model as well down the road, once we’ve released a few more generations.

    It’s business, not charity.


  • According to Skywork-13B’s paper it did train on the training split of GSM8K, but it also scores perfectly the same (and highly) on test set and the GPT-4 synthesized reference set (at least in terms of loss), so I think it did learn something useful rather than just memorized stuff. And we know from Chinchilla laws that, assuming same-quality dataset, you get the same loss with 10B x 2T tokens as with 6B x 6T tokens. Most currently known <10Bs (probably not Mistral though) have not much than 2T tokens in them, so it’s not very hard to beat this bar.

    Still, it’s 6B, and its data quality is inherently reduced for our use cases, because it’s trained on heavily Chinese data, and we know there is only lossy transfer between languages (the more their difference, the worse it gets, and probably the smaller the model the worse it is at learning similarities). It can’t reason deeply, it won’t blow your mind; it sure didn’t blow mine, it made very strange hallucinations, as often happens with Chinese models. And on English text, its loss is way higher than that of, say, LLaMA-13B.