Clearing up confusion: GPT 3.5-Turbo may not be 20b after all

SomeOddCodeGuy@alien.top · 2 years ago

Clearing up confusion: GPT 3.5-Turbo may not be 20b after all

RiotNrrd2001@alien.top · 2 years ago

My guess, pulled from deep within my ass, is that it is a cluster of models, many possibly in the 20b range. The results we get aren’t from a single 20b model, but from one of many models that have been “optimized” (whatever that means) for particular areas. Some router function tries to match input prompts to the best model for that prompt and then sends it to that model.

Totally making things up, here, but I can see benefits to doing it this way.

kuzheren@alien.top · 2 years ago

GPT 3.5 probably has more than 20b parameters, but then why is its API several times cheaper than text-davinci-003?

Although at the same time GPT 3.5 is good at facts and is great at creating text in many languages, while the opensource models are not always good even with English, because with 20b parameters it’s hard to store much data, so there’s probably a lot more than 20b

Kep0a@alien.top · 2 years ago

I agree. It’s possible it’s that small but I just think that’s unlikely.

Acceptable_Can5509@alien.top · 2 years ago

Probably heavily quantized and uses a smaller gpt-3 model.

Ilforte@alien.top · 2 years ago

I think this is ass-covering. Microsoft Research don’t know the scale of ChatGPT? What are the odds?

They have to deny the leak by providing a non-credible attribution instead of saying “lmao we just talked to OpenAI engineers over a dinner”, sure. But this doesn’t mean that they, or Forbes, or multiple people who tested Turbo speed, compared costs and concluded it’s in the 20B range, or others are wrong. I’d rather believe that Forbes got an insider leak about a model as it was getting readied.

We know that Turbo is quantized, at least.

And it really started even with GPT-3. We built this model. And I actually did the initial productionization of it. And so you have this research model that takes all these GPUs. We compressed it down to basically end up running on one machine. So that was effectively a 10 to 20x reduction in footprint. And then over time, we’ve just been improving inference technology on a lot of angles. And so we do a lot of quantization, we do a lot of, honestly, there’s a lot of just like systems work because you have all these requests coming in, you need to batch them together efficiently.

ttkciar@alien.top · 2 years ago

Perhaps someone heard “10x reduction in footprint” and didn’t realize that meant a reduction in bytes, not a reduction in parameters, and concluded it had a tenth as many parameters?

ambient_temp_xeno@alien.top · 2 years ago

So they, as big-shot microsoft scientists, just decided that was good enough to stick it in a table in their paper?

2muchnet42day@alien.top · 2 years ago

“Yes, we made a mistake, we totally don’t have direct knowledge about this”

Tight_Range_5690@alien.top · 2 years ago

looking at huggingface models, a raw 20b is ~42gb, not a lot of space to fit big model quants. Q4KM of 70b llama fits in that (q2 is 30gb). and the smallest falcon 180b quantization is 74gb

that would make more sense while still being really impressive. not sure if someone wants to math it out, but what’s the biggest B model that would fit in that on the lowest quants (q2-q3)?

disclaimer: bees are not everything, maybe they have great dataset/money/lies

Auto_Luke@alien.top · 2 years ago

Try to experience the last best models under 20 billion parameters (Mistral, Qwen). Then be aware that the training set of those models is much smaller and less optimized than that of 3.5-turbo (I assume that the current version of 3.5-turbo is using above 10 trillion tokens of partially synthetic data). Also, I do not feel like 3.5-turbo is so good, to be honest. It’s realistic for it to be in this size range. I think that, with a maximally optimized latent space, it is possible to achieve similar results with around 10 billion parameters.

Monkey_1505@alien.top · 2 years ago

I tend to disagree that it’s less optimized. Generally more data, and more compute reduces the need for heavy data refinement, whereas smaller models with less available compute benefit more.

Auto_Luke@alien.top · 2 years ago

It’s very true that a small amount of high-quality data is better than a lot of garbage, but even better would be a large amount of high-quality data optimized in a way that we haven’t figured out yet. However, openai could be even one year ahead. Unfortunately, it is closedai now.

Monkey_1505@alien.top · 2 years ago

That’s true, but they still have less impetus to do that. They are being fairly heavily subsidized by microsoft so running costs and compute isn’t much of a concern. It’s only really at the point where more data, and more compute hits a wall, where they have to worry too much about data refinement.

FeltSteam@alien.top · 2 years ago

20B parameters make sense. This is about a 9x reduction in parameter count and the API cost was reduced by 10x.

caphohotain@alien.top · 2 years ago

It’s 20b or .2b or 200000000b doesn’t bother me at all.

TheTerrasque@alien.top · 2 years ago

ITT: People explaining why it can still be 20b

PookaMacPhellimen@alien.top · 2 years ago

We haven’t approached saturation yet with tokens versus parameters on models which disclose their training. 20B is highly plausible, particularly given success of Mistral at 7B.

Feztopia@alien.top · 2 years ago

So does Forbes give a source for that claim or is it just the usual “the media is allowed to lie to the public” story?

a_beautiful_rhind@alien.top · 2 years ago

They’re also morons so they’re often simply wrong.