Why is no one releasing 70b models?

Longjumping-Bake-557@alien.top · 3 years ago

Why is no one releasing 70b models?

jeffwadsworth@alien.top · 3 years ago

The 13b’s don’t surpass the 70b Airoboros model. Not even close.

BeginningMacaroon374@alien.top · 3 years ago

You need at least 4 A100 for inference

arekku255@alien.top · 3 years ago

No point to release a model that hardly anyone can run.

13B and 7B can be run by the majority of users, 70B not so much…

zBlackVision11@alien.top · 3 years ago

Qwen 72b is comming in 2 days 👍 Will be a real beast.

ninjasaid13@alien.top · 3 years ago

2 days? Bro if they said November and haven’t released it by now, it’s not two days.

FaustBargain@alien.top · 3 years ago

Qwen 72b

I can’t seem to find anything about qwen 72b except two tweets from a month ago that said it was coming out. who makes it? what’s it trained on? any details?

Thireus@alien.top · 3 years ago

Curiously nobody from the previous comment upvoters have provided an answer to your question.

a_beautiful_rhind@alien.top · 3 years ago

I heard, if it comes out then finally it might be worth exllama supporting it. I heard the 14b was fairly strong.

zBlackVision11@alien.top · 3 years ago

Yes I also hope it get’s exllamav2 support, here is a issue regarding it: (Qwen model not supported) · Issue #160 · turboderp/exllamav2 (github.com)

Antique_Elk9380@alien.top · 3 years ago

Diminishing returns and cost of compute.

If people saw better returns from larger models, there would be more.

thereisonlythedance@alien.top · 3 years ago

I’ve been training a lot lately, mostly on RunPod, a mix of fine-tuning Mistral 7B and training LoRA and QLoRAs on 34B and 70Bs. My main takeaway is that the LoRA outcomes are just… not so great. Whereas I’m very happy with the Mistral fine-tunes.

I mean, it’s fantastic we can tinker with a 70B at all, but it doesn’t matter how good your dataset is, you just can’t have the same impact as you can with a full finetune. I think this is why model merging/frankensteining has become popular, it’s an expression of the limitations of LoRA training.

Personally, I have high hopes for a larger Mistral model (in the 13-20B range) that we can still do a full fine-tune on. Right now, between my own specific tunes of Mistral and some of the recent external tunes like Starling I feel like I’m close to having the tools I want/need. But Mistral is still 7B, it doesn’t matter how well it’s tuned, it will still get a little muddled at times, particular with longer term dependencies.

Armym@alien.top · 3 years ago

Do you think that finetuning models with more parameters requires more data to actually do something?

thereisonlythedance@alien.top · 3 years ago

With a full finetune I don’t think so – the LIMA paper showed that 1000 high quality samples is enough with a 65B model. With QLoRA and LoRA, I don’t know. The number of parameters you’re affecting is set by the rank you choose. It’s important to get the balance between the rank, dataset size, and learning rate right. Style and structure is easy to impart, but other things not so much. I often wonder how clean the merge process actually is. I’m still learning.

Vilzuh@alien.top · 3 years ago

I have been trying to learn about fine-tuning and lora training for the past couple weeks but I’m having trouble finding easy enough resources to learn from. Could you give me some pointers to what I can read to get started with finetuning llama2 or mistral?

I have tried training quantized models locally with oobabooga and llama.cpp and I also have access to runpod. Really appreciate any info!

a_beautiful_rhind@alien.top · 3 years ago

What do you mean? Someone just posted 100,200 and 600b models and several 120b models have released past couple of weeks.

Slimxshadyx@alien.top · 3 years ago

Those models can’t be accessed, they say it’s “too dangerous to be released”

__JockY__@alien.top · 3 years ago

It took 3,311,616 hours of training for the llama2 70b base model. At $1/hour for an A100 GPU you’d spend just over $3M and it would take approximately 380 years to train the model.

Scale that across 10,000 GPUs and you’re looking at 2 weeks and a couple of million dollars.

Fine tune training is much, much faster and cheaper.

__JockY__@alien.top · 3 years ago

I’ll reply to myself!

It’s not just about GPU expense. You need a small team of ML data scientists. You need access to (or a way to scrape/generate) a mind-bogglingly broad dataset. You need to clean, normalize, and prepare the dataset. All of this takes a huge amount of expertise, time and money. I wouldn’t be at all surprised if the auxiliary costs surpassed the GPU rental cost.

So the main answer to your question “Why is no one releasing 70b models?” is: it’s really, really, really expensive. Other parts of the answer are: lack of expertise, difficulty of generating a good dataset, and probably a hundred things I haven’t thought of.

But mainly it just comes down to cost. I bet you wouldn’t see any change from $5,000,000 if you wanted to make your own new 70b base model.

ninjasaid13@alien.top · 3 years ago

How much would that be in H100s or H200s?

MerePotato@alien.top · 3 years ago

About tree fiddy

__JockY__@alien.top · 3 years ago

A bushel.

Exotic-Estimate8355@alien.top · 3 years ago

$1/hour for an A100 ? Where? I can barely get one in GCE and it’s almost 4$ / hr

toothpastespiders@alien.top · 3 years ago

I’d like to know too if there’s one for exactly $1. Even half a buck or so difference builds up over time.

But runpod’s close at least, at $1.69/hour.

__JockY__@alien.top · 3 years ago

Yes, but you don’t have Meta’s purchasing power to rent 10,000 GPUs for a month. Economies of scale, my friend!

WaterPecker@alien.top · 3 years ago

Who pays for all this training on all these models we see knocking about and I don’t mean the ones released by the big companies? Like who has the resources to train a 70b model? Like one of the guys below said 1.7 million GPU hours for example thats pretty friggin expensive no?

ChiefBigFeather@alien.top · 3 years ago

13b models magically being better then 70b models is a myth. Most of the 7b or 13b model headlines are just clickbait, the models being good at benchmarks because they where trained on benchmark data.

Try Airo 70b 3.1.2, it is much, much better (for general purposes) then 99% of models out there. Yi based models are strong if you want the larger context.

ambient_temp_xeno@alien.top · 3 years ago

Orca still memeing strong.

Markon101@alien.top · 3 years ago

Google just released a 1.8T model that’s partially trained. Would need a ton of H100’s though just to run it, forget training it lol.

SativaSawdust@alien.top · 3 years ago

Look at the market share of video cards with more than 100GB of Vram.

candre23@alien.top · 3 years ago

It’s adorable that you think any 13b model is anywhere close to a 70b llama2 model.

alcalde@alien.top · 3 years ago

Oooh! Model fight! I’ll try it out and post results later.

extopico@alien.top · 3 years ago

The problem with 70B is that it is incrementally better than smaller models, but is still nowhere near competitive with GPT-4, so it is stuck in no man’s land.

Once we finally get an open source model or architecture that can spar even with GPT-4, let alone 5, there will be much more interest in large models.

Regarding Falcon Chat 180B, it’s no better in my tests and for my use cases than fine tuned Llama 2 70B, which is a shame. It makes me think that there is something fundamentally wrong with Falcon, besides the laughably small context window.