Is there a technical reason that distributed LLMs don't exist?

chinawcswing@alien.top · 10 months ago

Is there a technical reason that distributed LLMs don't exist?

remghoost7@alien.top · 10 months ago

It actually does exist.

It’s called Petals.

I believe it was made to run Bloom 176B.

xqoe@alien.top · 10 months ago

When they say that you host your part of load to have access to the model. How much is that part (RAM CPU GPU HDD…)

PookaMacPhellimen@alien.top · 10 months ago

Why does no one use it?

JackRumford@alien.top · 10 months ago

It’s terribly inefficient in many ways. Data centers with best GPUs are the most efficient hardware and energy wise. They are often built in places with access to cheap/green energy and subsidies. Also for research/development cash is cheap, so there’s little incentive to play with some decentralized stuff which adds a level of technical abstraction + needing a community. Opportunity cost wayyy outweighs running this in a data center for the vast majority of use cases.

Prudent-Artichoke-19@alien.top · 10 months ago

Distributed inference IS indeed slower BUT its definitely not too slow for production use. I’ve used it and it’s still faster than GPT4 with the proper cluster.

ColorlessCrowfeet@alien.top · 10 months ago

some niche community uses where the budget is none and people will just distribute the electricity/GPU cost

Aren’t there a lot of people who don’t run their GPUs 24/7? That would put the marginal cost of equipment at zero, and electricity costs what, something around $1/W-yr?

TheTerrasque@alien.top · 10 months ago

Transferring the state over the internet so the next card can take over is sloooow. You’d want cards that can take a lot of layers to minimize that.

In other words, you want few and big gpu’s in the network, not a bunch of small ones.

ColorlessCrowfeet@alien.top · 10 months ago

Yes, for actually dividing models across machines, which was the original idea. I’d shifted to a different (and less technically interesting) question of sharing GPUs without dividing the model.

For dividing training, though, see this paper:

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

ExTrainMe@alien.top · 10 months ago

Bad marketing. I only seen it recently.

Plus you get one model no loras (unless something changed recently).

lordpuddingcup@alien.top · 10 months ago

It runs a few models and if others decide to run models it runs with em just try the chat we app or the dashboard to see what’s currently running issue is not enough people donating compute

ortegaalfredo@alien.top · 10 months ago

Because LLama2-70B is similar or better in most metrics, and it small enough to not need distributed inference.

512DuncanL@alien.top · 10 months ago

check out chat.petals.dev

Prudent-Artichoke-19@alien.top · 10 months ago

I’ve used petals a ton.