Why is there no analog to napster/bittorent/bitcoin with LLMs?
Is there a technical reason that there is not some kind of open source LLM that we can all install on our local host which contributes computing power to answering prompts, and rewards those who contribute computing power by allowing them to enter more prompts?
Obviously, there must be a technical reason which prevents distributed LLMs or else it would have already been created by now.
The cost of separating task to workers and combining result is too much. Distributed computing only make sense if the cost of data transfer is small enough to be ignored
I’ve seen projects along these lines getting going, it’s coming.
EDIT I hah didn’t know Petals existed either.
there’s going to be multiple attempts
I was thinking of distributed MoEs as well.
Question I have is how do you route queries? I don’t know how to do that if all the Es are in the same cluster let alone distrivuted.
I was thinking of distributed MoEs as well.Question I have is how do you route queries? I don’t know how to do that if all the Es are in the same cluster let alone distrivuted.
yeah its a work in progress. Its not trivial to setup . it’s easy to imagine a way it could be done , but it all has to be built, tested, refined.
llama cpp is out there, I am a c++ person but I dont have deep experience with LLMs (how to fine tune etc) generally and have other projects in progress. but if you look around in the usual places with some search terms you’ll find the attempts in progress, and they probably could use volunteers.
my aspirations are more toward the vision side, I’m a graphics person and need to get on with producing synthetic data or something
I don’t know if there’s much value there when LORA’s are easily portable — you can just select the right lora as needed. One base model instance on one machine, many potential experts. This has been demonstrated.
I think the primary concern revolves around the security and data privacy of users, given the absence of assurance regarding the safety of your data when operating on a network accessible to anyone, including potential malicious actors.
I believe implementing an experimental model, wherein users can contribute their GPU for a specific model and receive credits in return, which they can use to use other models instead of downloading Terabytes worth of data
No, the primary concern is that network latency kills the serial performance of LLMs.
You can have a distributed llm getting decent throughput in total across many slow generations. You can’t have a distributed LLM with throughput for a single generation competitive to running in a single cluster.
The latencies involved make it tricky. You can’t just split it across them due to latency, which means both computers need to do their compute independently and then get combined somehow, which means you need to be able to break up inference into two completely distinct tasks.
I’m not sure if this is possible, but if it is, it hasn’t been invented yet.
Maybe some BitTorrent version of RingAttention?
It doesn’t really make that much sense at runtime. By the time you get to running large enough models (think GPT-4) you will already have infrastructure built up from training, which you can then use for inference. Why not run queries through that 1 data center, to minimize latency? For pooled computing resources (prompts are run through 1 member in a pool, kinda like sheepit render farm) it would make more sense, but you’re still limited by varying user hardware and software availability. People might have 1060s or 4090s, mistral 7Bs or llama-70Bs. Providing a service to end users means either (1) forcing users to accept quality inconsistency, or (2) forcing providers to maintain very specific software and hardware, plus limiting users to few models.
Why does no one use it?
Bad marketing. I only seen it recently.
Plus you get one model no loras (unless something changed recently).
It runs a few models and if others decide to run models it runs with em just try the chat we app or the dashboard to see what’s currently running issue is not enough people donating compute
check out chat.petals.dev
It’s terribly inefficient in many ways. Data centers with best GPUs are the most efficient hardware and energy wise. They are often built in places with access to cheap/green energy and subsidies. Also for research/development cash is cheap, so there’s little incentive to play with some decentralized stuff which adds a level of technical abstraction + needing a community. Opportunity cost wayyy outweighs running this in a data center for the vast majority of use cases.
Distributed inference IS indeed slower BUT its definitely not too slow for production use. I’ve used it and it’s still faster than GPT4 with the proper cluster.
some niche community uses where the budget is none and people will just distribute the electricity/GPU cost
Aren’t there a lot of people who don’t run their GPUs 24/7? That would put the marginal cost of equipment at zero, and electricity costs what, something around $1/W-yr?
Transferring the state over the internet so the next card can take over is sloooow. You’d want cards that can take a lot of layers to minimize that.
In other words, you want few and big gpu’s in the network, not a bunch of small ones.
Yes, for actually dividing models across machines, which was the original idea. I’d shifted to a different (and less technically interesting) question of sharing GPUs without dividing the model.
For dividing training, though, see this paper:
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
I’ve used petals a ton.
Because LLama2-70B is similar or better in most metrics, and it small enough to not need distributed inference.
When they say that you host your part of load to have access to the model. How much is that part (RAM CPU GPU HDD…)
Nice post. This got me thinking…
While many commenters are discussing the computation aspect, which leads to petals and the horde, I am thinking about bit torrent (since you mentioned it).
We do need a hub for torrenting LLMs. HF is amazing for their bandwidth (okay for the UI) - but once that VC money dries up, we’ll be on our own. So, distributing the models - just the data, not the computation - is also important.
Hopefully the community will transition to LoRAs instead of passing barely changed model weights around.
You guys are all talking about inference but how about using distributed computing strictly for training. That alone would save developers some serious moolah assuming somebody is able to solve all the technical problems like security & privacy.
That’s how pretraining is already done. You would have the same issue, orders of magnitude greater latency. Given the number of calculations per training epoch, you don’t want to be bound by the slowest worker in the cluster. OpenAI etc. use 40Gbps (or 100Gbps nowadays) backplanes between A100/H100 GPU servers. Sending data over the Internet to an Nvidia 1080 is simply just slow.
Simple answer is that you can’t parallelize LLM work.
It generate answers word-by-word, (or token-by-token to be more precise) so it’s impossible to split task into 10-100-1000 different pieces that you could send into this distributed network.Each word in the LLM answer also serve as part of input to calculate next one, so LLMs are actually counter-distributed systems.
You’d better tell the GPU manufacturers that LLM workloads can’t be parallelized.
The point of Transformers is that the matrix operations can be parallelized, unlike in standard RNNs.
The issue with distributing those parallel operations is that for every partition of the workload, you introduce latency.
If you offload a layer at a time, then you are introducing both the latency of the slowest worker and the network latency, plus the latency of combining results back into one set.
If you’re partitioning at finer grain, eg parts of a layer, then you add even more latency.
Latency can go from 1ms per layer in a monolithic LLM to >1s. That means response times measured in multiple minutes.
It does exist, but really only works when you have very high speed, low latency connections between the machine. Like infiniteband.
I mean, they get distributed over multiple GPU cores… what’s it matter if they’re local or not?