Hi, you wonderful people!
Here’s a thought that came to my mind: Since training LLMs involves a degree of randomness, is there potentially a way to create an architecture for LLMs (or other AI) that would be somewhat deterministic in its training instead?
What I mean is, could a theoretical architecture exist where everyone could train their own separate checkpoints on different datasets, which, after combining, would result in a checkpoint with combined learning from all these different smaller checkpoints?
What this would allow us to do is let thousands of people create their own checkpoints, which when combined would result in something greater than the individual parts themselves. And since the training process is what takes the longest in developing LLMs (or any AI), this approach would allow almost everyone to contribute their share of processing power towards creating something together.
If viable, this could have huge potential implications for Open Source Software.
I’m looking forward to hearing what all of you smart people have to say about it!
Not sure if that is possible but even if it is possible it will be very inefficient given each model would have to “learn from scratch” like learning grammar, sentence positioning, etc. You could potentially make a central model that is pre-trained on a large corpus dataset and fine-tuned for a specific task but then it is just a standard GPT-like model, which when aggregated like a mixture of experts or ensemble can potentially do what you are saying.
Combining multiple models that are trained independently into one huge model is not possible because the model learns something different for each task and due to the inherently stochastic nature of general LLM (which is desirable to aggregate information), unless you are just looking to purely “retrieve information” what you say is not possible with the current standard training regime.
I’m not aware of any way to accomplish what you’re describing besides those you’ve ruled out (federated learning and mixtures of experts). Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs (though it will for logistic or linear models). This sounds to me like an open research question.
Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs
Why would simply aggregating the weights like this categorically fail to produce a reasonable model? Assuming of course that the datasets are all “the same” in some meaningful sense (e.g., equally representative of the same underlying X→Y mappings).
Here’s a simple intuition as to why averaging the weights of a 1+ hidden layer NN won’t work: pick a hidden layer in your model and apply a permutation matrix to its weights (along the input axis) and the inverse permutation matrix to the previous layer (along the output axis). Obviously the model is unchanged (from an input/output perspective). Repeat that N times (where N is the input dimension of the hidden layer you picked). You now have N models that are identical from an input output perspective. If you average those model weights, your hidden layer will output a constant because all its weights will be identical. This averaged weights model is obviously completely broken even though it’s the average of N “identical” (from an input / output perspective) model. QED.
Interesting. I love a good thought experiment :)
But what about the idea of bagging? As in aggregating multiple models together that have all been trained on different examples, and thus learned different things. Why is that not subject to similar criticism?
Would it be possible to create a system where every model’s training includes a specific set seed and records its exact state, and then share this information with the dataset it was trained on to ensure we can reproduce the training? This method could help manage the randomness in training.
Using a set seed means we can make sure that the way the model starts and how it learns during training is the same every time. Essentially, if we restart the training from a certain point with this seed, the model should learn in the same way it did before. Also, by saving and sharing details like the model’s structure, which training stage it’s in, and the training step, along with the seed, we’re essentially taking a ‘snapshot’ of where the model is at that moment.
Others could use this snapshot to pick up the training right where it was left off, under the same conditions. For merging different models, this technique could help line up how they learn, making it easier and more predictable to combine their training.
Am I thinking right about this or am I missing something? This is just theoretical thinking and I am not an expert on the subject.
You could use set seeds and checkpoints to serially train a model between different models. I don’t know how you could “merge” different models that are trained independently. I think the challenge here is in the merging, not necessarily the deterministic part.
PlayStation fold at home for the next generation.
What does a “checkpoint” mean? A service?
SingularityNET was working on something like that.