Are LLMs at a practical limit for layer stacking? [D]

cstein123@alien.top · 1 year ago

Are LLMs at a practical limit for layer stacking? [D]

Brudaks@alien.top · 1 year ago

The bottleneck is the total compute budget devoted to training, so while I’m quite certain that stacking a few more layers can be done and would have some benefit, it might well be that spending the same extra compute on a larger context window or ‘wider’ layers or simply doing more iterations on the same data would have a larger benefit than more layers, and if the people training the very large models think so, they would do these other things instead of stacking more layers.