@Brudaks

Brudaks@alien.top · 1 year ago

Cross-validation is a reasonable alternative, however, it does increase your compute cost 5-10 times, or, more likely, means that you generate 5-10 times smaller model(s) which are worse than you could have made if you’d just made a single one.

Brudaks@alien.top · 1 year ago

The bottleneck is the total compute budget devoted to training, so while I’m quite certain that stacking a few more layers can be done and would have some benefit, it might well be that spending the same extra compute on a larger context window or ‘wider’ layers or simply doing more iterations on the same data would have a larger benefit than more layers, and if the people training the very large models think so, they would do these other things instead of stacking more layers.