• 0 Posts
  • 2 Comments
Joined 1 year ago
cake
Cake day: November 14th, 2023

help-circle

  • The bottleneck is the total compute budget devoted to training, so while I’m quite certain that stacking a few more layers can be done and would have some benefit, it might well be that spending the same extra compute on a larger context window or ‘wider’ layers or simply doing more iterations on the same data would have a larger benefit than more layers, and if the people training the very large models think so, they would do these other things instead of stacking more layers.