I’m a bit confused by the motivation of removing the skip connections. I’ve read things from people in the mechanistic interpretability community refer to the “residual stream” as not just a training hack but as an important part of the reason why transformer models work well. That the fact that information can propagate through the layers past the nonlinearities is seen as a feature not a bug.
I’m a bit confused by the motivation of removing the skip connections. I’ve read things from people in the mechanistic interpretability community refer to the “residual stream” as not just a training hack but as an important part of the reason why transformer models work well. That the fact that information can propagate through the layers past the nonlinearities is seen as a feature not a bug.