The point is that the adapted layers have a significantly higher parameter count in the freezed model, leading to huge savings of memory. You never take your gradient with respect to adapted layers, only to adaptor layers and whatever is left of the original model.
This is of course not necessarily true for smaller models.
The point is that the adapted layers have a significantly higher parameter count in the freezed model, leading to huge savings of memory. You never take your gradient with respect to adapted layers, only to adaptor layers and whatever is left of the original model.
This is of course not necessarily true for smaller models.