metagradient descent
- From Optimizing ML Training with Metagradient Descent
- We want to take the gradient of our model output with respect to the hyper-parameters
- We could do this using automatic differentiation, but this would take a lot of storage, since the influence of the hyper parameters must be computed for the (final) model outputs. And there could be many many optimizer steps.
- We instead use a trick where we only need to perform backprop per optimization step. And to avoid saving every optimizer state, we can re-start the network training (assuming training is deterministic) over and over again. (TODO: understand this in more detail). This technique is more broadly known as rematerialization.