metagradient descent

From Optimizing ML Training with Metagradient Descent
We want to take the gradient of our model output with respect to the hyper-parameters
We could do this using automatic differentiation, but this would take a lot of storage, since the influence of the hyper parameters must be computed for the (final) model outputs. And there could be many many optimizer steps.
We instead use a trick where we only need to perform backprop per optimization step. And to avoid saving every optimizer state, we can re-start the network training (assuming training is deterministic) over and over again. (TODO: understand this in more detail). This technique is more broadly known as rematerialization.