- L2 regularization is simply adding \(\frac{\lambda}{2}\sum w^2\) to the loss, where \(w\) are the model weights
- in practice, l2 regularization is often implemented as:
grad = grad_w + lambda * w
w = w - learning_rate * grad #or, see below
w = w - learning_rate * grad_w + learning_rate * lambda * w
- in weight decay, we subtract a little bit of the weights from the weights at each step
w = w - learning_rate * grad_w + learning_rate * lambda * w
- For vanilla SGD, L2 regularization and weight decay result in the same effect. But for optimizers with momentum, like Adam, they have different results due to where the weights get subtracted