weight decay and l2 regularization

1. difference between l2 regularization and weight decy

L2 regularization is simply adding \(\frac{\lambda}{2}\sum w^2\) to the loss, where \(w\) are the model weights
in practice, l2 regularization is often implemented as:

grad = grad_w + lambda * w
w = w - learning_rate * grad #or, see below
w = w - learning_rate * grad_w + learning_rate * lambda * w

in weight decay, we subtract a little bit of the weights from the weights at each step

w = w - learning_rate * grad_w + learning_rate * lambda * w

For vanilla SGD, L2 regularization and weight decay result in the same effect. But for optimizers with momentum, like Adam, they have different results due to where the weights get subtracted

2. sources

Created: 2024-07-15 Mon 01:28