UP | HOME

weight decay and l2 regularization

1. difference between l2 regularization and weight decy

  • L2 regularization is simply adding \(\frac{\lambda}{2}\sum w^2\) to the loss, where \(w\) are the model weights
  • in practice, l2 regularization is often implemented as:
grad = grad_w + lambda * w
w = w - learning_rate * grad #or, see below
w = w - learning_rate * grad_w + learning_rate * lambda * w
  • in weight decay, we subtract a little bit of the weights from the weights at each step
w = w - learning_rate * grad_w + learning_rate * lambda * w
  • For vanilla SGD, L2 regularization and weight decay result in the same effect. But for optimizers with momentum, like Adam, they have different results due to where the weights get subtracted

2. sources

Created: 2024-07-15 Mon 01:28