layer normalization

Let \(\mathbf{x}_i\) be the \(i\) th sample. Say that \(\mathbf{x}_i\) has a \(d\) dimensional representation. Then, we compute the mean \(\mu_i = \frac{1}{d}\sum_{k=1}^{d}\mathbf{x}_{i,k}\) and standard deviation \(\sigma_i\) for the sample. For each \(i\), and feature \(k\) we do:

\[ \mathbf{\hat{x}}_{i,k} = \frac{\mathbf{x}_{i,k}}{\sqrt{\sigma_i^2 + \epsilon}} \]

Then, for each sample, the features have mean 0 and unit standard deviation. Contrast this with batch normalization, where the mean and standard deviation are computed across the batch. Then, each feature has mean 0 and unit standard deviation across the samples in the batch.

1. Useful links

Article that contrasts layer normalization with batch normalization