# layer normalization

Let \(\mathbf{x}_i\) be the \(i\) th sample. Say that \(\mathbf{x}_i\) has a \(d\) dimensional representation. Then, we compute the mean \(\mu_i = \frac{1}{d}\sum_{k=1}^{d}\mathbf{x}_{i,k}\) and standard deviation \(\sigma_i\) for the sample. For each \(i\), and feature \(k\) we do:

\[ \mathbf{\hat{x}}_{i,k} = \frac{\mathbf{x}_{i,k}}{\sqrt{\sigma_i^2 + \epsilon}} \]

Then, for each *sample*, the features have mean 0 and unit standard deviation. Contrast this with **batch** normalization, where the mean and standard deviation are computed across the batch. Then, each *feature* has mean 0 and unit standard deviation across the samples in the batch.