Residual blocks

Let \(x\) be the input. Let \(\mathcal{F}(x)\) be the result of passing \(x\) through \(k\) layers. Then, the input to the \(k\) th layer is \(F(x) + x\)

This is done to prevent vanishing gradients. Often, \(F(x)\) will include non-linear activations, which can make the gradients disappear.

The term "residual" is used, because the network learns the residual component \(R(x)\) of the distribution \(H(x) = R(x) + x\).

1. intuition

the output of the block now needs to only represent the diff, not the entire previous input

2. Useful links

Medium post on residual blocks