Residual blocks
Let \(x\) be the input. Let \(\mathcal{F}(x)\) be the result of passing \(x\) through \(k\) layers. Then, the input to the \(k\) th layer is \(F(x) + x\)
This is done to prevent vanishing gradients. Often, \(F(x)\) will include non-linear activations, which can make the gradients disappear.
The term "residual" is used, because the network learns the residual component \(R(x)\) of the distribution \(H(x) = R(x) + x\).
1. intuition
- the output of the block now needs to only represent the diff, not the entire previous input