automatic differentiation

1. differences with other methods

symbolic differentiation
- this is a derivative like you would do in high school. It yields a closed form expression.
- automatic differentiation is like this in that we get the exact gradient of the function too. But we are limited in our set of functions that we can do this with. We need to stick with functions that can be built up from small primitive functions, that we know the derivative of exactly.
numerical methods
- you just need to know how to evaluate the function to do this one.
- the gradient is not exact.

for a given \(x_i\) input, you can compute \(\frac{\partial f}{\partial x_i}\) in one sweep along with the forward pass.
imagine the computational graph. In topological order, for each node \(w_i\), compute
- the values of the node and store it. You will need the values of the input nodes. These have already been computed and stored.
- \(\frac{\partial w_i}{\partial x}\). For this, you can set up rules to compute the partial derivative with respect to the input nodes. Then by the chain rule, you need to multiply by the partial derivative of the input nodes with respect to \(x\). These have already been computed and stored
if there are \(m\) variables, you have to run \(m\) sweeps to get \(\frac{\partial f}{\partial x_i}\) for all \(m\) inputs

in the forward pass, compute and store the partial derivative with each node with respect to its child nodes. For this, the only ingredients you need are the values of the child nodes. Note, because you are not computing the derivative all the way back to \(x\) at this time, you do not need the partial derivatives of the child nodes with respect to \(x\).
in the backward pass, starting from the output node, to find \(\frac{\partial f}{\partial w_{n-1}}\), compute \(\frac{\partial f}{\partial w_{n}}\frac{\partial w_{n}}{\partial w_{n-1}}\).
- The first term is part of the running product you are computing as you move backwards.
- The second term was computed in the forward pass
As you move backwards, if \(w_{n-1}\) was the input to two output nodes \(w_{n+1}\) and \(w_{n+2}\), you will have to use the multivariate chain rule