forward forward

Instead of storing gradients at each layer, perform the following at each layer:

Note that gradient descent is used per-layer. But backprop is not used on the whole network

1. sources