Bellman equation
1. setting
We start in state \(x_0\). At every time-step \(t\), we are in state \(x_t\) and we take an action \(a_t\in \Gamma(x_t)\). As a result of taking action \(a_t\), we move to a new state \(x_{t+1} = T(x_t, a_t)\) and receive reward \(F(x_t, a_t)\), where \(T\) is our transition function and \(F\) is our reward function.
Our goal is to maximize the reward that we can collect, starting at \(x_0\). We discount later rewards using a discount factor \(0 < \beta < 1\). This value can be written as: \[ V(x_0) = \max_{\{a_t\}_{t=0}^{\infty}} \sum_{t=0}^{\infty} \beta^t F(a_t, x_t) \] (this is the value function, as used in Policy Gradient, policy iteration, value iteration)
2. Bellman equation
We are trying to find the best policy \(\{a_t\}_{t=0}^{\infty}\). Let's say that we select an action \(a_0\) and want to know how much value we can still collect. Bellman, from observing the optimal substructure of this problem (see /Dynamic Programming), noticed that no matter what action \(a_0\) we take, if we want maximal reward, then the policy that we follow afterwards should be optimal from \(a_{1}\) going forwards. So, we re-write our above equation as: \[ \max_{a_0} \left\{ F(x_0, a_0) + \beta \left[\max_{{a_t}_{t=0}^{\infty}} \sum_{t=1}^{\infty} \beta^{t-1} F(x_t, a_t) \right] \right\} \] We notice that the quantity in square brackets is just \(V(x_1)\), which gives us the Bellman equation: \[ V(x_0) = \max_{a_0}\{F(x_0,a_0) + \beta V(x_1)\} \]