# policy iteration

## 1. context

Learning the optimal policy for an agent. The agent lives in an environment. It can take actions, and each action has some probability-weighted outcome.

Policy iteration is a *model-based* algorithm that requires a model, e.g. a action-state transition matrix, of the environment dynamics.

## 2. algorithm

Iteratively perform policy evaluation and policy improvement. During policy evaluation, the policy \(\pi\) is fixed and a new value function \(V\) is found. During policy improvement, \(V\) is fixed, and a new \(\pi\) is found.

- Policy evaluation: An iterative procedure. For each state \(s\), \(V(s)\) is updated to be the expected value of taking the action recommended by the policy: \(\pi(s)\). That is, set \[ v(s) \leftarrow \sum_{s'} p(s' \mid s, \pi(s))\left[ R(s,\pi(s),s') + \gamma v(s')\right] \] (Compare this step with value iteration and the Bellman equation) Iterate until \(V(s)\) does not change beyond some \(\Delta\).
- Policy improvement: Find, for each state \(s\), the action \(\pi(s)\) that will result in the maximum expected value. Here, value is given by the function \(V(s)\) from the policy evaluation. So, set \[ \pi(s) \leftarrow \max_{a} \mathbb{E}_{p(s' \mid s,\pi(s)}\left[ R(s, \pi(s), s') + \gamma v(s') \right] \]