Proximal Policy Optimization

PPO is an on-policy policy gradient approach for distributed training. It can be thought of as a variant of Trust Region Policy Optimization, where the objective is clipped so that the importance sampling weights are bounded. (Why do you want to bound those weights? I am still not sure. I haven't looked into it yet).

Remember that for TRPO, the objective is: \[ J(\theta) = \mathbb{E}_{t} \left[ \frac{\pi_{\theta}}{\pi_{\theta_{old}}} \hat{A}_t \right] \] Note to self: what does the \(t\) subscript mean?

Let \(r_{t}(\theta) = \frac{\pi_{\theta}}{\pi_{\theta_{old}}}\) . Then, the clipped objective is:

\[ J^{clip}(\theta) = \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta, 1-\epsilon, 1+\epsilon)\hat{A}_t) \]

1. sources

papers with code