Trust Region Policy Optimization
1. motivation
Now, say that you're doing off-policy policy gradient, where you have the policy \(\pi_{\theta}\) that you're learning and you also have the policy \(\beta\) that you are drawing samples from. We start with the objective: \[ J(\theta) = \sum_{s\in S} p^{\pi_{\theta_{old}}}(s) \sum_{a\in \mathcal{A}} \left( \pi_{\theta}(a \mid s) \hat{A}_{\theta_{old}} (s, a) \right) \] (side note: why does the objective look like this? Intuitively, we want to maximize the advantage value we will get, starting from state 0. Here \(p^{\pi_{\theta_{old}}}\) is the stationary distribution of \(\pi_{\theta_{old}}\))
Then, after incorporating the \(\beta\) term, and using importance sampling weighting: \[ J(\theta) = \sum_{s\in S} p^{\pi_{\theta_{old}}} \sum_{a\in\mathcal{A}} \beta(a\mid s) \frac{\pi_{\theta}(a \mid s)}{\beta(a\mid s)} \hat{A}_{\theta_{old}}(s,a) \] which can be written as: \[ \mathbb{E}_{s\sim p^{\pi_{\theta_{old}}}, a \sim \beta} \left[\frac{\beta(a\mid s)}{\pi_{\theta}(a\mid s)}\hat{A}_{\theta_{old}}\right] \]
Note that training is distributed across many workers, so that \(\pi_{\theta}\) is the policy being learned on an individual worker, and \(\pi_{\theta_{old}}\) is the shared policy (I presume) which may have gone stale.
2. key idea
Now, let's treat \(\pi_{\theta_{old}}\) as our exploration policy \(\beta\), so that we have \[ J(\theta) = \mathbb{E}_{s\sim p^{\pi_{\theta_{old}}}, a\sim \pi_{\theta_{old}}} \frac{\pi_{\theta}}{\pi_{\theta_{old}}} \hat{A}_{\theta_{old}} \]
And let's also add in a constraint that prevents \(\theta\) from straying too far from \(\theta_{old}\) by constraining the Kullback-Leibler divergence between the two: \[ \mathbb{E}_{s\sim p^{\theta_{old}}}\left[D_{KL} \left( \pi_{\theta}(\cdot \mid s) || \pi_{\theta_{old}}(\cdot \mid s \right) \right] \leq \delta \]