# Policy Gradient

## 1. Terminology

- The framework which the agent operates in is a Markov Decision Process, which has an action space \(\mathcal{A}\), a state space \(\mathcal{S}\), a reward function \(R\), and the dynamics \(p\) of the environment
- \(p(s_{t+1} \mid s_{t}, a_{t})\) models the dynamics of the environment. "Given that I am in state \(s_t\) and take action \(a_{t}\) at time \(t\), what is the probability that I end up in \(s_{t+1}\) at time \(t+1\)?
- \(\tau\) is a trajectory which consists of states and actions for every time step: \(\{s_0, a_0, s_1, a_1, \ldots\}\)
- \(r(\tau)\) is the reward given to a trajectory
- \(\pi_{\theta}(s)\) is the policy we are trying to learn. It gives a distribution over actions, given a state \(s\). \(\theta\) are the parameters of the model

## 2. Gradient

The objective function is: \(J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}\left[r(\tau)\right]\)

The gradient is given by: \(\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau\sim\pi_{\theta}}\left[ r(\tau) \sum_{t=1}^{|\tau|} \nabla \log \pi_{\theta} (a_{t} \mid s_{t}) \right]\)

### 2.1. Comments on the gradient

Note that the above gradient does not require us to know anything about the environment dynamics \(p(s_{t+1} \mid s_t, a_t)\) or the probability \(p(s_1)\) that we start a trajectory in any given state. Algorithms that use this result are called *model-free*.

## 3. Derivation of gradient

\[\begin{align*} \nabla_{\theta} J(\theta) &= \nabla_{\theta} \mathbb{E}_{\tau\sim\pi_{\theta}}\left[ r(\tau) \right]\\ &= \int \nabla_{\theta} \pi_{\theta}(\tau) r(\tau) d\tau\\ &= \int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) d\tau\\ &= \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau)\right] \end{align*}\]

The third line comes from the fact that \(\pi_{\theta}(\tau)\nabla_{\theta} \log \pi_{\theta}(\tau) = \pi_{\theta}(\tau) \frac{1}{\pi_{\theta}(\tau)}\nabla \pi_{\theta}(\tau) = \nabla_{\theta} \pi_{\theta}(\tau)\)

But what is \(\pi_{\theta}(\tau)\)? It is given by: \[\begin{align*} \pi_{\theta}(\tau) &= \pi_{\theta}(s_1,a_1,s_2,a_2,\ldots) \\ &= p(s_1)\prod_{t=1}^{|\tau|} \pi(a_{t} \mid s_{t}) p(s_{t+1} \mid a_t) \end{align*}\] where \(p(s_1)\) is the probability of starting in state \(s_1\).

We can take the log of the above to get: \[\log \pi_{\theta}(\tau) = \log p(s_1) + \sum_{t=1}^{|\tau|} \pi(a_{t} \mid s_{t}) + \sum_{t=1}^{|\tau|} p(s_{t+1} \mid a_t)\]

Let's substitute this into the gradient: \[\begin{align*} \mathbb{E}_{\tau\sim\pi_{\theta}} \left[\nabla_{\theta}\log \pi_{\theta}(\tau) r(\tau) \right] & = \mathbb{E}_{\tau\sim\pi_{\theta}} \left[ \nabla_{\theta}\left(\log p(s_1) + \sum_{t=1}^{|\tau|} \pi(a_{t} \mid s_{t}) + \sum_{t=1}^{|\tau|} p(s_{t+1} \mid a_t) \right) \right]\\ & = \mathbb{E}_{\tau\sim\pi_{\theta}} \left[r(\tau)\sum_{t=1}^{|\tau|} \nabla_{\theta}\pi(a_{t} \mid s_{t}) \right] \end{align*}\]

And we are done! The last line comes from the fact that any term that doesn't depend on \(\theta\) disappears when we take the gradient \(\nabla_{\theta}\).

## 4. Algorithms

The expectation for the gradient can be estimated using Monte-Carlo methods. This is the approach taken by the REINFORCE algorithm.

## 5. Actor-Critic

- See Actor Critic

## 6. off policy

Instead of sampling trajectories according to the policy being learned, you can sample according to a separate exploration policy. See off-policy policy gradient.

## 7. Comparison to maximum likelihood for sequence learning

For sequences:

- All of the input is consumed by an encoder, before the decoder begins producing output
- Let \(x\) be the source sequence, and \(y\) be the target sequence

For sequences, the goal of Maximum Likelihood Estimation is to find parameters \(\theta^*\): \[\theta^* = \arg\max_{\theta} p_{\theta}(y \mid x)\] where \[p_{\theta}(y\mid x) = \prod_t p_{\theta}(y_t \mid x)\] when maximizing the log likelihood of a dataset \(\mathcal{D}\), our objective is then: \[\frac{1}{\mathcal{D}}\sum_{(x,y)\in\mathcal{D}} \nabla \log p(y\mid x)\]

Compare this to a Monte-Carlo estimator of the RL gradient, which samples \(N\) tracks: \[\frac{1}{N}\sum_{i=1}^{N} r(\tau) \nabla \log p(a_{\tau} \mid x)\] Note that in the above, the actions \(a_{\tau}\) is the output, which lies in the target domain. In contrast to the RL framework discussed above, the state \(s_{t}\) is not conditioned over at every time-step. Instead, the input sequence \(x\) is fed to an encoder, which passes an embedding of the input to the decoder. The decoder produces the actions/output tokens.

The two objectives are identical in the case where \(r(.)\) is 1 for the target output and 0 otherwise. When this is not the case, I think of RL as a fuzzy, trial-and-error approximation of MLE.