Actor Critic

1. Terminology

Let \(\tau\) be a trajectory consisting of \(T\) timesteps: \((s_0, a_0 ... s_{T})\).
Let \(R(\tau) = \sum_{t=1}^{T}r_t\)

2. Re-writing the gradient

For a Policy Gradient method, the gradient of the objective is given by: \[\begin{align*} \nabla_{\theta} \mathbb{E}_{\tau}[R(\tau)] &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left(\sum_{t=0}^{T-1} r_t \right) \left(\sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right) \right]\\ &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_{\theta} \log\pi_{\theta}(a_t \mid s_t) \left( \sum_{t'=t}^{T-1} r_t' \right) \right] \end{align*} \]

For a full explanation of why, see Daniel Takeshi's notes (see also: Blogs). In rough, the second line follows from the first because we are taking an expectation over \(\tau \sim \pi_\theta\), so any weirdness in indexing gets taken care of by iterated expectation.

3. Baseline

We can reduce the variance of the estimated gradient without introducing any bias using a baseline function \(b(s)\): \[ \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \pi_\theta(a_t \mid s_t) \left( \sum_{t'=t}^{T-1} r_t - b(s_t) \right) \right] \]

4. Value functions

We introduce two value functions: \[ Q(s,a) = \mathbb{E}_{\tau \sim \pi_\theta} \left[\sum_{t=0}^{T-1} r_t \mathrel{\Big|} s_0=s, a_0=a\right] \] and \[V(s) = \mathbb{E}_{\tau\sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \mathrel{\Big|} s_0=s\right]\]

5. Advantage

Then, note the similarity between \(Q(s,a)\) and \(\sum_{t=0}^{T-1}\). We are permitted to write: \[ \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta (a_t \mid s_t) \left( Q(s_t,a_t) - b(s_t) \right) \right] \] Why? I have no chain of derivations to give. And neither does Takeshi. He writes that this ends up working, again perhaps due to iterated expectation.

Now, if we use \(V(s)\) as our baseline, we will have the advantage actor critic \[ \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta (a_t \mid s_t) \left( Q(s_t,a_t) - V^{\pi}(s_t) \right) \right] \]

where the advantage \(A(s_t,a_t) = Q(s_t,a_t) - V(s_t)\) measures how much better taking an action \(a_t\) at state \(s_t\) is, compared to the average of the other actions that you could take in \(s_t\).