Actor Critic
1. Terminology
- Let \(\tau\) be a trajectory consisting of \(T\) timesteps: \((s_0, a_0 ... s_{T})\).
- Let \(R(\tau) = \sum_{t=1}^{T}r_t\)
2. Re-writing the gradient
For a Policy Gradient method, the gradient of the objective is given by: \[\begin{align*} \nabla_{\theta} \mathbb{E}_{\tau}[R(\tau)] &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left(\sum_{t=0}^{T-1} r_t \right) \left(\sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right) \right]\\ &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_{\theta} \log\pi_{\theta}(a_t \mid s_t) \left( \sum_{t'=t}^{T-1} r_t' \right) \right] \end{align*} \]
For a full explanation of why, see Daniel Takeshi's notes (see also: Blogs). In rough, the second line follows from the first because we are taking an expectation over \(\tau \sim \pi_\theta\), so any weirdness in indexing gets taken care of by iterated expectation.
3. Baseline
We can reduce the variance of the estimated gradient without introducing any bias using a baseline function \(b(s)\): \[ \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \pi_\theta(a_t \mid s_t) \left( \sum_{t'=t}^{T-1} r_t - b(s_t) \right) \right] \]
4. Value functions
We introduce two value functions: \[ Q(s,a) = \mathbb{E}_{\tau \sim \pi_\theta} \left[\sum_{t=0}^{T-1} r_t \mathrel{\Big|} s_0=s, a_0=a\right] \] and \[V(s) = \mathbb{E}_{\tau\sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \mathrel{\Big|} s_0=s\right]\]
5. Advantage
Then, note the similarity between \(Q(s,a)\) and \(\sum_{t=0}^{T-1}\). We are permitted to write: \[ \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta (a_t \mid s_t) \left( Q(s_t,a_t) - b(s_t) \right) \right] \] Why? I have no chain of derivations to give. And neither does Takeshi. He writes that this ends up working, again perhaps due to iterated expectation.
Now, if we use \(V(s)\) as our baseline, we will have the advantage actor critic \[ \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta (a_t \mid s_t) \left( Q(s_t,a_t) - V^{\pi}(s_t) \right) \right] \]
where the advantage \(A(s_t,a_t) = Q(s_t,a_t) - V(s_t)\) measures how much better taking an action \(a_t\) at state \(s_t\) is, compared to the average of the other actions that you could take in \(s_t\).