Mnih et al 2013 – Playing Atari with Deep Reinforcement Learning

Notes on Deep Q Learning and Deep Q Networks from mnih13_playin_atari_with_deep_reinf_learn

1. background

The agent interacts with the environment $\mathcal{E}$. At each time-step $t$, the agent:

receives as input a representation of the environment $x^t \in \mathbb{R}^d$
selects an action $a_t \in \mathcal{A}$
subsequently receives a reward $r_t$

1.1. Markov Decision Process

The agent makes all of it's decisions in the context of a Markov Decision Process where the states are: \[ s_t = x_1, a_1, x_2, ..., a_{t-1}, x_{t} \] That is, each state is itself a sequence of observations and actions. Note that this makes the state space very big.

The goal of the agent is to maximize its collected reward. With a discount factor $\gamma$, the return of a sequence of rewards starting at time $t$ is $R_t =\sum_{t'=t}^{T}\gamma^{t'-t}r_t$, where the game ends at time step $T$.

What we're trying to find is the optimal action-value function $Q*(s,a)$, which tells us the maximum expected return that can follow after taking action $a$ in state $s$, where the maximum is taken across all possible policies $\pi$: \[ Q^*(s,a) = \max_{\pi} \mathbb{E}\left[R_t \mid s_t = s, a_t = a, \pi\right] \]

The optimal policy follows the Bellman equation: \[ Q^*(s,a) = \mathbb{E}_{s'\sim \mathcal{E}}\left[r + \gamma \max_{a'} Q^*(s',a') \mid s,a \right] \] where $r$ is the reward you get from doing $a$ in $s$. Note that we have an expectation taken over $s' \sim \mathcal{E}$, because the outcome of a given transition is stochastic.

Finding this function can be done by value iteration: \[ Q_{i+1}(s,a) = \mathbb{E}\left[r + \gamma \max_{a'} Q_i(s', a') \mid s, a \right] \] However, this can be impractical for very large state space. So instead, we will use an approximator, parameterized by $\theta$: $Q(s,a; \theta) \approx Q^*(s,a)$.

We train this Q-network using a loss that changes for each iteration $i$: \[ L_i(\theta_i) = \mathbb{E}_{s, a\sim \rho(.)}\left[ (y_i - Q(s,a;\theta_i) \right] \] where $y_i = \mathbb{E}_{s'\sim \mathcal{E}}\left[ r + \gamma \max_{a'} Q(s',a'; \theta^{i-1}) \mid s, a \right]$ is the target and $\rho(s,a)$ is the behavior distribution (see off-policy policy gradient). Compare what is happening here with what is happening in the value iteration equation above. Notice that the same thing is happening, except now our approximator is a neural network which is hopefully learning something about generalization to states that it has not seen.

The gradient of this loss ends up having an expectation in it too. Instead of computing an expectation, we can approximate by sampling. The Q-learning algorithm simply samples a single trajectory from $\rho$.

1.2. deep q learning

Finally, for deep Q-learning there is one more ingredient: experience replay. For experience replay, store the transition $(s_t, a_t, r_t, s_{t+1})$ in a dataset. Then, during training, we simultaneously collect experiences and sample from that dataset to get samples for gradient updates.

More precisely, let $\phi_j$ be the representation of a state $j$. Then, say that we have sampled $(\phi_j, a_j, r_j, \phi_{j+1})$ from our database $D$ of stored experiences. We then set: \[ y_j := \begin{cases} r_j & \text{if $s_j$ is a terminal state}\\ r_j + \gamma \max_{a'} Q(\phi_{j+1}, a'; \theta) & \text{if $s_j$ is non-terminal} \end{cases} \] Then, perform a gradient update according to the loss given above. What is going on here? Recall that $y_j$ is supposed to be an expectation, but instead we're approximating it with a single sample. Also recall that each $s_j$ actually represents a sequence of observations and actions.

Why go through all this trouble? Apparently it helps clean the procedure of auto-correlated samples, which makes sense because the updates we are making are may occur far after the actual experience was originally collected.

Noteworthy aspects:

this algorithm is model-free we don't ever see the environment dynamics $\mathcal{E}$ directly, nor do we explicity construct a model of them (see Policy Gradient for more discussion)
it is also off-policy – training samples are drawn from the behavior distribution $\rho$. In practice, we usually follow the policy greedily, but with $\epsilon$ -dithering to ensure that we are exploring (see off-policy policy gradient)

2. helpful links

experience replay

3. bib

bibliography/references.bib