Mnih et al 2013 – Playing Atari with Deep Reinforcement Learning
Notes on Deep Q Learning and Deep Q Networks from mnih13_playin_atari_with_deep_reinf_learn
1. background
The agent interacts with the environment
- receives as input a representation of the environment
- selects an action
- subsequently receives a reward
1.1. Markov Decision Process
The agent makes all of it's decisions in the context of a Markov Decision Process where the states are:
The goal of the agent is to maximize its collected reward. With a discount factor
What we're trying to find is the optimal action-value function
The optimal policy follows the Bellman equation:
Finding this function can be done by value iteration:
We train this Q-network using a loss that changes for each iteration
The gradient of this loss ends up having an expectation in it too. Instead of computing an expectation, we can approximate by sampling. The Q-learning algorithm simply samples a single trajectory from
1.2. deep q learning
Finally, for deep Q-learning there is one more ingredient: experience replay. For experience replay, store the transition
More precisely, let
Why go through all this trouble? Apparently it helps clean the procedure of auto-correlated samples, which makes sense because the updates we are making are may occur far after the actual experience was originally collected.
Noteworthy aspects:
- this algorithm is model-free we don't ever see the environment dynamics
directly, nor do we explicity construct a model of them (see Policy Gradient for more discussion) - it is also off-policy – training samples are drawn from the behavior distribution
. In practice, we usually follow the policy greedily, but with -dithering to ensure that we are exploring (see off-policy policy gradient)