Dyna-Q Learning
1. setting
- We want to learn the Q values for a MDP. That is, we want to find the value of taking an action \(a\) in state \(s\)
2. solution
- Recall value iteration – where at each iteration, we randomly select a state, and update the value function at that state
- Now, we do Q-learning, where at each iteration, we
- randomly select a state \(s\) and an action \(a\) in that state
- record the reward \(R\) and resulting state \(s'\)
- update the \(Q\) value \(Q(s,a)\) according to the bellman equation (see DQN)
- update our model of the environment dynamics \(Model(s,a) = (R, s')\) (let's assume that the model dynamics are deterministic)
- The last bullet is key to Dyna Q learning
- for each iteration we also run a sub-routine \(n\) times where:
- we randomly sample a previously seen action and state \((s,a)\)
- since we've previously seen them, we have their resulting states and rewards stored
- update \(Q(s,a)\)
- why run this sub-routine? presumably because querying our (approximation) model of the environment dynamics cheaper/faster than querying the real environment. Remember for normal \(Q\) learning: even though \(Q(s,a)\) is being updated in the outer loop, these values won't be propagated to the rest of the table unless the neighboring states are visited. Imagine a state \(s'\), such that \(T(s',a) = s\). If we've updated \(Q(s,a)\), this information won't reach \(Q(s',a)\) until we run the update rule after visiting \(s'\). This sub-routine increases the odds of that happening.
- for each iteration we also run a sub-routine \(n\) times where:
- This is very similar to what goes on in DQN