UP | HOME

Dyna-Q Learning

1. setting

  • We want to learn the Q values for a MDP. That is, we want to find the value of taking an action \(a\) in state \(s\)

2. solution

  • Recall value iteration – where at each iteration, we randomly select a state, and update the value function at that state
  • Now, we do Q-learning, where at each iteration, we
    • randomly select a state \(s\) and an action \(a\) in that state
    • record the reward \(R\) and resulting state \(s'\)
    • update the \(Q\) value \(Q(s,a)\) according to the bellman equation (see DQN)
    • update our model of the environment dynamics \(Model(s,a) = (R, s')\) (let's assume that the model dynamics are deterministic)
  • The last bullet is key to Dyna Q learning
    • for each iteration we also run a sub-routine \(n\) times where:
      • we randomly sample a previously seen action and state \((s,a)\)
      • since we've previously seen them, we have their resulting states and rewards stored
        • update \(Q(s,a)\)
    • why run this sub-routine? presumably because querying our (approximation) model of the environment dynamics cheaper/faster than querying the real environment. Remember for normal \(Q\) learning: even though \(Q(s,a)\) is being updated in the outer loop, these values won't be propagated to the rest of the table unless the neighboring states are visited. Imagine a state \(s'\), such that \(T(s',a) = s\). If we've updated \(Q(s,a)\), this information won't reach \(Q(s',a)\) until we run the update rule after visiting \(s'\). This sub-routine increases the odds of that happening.
  • This is very similar to what goes on in DQN

3. sources

Created: 2024-07-15 Mon 01:26