Deep Recurrent Q-Learning for Partially Observable MDPs

Resources
- Paper
Introduction
- Limitation of DQNs is that they learn a mapping from a limited number of past states (i.e., 4 states in atari)
- Need to remember mapping from more distant states
  - Future states + reward depend on more than just input to DQN
  - MDP becomes POMDP
- Real-world tasks have noisy state due to partial observability
  - I.e., one image doesn’t tell you velocity of ball, only position
- DQN performance declines with incomplete state observations
  - We can use Deep recurrent Q network: combines LSTM + Deep Q Networks
- Deep Q Learning
  - Avoids feedback loops (achieves stability) through 1) experience replay 2) separate target network and 3) adaptive learning rates
- Partial Observability
  - Sensations received by agent are only partial glimpses of underlying system state
  - Vanilla DQNs can’t decipher underlying system state of a POMDP given an observation
    - $Q(o, a \vert \theta) \neq Q(s, a \vert \theta)$
  - Recurrency helps narrow the gap $Q(o, a \vert \theta) \rightarrow Q(s, a \vert \theta)$ using sequences of observations.
DRQN Architecture
- Take DQN, replace the first convolutional layer with a recurrent LSTM
- Each frame is convolved first and then passed through the LSTM through time
Stable Recurrent Updates
- Bootstrapped Sequential Updates: Episodes selected randomly from replay memory. Updates begin at beginning of episode until conclusion
  - Targets generated by target Q network
  - Advantage: Carry hidden state forward from beginning of episode
  - Disadvantage: Violates DQN random sample policy
- Bootstrapped Random Updates: Episodes selected randomly from replay memory. Updates start at random parts of epsiode and unroll (backward call) for set number of timesteps
  - Targets generated by target Q network
  - Advantage: Follows DQN random sample policy
  - Disadvantage: Hidden state must be zeroed → harder to learn longer time scales than unroll amount
Atari Games: MDP or POMDP
- Regular DQN used 4 frames for observations → enough to represent it as an MDP instead of POMDP
Flickering Atari Games
- To test on POMDP, modify atari such that at each timestep there is a 50% chance of frame being obscured.
- Need to integrate information across frames to estimate variables
- DRQN performs well even with one input per timestep
  - LSTM can integrate noisy single frame info through time to detect high level pong events
- DRQN with 1 frame per time-step is equivalent to DQN with 10 frames stacked
Evaluation on Standard Atari Games
- TLDR: DRQN does roughly as good as DQN - outperforms on some and much worse on others
MDP to POMDP Generalization
- DRQN and DQN captured on game without flickering
  - When transferring policy to flickering game, DRQN retains more of its performance than DQN
Discussion & Conclusion
- DRQN integrates information across time
- DRQN trained on partial observability generalizes policies to complete observability and vice versa
- No systematic benefit with using recurrency (DRQN) vs stacking frames (DQN)

April 1, 2025 · research