Limitation of DQNs is that they learn a mapping from a limited number of past states (i.e., 4 states in atari)
Need to remember mapping from more distant states
Future states + reward depend on more than just input to DQN
MDP becomes POMDP
Real-world tasks have noisy state due to partial observability
I.e., one image doesn’t tell you velocity of ball, only position
DQN performance declines with incomplete state observations
We can use Deep recurrent Q network: combines LSTM + Deep Q Networks
Deep Q Learning
Avoids feedback loops (achieves stability) through 1) experience replay 2) separate target network and 3) adaptive learning rates
Partial Observability
Sensations received by agent are only partial glimpses of underlying system state
Vanilla DQNs can’t decipher underlying system state of a POMDP given an observation
$Q(o, a \vert \theta) \neq Q(s, a \vert \theta)$
Recurrency helps narrow the gap $Q(o, a \vert \theta) \rightarrow Q(s, a \vert \theta)$ using sequences of observations.
DRQN Architecture
Take DQN, replace the first convolutional layer with a recurrent LSTM
Each frame is convolved first and then passed through the LSTM through time
Stable Recurrent Updates
Bootstrapped Sequential Updates: Episodes selected randomly from replay memory. Updates begin at beginning of episode until conclusion
Targets generated by target Q network
Advantage: Carry hidden state forward from beginning of episode
Disadvantage: Violates DQN random sample policy
Bootstrapped Random Updates: Episodes selected randomly from replay memory. Updates start at random parts of epsiode and unroll (backward call) for set number of timesteps
Targets generated by target Q network
Advantage: Follows DQN random sample policy
Disadvantage: Hidden state must be zeroed → harder to learn longer time scales than unroll amount
Atari Games: MDP or POMDP
Regular DQN used 4 frames for observations → enough to represent it as an MDP instead of POMDP
Flickering Atari Games
To test on POMDP, modify atari such that at each timestep there is a 50% chance of frame being obscured.
Need to integrate information across frames to estimate variables
DRQN performs well even with one input per timestep
LSTM can integrate noisy single frame info through time to detect high level pong events
DRQN with 1 frame per time-step is equivalent to DQN with 10 frames stacked
Evaluation on Standard Atari Games
TLDR: DRQN does roughly as good as DQN - outperforms on some and much worse on others
MDP to POMDP Generalization
DRQN and DQN captured on game without flickering
When transferring policy to flickering game, DRQN retains more of its performance than DQN
Discussion & Conclusion
DRQN integrates information across time
DRQN trained on partial observability generalizes policies to complete observability and vice versa
No systematic benefit with using recurrency (DRQN) vs stacking frames (DQN)