PixelCNN provides an exploration bonus to a DQN agent
Used a mixed monte carlo update
Compared DQN-PixelCNN to DQN and DQN-CTS
CTS and PixelCNN both outperform the baseline agent on Montezuma
PixelCNN is SOTA on other hard exploration games
PixelCNN outperforms CTS on 52/57 games
A Multi-Step RL Agent with PixelCNN
Combined PixelCNN with Reactor
Only perform updates on 25% of steps to reduce computational burden
Prediction gain decay is $0.1n^{1/2}$
PixelCNN improves baseline reactor which is an improvement on baseline DQN
On hard exploration games, Reactor can’t take advantage of the full exploration bonus
Across long horizons in sparse reward settings, propagation of reward signal is crucial
Reactor relies on $\lambda$ and the trucated importance sampling ratio which discards off-policy trajectories $\rightarrow$ cautious learning
Cautious learning causes it to not take advantage of the bonus
Quality of the Density Model
PixelCNN has lower and smoother prediction gain (lower variance)
Shows pronounced peaks at infrequent states
Per step prediction gain never vanishes because step size isn’t decaying
Model reamins mildly suprised by significant state changes
Importance of the Monte Carlo Return
We need the learning algorithm to understand the transient nature of exploration bonuses
Mixed monte carlo updates help do this
MMC also helps in long horizon sparse settings where rewards are far apart
Monte carlo return on-policy increases variance in learning algorithm $\rightarrow$ prevents convergence when training off-policy
MMC speeds up training + improves final performance when used in PixelCNN over base DQN on some games
MMC can also hurt performance in some games when using PixelCNN over base DQN
MMC + PixelCNN bonuses have a compounding effect
On hard exploration games, DQN fails completely but PixelCNN + DQN does well
Reward bonus creates denser rewards
Because bonuses are temporary, agent needs to learn the policy faster than 1-step methods $\rightarrow$ MMC is the solution!
Pushing the Limits of Intrinsic Motivation
In experiments, prediction gain was clamped to avoid adverse effects on easy exploration games
Increasing this scale leads to stronger exploration on hard exploration games
Reaches peak performance rapidly
Deteriorates stability of training + long term performance too
With reward clipping, exploration bonus becomes essentially constant (no prediction gain decay) $\rightarrow$ no longer useful signal for intrinsic motivation
Training on exploration bonus only is another way to get a high performing agent!