Using distributional RL KL divergence loss → makes it multi-step instead of 1 step
New distribution: $d_t = (R_{t}^{(n)} + \gamma_{t}^{(n)}z, p_\theta(S_{t+n}, a^*_{t+n}))$
Combine with double Q learning
Online network chooses action at $S_{t+n}$
Target network evaluates the action
Uses prioritized proportional replay based on DL loss → more robust to noisy stochastic environments
Dueling network architecture
Shared representation fed into value and advantage stream
Streams aggregated and fed into softmax for return distribution
Replace all layers with noisy layers with Gaussian noise
Experimental Methods
Evaluation Methodology
Tested on 57 atari games
Scores normalized and compared to human expert baselines
Test with random starts (insert 30 no-op actions)
Test with human starts (sample points from human expert data)
Hyperparameter Tuning
Number of hyperparameters too large for search
Perform limited tuning
DQN uses 200k learning starts to ensure no temporal correlations → with prioritized replay, we can learn after 80k
DQN uses annealing to decrease exploration rate from 1 to 0.1
With noisy nets, we act greedily ($\epsilon = 0$) with value of 0.5 for standard deviation
Without noisy nets, use epsilon greedy but decrease $\epsilon$ faster
Adam Optimizer
For prioritized replay, use proportional variant with importance sampling exponent increased from 0.4 to 1 over training
Multi-step number of steps was set to 3 (both 3 and 5 performed better than single step)
Analysis
Rainbow is better than any of the baselines ( A3C, DQN, DDQN, Prioritized DDQN, Dueling DDQN, Distributional DQN, and Noisy DQN)
Both in data efficiency + final performance
Match DQN in 7M frames (vs 44M)
153% human performance with human starts and 223% with no-ops
Learning speed varied by 20% across all variants
Ablation Studies
PER + multi-step learning were the most important parts
Removing both hurt early performance
Removing multi-step hurt final performance
Distributional Q learning 3rd most important
Not much performance difference early on but lags later on in training
Noisy nets also helped in most games → drop in performance in some but increase in performance in others
No significant difference when removing dueling networks
Double Q learning caused significant difference in median performance
Discussion
Rainbow is based on value-based methods
Have not considered policy based methods like TRPO or actor-critic methods
Value Alternatives
Optimality tightening: constructs additional inequality bounds instead of replacing 1-step targets in Q learning
Eligibility traces combine n-step returns across many n
Single step methods = more computation per gradient than n-step + need to determine how to store in PER
Episodic control = better data efficiency + improves learning using episodic memory as complementary system (can re-enact successful actions sequences)
Other Exploration Schemes
Bootstrapped DQN
Intrinsic Motivation
Count based exploration
Computational Architecture
Asynchronous learning from parallel environments (A3C)