Rainbow - Combining Improvements in Deep Reinforcement Learning

Resources
- Paper
Introduction
- DQN: Combination of CNNs + Experience Replay
- DDQN: Addresses overestimation bias
- PER: Improves data efficiency
- Dueling Architecture: Generalizes across actions by separating value and action advantages
- A3C: Multi-step bootstrap targets to shift bias-variance trade off + propagate newly observed rewards
- Distributional Q-Learning: Learns categorical distribution of discounted returns instead of estimating mean
- Noisy DQN: Stochastic network layers for exploration
- Rainbow combines all aforementioned improvements
Background
- MDP: Formalizes the interaction between agent and environment
- Agent tries to maximize expected discounted reward
- Estimate value function and use epsilon greedy policy for exploration
- Deep RL + DQN
  - Intractable state / action spaces require us to use neural nets and use gradient descent to minimize some loss
  - Use squared TD error for loss + gradients propagated backwards through online network (selects actions)
  - Target network has parameters copied over every so often → used to estimate current q-value → stability
  - Experience Replay → stability
Extensions to DQN
- Double Q Learning
  - Action selection and evaluation decoupled to avoid maximization bias
- Prioritized Replay
  - Samples transitions with a probability that is based on the magnitude of TD error
- Dueling Network
  - Two streams of computation - one for advantages and one for state-value functions
- Multistep Learning
  - Instead of just using immediate reward + discounted bootstrap, you can use n steps of rewards for target:
    - $R_t^{(n)} = \sum_{k=0}^{n-1} = \gamma_t^{(k)}R_{t+k+1}$
- Distributional RL
  - You try to learn the distribution of returns instead of expected return
    - Distribution: $d_t = (R_{t+1} + \gamma_{t+1}z, p_\theta(S_t, A_t))$
  - Approximate a distribution and update parameters to match actual distribution as closely as possible
  - Learn probability masses - return distribution satisfy variant of bellman equation
    - Take a distribution of returns, contract it by discount factor, and shift it by new distribution of rewards
    - Minimize KL divergence between target and current distribution
- Noisy Nets
  - Epsilon greedy policies limited in sparse reward settings
  - Noisy nets incorporate deterministic and noisy stream in linear layer
  - $y = (b + Wx) + ({noisy} \odot \epsilon^b + W_{noisy} \odot \epsilon^w)x$
    - Epsilons are random variables
  - Over time, network learns to ignore noise
The Integrated Agent
- Using distributional RL KL divergence loss → makes it multi-step instead of 1 step
  - New distribution: $d_t = (R_{t}^{(n)} + \gamma_{t}^{(n)}z, p_\theta(S_{t+n}, a^*_{t+n}))$
- Combine with double Q learning
  - Online network chooses action at $S_{t+n}$
  - Target network evaluates the action
- Uses prioritized proportional replay based on DL loss → more robust to noisy stochastic environments
- Dueling network architecture
  - Shared representation fed into value and advantage stream
  - Streams aggregated and fed into softmax for return distribution
  - Replace all layers with noisy layers with Gaussian noise
Experimental Methods
- Evaluation Methodology
  - Tested on 57 atari games
  - Scores normalized and compared to human expert baselines
  - Test with random starts (insert 30 no-op actions)
  - Test with human starts (sample points from human expert data)
- Hyperparameter Tuning
  - Number of hyperparameters too large for search
  - Perform limited tuning
  - DQN uses 200k learning starts to ensure no temporal correlations → with prioritized replay, we can learn after 80k
  - DQN uses annealing to decrease exploration rate from 1 to 0.1
    - With noisy nets, we act greedily ($\epsilon = 0$) with value of 0.5 for standard deviation
    - Without noisy nets, use epsilon greedy but decrease $\epsilon$ faster
  - Adam Optimizer
  - For prioritized replay, use proportional variant with importance sampling exponent increased from 0.4 to 1 over training
  - Multi-step number of steps was set to 3 (both 3 and 5 performed better than single step)
Analysis
- Rainbow is better than any of the baselines ( A3C, DQN, DDQN, Prioritized DDQN, Dueling DDQN, Distributional DQN, and Noisy DQN)
  - Both in data efficiency + final performance
  - Match DQN in 7M frames (vs 44M)
  - 153% human performance with human starts and 223% with no-ops
- Learning speed varied by 20% across all variants
- Ablation Studies
  - PER + multi-step learning were the most important parts
    - Removing both hurt early performance
    - Removing multi-step hurt final performance
  - Distributional Q learning 3rd most important
    - Not much performance difference early on but lags later on in training
  - Noisy nets also helped in most games → drop in performance in some but increase in performance in others
  - No significant difference when removing dueling networks
  - Double Q learning caused significant difference in median performance
Discussion
- Rainbow is based on value-based methods
  - Have not considered policy based methods like TRPO or actor-critic methods
- Value Alternatives
  - Optimality tightening: constructs additional inequality bounds instead of replacing 1-step targets in Q learning
  - Eligibility traces combine n-step returns across many n
  - Single step methods = more computation per gradient than n-step + need to determine how to store in PER
  - Episodic control = better data efficiency + improves learning using episodic memory as complementary system (can re-enact successful actions sequences)
- Other Exploration Schemes
  - Bootstrapped DQN
  - Intrinsic Motivation
  - Count based exploration
- Computational Architecture
  - Asynchronous learning from parallel environments (A3C)
  - Gorila
  - Evolution Strategies
  - Hierarchal RL
  - State Representation via pixel + feature control

April 5, 2025 · research