Exploration by Random Network Distillation

Resources
- Paper
Introduction
- Counts, pseudo-counts, infomration gain, and prediction gain hard to scale up large numbers of parallel environments
- Key insight: neural nets have lower prediction errors on examples similar to those that have already been trained on
  - Use prediction error on past experience to determine novelty of new experience
  - Issue: Using prediction errors can cause agents to be attracted to transitions where the output is stochastic (i.e., noisy TV problem)
    - Solution: Exploration bonus uses prediction problem where answer is deterministic
Method
- Exploration Bonuses
  - Replace extrinsic reward, $e_t$, with extrinsic and intrinsic rewards, $r_t = e_t + i_t$
    - $i_t$ is higher if state is novel
      - Usually a decreasing function of visitation count
      - Uses density estimates in non-tabular settings (pseudo-counts)
    - Alternatively $i_t$ can be prediction error related to agent’s transitions
      - I..e, error of predicting forward dynamics, inverse dynamics, physical properties of objects agent interacts with, etc.
      - Errors decrease as agent collects experience
- Random Network Distillation
  - Prediction problem is randomly generated
  - 2 neural networks
    - Fixed + randomly initialized target network which sets prediction problem
      - Takes an observation to an embedding $f: \mathcal{O} \rightarrow \mathbb{R}^k$
    - Predictor network trained on agent data: $\hat{f}: \mathcal{O} \rightarrow \mathbb{R}^k$
      - Minimizes expected MSE with respect to parameters using gradient descent
        $\vert \vert \hat{f}(x; \theta) - f(x) \vert \vert^2$
    - Distills random network onto trained one
    - Error is higher for novel states, dissimilar to the ones predictor has been trained on
  - Sources of Prediction Errors
    - Amount of training data: Higher prediction error where there are few similar examples
      - Desirable error for novelty
    - Stochasticity: Target function is stochastic + stochastic transitions in forward dynamics models
      - We can specify model to be determinstic (this is a non-issue)
    - Model Misspecification: Missing information or model capacity too low
    - Learning Dynamics: Predictor fails to approximate target well
  - Relation to Uncertainty Quantification
    - Let $\mathcal{F}$ be distribution over functions $g _{\theta} = f _{\theta} + f _{\theta^\ast}$
      - $\theta^\ast$ drawn from $p(\theta^\ast)$ a prior over the parameters mapping $f _{\theta^\ast}$
      - $\theta = argmin _{\theta} \mathbb{E} _{(x_i, y_i) \sim \mathcal{D}}\vert \vert f _{\theta}(x_i) + f _{\theta^\ast}(x_i) - y_i \vert \vert^2 + \mathcal{R}(\theta)$
        
        $\mathcal{R}(\theta)$: regularization from prior
        
        Minimizes expected prediction error
    - We are trying to adjust $f _\theta$ to match a random function $f _{\theta^\ast}$ sampled from a prior
      - If we view an ensemble of networks as samples from a posterior, then minimizing the loss corresponds to approximating the posterior
      - Distillation error is just when $y_i = 0$
- Combining Intrinsic and Extrinsic Returns
  - Treating problem as non-episodic resulted in better exploration (return not truncated at game over)
    - Intrinsic return should be related to all novel states across all episodes
      - This could leak information to agents about task
      - Can be exploited by continuously resetting the game where it finds a reward at the beginning
  - Decompose reward into $R = R_E + R_I$
    - Fit two values heads and get combined value function through sum: $V = V_E + V_I$
- Reward and Observation Normalization
  - Reward Normalization:
    - Scale of reward can vary between environments and through time
    - Noramlize intrinsic reward by dividing by running estimate of standard deviation of intrinsic rewards
  - Observation Normalization:
    - In RND, paramters are frozen + cannot adjust to scale of datasets
    - No normalization = embedding variance = low + carries little info about inputs
    - Normalize by whitening each dimension, subtract running mean, divide by standard deviation, and clip observations between -5 and 5
Experiments
- Experiments run on Montezuma’s Revenge
- Pure Exploration
  - Comparing episodic and non-episodic exploration, non-episodic has better exploration performance
  - Mean epsiodic return: Agent not optimizing it directly but as it explores more rooms, it goes up anyways
- Combining Episodic and Non-Episodic Returns
  - Non-episodic reward stream increases the number of rooms explored
    - Effect less dramatic than pure exploration because extrinsic reward behaviors preserves useful behaviors
  - Two value heads didn’t show benefit over single in episodic setting
- Discount Factors
  - Extrinsic discount factor: Increasing this from $0.99 \rightarrow 0.999$ improves performance
  - Intrinsic discount factor: Increasing this from $0.99 \rightarrow 0.999$ hurts performance
- Scaling Up Training
  - To hold the rate at which intrinsic reward decreases over time across experimetns with different numbers of parallel environments, downsample batch size to match 32 parallel environments
  - More environments = larger policy batch size but constant predictor network batch size
    - Policy needs to quickly learn to find and exploit rewards since they disappear
- Recurrence
  - Montezuma’s Revenge is a partially observable environment
  - With a larger discount factor, recurrent policies performed better than CNNs
    - Across multiple games, recurrent policies do better more frequently than CNNs
- Comparison to Baselines
  - Compare RND to PPO on various games
    - Gravitar:
      - RND does not consistently exceed PPO performance
      - Both exceed average human performance with RNN policy and SOTA
    - Montezuma’s Revenge + Venture: RND outperforms PPO, SOTA, and average human performance
    - Pitfall: Both alogirthms fail to find positive rewards
    - PrivateEye: RND exceeds PPO
    - Solaris: RND is comparable to PPO
  - Exploration bonus based on forward dynamics error:
    - Change RND loss so that predictor predicts random features of next observation given current observation and action
    - Performs signficiantly worse than RND on Montezuma, PrivateEye, Solaris and similarly on Venture, Pitfall, and Gravitar
      - Oscillates between two rooms in montezuma, causing high prediction error (due to non-determinism)
        Similar behavior in PrivateEye and Pittfall
- Qualitative Analysis: Dancing with Skulls
  - Once an agent obtains all extrinsic rewards it knows, it keeps interacting with dangerous objects
    - Dangerous states are difficult to achieve and hence rarely in past experience

June 1, 2025 · research