We are trying to adjust $f _\theta$ to match a random function $f _{\theta^\ast}$ sampled from a prior
If we view an ensemble of networks as samples from a posterior, then minimizing the loss corresponds to approximating the posterior
Distillation error is just when $y_i = 0$
Combining Intrinsic and Extrinsic Returns
Treating problem as non-episodic resulted in better exploration (return not truncated at game over)
Intrinsic return should be related to all novel states across all episodes
This could leak information to agents about task
Can be exploited by continuously resetting the game where it finds a reward at the beginning
Decompose reward into $R = R_E + R_I$
Fit two values heads and get combined value function through sum: $V = V_E + V_I$
Reward and Observation Normalization
Reward Normalization:
Scale of reward can vary between environments and through time
Noramlize intrinsic reward by dividing by running estimate of standard deviation of intrinsic rewards
Observation Normalization:
In RND, paramters are frozen + cannot adjust to scale of datasets
No normalization = embedding variance = low + carries little info about inputs
Normalize by whitening each dimension, subtract running mean, divide by standard deviation, and clip observations between -5 and 5
Experiments
Experiments run on Montezuma’s Revenge
Pure Exploration
Comparing episodic and non-episodic exploration, non-episodic has better exploration performance
Mean epsiodic return: Agent not optimizing it directly but as it explores more rooms, it goes up anyways
Combining Episodic and Non-Episodic Returns
Non-episodic reward stream increases the number of rooms explored
Effect less dramatic than pure exploration because extrinsic reward behaviors preserves useful behaviors
Two value heads didn’t show benefit over single in episodic setting
Discount Factors
Extrinsic discount factor: Increasing this from $0.99 \rightarrow 0.999$ improves performance
Intrinsic discount factor: Increasing this from $0.99 \rightarrow 0.999$ hurts performance
Scaling Up Training
To hold the rate at which intrinsic reward decreases over time across experimetns with different numbers of parallel environments, downsample batch size to match 32 parallel environments
More environments = larger policy batch size but constant predictor network batch size
Policy needs to quickly learn to find and exploit rewards since they disappear
Recurrence
Montezuma’s Revenge is a partially observable environment
With a larger discount factor, recurrent policies performed better than CNNs
Across multiple games, recurrent policies do better more frequently than CNNs
Comparison to Baselines
Compare RND to PPO on various games
Gravitar:
RND does not consistently exceed PPO performance
Both exceed average human performance with RNN policy and SOTA
Montezuma’s Revenge + Venture: RND outperforms PPO, SOTA, and average human performance
Pitfall: Both alogirthms fail to find positive rewards
PrivateEye: RND exceeds PPO
Solaris: RND is comparable to PPO
Exploration bonus based on forward dynamics error:
Change RND loss so that predictor predicts random features of next observation given current observation and action
Performs signficiantly worse than RND on Montezuma, PrivateEye, Solaris and similarly on Venture, Pitfall, and Gravitar
Oscillates between two rooms in montezuma, causing high prediction error (due to non-determinism)
Similar behavior in PrivateEye and Pittfall
Qualitative Analysis: Dancing with Skulls
Once an agent obtains all extrinsic rewards it knows, it keeps interacting with dangerous objects
Dangerous states are difficult to achieve and hence rarely in past experience