Curiosity-driven Exploration by Self-supervised Prediction
-
Resources
-
Introduction
- Random exploration for sparse rewards only really works well for small environments
- Curiosity: Learning new skills that may be useful for pursuing rewards later
- Classes of intrinsic rewards:
- Encourage agent to explore novel states
- Encourage agent to take action that reduce uncertainty (by predicting consequences of actions)
- Measuring novelty requires building dynamics model
- Hard in high dimensional state spaces (i.e., images)
- Hard to deal with stochasticity
- Another issue is dealing with states that are visually distinct but functionally similar
- Key insight of ICM: predict changes to environment that are a consequence of the action; ignore everything else
- Transform raw states to latent state
- Predict agent’s action given current and next state
- Since we predict action, network has no incentive to include factors of variation not induced by the action
- Train forward dynamics model; predict next state given current state and action
- Prediction error acts as intrinsic reward
- Curiosity
- Helps explore environment for new knowledge
- Helps learn skills for future scenarios
-
Curiosity-Driven Exploration
- Two subsystems
- Reward generator that outputs curiosity driven reward signal
- Policy that outputs sequence of actions to maximize reward signal
- Maximize rewards $r_t = r^i_t + r^e_t$ (sum of extrinsic and intrinsic rewards)
-
Prediction error as curiosity reward
- Predicting pixels is both difficult and may not be the right objective
- I.e., inherent changes in environment not caused by action will cause prediction error to stay high
- Parts of state space can’t be modeled
- Agent is unaware of this and can fall into artifical curiosity trap, stalling exploration
- Things that can change environment
- Controllable factors
- Uncontrollable factors that can impact agent
- Uncontrollable factors that cannot impact agent
- Good feature space should model first 2
-
Self-supervised prediction for exploration
- 2 sub-modules for deep neural network
- One that encodes $s_t$ into feature $\phi(s_t)$
- One that takes $\phi(s_t), \phi(s _{t+1})$ and predicts $a_t$
- Optimize $min _{\theta_I} L_I(\hat{a_t}, a_t)$
- Inverse dynamics model: $\hat{a_t} = g(s_t, s _{t+1}; \theta_I)$
- Train another network that predicts $\phi(s _{t+1})$ given $\phi(s_t), a_t$
- Forward dynamics model: $\hat{\phi}(s _{t+1}) = f(\phi(s_t), a_t; \theta_F)$
- Loss: $L_F(\phi(s_t), \hat{\phi}(s _{t+1})) = \frac{1}{2}\vert \vert \hat{\phi}(s _{t+1}) - \phi(s _{t+1}) \vert \vert^2_2$
- Intrinsic reward: $r^i_t = \frac{\eta}{2}\vert \vert \hat{\phi}(s _{t+1}) - \phi(s _{t+1}) \vert \vert^2_2$
- Intrinsic curiosity Module:
- Inverse model learns feature space
- Forward model makes prediction in feature space
- Overall optimization problem
- $min _{\theta_P, \theta_I, \theta_F}[- \lambda \mathbb{E} _{\phi(s_t; \theta_P)}[\sum_t r_t] + (1 - \beta)L_I + \beta L_F]$
- $\beta$: Scalar that weighs inverse model loss against forward model loss
- $\lambda$: Weights importance of policy gradient loss vs importance of learning intrinsic reward
-
Experimental Setup
- Environments: VizDoom game and super mario brothers
- Training details:
- Visual inputs
- Preprocessed to 42x42 and grayscale
- Concatenate 4 frames together a time for temporal details
- Use action repeat during training but not inference
- A3C algorithm with 20 workers
- A3C Architecture
- 4 convolutional layers
- Exponential linear unit after each convolution
- LSTM after convolution
- 2 linear layers to predict value and action from LSTM representation
- ICM Architecture
- Inverse Model
- 4 convolution layers
- Exponential linear unit after each convolution
- Current and next state feature vectors concatenated for single representation
- 2 Fully connected layers predict to predict actions
- Forward Model
- Concatenate $a_t$ and $\phi(s_t)$
- 2 fully connected layers
- $\beta = 0.2, \lambda = 0.1$ for overall optimization equation
- Baseline Methods
- ICM + A3C
- A3C with $\epsilon$-greedy exploration
- ICM-pixels + A3C (ICM without the inverse model)
- Remove inverse model and add deconvolutions to forward model
- Doesn’t learn invariant embeddings for uncontrollable parts
- Represents information gain based on direct observations
- VIME + TRPO
-
Experiments
-
Sparse Extrinsic Reward Setting
- Vary the distance between spawn and goal (chances of reaching goal is lower, reward is sparser)
- Dense: Agent spawned randomly in any possible location uniformly
- Not hard and goal can be reached with $\epsilon$-greedy
- Sparse + Very Sparse: Spawn in rooms 270 - 350 steps away from goal
- Requires long directed sequence of actions
- A3C degrades with sparser rewards
- Curious A3C agents are better in all cases + learn faster, indicating more efficient exploration
- ICM Pixels not as good because its hard to learn pixel-prediction models as number of textures increases
- In sparse, only baseline A3C fails
- In very sparse, A3C and ICM-pixels fail
- Uncontrollable Dynamics
- Augment state with white noise area
- ICM Pixel fails here but ICM succeeds
- Comparison to TRPO-VIME
- TRPO-VIME has better sample efficiency
- ICM has better convergence rates and accuracy
-
No Reward Setting
- ICM performs well even in the absence of extrinsic reward
- VizDoom: Coverage during Exploration
- Able to learn to navigate corridors, walk between rooms, explore rooms
- Mario: Learning to play with no rewards
- Learns to cross over 30% of level-1
- Discovers behaviors like killing / dodging enemies
-
Generalization to Novel Scenarios
- To determine generalization, train no reward exploratory behavior in 1 scenario and evaluate as follows:
- Apply learned policy as-is to new scenario
- Adapt policy by fine tuning curiosity
- Adapt policy to maximize extrinsic reward
- Evaluate as is
- No reward policy generalizes well on other levels despite having different structures
- Similar global appearance
- Fine tuning with curiosity only
- No reward policy from level 1 can adapt to level 2 with some fine-tuning of the policy
- Training a policy from scratch actually leads to worse results
- Level 2 is harder so learning skills is harder $\rightarrow$ starting with level 1 is a form of curriculum learning
- Fine tuning level 1 policy to level 3 deteroriates performance
- Get across a certain point in level 3 is very hard $\rightarrow$ curiosity blockade
- 0 intrinsic reward, policy degenerates so it stops exploring (analogous to boredom)
- Fine tuning with extrinsic rewards
- ICM pre-trained with only curisoity and then fine tuned with external rewards learns faster + achieves higher reward than ICM trained from scratch
-
Discussion
- Future research: use learned exploration behavior as a low level policy in more complex system
· research