Curiosity-driven Exploration by Self-supervised Prediction

Resources
- Paper
Introduction
- Random exploration for sparse rewards only really works well for small environments
- Curiosity: Learning new skills that may be useful for pursuing rewards later
- Classes of intrinsic rewards:
  - Encourage agent to explore novel states
  - Encourage agent to take action that reduce uncertainty (by predicting consequences of actions)
- Measuring novelty requires building dynamics model
  - Hard in high dimensional state spaces (i.e., images)
  - Hard to deal with stochasticity
  - Another issue is dealing with states that are visually distinct but functionally similar
- Key insight of ICM: predict changes to environment that are a consequence of the action; ignore everything else
  - Transform raw states to latent state
  - Predict agent’s action given current and next state
    - Since we predict action, network has no incentive to include factors of variation not induced by the action
  - Train forward dynamics model; predict next state given current state and action
    - Prediction error acts as intrinsic reward
- Curiosity
  - Helps explore environment for new knowledge
  - Helps learn skills for future scenarios
Curiosity-Driven Exploration
- Two subsystems
  - Reward generator that outputs curiosity driven reward signal
  - Policy that outputs sequence of actions to maximize reward signal
    - Maximize rewards $r_t = r^i_t + r^e_t$ (sum of extrinsic and intrinsic rewards)
- Prediction error as curiosity reward
  - Predicting pixels is both difficult and may not be the right objective
    - I.e., inherent changes in environment not caused by action will cause prediction error to stay high
    - Parts of state space can’t be modeled
      - Agent is unaware of this and can fall into artifical curiosity trap, stalling exploration
  - Things that can change environment
    - Controllable factors
    - Uncontrollable factors that can impact agent
    - Uncontrollable factors that cannot impact agent
    - Good feature space should model first 2
- Self-supervised prediction for exploration
  - 2 sub-modules for deep neural network
    - One that encodes $s_t$ into feature $\phi(s_t)$
    - One that takes $\phi(s_t), \phi(s _{t+1})$ and predicts $a_t$
  - Optimize $min _{\theta_I} L_I(\hat{a_t}, a_t)$
    - Inverse dynamics model: $\hat{a_t} = g(s_t, s _{t+1}; \theta_I)$
  - Train another network that predicts $\phi(s _{t+1})$ given $\phi(s_t), a_t$
    - Forward dynamics model: $\hat{\phi}(s _{t+1}) = f(\phi(s_t), a_t; \theta_F)$
    - Loss: $L_F(\phi(s_t), \hat{\phi}(s _{t+1})) = \frac{1}{2}\vert \vert \hat{\phi}(s _{t+1}) - \phi(s _{t+1}) \vert \vert^2_2$
    - Intrinsic reward: $r^i_t = \frac{\eta}{2}\vert \vert \hat{\phi}(s _{t+1}) - \phi(s _{t+1}) \vert \vert^2_2$
      - $\eta$: Scaling factor
  - Intrinsic curiosity Module:
    - Inverse model learns feature space
    - Forward model makes prediction in feature space
  - Overall optimization problem
    - $min _{\theta_P, \theta_I, \theta_F}[- \lambda \mathbb{E} _{\phi(s_t; \theta_P)}[\sum_t r_t] + (1 - \beta)L_I + \beta L_F]$
      - $\beta$: Scalar that weighs inverse model loss against forward model loss
      - $\lambda$: Weights importance of policy gradient loss vs importance of learning intrinsic reward
Experimental Setup
- Environments: VizDoom game and super mario brothers
- Training details:
  - Visual inputs
  - Preprocessed to 42x42 and grayscale
  - Concatenate 4 frames together a time for temporal details
  - Use action repeat during training but not inference
  - A3C algorithm with 20 workers
- A3C Architecture
  - 4 convolutional layers
  - Exponential linear unit after each convolution
  - LSTM after convolution
  - 2 linear layers to predict value and action from LSTM representation
- ICM Architecture
  - Inverse Model
    - 4 convolution layers
    - Exponential linear unit after each convolution
    - Current and next state feature vectors concatenated for single representation
    - 2 Fully connected layers predict to predict actions
  - Forward Model
    - Concatenate $a_t$ and $\phi(s_t)$
    - 2 fully connected layers
    - $\beta = 0.2, \lambda = 0.1$ for overall optimization equation
- Baseline Methods
  - ICM + A3C
  - A3C with $\epsilon$-greedy exploration
  - ICM-pixels + A3C (ICM without the inverse model)
    - Remove inverse model and add deconvolutions to forward model
    - Doesn’t learn invariant embeddings for uncontrollable parts
    - Represents information gain based on direct observations
  - VIME + TRPO
Experiments
- Sparse Extrinsic Reward Setting
  - Vary the distance between spawn and goal (chances of reaching goal is lower, reward is sparser)
    - Dense: Agent spawned randomly in any possible location uniformly
      - Not hard and goal can be reached with $\epsilon$-greedy
    - Sparse + Very Sparse: Spawn in rooms 270 - 350 steps away from goal
      - Requires long directed sequence of actions
      - A3C degrades with sparser rewards
      - Curious A3C agents are better in all cases + learn faster, indicating more efficient exploration
      - ICM Pixels not as good because its hard to learn pixel-prediction models as number of textures increases
      - In sparse, only baseline A3C fails
      - In very sparse, A3C and ICM-pixels fail
  - Uncontrollable Dynamics
    - Augment state with white noise area
    - ICM Pixel fails here but ICM succeeds
  - Comparison to TRPO-VIME
    - TRPO-VIME has better sample efficiency
    - ICM has better convergence rates and accuracy
- No Reward Setting
  - ICM performs well even in the absence of extrinsic reward
  - VizDoom: Coverage during Exploration
    - Able to learn to navigate corridors, walk between rooms, explore rooms
  - Mario: Learning to play with no rewards
    - Learns to cross over 30% of level-1
    - Discovers behaviors like killing / dodging enemies
- Generalization to Novel Scenarios
  - To determine generalization, train no reward exploratory behavior in 1 scenario and evaluate as follows:
    - Apply learned policy as-is to new scenario
    - Adapt policy by fine tuning curiosity
    - Adapt policy to maximize extrinsic reward
  - Evaluate as is
    - No reward policy generalizes well on other levels despite having different structures
    - Similar global appearance
  - Fine tuning with curiosity only
    - No reward policy from level 1 can adapt to level 2 with some fine-tuning of the policy
      - Training a policy from scratch actually leads to worse results
        Level 2 is harder so learning skills is harder $\rightarrow$ starting with level 1 is a form of curriculum learning
    - Fine tuning level 1 policy to level 3 deteroriates performance
      - Get across a certain point in level 3 is very hard $\rightarrow$ curiosity blockade
        0 intrinsic reward, policy degenerates so it stops exploring (analogous to boredom)
  - Fine tuning with extrinsic rewards
    - ICM pre-trained with only curisoity and then fine tuned with external rewards learns faster + achieves higher reward than ICM trained from scratch
Discussion
- Future research: use learned exploration behavior as a low level policy in more complex system

May 31, 2025 · research