Reinforcement Learning with Unsupervised Auxiliary Tasks

Resources
- Paper
Introduction
- Agents live in sensorimotor stream
  - We can make agents predict and control the stream
- In sparse reward settings, extrinsic reward rarely observed
  - Sensorimotor streams have other learning targets
  - Unsupervised learning = reconstruct learning targets for useful representation
  - This paper = predict + control features of sensorimotor stream as pseudo-rewards
    - Tasks aligned with long-term goals
- Architecture uses RL to find optimal value function + policy for various pseudo-rewards
  - Makes auxiliary predictions to focus agent on certain aspects of task
    - Predict cumulative extrinsic reward and extrinsic reward
  - Use experience replay for more efficient learning
  - Auxiliary prediction and control share CNN + LSTM representation
Background
- Assume standard RL setting
- Value based RL methods = minimize MSE of Q value function
- Policy gradient methods maximize reward using the gradient: $\mathbb{E}[\frac{\partial}{\partial \theta}\log \pi(a \vert s) (Q(s,a) - V(s))]$
  - A3C approximates $V(s, \theta), \pi(a \vert s, \theta)$
    - Uses entropy regularization
    - $\mathcal{L} _{A3C} = \mathcal{L} _{VR} + \mathcal{\pi} - \mathbb{E} _{s \sim \pi}[\alpha H(\pi(s, \cdot, \theta))]$
    - Use an LSTM to jointly approixmates policy and value function
Auxiliary Tasks for Reinforcement Learning
- Auxiliary Control Tasks
  - Auxiliary Control Tasks: Additional pseudo-reward functions in the environment agent interacts with
    - $r^{(c)}: \mathcal{S} x \mathcal{A} \rightarrow \mathbb{R}$
  - Set of auxiliary control tasks $\mathcal{C}$
  - For $c \in \mathcal{C}$, we want to find: $argmax _{\theta} \mathbb{E} _{\pi}[R _{1:\infty}] + \lambda_c \sum _{c \in \mathcal{C}}\mathbb{E} _{\pi_c}[R^{(c)} _{1:\infty}]$
    - $R^{(c)} _{1:\infty}$: discounted return for $r^{(c)}$
    - $\theta$: Shared parameters across $\pi, \pi^{(c)}$
      - Sharing ensures balancing performance across global reward and auxiliary tasks
  - To efficiently learn many pseudo-rewards in parallel, use off-policy Q learning
  - Types of auxiliary reward functions
    - Pixel Control: Changes in perceptual stream = important events. Train a policy that maximizes pixel change
    - Feature Control: Networks extract high level features. Use activation of hidden units as auxiliary reward. Train seperate policy to maximize the hidden units activated in a layer
- Auxiliary Reward Tasks
  - Agent also needs to maximize global reward stream
    - Needs to recognize states that lead to high reward + value
      - However sparse reward environments make this difficult
  - Want to remove sparsity while keeping policy unbiased
  - Reward Prediction: Require agent to predict reward attained in a subsequent unseen frame
    - Helps shape features of the agent $\rightarrow$ biased reward predictor + feature shaper but not policy or value function
  - Train reward prediction on $S _{\tau} = (s _{\tau - k}, s _{\tau - k + 1}, \dots, s _{\tau -1})$ to predict $r _{\tau}$
    - Sample $S _{\tau}$ in skewed manner to over-represent rewarding events
    - Zero rewards and non-zero rewards equally represented ($P(r _{\tau} \neq 0) = 0.5$)
    - Use different architecture from policy network
      - Concatenate stack of states after encoded from CNN
        
        Focuses on immediate reward prediction instead of long term returns via looking at immediate predecessor states instead of entire history
        
        These features shared with LSTM
- Experience Replay
  - Uses prioritized replay with oversampling rewarding states
  - Also do value function replay: Does off-policy regression for value function from replay buffer
    - Randomly varies truncation window for returns (i.e., uses random n for n-step returns)
    - No oversampling for this
- UNREAL Agent
  - Primary policy traiend with A3C
    - Updated online using policy gradients
    - Uses LSTM to encode history
  - Auxiliary tasks trained via replay
    - Trained off-policy with Q-learning
    - Simple feed forward architecture
  - Unreal Loss: $\mathcal{L} _{UNREAL}(\theta) = \mathcal{L} _{A3C} + \lambda _{VR}\mathcal{L} _{VR} + \lambda _{PC} \sum_c \mathcal{L}_Q^{(c)} + \lambda _{RP} \mathcal{L} _{RP}$
    - $\mathcal{L} _{A3C}$: A3C loss, minimized on-policy
    - $\mathcal{L} _{VR}$: Value loss, minimized off-policy with replay buffer
    - $\mathcal{L} _{PC}$: Auxiliary control loss, minimized off-policy with replay buffer with n-step q-learning
    - $\mathcal{L} _{RP}$: Reward loss, minimized off-policy with rebalanced replay buffer
Experiments
- Use A3C CNN-LSTM agent + ablate variants that have auxiliary outputs + losses to base agent
- Labyrinth Results
  - Tested with a static and random goal scenarios (fixed map)
    - Random Goal: Optimal policy = find goal location at start of each episode and return to it as fast as possible (using long-term knowledge)
    - Static: Don’t need to explore since goal is always the same
  - Also test against maps that are not fixed
    - Optimal policy repeatedly explores maze and exploits knowledge to return to goal as many times as possible
  - Also test shooting lasers at bots
    - Tests planning, strategy, fine-control, and robustness to visual complexities
  - Results
    - UNREAL with all 3 auxiliary tasks achieves twice the performance of A3C
    - UNREAL has 10x data efficiency compared to A3C
    - Compared with A3C with pixel reconstruction loss, A3C with immediate auxiliary reward prediction, and A3C with feature control
      - Learning to control pixels better than pixel reconstruction
        Reconstruction better for faster learning earlier on and worse for final scores (puts too much focus on reconstructing irrelevant parts)
      - Feature control improved performance
- Atari
  - UNREAL surpasses SOTA agents (880% of mean and 250% of median score)
  - More robust to hyperparameter settings than A3C

June 13, 2025 · research