Diversity is All You Need: Learning Skills without a Reward Function

Resources
- Paper
Introduction
- Agents can learn skills without supervision and use these skills to satisfy goals later
  - Good for sparse reward environments
  - Helps with exploration
  - Primitives for hierarchial RL
  - Reduces amount of supervision necessary for a task
  - Challenging to determine what tasks an agent should learn
- Skill: Latent conditioned policy that alters environment in consistent way
- Setting: reward function unknown; maximize utility of skills
  - Learning objective: Each skill is distinct and skills explore large parts of state space
    - Use discriminability between skills as objective
    - Skills should be as diverse as possible
      - “Pushes” skils away from each other
Diversity is All You Need
- Setting: Unsupervised RL paradigm where agent has unsupervised exploration stage followed by supervised stage
- How it Works
  - 3 Ideas
    - Skills distinguishable
    - Use states to distinguish skills, not actions
    - Learn skills that act as randomly as possible
  - Objective:
    - $S$: Random variable for state space
    - $A$: Random variable for action space
    - $Z \sim p(z)$: latent variable on which policy is conditioned on
      - The policy is the skill
    - $I(\cdot ; \cdot)$: Mutual information
    - $\mathcal{H}(\cdot)$: Shannon entropy
    - $I(S; Z)$: Maximize the mutual information between skills and states
      - Encodes a dependency between skills and states
      - To ensure states are used to distinguish skills, condition mutual information between actions and skills on state: $I(A; Z \vert S)$
    - Maximize: $\mathcal{F}(\theta) \triangleq I(S; Z) + \mathcal{H}(A \vert S) - I(A; Z \vert S)$
      - $= (\mathcal{H}(Z) - \mathcal{H}(Z \vert S)) + \mathcal{H}(A \vert S) - (\mathcal{H}(A \vert S) - \mathcal{H}(A \vert S, Z))$
      - $= \mathcal{H}(Z) - \mathcal{H}(Z \vert S) + \mathcal{H}(A \vert S, Z)$
        
        First term encourages prior, $p(z)$, to have high entropy
        Fix $p(z)$ to be uniform to maximize entropy
        
        Second term: Easy to infer skill Z from state
        
        Third term: Each skill should act as randomly as possible
      - We can’t compute $p(z \vert s)$ directly so we approximate the posterior using a learned discriminator ($q _\phi (z \vert s)$) that gives us a lower bound for $\mathcal{F}(\theta)$
        
        $\mathcal{F}(\theta) = \mathcal{H}(A \vert S, Z) - \mathcal{H}(Z \vert S) + \mathcal{H}(Z)$
        
        $= \mathcal{H}(A \vert S, Z) + \mathbb{E} _{z \sim p(z), s \sim \pi(z)}[\log p(z \vert s)] - \mathbb{E} _{z \sim p(z)}[\log p(z)]$
        
        By Jensen’s inequality: $\geq \mathcal{H}(A \vert S, Z) + \mathbb{E} _{z \sim p(z), s \sim \pi(z)}[\log q _{\phi}(z \vert s) - \log p(z)]$
- Implementation
  - Implemented DIAYN with SAC; learned a policy $\pi _\theta(a \vert s, z)$
    - Scale entropy regularizer to balance exploration and discriminability
  - Use pseduo-reward to maximize variational lower bound: $r_z(s,a) \triangleq \log q _{\phi}(z \vert s) - \log p(z)$
  - Use categorical distribution for $p(z)$
  - Agent rewarded for visiting states easy to discriinate and discriminator updated to infer skill from states visited
- Stability
  - DIAYN is cooperative (unlike prior adversarial RL methods)
  - On grid worlds, optimum covergence is to partition states between skills evenly
  - In continuous domains, DIAYN is robust to random seeds
Experiments
- Analysis of Learned Skills
  - Skills learned
    - 2D Navigation Env
      - Skills learned move away from each other to remain distinguishable
    - Classical Control Tasks
      - Learns multiple skills for solving the task
    - Continuous Control Tasks
      - Learns primitive behaviors for all tasks
        
        Running forwards and backwards
        
        Doing flips, falling over
        
        Jumping, walking, diving
  - Skill distribution becomes increasingly diverse during training
  - DIAYN favors skills that don’t overlap but is not limited to learning skills of disjoint sets of states
    - I.e., visiting same initial states and then visit different later states
  - DIAYN vs VIC
    - In DIAYN, we do not learn a prior $p(z)$ whereas in VIC we do
      - Causes the more diverse skills to be sampled more frequently
      - The fixed distribution in DIAYN causes it to discover more diverse skills
- Harnessing Learned Skills
  - Accelerating Learning with Policy Initialization
    - We can adapt skills for a desired task
    - DIAYN can be unsupervised pre-training for more sample-efficient fine tuning for a specific task
    - By taking skill with highest reward for a benchmark task and finetuning it, we speed up learning
  - Using Skills for Hierarchal RL
    - Introduce a meta-controller whose actions are to choose which skill to execute for $k$ steps
    - VIME significantly underperforms DIAYN
      - DIAYN skills partition the state space
      - VIME tries to learn a policy that vists many state
    - DIAYN outperforms TRPO, SAC, and VIME on challenging robotics environments
      - Skill learning helps in exploration and sparse reward scenarios
    - We can bias DIAYN to discover particular types of skills
      - Condition discriminator on subset of observation space
        Maximize: $\mathbb{E}[\log q _{\phi}(z \vert f(s))]$
        $f(s)$: Could compute center of mass $\rightarrow$ Skills learn to change center of mass
  - Imitating an Expert
    - Replaying human actions fails in stochastic environments (closed-loop control necessary)
    - Imitation learning replaces this with a differentiable policy that we can adjust
    - Given an expert trajectory, we can use discriminator to determine what skills created it
      - $\hat{z} = argmax_z \prod _{s \in \tau^\ast}q _{\phi}(z \vert s_t)$

June 4, 2025 · research