Diversity is All You Need: Learning Skills without a Reward Function
Resources
Introduction
Agents can learn skills without supervision and use these skills to satisfy goals later
Good for sparse reward environments
Helps with exploration
Primitives for hierarchial RL
Reduces amount of supervision necessary for a task
Challenging to determine what tasks an agent should learn
Skill: Latent conditioned policy that alters environment in consistent way
Setting: reward function unknown; maximize utility of skillsLearning objective: Each skill is distinct and skills explore large parts of state space
Use discriminability between skills as objective
Skills should be as diverse as possible“Pushes” skils away from each other
Diversity is All You Need
Setting: Unsupervised RL paradigm where agent has unsupervised exploration stage followed by supervised stage
How it Works
3 Ideas
Skills distinguishable
Use states to distinguish skills, not actions
Learn skills that act as randomly as possible
Objective:
$S$: Random variable for state space
$A$: Random variable for action space
$Z \sim p(z)$: latent variable on which policy is conditioned on
$I(\cdot ; \cdot)$: Mutual information
$\mathcal{H}(\cdot)$: Shannon entropy
$I(S; Z)$: Maximize the mutual information between skills and states
Encodes a dependency between skills and states
To ensure states are used to distinguish skills, condition mutual information between actions and skills on state: $I(A; Z \vert S)$
Maximize: $\mathcal{F}(\theta) \triangleq I(S; Z) + \mathcal{H}(A \vert S) - I(A; Z \vert S)$
$= (\mathcal{H}(Z) - \mathcal{H}(Z \vert S)) + \mathcal{H}(A \vert S) - (\mathcal{H}(A \vert S) - \mathcal{H}(A \vert S, Z))$
$= \mathcal{H}(Z) - \mathcal{H}(Z \vert S) + \mathcal{H}(A \vert S, Z)$
First term encourages prior, $p(z)$, to have high entropyFix $p(z)$ to be uniform to maximize entropy
Second term: Easy to infer skill Z from state
Third term: Each skill should act as randomly as possible
We can’t compute $p(z \vert s)$ directly so we approximate the posterior using a learned discriminator ($q _\phi (z \vert s)$) that gives us a lower bound for $\mathcal{F}(\theta)$
$\mathcal{F}(\theta) = \mathcal{H}(A \vert S, Z) - \mathcal{H}(Z \vert S) + \mathcal{H}(Z)$
$= \mathcal{H}(A \vert S, Z) + \mathbb{E} _{z \sim p(z), s \sim \pi(z)}[\log p(z \vert s)] - \mathbb{E} _{z \sim p(z)}[\log p(z)]$
By Jensen’s inequality: $\geq \mathcal{H}(A \vert S, Z) + \mathbb{E} _{z \sim p(z), s \sim \pi(z)}[\log q _{\phi}(z \vert s) - \log p(z)]$
Implementation
Implemented DIAYN with SAC; learned a policy $\pi _\theta(a \vert s, z)$Scale entropy regularizer to balance exploration and discriminability
Use pseduo-reward to maximize variational lower bound: $r_z(s,a) \triangleq \log q _{\phi}(z \vert s) - \log p(z)$
Use categorical distribution for $p(z)$
Agent rewarded for visiting states easy to discriinate and discriminator updated to infer skill from states visited
Stability
DIAYN is cooperative (unlike prior adversarial RL methods)
On grid worlds, optimum covergence is to partition states between skills evenly
In continuous domains, DIAYN is robust to random seeds
Experiments
Analysis of Learned Skills
Skills learned
2D Navigation EnvSkills learned move away from each other to remain distinguishable
Classical Control TasksLearns multiple skills for solving the task
Continuous Control TasksLearns primitive behaviors for all tasks
Running forwards and backwards
Doing flips, falling over
Jumping, walking, diving
Skill distribution becomes increasingly diverse during training
DIAYN favors skills that don’t overlap but is not limited to learning skills of disjoint sets of statesI.e., visiting same initial states and then visit different later states
DIAYN vs VICIn DIAYN, we do not learn a prior $p(z)$ whereas in VIC we do
Causes the more diverse skills to be sampled more frequently
The fixed distribution in DIAYN causes it to discover more diverse skills
Harnessing Learned Skills
Accelerating Learning with Policy Initialization
We can adapt skills for a desired task
DIAYN can be unsupervised pre-training for more sample-efficient fine tuning for a specific task
By taking skill with highest reward for a benchmark task and finetuning it, we speed up learning
Using Skills for Hierarchal RL
Introduce a meta-controller whose actions are to choose which skill to execute for $k$ steps
VIME significantly underperforms DIAYN
DIAYN skills partition the state space
VIME tries to learn a policy that vists many state
DIAYN outperforms TRPO, SAC, and VIME on challenging robotics environmentsSkill learning helps in exploration and sparse reward scenarios
We can bias DIAYN to discover particular types of skillsCondition discriminator on subset of observation spaceMaximize: $\mathbb{E}[\log q _{\phi}(z \vert f(s))]$$f(s)$: Could compute center of mass $\rightarrow$ Skills learn to change center of mass
Imitating an Expert
Replaying human actions fails in stochastic environments (closed-loop control necessary)
Imitation learning replaces this with a differentiable policy that we can adjust
Given an expert trajectory, we can use discriminator to determine what skills created it$\hat{z} = argmax_z \prod _{s \in \tau^\ast}q _{\phi}(z \vert s_t)$
June 4, 2025 · research