Reward-free Option Discovery: Discover skills without rewards
Variational option discovery: Option discovery based on variational inference
Policy: Encoder that translates context from noise distribution into trajectories
Decoder: Recovers context from trajectories
Context: Random vectors that have no association with trajectories but become associated with trajectories through training
VIC and DIAYN are variations of this algorithm but use individual states instead of trajectories
Variational Autoencoding Learning of Options by Reinforcement (VALOR): Encourages learning dynamics modes instead of goal-attaining ones (i.e., move in circle vs go to specific state)
Uses curriculum learning (contexts increase )
Variational Option Discovery Algorithms
Policy conditioned on state ($s_t$) and context ($c$)
Context specifies a specific skill
Context arbitrarily assigned/discovered during training
Context sampled from noise distribution ($G$)
Encoded into a trajectory ($\tau$) by a policy
Context then decoded from trajectory using a decoder ($D$)
If $\tau$ unique to $c$, decoder gives high probability to $c$ + policy should be reinforced
Update policy using any RL algorithm to maximize training objective above
Update decoder to maximize $\mathbb{E} [\log P_D(c \vert \tau)]$
Connections to Prior Work
Variational Intrinsic Control
VIC optimizes variational lower bound of mutual information between context and final state conditioned on initial state
Differs from VALOR because
$G$ can be optimized
$G$ depends on initial state
$G$ is entropy regularized
$\pi$ is not entropy regularized
Decoder only looks at first and last state of the trajectory
VIC is a form of VALOR
Keep $G$ fixed + let $\log P_D(c \vert \tau) = \log P_D(c \vert S_T)$ with no entropy regularization on policy
Diversity is all you need
Optimizes variational lower bound of mutual information between context and every state in a trajectory
Minimizes mutual information between actions and contexts conditioned on states
Maximizes entropy of mixture policy over contexts
DIAYN is a form of VALOR with $\log P_D(c \vert \tau) = \sum _{t = 0}^T \log P_D(c \vert s_t)$
VALOR
Optimizes Equation 2 with following caveats
Decoder never sees actions; if decoder could see actions, it could communicate signal contex through actions, ignoring environment
Forces agent to manipulate environment to communicate with decoder
Decoder does not decompose as sum of per-timestep computations (unlike DIAYN)
Inhibits decoder’s ability to distinguish between behaviors which share states
Implemented with a recurrent architecture (bidirectional LSTM)
Recurrent layer is of length 11 (11 evenly-spaced points in trajectory)
Efficient and encoders only low-frequency behaviors instead of high frequency ones (i.e., jitters)
Compute differences between every k states
Encodes the prior that agent should be moving (staying in the same state will be indistinguishable to decoder)
Curriculum Approach
VIC and DIAYN use discrete contexts $c \sim Uniform(K)$
Worked poorly for large $K$
Instead, start with a small $K$ where learning was easy and gradually increase it over time
$\mathbb{E}[\log P_D(c \vert \tau)]$ should pass a threshold
Increase K: $K \leftarrow min(int(1.5 \cdot K + 1), K _{max})$
Experimental Setup
Test Environment: 2D Point Agent, HalfCheetah, Swimmer, a modified Ant environment, Dexterous Hand, and humanoid toddler
Implementation: implement VALOR, VIC, and DIAYN with vanilla policy gradient
Training Techniques: Curriculum generation (above equation) and context embeddings
Context Embeddings: Let agent learn own embedding vector for each context
Results
Using embeddings improves speed + stability of training
Training with a uniform distribution becomes more challenging as $K$ increases
Curriculum learning helps alleviate this difficulty
Variational option discovery methods find locomotion gaits in variety of speeds + directions in mujoco environments
DIAYN has tendency to learn behaviors to atain a target state
Curriculum learning doesn’t increase diversity of behaviors found
Does make distribution of scores more consistent across seeds
Hand + Toddler Environment: Optimizing hand was easy, toddler was not
Learned very few behaviors + were unnatural beahviors
Due to limitations of information theoretic RL - does not have strong priors on natural behavior
Was able to learn hundreds of behaviors on the point environment
Correlated with decoder capacity (higher capacity = more easily overfit to small differences that would otherwise be undetectable)
Mode Interpolation: Interpolate between context embeddings
Smooth interpolated behaviors achieved $\rightarrows$ suggests training learns universal policies
Downstream Tasks: Used a VALOR policy for Ant and used it as a lower level policy for Ant maze
Only upper level policy trained
Worked as well as a hierarchial policy (with random network for lower level policy) trained from scratch and non-hierarchial policy trained from scratch