Variational Option Discovery Algorithms

Resources
- Paper
Introduction
- Reward-free Option Discovery: Discover skills without rewards
- Variational option discovery: Option discovery based on variational inference
  - Policy: Encoder that translates context from noise distribution into trajectories
  - Decoder: Recovers context from trajectories
    - Context: Random vectors that have no association with trajectories but become associated with trajectories through training
  - VIC and DIAYN are variations of this algorithm but use individual states instead of trajectories
- Variational Autoencoding Learning of Options by Reinforcement (VALOR): Encourages learning dynamics modes instead of goal-attaining ones (i.e., move in circle vs go to specific state)
  - Uses curriculum learning (contexts increase )
Variational Option Discovery Algorithms
- Policy conditioned on state ($s_t$) and context ($c$)
  - Context specifies a specific skill
  - Context arbitrarily assigned/discovered during training
- Context sampled from noise distribution ($G$)
  - Encoded into a trajectory ($\tau$) by a policy
- Context then decoded from trajectory using a decoder ($D$)
  - If $\tau$ unique to $c$, decoder gives high probability to $c$ + policy should be reinforced
  - Can apply supervised learning to $c$
- Encourage exploration via entropy regularization
- Training objective: $max _{\pi, D}\mathbb{E} _{c \sim G}[\mathbb{E} _{\tau \sim \pi, c}[\log P_D(c \vert \tau)] + \beta \mathcal{H}(\pi \vert c)]$
  - $P_D$: Distribution of contexts from decoder
  - Corresponds to $\beta$-VAE objective
    - $c$ is the data
    - $\tau$ is the latent representation
    - $\pi$ and the MDP represent the encoder
    - $D$ represents the decoder
    - Entropy regularization represents KL divergence term
- Algorithm
  - Generate initial policy $\pi _{\theta_0}$, decoder $D _{\phi_0}$
  - For $k = 1, 2 \dots$
    - Sample $c \sim G$ and roll out a trajectory
    - Update policy using any RL algorithm to maximize training objective above
    - Update decoder to maximize $\mathbb{E} [\log P_D(c \vert \tau)]$
- Connections to Prior Work
  - Variational Intrinsic Control
    - VIC optimizes variational lower bound of mutual information between context and final state conditioned on initial state
    - Differs from VALOR because
      - $G$ can be optimized
      - $G$ depends on initial state
      - $G$ is entropy regularized
      - $\pi$ is not entropy regularized
      - Decoder only looks at first and last state of the trajectory
    - VIC is a form of VALOR
      - Keep $G$ fixed + let $\log P_D(c \vert \tau) = \log P_D(c \vert S_T)$ with no entropy regularization on policy
  - Diversity is all you need
    - Optimizes variational lower bound of mutual information between context and every state in a trajectory
    - Minimizes mutual information between actions and contexts conditioned on states
    - Maximizes entropy of mixture policy over contexts
    - DIAYN is a form of VALOR with $\log P_D(c \vert \tau) = \sum _{t = 0}^T \log P_D(c \vert s_t)$
- VALOR
  - Optimizes Equation 2 with following caveats
    - Decoder never sees actions; if decoder could see actions, it could communicate signal contex through actions, ignoring environment
      - Forces agent to manipulate environment to communicate with decoder
    - Decoder does not decompose as sum of per-timestep computations (unlike DIAYN)
      - Inhibits decoder’s ability to distinguish between behaviors which share states
  - Implemented with a recurrent architecture (bidirectional LSTM)
    - Recurrent layer is of length 11 (11 evenly-spaced points in trajectory)
      - Efficient and encoders only low-frequency behaviors instead of high frequency ones (i.e., jitters)
    - Compute differences between every k states
      - Encodes the prior that agent should be moving (staying in the same state will be indistinguishable to decoder)
- Curriculum Approach
  - VIC and DIAYN use discrete contexts $c \sim Uniform(K)$
    - Worked poorly for large $K$
  - Instead, start with a small $K$ where learning was easy and gradually increase it over time
    - $\mathbb{E}[\log P_D(c \vert \tau)]$ should pass a threshold
    - Increase K: $K \leftarrow min(int(1.5 \cdot K + 1), K _{max})$
Experimental Setup
- Test Environment: 2D Point Agent, HalfCheetah, Swimmer, a modified Ant environment, Dexterous Hand, and humanoid toddler
- Implementation: implement VALOR, VIC, and DIAYN with vanilla policy gradient
- Training Techniques: Curriculum generation (above equation) and context embeddings
  - Context Embeddings: Let agent learn own embedding vector for each context
Results
- Using embeddings improves speed + stability of training
- Training with a uniform distribution becomes more challenging as $K$ increases
  - Curriculum learning helps alleviate this difficulty
- Variational option discovery methods find locomotion gaits in variety of speeds + directions in mujoco environments
- DIAYN has tendency to learn behaviors to atain a target state
- Curriculum learning doesn’t increase diversity of behaviors found
  - Does make distribution of scores more consistent across seeds
- Hand + Toddler Environment: Optimizing hand was easy, toddler was not
  - Learned very few behaviors + were unnatural beahviors
    - Due to limitations of information theoretic RL - does not have strong priors on natural behavior
- Was able to learn hundreds of behaviors on the point environment
  - Correlated with decoder capacity (higher capacity = more easily overfit to small differences that would otherwise be undetectable)
- Mode Interpolation: Interpolate between context embeddings
  - Smooth interpolated behaviors achieved $\rightarrows$ suggests training learns universal policies
- Downstream Tasks: Used a VALOR policy for Ant and used it as a lower level policy for Ant maze
  - Only upper level policy trained
  - Worked as well as a hierarchial policy (with random network for lower level policy) trained from scratch and non-hierarchial policy trained from scratch

June 5, 2025 · research