Variational Intrinsic Control

Resources
- Paper
Introduction
- We want to find what intrinsic options available to agent at a state
- Options: policies with a termination condition
  - Independent of agent’s intentions
  - Set of all things that is possible for an agent to achieve
- Traditional approach to option learning: Find small set of options for a specific task
  - Makes credit assignment + planning easier over long horizons
- Larger sets of options advantageous
  - Number of options still smaller than number of action sequences (since options distinguished by final state)
  - We want to learn representational embedding of options (similar options = similar embedding)
  - In embedded spaces, planners only needs to choose neighborhood of space
- Using function approximators for state + goal embeddings was useful for control + generalization over many goals
  - This paper gives a method to learn goals (options)
- Two applications of learning intrinsic options
  - Classical RL: maximize expected reward
  - Empowerment: Get to a state with maximal set of options that an agent knows
    - Agent should aim for states where it has most control after learning
- Intrinsic Motivation vs Options
  - Motivation: goal is to predict observations
    - Understands environment via creating a dynamics model; may distract / impair the agent
  - Options: goal is to control the environment
    - Learns the amount of influence agent has on environment (i.e., how many distinct states it can cause)
    - Similar to unsupervised learning but instead of finding representations, it finds policies
    - Also estimates amount of control in different states
- Evaluation Metrics
  - Unsupervised learning uses data likelihood: amount of information needed to describe data
  - For unsupervised control, use mutual information between options and final states
    - Open loop options: agent decides sequence of actions beforehand and follows them regardless of environment dynamics. Mutual information is between sequence of actions and final states
      - Results in poor performance
    - Closed loop options: actions conditioned on state
Intrinsic Control and the Mutual Information Principle
- Option: Element, $\Omega$ of a space and policy $\pi(a \vert s, \Omega)$
  - $\pi$ has termination action that leads to final state, $s_f$
  - $\Omega$ can take finite number of values; each value has a distinct policy
  - $\Omega$ can be a binary vector; $2^n$ options
  - $\Omega$ can be a real-valued vector; infinite options
- Start at $s_0$ and follow an option $\Omega \rightarrow$ stochastic environments + policies means policy is a probability distribution: $p^J(s_f \vert s_0, \Omega)$
- Options that lead to similar states should be the same
  - Group these options together into a probability distribution and sample from it when choosing an option
    - Called the controllability distribution: $p^C(\Omega \vert s_0)$
    - Ensures behavior diversity
- To maximize intrinsic control, choose $\Omega$ that maximizes diversity of final states
  - Entropy of final states: $H(s_f) = -\sum _{s_f} p(s_f \vert s_0) \log p(s_f \vert s_0)$
    - $p(s_f \vert s_0) = \sum _{\Omega}p^J(s_f \vert s_0 \Omega) p^C(\Omega \vert s_0)$
      - Can also be expressed as the information content for a given $\Omega$: $-\log p^J(s_f \vert s_0, \Omega)$
  - Mutual information between options and final states under $p(\Omega, s_f \vert s_0) = p^J(s_f \vert s_0, \Omega)p^C(\Omega \vert s_0)$
    - $I(\Omega, s_f \vert s_0) = H(s_f \vert s_0) - H(s_f \vert s_0, \Omega)$
    - $I(\Omega, s_f \vert s_0) = - \sum _{s_f} p(s_f \vert s_0)\log p(s_f \vert s_0) + \sum _{\Omega, s_f}p^J(s_f \vert s_0, \Omega)p^C(\Omega \vert s_0)\log p^J(s_f \vert s_0, \Omega)$
    - Reverse Expression: $I(\Omega, s_f \vert s_0) = - \sum _{\Omega} p^C(\Omega \vert s_0)\log p^C(\Omega \vert s_0) + \sum _{\Omega, s_f}p^J(s_f \vert s_0, \Omega)p^C(\Omega \vert s_0)\log p(\Omega \vert s_0, s_f)$
      - First term: maximize set of options (i.e., achieve large entropy)
      - Second term: make sure options achieve different goals (can infer option from final state)
      - Formulation avoids $p(s_f \vert s_0)$ - requires integration over $\Omega$
      - Introduces $p(\Omega \vert s_0, s_f)$ which we get from Bayes rule $p^J(s_f \vert s_0, \Omega)p^C(\Omega \vert s_0)$
        
        $p^J(s_f \vert s_0, \Omega)$ inherent to environment
        
        $p^C(\Omega \vert s_0)$ is what the variational bound provides
  - Mutual Information Variational Bound: $I^{VB}(\Omega, s_f \vert s_0) = - \sum _{\Omega} p^C(\Omega \vert s_0)\log p^C(\Omega \vert s_0) + \sum _{\Omega, s_f}p^J(s_f \vert s_0, \Omega)p^C(\Omega \vert s_0)\log q(\Omega \vert s_0, s_f)$
    - $q$: Arbitrary distribution
    - $I \geq I^{VB}$: thus maximize $I^{VB}$
    - Train parameters of $p^C(\Omega \vert s_0), q(\Omega \vert s_0, s_f), \pi(a \vert s, \Omega)$
- Intuition: If mutual information is high, we can easily map an option to a distinct behavior
  - This is like learning a representation!
Intrinsic Control with Explicit Options
- Algorithm
  - for episode $1 \dots M$
    - Sample $\Omega \sim p^C(\Omega \vert s_0)$
    - Follow policy: $\pi(a \vert \Omega, s)$ until $s_f$
    - Regress $q(\Omega \vert s_0, s_f)$ towards $\Omega$
      - Using log likelihood + gradient descent
    - Compute intrinsic reward $r_I = \log q(\Omega \vert s_0, s_f) - \log p^C(\Omega \vert s_0)$
    - Use RL algorithm to update policy to maximize $r_I$
    - Reinforce $p^C(\Omega \vert s_0)$ based on $r_I$
      - If intrinsic reward is high, choose option more often
    - $s_0 = s_f$
- Tries to choose $\Omega$ that can be inferred from $s_f$ using $q(\Omega \vert s_0, s_f)$
  - Infer option well = other options don’t lead to this state often $\rightarrow$ option intrinsically different from others
- On average, $r_I$, is log of number of options an agent has in a state (empowerment)
- $\log p^C(\Omega \vert s_0)$ is approximately negative log of number of options we can choose
  - If $q(\Omega \vert s_0, s_f)$ is large, it defines region of similar options
  - Empowerment = number of regions in total region given by $p^C$
  - $r_I = \log q - \log p^C$: log ratio of total options ($1 / p^C$) to options in a region ($1 / q$)
- Train $p^C$ using policy gradients
- Experiments
  - Grid World
    - Let there be 30 options with the prior distribution over options to be uniform
    - Goal: learn a policy that makes the 30 options end up at as many different states as possible
    - End of each episode we get $r_I = -\log p + \log q = \log N + \log q$
      - If option is inferred correctly + with confidence, $\log q$ is close to 0, meaning reward will be large
      - If not, then $\log q$ will be very negative and small reward
    - To get large reward, options need to reach substantially distinct states
    - With Q learning, we update Q function by using $N \cdot n _{actions}$ values and updating on triplets of experience (N is the number of options)
      - For continuous option spaces, we randomly sample options and do this same process
  - Dangerous Grid World
    - Modified grid world with 2 parts:
      - Narrow corridor with open square
      - Blue walls
    - In one sublattice of the grid right and left only move the agent and the other sublattice up and down actions move the agent
      - If it doesn’t pick these actions, it gets stuck in a state for a long time
    - If the agent doesn’t observe environment, loses info about what sublattice its on and falls into low empowerment state
    - If the agent does observe the environment, it knows which sublattice its on and therefore doesn’t fall
- The Importance of Closed Loop Policies
  - Open Loop: Agent commits to series of actions and blindly follows it irrespective of environment
    - Leads to underestimation of empowerment
    - In dangerous grid, the agent tends to stay in corridor because of exponentially increasing probability of being reset as time increases
      - This is a low empowerment state
  - Closed Loop: Action is conditioned on current state
    - In dangerous grid, empowerment grows quadratically with option length
- Advantages and Disadvantages
  - Advantages:
    - Simple
    - Closed Loop Policies
    - Can be used with function approximation
    - Works well with discrete and continuous options spaces
    - Model Free
  - Disadvantages
    - Difficult to make work in practice with function approximation
      - Intrinsic reward is noisy
      - Difficult to make work with continuous option spaces
    - Exploration
      - If agent discovers new state, we want it go there because there might be new options
      - When it gets there, $q$ hasn’t learned it yet; inferring incorrect option
        Causes low reward, discouraging it to go there
      - It does good job expressing options in familiar regions but fails to push to new state regions
Intrinsic Control with Implicit Policies
- To address disadvantages, use action space as option space
  - Controllability prior, $p^C$, merges with the policy: $\pi^p(a_t \vert s_t^p)$
    - $s_t^p$: Internal state computed from $(s _{t-1}^p, x_t, a _{t-1})$
  - $q = \pi^q(a_t \vert s_t^q)$: infers actions made by $\pi^p$ given the final observation $x_f$
  - $r _{I, t} = \log \pi^q(a_t \vert s_t^q) - \log \pi^p(a_t \vert s_t^p)$
- Learning $\pi^q$ becomes a supervised learning problem of inferring actions that led to $x_f$
  - Can be done on random policies too because random policies produce variety of final states
  - $\pi^p$ selects different possible behaviros based on whether they lead to diverse outcomes
    - As $\pi_p$ chooses diverse behaviors, $\pi_q$ gets more varied training data
- Algorithm
  - Full Update:
    - Follow $\pi^p$. Gets experience $x_0, a_0 \dots x_f$
    - Regress $\pi^q$ towards $a_t$ for each action
    - Compute $r_I$
    - Reinforce $\pi^p$ with $r_I$
  - Exploratory Update:
    - Follow $\pi^p$ with exploration, creating experience
    - Regress $\pi^q$ towards $a_t$ for each action
  - Note 1: Algorithm works on partially observable environments
    - We should use final states instead of observations as set of states agent can reach
  - Note 2: $\pi^p$ can be thought of as implict option but embedding of final state can be thought of as explicit option with $\pi^q$ implementing option
- Experiments
  - Experiment 1:
    - 25x25 Gridworld with 4 rooms with narrow doors
      - Makes it difficult to pass through to other rooms
    - Random policy with final states whose distance from initial state is distributed based on a gaussian
    - $\pi^q$ conditioned on final state but $\pi^p$ was not
      - Policy is able to go through doors seamlessly
        Implicitly learns to navigates through doors
    - Maximizing intrinsic control = distribution of final points reachable from $\pi^p$ should be uniform based on points reachable from a given state
    - 403 reachable states
  - Experiment 2:
    - 40x40 color image of environment
    - $\pi^p$ trajectories more consistent than random policy
    - 221 reachable states
  - Experiment 3:
    - Grid World but contains blocks agent can push
    - 1200 reachable state
- Elements Beyond an Agent’s Control
  - Parts of the environment are not impacted by an agent
  - Agent is shown to avoid distractors and reach the same level of empowerment
- Open vs Closed Loop Options
  - Take grid world environment and add environment noise (stochastically push the agent in a random direction)
  - Closed loop policy can correct for this noise (strategy always goes towards goal)
  - Open loop agent cannot reliably navigate towards actual location (Only fed start and end states)
- Maximizing Extrinsic Reward
  - Agent is given some time to explore + learn to control environment
  - After some time, its told the extrinsic reward and has a limited amount of time to collect as much reward as possible
  - Found that extrinsic reward is collected significantly faster after having the opportunity to interact with the environment

June 2, 2025 · research