We want to find what intrinsic options available to agent at a state
Options: policies with a termination condition
Independent of agent’s intentions
Set of all things that is possible for an agent to achieve
Traditional approach to option learning: Find small set of options for a specific task
Makes credit assignment + planning easier over long horizons
Larger sets of options advantageous
Number of options still smaller than number of action sequences (since options distinguished by final state)
We want to learn representational embedding of options (similar options = similar embedding)
In embedded spaces, planners only needs to choose neighborhood of space
Using function approximators for state + goal embeddings was useful for control + generalization over many goals
This paper gives a method to learn goals (options)
Two applications of learning intrinsic options
Classical RL: maximize expected reward
Empowerment: Get to a state with maximal set of options that an agent knows
Agent should aim for states where it has most control after learning
Intrinsic Motivation vs Options
Motivation: goal is to predict observations
Understands environment via creating a dynamics model; may distract / impair the agent
Options: goal is to control the environment
Learns the amount of influence agent has on environment (i.e., how many distinct states it can cause)
Similar to unsupervised learning but instead of finding representations, it finds policies
Also estimates amount of control in different states
Evaluation Metrics
Unsupervised learning uses data likelihood: amount of information needed to describe data
For unsupervised control, use mutual information between options and final states
Open loop options: agent decides sequence of actions beforehand and follows them regardless of environment dynamics. Mutual information is between sequence of actions and final states
Results in poor performance
Closed loop options: actions conditioned on state
Intrinsic Control and the Mutual Information Principle
Option: Element, $\Omega$ of a space and policy $\pi(a \vert s, \Omega)$
$\pi$ has termination action that leads to final state, $s_f$
$\Omega$ can take finite number of values; each value has a distinct policy
$\Omega$ can be a binary vector; $2^n$ options
$\Omega$ can be a real-valued vector; infinite options
Start at $s_0$ and follow an option $\Omega \rightarrow$ stochastic environments + policies means policy is a probability distribution: $p^J(s_f \vert s_0, \Omega)$
Options that lead to similar states should be the same
Group these options together into a probability distribution and sample from it when choosing an option
Called the controllability distribution: $p^C(\Omega \vert s_0)$
Ensures behavior diversity
To maximize intrinsic control, choose $\Omega$ that maximizes diversity of final states
Entropy of final states: $H(s_f) = -\sum _{s_f} p(s_f \vert s_0) \log p(s_f \vert s_0)$