VIME: Variational Information Maximizing Exploration

Resources
- Paper
Introduction
- One way to explore is to generate trajectories that give maximum information about the environment
  - Can be done via Bayesian RL / PAC-MDP on small environments
- Can use heuristic exploration
  - Epsilon Greedy
  - Boltzmann Exploration
  - Usually rely on random walk behavior (inefficient)
    - Training time = exponential in number of states
- VIME is a curiosity driven exploration strategy
  - Uses information gain on agent’s belief about dynamics model
  - Agent takes actions that result in states seeming suprising
  - Use a bayesian neural net to understand environment dynamics
Methodology
- Preliminaries
  - Assume standard MDP RL setting with a stochastic policy ($\pi _\alpha$)
    - Discounted return: $\mu(\pi _\alpha)$
- Curiosity
  - Seeks out state-action regions that are unexplored
  - Agent models environment dynamics via model $p(s _{t+1} \vert s_t, a_t; \theta)$ paramterized by a random variable $\Theta$ where $\theta \in \Theta$
    - Maintains distribution over dynamic models
    - $\theta$ maintained in bayesian manner
  - Agent should take actions that maximize the reduction in uncertainty about dynamics
    - Maximizing sum of reduction in entropy: $\sum_t (H(\Theta \vert \xi_t, a_t) - H(\Theta \vert S _{t+1}, \xi_t, a_t))$
      - $\xi_t$ is the history at time $t$
    - Each term in the sum is the mutual information between $S _{t+1}$ and $\Theta$
      - Agent should take actions that lead to states that are maximally informative about dynamics model
      - Information Gain: $I(S _{t+1}; \Theta \vert \xi_t, a_t) = \mathbb{E} _{s _{t+1} \sim, \mathcal{P}(\cdot \vert \xi_t, a_t)}[D _{KL}(p(\theta \vert \xi_t, a_t, s _{t+1}) \vert \vert p(\theta \vert \xi_t))]$
        
        Divergence between new belief and old one of dynamics model
        
        If we can calculate the posterior, we can optimize this directly (not usually possible)
        Instead, use RL and create an intrinsic reward of information gain along trajectory (captures suprise)
    - New Reward: $r’(s_t, a_t, s _{t+1}) = r(s_t, a_t) +\eta D _{KL}(p(\theta \vert \xi_t, a_t, s _{t+1}) \vert \vert p(\theta \vert \xi_t))$
      - $\eta$: Hyperparameter that controls urge to explore
- Variational Bayes
  - In bayesian setting, we can derive posterior with bayes rule
    - $p(\theta \vert \xi_t, a_t, s _{t+1}) = \frac{p(\theta \vert \xi_t)p(s _{t+1} \vert \xi_t, a_t,; \theta) }{p(s _{t+1} \vert \xi_t, a_t)}$
      - Actions do not influence beliefs about environment: $p(\theta \vert \xi_t, a_t) = p(\theta \vert \xi_t)$
      - Denominator computed through integral: $p(s _{t+1} \vert \xi_t, a_t) = \int _{\Theta} p(s _{t+1} \vert \xi_t, a_t; \theta) p(\theta \vert \xi_t) d\theta$
        Intractable to compute
    - Use variational inference to compute denominator
      - We approximate posterior ($p(\theta \vert \mathcal{D})$) for dataset $\mathcal{D}$ via an alternative distribution $q(\theta; \phi)$
        Minimize KL: $D _{KL}(q(\theta; \phi) \vert \vert p(\theta \vert \mathcal{D}))$
        Done through maximizing variational lower bound: $L[q(\theta;\phi), \mathcal{D}] = \mathbb{E} _{\theta \sim q(\cdot; \phi)}[\log p(\mathcal{D} \vert \theta)] - D _{KL}[q(\theta; \phi) \vert \vert p(\theta)]$
      - Uses approximation, new reward function: $r’(s_t, a_t, s _{t+1}) = r(s_t, a_t) +\eta D _{KL}(q(\theta; \phi _{t+1}) \vert \vert q(\theta \vert \phi_t))$
      - To parameterize dynamics model, use bayesian neural nets
        Parameterized with a fully factorized gaussian
- Compression
  - Agent curiosity can be equated with compression improvement: $C(\xi_t; \phi _{t-1}) - C(\xi_t; \phi_t)$ where $C(\xi, \phi)$ is description length of $\xi$ using a model, $\phi$
  - Compression can be expressed as negative variational lower bound
    - $L[q(\theta;\phi_t), \xi_t] - L[q(\theta;\phi _{t-1}), \xi_t]$
    - Using formula for variational lower bound, we can express compression as: $(\log p(\xi_t) - D _{KL}[q(\theta; \phi_t) \vert \vert p(\theta \vert \xi_t)]) - (\log p(\xi_t) - D _{KL}[q(\theta; \phi _{t+1}) \vert \vert p(\theta \vert \xi_t)])$
      - KL becomes 0 when the approximation equals the posterior
        Compression improvement comes down to optimizing KL from posterior given $\xi _{t-1}$ to posterior given $\xi_t$
        Reverse KL of information gain: $D _{KL}(p(\theta \vert \xi_t) \vert \vert p(\theta \vert \xi_t, a_t, s _{t+1}))$
- Implementation
  - BNN weight distribution based on fully factorized gaussian: $q(\theta; \phi) = \prod _{i=1}^{\vert \Theta \vert} \mathcal{N}(\theta_i \vert \mu_i; \sigma_i^2)$
    - $\phi = \mu, \sigma$
    - Standard deviation parameter paramterized as $\sigma = \log(1 + e^\rho)$
  - To train the BNN, second term of variational lower bound optimizd through sampling and computing $\mathbb{E} _{\theta \sim q(\cdot;\phi)}[\log p(\mathcal{D} \vert \theta)]$
    - Use stochastic gradient variational bayes or bayes by backprop for optimizing variational lower bound
  - Use local reparameterization trick
    - Sample neuron pre-activations instead of weights
  - Sample from replay buffer to optimize variational lower bound (prevents temporal correlation, destabalizes learning, iid samples, diminishes posterior approximation error)
  - To compute posterior distribution of the dynamics model can be computed: $\phi’ = argmin _{\phi}[\ell(q(\theta; \phi), s_t)]$
    - $\ell(q(\theta; \phi), s_t) = D _{KL}[q(\theta; \phi) \vert \vert q(\theta; \phi _{t-1})] - \mathbb{E} _{\theta \sim q(\cdot; \phi)}[\log p(s_t \vert \xi_t, a_t; \theta)]$
    - Used to compute intrinsic reward
    - Use second-order step for gradient approximation: $\nabla \phi = H^{-1}(\ell)\nabla _{\phi} \ell(q(\theta; \phi), s_t)$
      - $H$ is the hessian of $\ell$
  - KL from posterior to prior has simple form
    - $D _{KL}[q(\theta; \phi) \vert \vert q(\theta; \phi’)] = \frac{1}{2}\sum _{i=1}^{\vert \Theta \vert}((\frac{\sigma_i}{\sigma_i’})^2 + 2\log \sigma’_1 - 2 \log \sigma_i + \frac{(\mu_i’ - \mu_u)^2}{\sigma_i^2}) - \frac{\vert\Theta\vert}{2}$
  - We can approximate the Hessian
    - With respect to $\mu$: $\frac{\partial^2 \ell _{KL}}{\partial \mu_i^2} = \frac{1}{\log^2(1 + e^{\rho_i})}$
    - With respect to $\rho_i$ (used for standard deviation): $\frac{\partial^2 \ell _{KL}}{\partial \rho_i^2} = \frac{2e^{2\rho_i}}{(1 + e^{\rho_i})^2} \frac{1}{\log^2(1 + e^{\rho_i})}$
  - Can also approximate KL via second-order taylor expansion
    - $D _{KL}[q(\theta; \phi + \lambda \nabla \phi) \vert \vert q(\theta; \phi)] \approx \frac{1}{2}\lambda^2 \nabla _{\phi} \ell^T H^{-1}(\ell _{KL})\nabla _{\phi}\ell$
Experiments
- Testing sparse rewards, whether reward shaping guides agent toward goal, and how $\eta$ trades off exploration with exploitation
- Sparse Rewards:
  - VIME intrinsic reward compared to gaussian control noise and $\ell^2$ BNN error as reward
  - Use TRPO as algorithm
  - Naive exploration is terrible
  - $\ell^2$ does not perform well either
  - VIME performs well even in the absence of rewards
- In non-sparse reward settings, VIME achieves performance gains over heuristic exploration (faster convergence + prevents converging to local optima)
  - Gaussian noise to rewards did not improve baseline
- Using gaussian control noise acts as random walk behavior
- Too high of $\eta$ leads to prioritizing exploration over getting additional external reward
- Too low $\eta$ reduces method to baseline algorithm

May 22, 2025 · research