EX2: Exploration with Exemplar Models for Deep Reinforcement Learning

Resources
- Paper
Introduction
- Exploration is hard with sparse reward signals
- $\epsilon$-greedy and gaussian noise are undirected (don’t explicitly seek for interesting states)
- Estimate novelty by predicting future states or state densities
- Count based approaches have shown speedups
- Generating + predicting states in high dimensions is difficult
- EX2 uses a discriminator to distinguish between a given state from other states seen previously
  - Easy to distinguish = likely to be novelty
  - Exemplar Models: model that distinguishes a state from all other observed states
Preliminaries
- Assume standard MDP + RL setting
- In discrete state settings, we can give exploration bonuses based on counts
- In continuous state settings, we give exploration bonuses by generatively learning a probability density and estimating pseudo-counts
Exemplar Models and Density Estimation
- Exemplar Models
  - Dataset: $X = (x_1 \dots x_n)$
  - Exemplar model has $n$ classifiers / discriminators, $(D _{x_1} \dots D _{x_n})$
    - Avoid $n$ classifiers by using a single exemplar-conditioned network
  - $D _{x^\ast}(X) : \chi \rightarrow [0,1]$: discriminator associated with exemplar $x^\ast$
  - $P _{\chi}(x)$: Data distribution
  - Each discriminator receives balanced Dataset
    - Half of the dataset consists of exemplar $x^\ast$ and other half from $P _{\chi}(x)$
  - Discriminator models bernoulli distribution
    - $D _{x^\ast}(X) = P(x = x^\ast \vert x)$ via maximum likelihood
      - $x = x^\ast$ is noisy (data similar / identical to $x^\ast$ or can occur in $P _{\chi}(x)$)
      - Cross entropy objective: $D {x^\ast} = argmax _{D \in \mathcal{D}}(E _{\delta _{x^\ast}}[\log D(x)] + E _{P\chi}[\log 1 - D(x)])$
- Exemplar Models as Implicit Density Estimation
  - Optimal discriminator satisfies:
    - $D _{x^\ast}(x) = \frac{\delta _{x^\ast}(x)}{\delta _{x^\ast}(x) + P _{\chi}(x)}$
    - $D _{x^\ast}(x^\ast) = \frac{1}{1 + _{x^\ast}(x) + P _{\chi}(x)}$
  - If a discriminator is optimal, we can find probability of datapoint by evaluating discriminator at exemplar
    - $P _\chi(x^\ast) = \frac{1 - D _{x^\ast}(x)}{D _{x^\ast}(x)}$
    - For continuous domains, we can’t recover $P _{\chi}(x)$ because $\delta _{x^\ast}(x) \rightarrow \infty$ and $D(x) \rightarrow 1$
      - Smooth the delta by adding noise ($q$)
      - When noise is added: $P _\chi (x^\ast) \propto \frac{1 - D _{x^\ast}(x)}{D _{x^\ast}(x)}$
        
        Proportionality holds as long a convolution $(\delta \ast q)(x^\ast)$ is same for all $x^\ast$
        
        Reward bonus invariant to normalization so proportionality is sufficient
      - In practice, we add noise to background distribution
        $D _{x^\ast}(x) = = \frac{(\delta \ast q)(x)}{(\delta \ast q)(x) + (P _{\chi} \ast q)(x^\ast)}$
- Latent Space Smoothing with Noisy Discriminators
  - Adding noise to high dimensions does not produce meaningful new states
    - Instead inject it into latent Space
  - Train an encoder: $q(z \vert x)$
  - Train a latent space classifier: $p(y \vert z) = D(z)^y(1 - D(z))^{1- y}$
    - y = 1 if $x = x^\ast$ else 0
    - Regularlize noise against a prior $\tilde{p}(z) = \frac{1}{2}\delta _{x^\ast}(x) + \frac{1}{2} P _{\chi}(x)$
  - Maximizing objective: $max _{p _{y \vert z}, q _{z \vert x}} \mathbb{E} _{\tilde{p}}[\mathbb{E} _{q _{z \vert x}}[\log p(y \vert z)] - D _{KL}(q (z \vert x)\vert \vert p(z))]$
    - Maximize classifcation accuracy while transmit minimal info through latent space
    - Captures factors of variation in $x$ that are most informative to distinguish points
  - Let the following:
    - Marginal positive density over latent space: $q(z \vert y = 1) = \int_x \delta _{x^\ast}(x) q(z \vert x)dx$
    - Marginal negative density overt latent space: $q(z \vert y = 0) = \int_x p _{\chi}(x) q(z \vert x)dx$
    - Optimal discriminator satisfies $p(y = 1 \vert z) = D(z) = \frac{q(z \vert y =1)}{q(z \vert y =1) + q(z \vert y = 0)}$
    - Optimal encoder satisfies $q(z \vert x) \propto D(z)^{y _{soft}(x)}(1 - D(z))^{1 - y _{soft}(x)}p(z)$
      - Average label of x: $y _{soft}(x) = p(y = 1 \vert x) = \frac{\delta _{x^\ast}(x)}{\delta _{x^\ast}(x) + p _{\chi}(x)}$
    - Recover a density: $D(x) = \mathbb{E}_q [D(z)]$
- Smoothing from Suboptimal Discriminators
  - With a suboptimal discriminator, second source of density smoothing = discriminator has difficultly distinguishing between 2 states
    - Adding noise is not necessarily needed
$EX^2$: Exploration with Exemplar Models
- To generate samples from $P(s)$, sample the replay buffer
- Exemplars = states we want to score
  - Offline setting: states in current batch of trajectories
  - Online setting: train discriminator on states as we receive them
- Augment reward with novelty of the state
  - $R’(s,a) = R(s,a) + \beta f(D _{s}(s))$
    - $\beta$ is a hyperparamter to tune magnitude of bonus
Model Architecture
- Amortized Multi-Exemplar Model
  - Instead of a seperate classifier for each exemplar, use a single model conditioned on exemplar
  - Condition latent space discriminator on encoded exemplar from $q(z^\ast \vert x^\ast)$
  - Don’t need to train discriminators from scratch every iteration + provides generalization
- K-Exemplar Model
  - We can interpolate between 1 model for all exemplars (K = 1) and 1 model for each exemplar (K = number of states)
  - Batch adjacent states into same discriminator
    - Form of temporal regularization (adjacent states are similar)
  - Share majority of layers in each discriminator
    - Final linear layer is all that varies
- Relationship to Generative Adverserial Networks (GANs)
  - Policy = generator of a GANs
  - Exemplar model = discriminator of GAN
  - In EX2, generator is rewarded for helping the discriminator rather than fooling it
    - Competing with the progression of time
    - As novel state becomes visited frequently, replay buffer saturated with that state
    - Forces policy to find new states
Experimental Evaluation
- Compare K-exemplar and amortized to standard random exploration, kernel density estimation with RBF kernels, VIME, and hash based methods
- Continuous Control:
  - Even on medium dimensional tasks, implicit density estimation approach works well
  - EX2, VIME, and hasing outperform regular TRPO and KDE on SwimmerGather
  - Amortized EX2 outperforms all other methods on HalfCheetah
- Image-Based Control:
  - EX2 generates coherent exploration behavior even at high dimensional visual environments
  - On most challenging task, EX2 exceeds all other exploration techniques
  - Indicates explicit density estimators are good for simple/clean images but struggle with complex + egocentric observations

May 30, 2025 · research