EX2: Exploration with Exemplar Models for Deep Reinforcement Learning
-
Resources
-
Introduction
- Exploration is hard with sparse reward signals
- $\epsilon$-greedy and gaussian noise are undirected (don’t explicitly seek for interesting states)
- Estimate novelty by predicting future states or state densities
- Count based approaches have shown speedups
- Generating + predicting states in high dimensions is difficult
- EX2 uses a discriminator to distinguish between a given state from other states seen previously
- Easy to distinguish = likely to be novelty
- Exemplar Models: model that distinguishes a state from all other observed states
-
Preliminaries
- Assume standard MDP + RL setting
- In discrete state settings, we can give exploration bonuses based on counts
- In continuous state settings, we give exploration bonuses by generatively learning a probability density and estimating pseudo-counts
-
Exemplar Models and Density Estimation
-
Exemplar Models
- Dataset: $X = (x_1 \dots x_n)$
- Exemplar model has $n$ classifiers / discriminators, $(D _{x_1} \dots D _{x_n})$
- Avoid $n$ classifiers by using a single exemplar-conditioned network
- $D _{x^\ast}(X) : \chi \rightarrow [0,1]$: discriminator associated with exemplar $x^\ast$
- $P _{\chi}(x)$: Data distribution
- Each discriminator receives balanced Dataset
- Half of the dataset consists of exemplar $x^\ast$ and other half from $P _{\chi}(x)$
- Discriminator models bernoulli distribution
- $D _{x^\ast}(X) = P(x = x^\ast \vert x)$ via maximum likelihood
- $x = x^\ast$ is noisy (data similar / identical to $x^\ast$ or can occur in $P _{\chi}(x)$)
- Cross entropy objective: $D {x^\ast} = argmax _{D \in \mathcal{D}}(E _{\delta _{x^\ast}}[\log D(x)] + E _{P\chi}[\log 1 - D(x)])$
-
Exemplar Models as Implicit Density Estimation
- Optimal discriminator satisfies:
- $D _{x^\ast}(x) = \frac{\delta _{x^\ast}(x)}{\delta _{x^\ast}(x) + P _{\chi}(x)}$
- $D _{x^\ast}(x^\ast) = \frac{1}{1 + _{x^\ast}(x) + P _{\chi}(x)}$
- If a discriminator is optimal, we can find probability of datapoint by evaluating discriminator at exemplar
- $P _\chi(x^\ast) = \frac{1 - D _{x^\ast}(x)}{D _{x^\ast}(x)}$
- For continuous domains, we can’t recover $P _{\chi}(x)$ because $\delta _{x^\ast}(x) \rightarrow \infty$ and $D(x) \rightarrow 1$
- Smooth the delta by adding noise ($q$)
- When noise is added: $P _\chi (x^\ast) \propto \frac{1 - D _{x^\ast}(x)}{D _{x^\ast}(x)}$
- Proportionality holds as long a convolution $(\delta \ast q)(x^\ast)$ is same for all $x^\ast$
- Reward bonus invariant to normalization so proportionality is sufficient
- In practice, we add noise to background distribution
- $D _{x^\ast}(x) = = \frac{(\delta \ast q)(x)}{(\delta \ast q)(x) + (P _{\chi} \ast q)(x^\ast)}$
-
Latent Space Smoothing with Noisy Discriminators
- Adding noise to high dimensions does not produce meaningful new states
- Instead inject it into latent Space
- Train an encoder: $q(z \vert x)$
- Train a latent space classifier: $p(y \vert z) = D(z)^y(1 - D(z))^{1- y}$
- y = 1 if $x = x^\ast$ else 0
- Regularlize noise against a prior $\tilde{p}(z) = \frac{1}{2}\delta _{x^\ast}(x) + \frac{1}{2} P _{\chi}(x)$
- Maximizing objective: $max _{p _{y \vert z}, q _{z \vert x}} \mathbb{E} _{\tilde{p}}[\mathbb{E} _{q _{z \vert x}}[\log p(y \vert z)] - D _{KL}(q (z \vert x)\vert \vert p(z))]$
- Maximize classifcation accuracy while transmit minimal info through latent space
- Captures factors of variation in $x$ that are most informative to distinguish points
- Let the following:
- Marginal positive density over latent space: $q(z \vert y = 1) = \int_x \delta _{x^\ast}(x) q(z \vert x)dx$
- Marginal negative density overt latent space: $q(z \vert y = 0) = \int_x p _{\chi}(x) q(z \vert x)dx$
- Optimal discriminator satisfies $p(y = 1 \vert z) = D(z) = \frac{q(z \vert y =1)}{q(z \vert y =1) + q(z \vert y = 0)}$
- Optimal encoder satisfies $q(z \vert x) \propto D(z)^{y _{soft}(x)}(1 - D(z))^{1 - y _{soft}(x)}p(z)$
- Average label of x: $y _{soft}(x) = p(y = 1 \vert x) = \frac{\delta _{x^\ast}(x)}{\delta _{x^\ast}(x) + p _{\chi}(x)}$
- Recover a density: $D(x) = \mathbb{E}_q [D(z)]$
-
Smoothing from Suboptimal Discriminators
- With a suboptimal discriminator, second source of density smoothing = discriminator has difficultly distinguishing between 2 states
- Adding noise is not necessarily needed
-
$EX^2$: Exploration with Exemplar Models
- To generate samples from $P(s)$, sample the replay buffer
- Exemplars = states we want to score
- Offline setting: states in current batch of trajectories
- Online setting: train discriminator on states as we receive them
- Augment reward with novelty of the state
- $R’(s,a) = R(s,a) + \beta f(D _{s}(s))$
- $\beta$ is a hyperparamter to tune magnitude of bonus
-
Model Architecture
-
Amortized Multi-Exemplar Model
- Instead of a seperate classifier for each exemplar, use a single model conditioned on exemplar
- Condition latent space discriminator on encoded exemplar from $q(z^\ast \vert x^\ast)$
- Don’t need to train discriminators from scratch every iteration + provides generalization
-
K-Exemplar Model
- We can interpolate between 1 model for all exemplars (K = 1) and 1 model for each exemplar (K = number of states)
- Batch adjacent states into same discriminator
- Form of temporal regularization (adjacent states are similar)
- Share majority of layers in each discriminator
- Final linear layer is all that varies
-
Relationship to Generative Adverserial Networks (GANs)
- Policy = generator of a GANs
- Exemplar model = discriminator of GAN
- In EX2, generator is rewarded for helping the discriminator rather than fooling it
- Competing with the progression of time
- As novel state becomes visited frequently, replay buffer saturated with that state
- Forces policy to find new states
-
Experimental Evaluation
- Compare K-exemplar and amortized to standard random exploration, kernel density estimation with RBF kernels, VIME, and hash based methods
- Continuous Control:
- Even on medium dimensional tasks, implicit density estimation approach works well
- EX2, VIME, and hasing outperform regular TRPO and KDE on SwimmerGather
- Amortized EX2 outperforms all other methods on HalfCheetah
- Image-Based Control:
- EX2 generates coherent exploration behavior even at high dimensional visual environments
- On most challenging task, EX2 exceeds all other exploration techniques
- Indicates explicit density estimators are good for simple/clean images but struggle with complex + egocentric observations
· research