Evolution Strategies as a Scalable Alternative to Reinforcement Learning
-
Resources
-
Introduction
- Alternative to RL is black-box optimization strategies
- Direct policy search
- Neuro evolution
- Key Findings
- Virtual batch normalization improves reliability of evolution strategies
- Evolution strategies are parallelizable
- Data efficiency is good - require 3 - 10x more data but decrease in this efficiency was offset with reduction in required computation by 3x
- ES had better exploration than policy gradient methods
- ES are robust to hyperparameters
- Properties of black box optimization
- In difference to distribution of rewards
- No need for backpropagation
- Tolerance to long time horizons
- (Perceived to be) less effective for hard RL problems
-
Evolution Strategies
- Terminology
- Generation: iteration
- Genotypes: parameter vectors
- Mutation: Perturbation in parameter vectors
- Fitness: Objective function
- Highest scoring parameter vectors are recombined to form population for next generation
- Natural Evolution Strategies (NES):
- $F$: Objective function
- $\theta$: Parameters
- Population with distribution over parameters $p _{\psi}(\theta)$ maximizes expectation of objective function
- Population parameterized by $\psi$
- Estimator: $\nabla _\psi \mathbb{E} _{\theta} F(\theta) = \mathbb{E} _{\theta \sim p _{\psi}}[F(\theta)\nabla _{\psi} \log p _{\psi}(\theta)]$
- For RL problems
- $F(\cdot)$ is stochastic return
- $\theta$ is policy parameters
- Instantiate population distribution as isotropic multivariate gaussian $p _{\psi} \sim \mathcal{N}(\psi, \sigma^2 I)$
- Expected objective in terms of parameter vector: $\mathbb{E} _{\theta \sim p _{\psi}}F(\theta) = \mathbb{E} _{\epsilon \sim N(0,I)} F(\theta + \sigma \epsilon)$
- Use stochastic gradient ascent to optimize over $\theta$
- Gradient: $\nabla _\theta \mathbb{E} _{\epsilon \sim N(0,I)} F(\theta + \sigma \epsilon) = \frac{1}{\sigma} \mathbb{E} _{\epsilon \sim N(0,I)} (F(\theta + \sigma \epsilon)\epsilon)$
- Algorithm:
- Initialize learning rate, standard deviation, initial policy parameters
- For t = 0, 1, 2, $\dots$
- Sample $\epsilon_1 \dots \epsilon_n \sim N(0,I)$
- Compute returns $F_i = F(\theta + \sigma \epsilon_i)$ for $i = 1 \dots n$
- Set $\theta _{t+1} \leftarrow \theta_t + \alpha \frac{1}{n\sigma}\sum _{i = 1}^n F_i \epsilon_i$
-
Scaling and parallelizing ES
- ES works on complete episodes $\rightarrow$ minimal communication between workers
- Each worker only obtains scalar return
- Synchronizing random seeds between workers before optimizing allows workers to know perturbations of other workers without communication
- ES doesn’t require value approximation
- Doesn’t require value functions to catch up with policy
- Algorithm
- Initialize n workers, n seeds, and initial parameters
- For $t = 0 \dots$
- For each worker
- Compute noise
- Compute returns from parameters and noise
- Send scalar returns from each worker to every other worker
- For each worker
- Reconstruct perturbations $\epsilon_j$ using known seeds
- Set $\theta _{t+1} = \theta_t + \alpha \frac{1}{n\sigma}\sum _{j = 1}^n F_j \epsilon_j$
- In practice instead of reconstructing perturbations, we just sample some gaussian noise
- With more workers can also perturb a subset of the parameters to reduce computation
- Perturbation distribution $p _{\psi}$ is a mixture of gaussians
- If only one coordinate changed for each worker then we get finite differences
- Mirrored Sampling (Anthithetic Sampling): Evaluate pairs of perturbations $\epsilon, -\epsilon$ for noise vector $\epsilon$
- Results in variance reduction
- Fitness Shaping: apply rank transformation to returns before computing parameter update
- Removes influence of outlier individuals
- Decreases probability of falling into local optima
- Apply weight decay
- $\sigma$ is fixed
- Episode-level ES can lead to low CPU utilization; some episodes run much longer than others
- Truncate episodes at specific length $m$
-
The impact of network parameterization
- Learning signals in ES come from sampling policy parameters
- Exploration driven by perturbations
- To improve parameters, some members of population must get better return than others
- In Atari environments, perturbed parameters tended to take on one specific action regardless of the state that was given as input
- Match policy gradient performance by using virtual batch normalization
- Same as normal batch norm but minibatch for computing normalizing statistics is chosen at the start and fixed
- Makes it more sensitive to input changes at early stages of training; causes policy to take wider variety of actions
- Makes training more expensive
- In Mujoco environments, discretizing action allowed for more exploration
- Forced actions to be non-smooth with respect to input + parameter perturbations
- Caused wider variety of behavior to play out
-
Smoothing in parameter space versus smoothing in action space
- RL is hard because of lack of informative gradients of policy performance
- Due to non-smoothness of environment or policy
- Or have high variance
- If we want to solve problems that give return $R(a)$ for sequence of actions, $a = (a_1 \dots a_T)$
- Objective: $F(\theta) = R(a(\theta))$
- $F(\theta)$ might be non-smooth
- Cannot solve for $\theta$ via gradient optimization because we don’t have access to underlying state transition function
- Need to add noise to make problem smooth
- Policy gradient does this by sampling from action distribution
- I.e., discrete actions $\rightarrow$ score $\rightarrow$ softmax probabilities
- Policy gradient objective $F _{PG}(\theta) = \mathbb{E} _{\epsilon}[R(a(\epsilon, \theta))]$ where $\epsilon$ is a noise source
- Gradient: $\nabla _\theta F _{PG}(\theta) = \mathbb{E} _{\epsilon}[R(a(\epsilon, \theta))\nabla _{\theta} \log p(a(\epsilon, \theta); \theta)]$
- ES add noise to parameter space via perturbation
- Perturbed parameter: $\tilde{\theta} = \theta + \zeta$ where $\zeta$ is noise from a gaussian
- Objective: $F _{Es}(\theta) = \mathbb{E} _{\zeta}[R(a(\zeta, \theta))]$
- Gradient: $\nabla _\theta F _{ES}(\theta) = \mathbb{E} _{\zeta}[R(a(\zeta, \theta))\nabla _{\theta} \log p(\tilde{\theta}(a(\zeta, \theta)); \theta)]$
-
When is ES better than policy gradients?
- $Var[\nabla _\theta F _{PG}(\theta)] \approx Var[R(a)]Var[\nabla _\theta \log p(a; \theta)]$
- $Var[\nabla _\theta F _{ES}(\theta)] \approx Var[R(a)]Var[\nabla _\theta \log p(\tilde{\theta}; \theta)]$
- Variance of returns will be similar
- Variance of policy gradient estimator grows linearly with episode length
- In practice effective steps reduced due to discounted rewards (biases gradient if actions have long-lasting effect)
- Alternatively we use value function approximation to reduce steps (also biases gradient)
- Variance of ES gradient is independent of episode length
- Beter for longer episodes
-
Problem dimensionality
- Gradient for ES is just finite differences in high dimensional space
- $\nabla _\theta \eta(\theta) = \mathbb{E} _{\epsilon \sm N(0, I)}[F(\theta + \sigma \epsilon)\epsilon / \sigma] = \mathbb{E} _{\epsilon \sm N(0, I)}[(F(\theta + \sigma \epsilon) - F(\theta))\epsilon / \sigma]$
- Will scale poorly with number of parameters of $\theta$
- Larger networks will have slightly better results
-
Advantages of not calculating gradients
- Easy parallelism
- Lower communication overhead in distributed settings
- Deals with sparse and delayed rewards
- Reduced computation + memory usage because we do not need backpropagation
- Do not need to deal with exploding gradients
- Bounded and smooth cost functions
- Can use non-differentiable elements into architecture (i.e., hard attention)
- Good for low precision hardware + low precision arithmetic
- Invaraint to frequency at which agents interacts with environment
- Does not need hierarchy
-
Experiments
-
MuJoCo
- ES benefits from discretizing actions sometimes (continuous hampers exploration happening from perturbations)
- ES matches TRPO performance
- Solve same environments with less than 10x penalty in sample efficiency (hard environment)
- Easy environments had 3x better sample complexity
-
Atari
- A3C required 1 day for atari game
- With parallelization, ES required 1 hour
- ES performed better in 23 games and worse in 28
-
Parallelization
- Solved 3D humanoid in 11 hours (same time as RL approach)
- Distributed across 80 nodes (1440 cores), it takes 10 mins (linear speed up in number of cores)
-
Invariance to temporal resolution
- Frame Skip: RL agent decide actions in lower frequency than used in simulator
- Too high = RL agent doesn’t make decisions at fine enought timeframe
- Too low = length of episode increases too much
- ES gradient estimate is invaraint to length of episode
· research