Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Resources
- Paper
Introduction
- Alternative to RL is black-box optimization strategies
  - Direct policy search
  - Neuro evolution
- Key Findings
  - Virtual batch normalization improves reliability of evolution strategies
  - Evolution strategies are parallelizable
  - Data efficiency is good - require 3 - 10x more data but decrease in this efficiency was offset with reduction in required computation by 3x
  - ES had better exploration than policy gradient methods
  - ES are robust to hyperparameters
- Properties of black box optimization
  - In difference to distribution of rewards
  - No need for backpropagation
  - Tolerance to long time horizons
  - (Perceived to be) less effective for hard RL problems
Evolution Strategies
- Terminology
  - Generation: iteration
  - Genotypes: parameter vectors
  - Mutation: Perturbation in parameter vectors
  - Fitness: Objective function
- Highest scoring parameter vectors are recombined to form population for next generation
  - Repeated til convergence
- Natural Evolution Strategies (NES):
  - $F$: Objective function
  - $\theta$: Parameters
  - Population with distribution over parameters $p _{\psi}(\theta)$ maximizes expectation of objective function
    - Population parameterized by $\psi$
    - Estimator: $\nabla _\psi \mathbb{E} _{\theta} F(\theta) = \mathbb{E} _{\theta \sim p _{\psi}}[F(\theta)\nabla _{\psi} \log p _{\psi}(\theta)]$
- For RL problems
  - $F(\cdot)$ is stochastic return
  - $\theta$ is policy parameters
- Instantiate population distribution as isotropic multivariate gaussian $p _{\psi} \sim \mathcal{N}(\psi, \sigma^2 I)$
  - Expected objective in terms of parameter vector: $\mathbb{E} _{\theta \sim p _{\psi}}F(\theta) = \mathbb{E} _{\epsilon \sim N(0,I)} F(\theta + \sigma \epsilon)$
- Use stochastic gradient ascent to optimize over $\theta$
  - Gradient: $\nabla _\theta \mathbb{E} _{\epsilon \sim N(0,I)} F(\theta + \sigma \epsilon) = \frac{1}{\sigma} \mathbb{E} _{\epsilon \sim N(0,I)} (F(\theta + \sigma \epsilon)\epsilon)$
    - Approximate with samples
- Algorithm:
  - Initialize learning rate, standard deviation, initial policy parameters
  - For t = 0, 1, 2, $\dots$
    - Sample $\epsilon_1 \dots \epsilon_n \sim N(0,I)$
    - Compute returns $F_i = F(\theta + \sigma \epsilon_i)$ for $i = 1 \dots n$
    - Set $\theta _{t+1} \leftarrow \theta_t + \alpha \frac{1}{n\sigma}\sum _{i = 1}^n F_i \epsilon_i$
- Scaling and parallelizing ES
  - ES works on complete episodes $\rightarrow$ minimal communication between workers
  - Each worker only obtains scalar return
    - Synchronizing random seeds between workers before optimizing allows workers to know perturbations of other workers without communication
  - ES doesn’t require value approximation
    - Doesn’t require value functions to catch up with policy
  - Algorithm
    - Initialize n workers, n seeds, and initial parameters
    - For $t = 0 \dots$
      - For each worker
        
        Compute noise
        
        Compute returns from parameters and noise
      - Send scalar returns from each worker to every other worker
      - For each worker
        
        Reconstruct perturbations $\epsilon_j$ using known seeds
        
        Set $\theta _{t+1} = \theta_t + \alpha \frac{1}{n\sigma}\sum _{j = 1}^n F_j \epsilon_j$
  - In practice instead of reconstructing perturbations, we just sample some gaussian noise
    - With more workers can also perturb a subset of the parameters to reduce computation
      - Perturbation distribution $p _{\psi}$ is a mixture of gaussians
      - If only one coordinate changed for each worker then we get finite differences
  - Mirrored Sampling (Anthithetic Sampling): Evaluate pairs of perturbations $\epsilon, -\epsilon$ for noise vector $\epsilon$
    - Results in variance reduction
  - Fitness Shaping: apply rank transformation to returns before computing parameter update
    - Removes influence of outlier individuals
    - Decreases probability of falling into local optima
  - Apply weight decay
  - $\sigma$ is fixed
  - Episode-level ES can lead to low CPU utilization; some episodes run much longer than others
    - Truncate episodes at specific length $m$
- The impact of network parameterization
  - Learning signals in ES come from sampling policy parameters
    - Exploration driven by perturbations
    - To improve parameters, some members of population must get better return than others
  - In Atari environments, perturbed parameters tended to take on one specific action regardless of the state that was given as input
    - Match policy gradient performance by using virtual batch normalization
      - Same as normal batch norm but minibatch for computing normalizing statistics is chosen at the start and fixed
      - Makes it more sensitive to input changes at early stages of training; causes policy to take wider variety of actions
      - Makes training more expensive
  - In Mujoco environments, discretizing action allowed for more exploration
    - Forced actions to be non-smooth with respect to input + parameter perturbations
      - Caused wider variety of behavior to play out
Smoothing in parameter space versus smoothing in action space
- RL is hard because of lack of informative gradients of policy performance
  - Due to non-smoothness of environment or policy
  - Or have high variance
- If we want to solve problems that give return $R(a)$ for sequence of actions, $a = (a_1 \dots a_T)$
  - Objective: $F(\theta) = R(a(\theta))$
    - $F(\theta)$ might be non-smooth
    - Cannot solve for $\theta$ via gradient optimization because we don’t have access to underlying state transition function
  - Need to add noise to make problem smooth
    - Policy gradient does this by sampling from action distribution
      - I.e., discrete actions $\rightarrow$ score $\rightarrow$ softmax probabilities
      - Policy gradient objective $F _{PG}(\theta) = \mathbb{E} _{\epsilon}[R(a(\epsilon, \theta))]$ where $\epsilon$ is a noise source
      - Gradient: $\nabla _\theta F _{PG}(\theta) = \mathbb{E} _{\epsilon}[R(a(\epsilon, \theta))\nabla _{\theta} \log p(a(\epsilon, \theta); \theta)]$
    - ES add noise to parameter space via perturbation
      - Perturbed parameter: $\tilde{\theta} = \theta + \zeta$ where $\zeta$ is noise from a gaussian
      - Objective: $F _{Es}(\theta) = \mathbb{E} _{\zeta}[R(a(\zeta, \theta))]$
      - Gradient: $\nabla _\theta F _{ES}(\theta) = \mathbb{E} _{\zeta}[R(a(\zeta, \theta))\nabla _{\theta} \log p(\tilde{\theta}(a(\zeta, \theta)); \theta)]$
- When is ES better than policy gradients?
  - $Var[\nabla _\theta F _{PG}(\theta)] \approx Var[R(a)]Var[\nabla _\theta \log p(a; \theta)]$
  - $Var[\nabla _\theta F _{ES}(\theta)] \approx Var[R(a)]Var[\nabla _\theta \log p(\tilde{\theta}; \theta)]$
    - Variance of returns will be similar
    - Variance of policy gradient estimator grows linearly with episode length
      - In practice effective steps reduced due to discounted rewards (biases gradient if actions have long-lasting effect)
      - Alternatively we use value function approximation to reduce steps (also biases gradient)
    - Variance of ES gradient is independent of episode length
      - Beter for longer episodes
- Problem dimensionality
  - Gradient for ES is just finite differences in high dimensional space
    - $\nabla _\theta \eta(\theta) = \mathbb{E} _{\epsilon \sm N(0, I)}[F(\theta + \sigma \epsilon)\epsilon / \sigma] = \mathbb{E} _{\epsilon \sm N(0, I)}[(F(\theta + \sigma \epsilon) - F(\theta))\epsilon / \sigma]$
    - Will scale poorly with number of parameters of $\theta$
    - Larger networks will have slightly better results
- Advantages of not calculating gradients
  - Easy parallelism
  - Lower communication overhead in distributed settings
  - Deals with sparse and delayed rewards
  - Reduced computation + memory usage because we do not need backpropagation
  - Do not need to deal with exploding gradients
  - Bounded and smooth cost functions
  - Can use non-differentiable elements into architecture (i.e., hard attention)
  - Good for low precision hardware + low precision arithmetic
  - Invaraint to frequency at which agents interacts with environment
  - Does not need hierarchy
Experiments
- MuJoCo
  - ES benefits from discretizing actions sometimes (continuous hampers exploration happening from perturbations)
  - ES matches TRPO performance
    - Solve same environments with less than 10x penalty in sample efficiency (hard environment)
    - Easy environments had 3x better sample complexity
- Atari
  - A3C required 1 day for atari game
  - With parallelization, ES required 1 hour
  - ES performed better in 23 games and worse in 28
- Parallelization
  - Solved 3D humanoid in 11 hours (same time as RL approach)
  - Distributed across 80 nodes (1440 cores), it takes 10 mins (linear speed up in number of cores)
- Invariance to temporal resolution
  - Frame Skip: RL agent decide actions in lower frequency than used in simulator
    - Too high = RL agent doesn’t make decisions at fine enought timeframe
    - Too low = length of episode increases too much
  - ES gradient estimate is invaraint to length of episode

May 22, 2025 · research