Proximal Policy Optimization Algorithms

Resources
- Paper
Introduction
- We want RL methods that are 1) robust 2) data efficient and 3) scalable
  - Q-Learning fails on simple problems + poorly understood
  - TRPO is complicated + incompatible with architectures with noise or parameter sharing
- PPO aims to get data efficiency + reliabiliyt of TRPO while only using 1st order optimization
- Uses novel objective with clipped probability ratios
- Alternates between using replay buffer data and on-policy data
Background: Policy Optimization
- Policy Gradient Methods
  - Estimates policy gradient and applies stochastic gradient ascent
  - Estimator: $\hat{g} = \hat{\mathbb{E}}_t [\nabla _\theta \log \pi _\theta (a_t \vert s_t)\hat{A}_t]$
    - Expectation refers to average over a minibatch
  - Loss Function: $L^{PG} = \hat{\mathbb{E}}_t [\log \pi _\theta (a_t \vert s_t)\hat{A}_t]$
  - Usually has large policy updates (destructive)
- Trust Region Methods
  - Maximize surrogate objective: $maximize _\theta \hat{\mathbb{E}}_t [\frac{\pi _\theta (a_t \vert s_t)}{\pi _{\theta _{old}}(a_t \vert s_t)} \hat{A}_t]$
  - Subject to a KL divergence constraint
  - You could use a penalty for the KL divergence constraint and solve the unconstrained optimization problem but choosing a weight for the penalty is difficult
    - $maximize_\theta \hat{\mathbb{E}}_t [\frac{\pi _\theta (a_t \vert s_t)}{\pi _{\theta _{old}}(a_t \vert s_t)} \hat{A}_t - \beta KL(\pi _{\theta _{old}}(\cdot \vert s_t), \pi _\theta(\cdot \vert s_t))]$: Hard to choose $\beta$
    - TRPO just makes it a hard constraint instead
Clipped Surrogate Objective
- Let $r_t(\theta) = \frac{\pi _\theta (a_t \vert s_t)}{\pi _{\theta _{old}}(a_t \vert s_t)}$
- TRPO maximizes surrogate: $L^{CPI} = \hat{\mathbb{E}}_t [r_t(\theta) \hat{A}_t]$
  - Without constraint, this would lead to large updates
- To avoid large updates: $L^{CLIP}(\theta) = \hat{\mathbb{E}}_t[min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t)]$
  - Clipping removes incentive for moving $r_t$ outside interval of $[1 - \epsilon, 1 + \epsilon]$
  - Minimum ensures we take a lower (pessimistic) bound on unclipped objective
  - $L^{CLIP}$ is lower bound for $L^{CPI}$
Adaptive KL Penalty Coefficient
- We can use a KL-penalized objective
- Empirically performs worse
- Algorithm:
  - Optimize the following on several minibatches using SGD: $L^{KLPEN}(\theta) = \hat{\mathbb{E}}_t [\frac{\pi _\theta (a_t \vert s_t)}{\pi _{\theta _{old}}(a_t \vert s_t)} \hat{A}_t - \beta KL(\pi _{\theta _{old}}(\cdot \vert s_t), \pi _\theta(\cdot \vert s_t))]$
  - Compute $d = \mathbb{E}_t[KL(\pi _{\theta _{old}}(\cdot \vert s_t), \pi _{\theta}(\cdot \vert s_t))]$
    - If $d \lt d _{targ} / 1.5, \beta = \beta / 2$
    - If $d \gt d _{targ} \cdot 1.5, \beta = \beta \cdot 2$
    - 1.5 and 2 are chosen heuristically but can be changed
    - We will see updates that diverge away from $d _{targ}$ but they are rare
    - Initial $\beta$ not important
Algorithm
- When using automatic differentiation, we replace $L^{PG}$ in policy gradient with $L^{CLIP}$ or $L^{KLPEN}$ and apply stochastic gradient ascent
- If estimating a value function, we can combine our loss using a value function error term, and can add an entropy bonus: $L^{CLIP + VF + S} _t(\theta) = \hat{\mathbb{E} _t}[L^{CLIP} _t(\theta) - c_1 L^{VF}_t + c_2 S [\pi _{\theta}(s _t)]]$
  - $S$ is entropy bonus and $L^{VF}$ is an MSE error
- To estimate advantages, we can use GAE or n-step returns
- PPO Actor-Critic Style
  - for iteration $1, 2 \dots$
    - for actor $1, 2, \dots N$
      - Run policy $\pi _{\theta _{old}}$ on environment
      - Compute advantages
    - Optimize surrogate with respect to parameters with minibatch with K epochs
    - $\theta _{old} = \theta$
Experiments
- Comparison of Surrogate Objectives
  - Compare no clipping, clipping, and KL penalty
  - Tried clipping in log-space but does not perform better
  - Clipping (with $\epsilon = 0.2$) produced the best results
  - No clipping or penalty produced the worst results
- Comparison to Other Algorithms in the Continuous Domain
  - Compared with cross-entropy method, vanilla policy gradient with adaptive step size, A2C, A2C + Trust Region
  - PPO comes out with as good or the best performance on all 7 mujoco tasks
- Showcase in the Continuous Domain: Humanoid Running and Steering
  - PPO produces a good policy here too (paper doesn’t say much)
- Comparison to Other Algorithms on the Atari Domain
  - Compared with A2C and ACER
  - PPO trained faster than ACER and A2C
  - PPO’s final performance was a bit worse than ACER

April 17, 2025 · research