You could use a penalty for the KL divergence constraint and solve the unconstrained optimization problem but choosing a weight for the penalty is difficult
If $d \gt d _{targ} \cdot 1.5, \beta = \beta \cdot 2$
1.5 and 2 are chosen heuristically but can be changed
We will see updates that diverge away from $d _{targ}$ but they are rare
Initial $\beta$ not important
Algorithm
When using automatic differentiation, we replace $L^{PG}$ in policy gradient with $L^{CLIP}$ or $L^{KLPEN}$ and apply stochastic gradient ascent
If estimating a value function, we can combine our loss using a value function error term, and can add an entropy bonus: $L^{CLIP + VF + S} _t(\theta) = \hat{\mathbb{E} _t}[L^{CLIP} _t(\theta) - c_1 L^{VF}_t + c_2 S [\pi _{\theta}(s _t)]]$
$S$ is entropy bonus and $L^{VF}$ is an MSE error
To estimate advantages, we can use GAE or n-step returns
PPO Actor-Critic Style
for iteration $1, 2 \dots$
for actor $1, 2, \dots N$
Run policy $\pi _{\theta _{old}}$ on environment
Compute advantages
Optimize surrogate with respect to parameters with minibatch with K epochs
$\theta _{old} = \theta$
Experiments
Comparison of Surrogate Objectives
Compare no clipping, clipping, and KL penalty
Tried clipping in log-space but does not perform better
Clipping (with $\epsilon = 0.2$) produced the best results
No clipping or penalty produced the worst results
Comparison to Other Algorithms in the Continuous Domain
Compared with cross-entropy method, vanilla policy gradient with adaptive step size, A2C, A2C + Trust Region
PPO comes out with as good or the best performance on all 7 mujoco tasks
Showcase in the Continuous Domain: Humanoid Running and Steering
PPO produces a good policy here too (paper doesn’t say much)
Comparison to Other Algorithms on the Atari Domain