Trust Region Policy Optimization
Resources
Paper
-
Introduction
- 3 categories of policy optimization
- Policy Iteration
- Policy Gradient
- Derivative Free
- Treat return as black box optimized by policy parameters
- Usually preferred because simple to implement + good results
- Gradient-based optimization has been good for supervised learning but less so for RL
- Minimizing surrogate objective = guarantees policy improvement
- TRPO Variants
- Single Path: Applied in model-free setting
- Vine: Requires restoring system to specific state → only possible in simulation
- TRPO is scalable to systems with thousands or millions of parameters
-
Preliminaries
- Use standard definitions of policy, state and action value functions, and advantage functions.
- Expected return of another policy $\tilde{\pi}$ in terms of advantage over $\pi$, accumulated over timesteps
- $\eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0, a_0, \dots \sim \tilde{\pi}}[\sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t)]$
- Equivalent to $\eta(\pi) + \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a\vert s)A_\pi(s,a)$
- Implies any policy update that has a non-negative expected advantage at every state is guaranteed to increase or leave it constant
- Classical tabular RL follows this because it uses argmax
- With function approximation, some states have negative expected advantage. Also dependency on $\rho_{\tilde{\pi}}(s)$ makes it hard to optimize
- Instead use: $L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a\vert s)A_\pi(s, a)$
- Where this uses current instead of next policy visitation frequency (easier to optimize)
- If $\pi_\theta(a\vert s)$ is differentiable
- $L_{\pi\theta_{old}} = \eta(\pi_{\theta_0})$
- $\nabla_{\theta_0}L_{\pi\theta_{old}} = \nabla_{\theta_0}\eta(\pi_{\theta_0})$
- For a small step $\pi_{\theta_0} \rightarrow \tilde{\pi}$ that improves $L_{\pi\theta_{old}}$ will also improve $\eta$
- We don’t know how small of a step to take, though
- Conservative Policy Iterations
- Provides lower bounds on improvement of $\eta$
- Let $\pi’ = argmax L_{\pi_{old}}(\pi’)$
- New Policy: $\pi_{new}(a\vert s) = (1-\alpha)\pi_{old}(a\vert s) + \alpha \pi’(a\vert s)$
- Lower Bound Improvement: $\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2$
- $\epsilon = max_s \vert\mathbb{E}_{a \sim \pi’(a\vert s)}[A_\pi(s,a)]\vert$
-
Monotonic Improvement Guarantee for General Stochastic Policies
- Conservative policy iteration uses a mixture policy (between old and argmax policy)
- We can extend this to general stochastic policies with a distance measure between $\pi$ and $\tilde{\pi}$ and changing $\epsilon$
- Uses total variation divergence as distance measure: $D_{max_{TV}}(\pi, \tilde{\pi}) = max_s D_{TV}(\pi(\cdot \vert s) \vert\vert \tilde{\pi}(\cdot \vert s))$
- $D_{TV}(p \vert\vert q) = \frac{1}{2}\sum_i \vert p_i - q_i\vert$
- Using total variation divergence, we get the following bound:
- $\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2$
- $\epsilon = max_{s,a}\vert A_\pi(s,a)\vert$
- We can also use KL divergence because we know $D_{TV}(p\vert\vert q)^2 \leq D_{KL}(p\vert\vert q)$
- $\eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi}) - CD_{max_{KL}}(\pi, \tilde{\pi})$ where $C = \frac{4\epsilon\gamma}{(1-\gamma)^2}$
- Guaranteed to produce monotonically non-decreasing policies when we maximize $L_\pi(\tilde{\pi}) - CD_{max_{KL}}(\pi, \tilde{\pi})$
- This is a minimization-maximization algorithm
- TRPO uses a KL divergence constraint instead of a penalty to allow large updates
-
Optimization of Parameterized Policies
- If we follow the maximation objective: $maximize_\theta [L_{\theta_{old}}(\theta) - CD^{max}_{KL}(\theta_{old}, \theta)]$
- Guaranteed to improve true objective $\eta$
- However step sizes would be very small
- Instead we can use a KL divergence (trust region) constraint between new and old policies
- $maximize L_{\theta_{old}}(\theta)$ subject to $D_{KL}^{max}(\theta_{old}, \theta) \leq \delta$
- In practice, we use heuristic approximation: $\bar{D}^\rho_{KL}(\theta_1, \theta_2) = \mathbb{E}_{s \sim \rho}[D_{KL}(\pi_{\theta_1}) (\cdot \vert s)\vert\vert \pi_{\theta_1})(\cdot \vert s))]$
-
Sample-Based Estimation of the Objective and Constraint
- We need to approximate the objective and constraint using monte carlo simulations
- Original optimization problem: $maximize_\theta \sum_s \rho_{\theta_{old}}(s)\sum_a \pi_\theta(a\vert s)A_{\theta_{old}}(s,a)$ subject to $\bar{D}_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \leq \delta$
- Replace $\sum_s \rho_{\theta_{old}}(s)$ with expectation
- Replace expectation with sample averages
- Replace advantages with $Q_{old}$ values
- Replace with empirical estimate
- Replace sum over actions with importance sampling ratio $\frac{\pi_theta(a\vert s_n)}{q(a\vert s_n)}$ where $q$ is the sampling distribution
- Two sampling schemes: single path and vine
-
Single Path
- Collect sequence of states by sampling $s_0 \sim \rho_0$ and then simulating a policy
- $q(a\vert s) = \pi_{\theta_{old}}$
- $Q_{old}$ computed at each state-action pair by taking the discounted sum of future rewards
-
Vine
- Sample $s_0 \sim \rho_0$ and then simulating a policy, $\pi_{\theta_i}$
- Generate a number of trajectories
- Choose a subset N of states known as the rollout set
- For each state in the rollout set, sample K actions from$q(\cdot \vert s_n)$ → $q(\cdot \vert s_n) = \pi(\cdot \vert s_n)$ works well in practice
- Estimate $\hat{Q}_{\theta_i}(s_n, a_{n,k})$ by doing a small rollout
- For finite action space
- $L_n(\theta) = \sum_{k=1}^K \pi_\theta(a_k \vert s_n)\hat{Q}(s_n, a_k)$
- For continuous action space, we can estimate using importance sampling and normalize it:
- $L_n(\theta) = \frac{\sum_{k=1}^K\frac{\pi_\theta(a_{n,k}\vert s_n)}{\pi_{\theta_{old}}(a_{n,k}\vert s_n)}\hat{Q}(s_n, a_k))}{\sum_{k=1}^K\frac{\pi_\theta(a_{n,k}\vert s_n)}{\pi_{\theta_{old}}(a_{n,k}\vert s_n)}}$
- Removes need for baselines (gradient unchanged)
- Vine results in much lower variance than single path given same number of Q-value samples
- But requires more calls to simulator
-
Practical Algorithm
- Use single path or vine to collect set of state-action pairs + Q value estimates
- Construct estimated objective + constraint
- Approximately solve to update parameters
- Use conjugate gradient with line search
- Compute fisher information matrix by analytically computing hessian on KL divergence
- Removes need to store dense hessian or policy gradients for batch of trajectories
-
Connections with Prior Work
- Natural policy gradient sets the step size as a hyperparameter
- TRPO forces the constraint size at each step
- Improves the performance significantly
- We can get the standard policy gradient update using an L2 constraint instead of KL
- Relative entropy policy search constrains $p(s,a)$ whereas TRPO constrains $p(s\vert A)$
- KL Divergence has been used to ensure policy does not stray away from regions where dynamics model is valid
-
Experiments
-
Simulated Robotic Locomotion
- Tested on swimmer, hopper, and walker
- TRPO had the best solutions on all problems relative to natural policy gradient
- KL divergence way better way to choose step sizes instead of a fixed penalty
- TRPO learned all policies with simple rewards + minimal prior knowledge
-
Playing Games from Images
- Only outperformed previous methods on some games but achieved reasonable scores
- TRPO was not made for these tasks, but its performance demonstrates generalization
· research