Trust Region Policy Optimization

Resources
- Paper
Introduction
- 3 categories of policy optimization
  - Policy Iteration
  - Policy Gradient
  - Derivative Free
    - Treat return as black box optimized by policy parameters
    - Usually preferred because simple to implement + good results
- Gradient-based optimization has been good for supervised learning but less so for RL
- Minimizing surrogate objective = guarantees policy improvement
- TRPO Variants
  - Single Path: Applied in model-free setting
  - Vine: Requires restoring system to specific state → only possible in simulation
- TRPO is scalable to systems with thousands or millions of parameters
Preliminaries
- Use standard definitions of policy, state and action value functions, and advantage functions.
- Expected return of another policy $\tilde{\pi}$ in terms of advantage over $\pi$, accumulated over timesteps
  - $\eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0, a_0, \dots \sim \tilde{\pi}}[\sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t)]$
  - Equivalent to $\eta(\pi) + \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a\vert s)A_\pi(s,a)$
    - Implies any policy update that has a non-negative expected advantage at every state is guaranteed to increase or leave it constant
    - Classical tabular RL follows this because it uses argmax
  - With function approximation, some states have negative expected advantage. Also dependency on $\rho_{\tilde{\pi}}(s)$ makes it hard to optimize
    - Instead use: $L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a\vert s)A_\pi(s, a)$
      - Where this uses current instead of next policy visitation frequency (easier to optimize)
  - If $\pi_\theta(a\vert s)$ is differentiable
    - $L_{\pi\theta_{old}} = \eta(\pi_{\theta_0})$
    - $\nabla_{\theta_0}L_{\pi\theta_{old}} = \nabla_{\theta_0}\eta(\pi_{\theta_0})$
      - For a small step $\pi_{\theta_0} \rightarrow \tilde{\pi}$ that improves $L_{\pi\theta_{old}}$ will also improve $\eta$
        We don’t know how small of a step to take, though
  - Conservative Policy Iterations
    - Provides lower bounds on improvement of $\eta$
    - Let $\pi’ = argmax L_{\pi_{old}}(\pi’)$
    - New Policy: $\pi_{new}(a\vert s) = (1-\alpha)\pi_{old}(a\vert s) + \alpha \pi’(a\vert s)$
    - Lower Bound Improvement: $\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2$
      - $\epsilon = max_s \vert\mathbb{E}_{a \sim \pi’(a\vert s)}[A_\pi(s,a)]\vert$
Monotonic Improvement Guarantee for General Stochastic Policies
- Conservative policy iteration uses a mixture policy (between old and argmax policy)
  - We can extend this to general stochastic policies with a distance measure between $\pi$ and $\tilde{\pi}$ and changing $\epsilon$
  - Uses total variation divergence as distance measure: $D_{max_{TV}}(\pi, \tilde{\pi}) = max_s D_{TV}(\pi(\cdot \vert s) \vert\vert \tilde{\pi}(\cdot \vert s))$
    - $D_{TV}(p \vert\vert q) = \frac{1}{2}\sum_i \vert p_i - q_i\vert$
  - Using total variation divergence, we get the following bound:
    - $\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2$
      - $\epsilon = max_{s,a}\vert A_\pi(s,a)\vert$
  - We can also use KL divergence because we know $D_{TV}(p\vert\vert q)^2 \leq D_{KL}(p\vert\vert q)$
    - $\eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi}) - CD_{max_{KL}}(\pi, \tilde{\pi})$ where $C = \frac{4\epsilon\gamma}{(1-\gamma)^2}$
    - Guaranteed to produce monotonically non-decreasing policies when we maximize $L_\pi(\tilde{\pi}) - CD_{max_{KL}}(\pi, \tilde{\pi})$
- This is a minimization-maximization algorithm
- TRPO uses a KL divergence constraint instead of a penalty to allow large updates
Optimization of Parameterized Policies
- If we follow the maximation objective: $maximize_\theta [L_{\theta_{old}}(\theta) - CD^{max}_{KL}(\theta_{old}, \theta)]$
  - Guaranteed to improve true objective $\eta$
  - However step sizes would be very small
- Instead we can use a KL divergence (trust region) constraint between new and old policies
  - $maximize L_{\theta_{old}}(\theta)$ subject to $D_{KL}^{max}(\theta_{old}, \theta) \leq \delta$
  - In practice, we use heuristic approximation: $\bar{D}^\rho_{KL}(\theta_1, \theta_2) = \mathbb{E}_{s \sim \rho}[D_{KL}(\pi_{\theta_1}) (\cdot \vert s)\vert\vert \pi_{\theta_1})(\cdot \vert s))]$
Sample-Based Estimation of the Objective and Constraint
- We need to approximate the objective and constraint using monte carlo simulations
- Original optimization problem: $maximize_\theta \sum_s \rho_{\theta_{old}}(s)\sum_a \pi_\theta(a\vert s)A_{\theta_{old}}(s,a)$ subject to $\bar{D}_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \leq \delta$
  - Replace $\sum_s \rho_{\theta_{old}}(s)$ with expectation
    - Replace expectation with sample averages
  - Replace advantages with $Q_{old}$ values
    - Replace with empirical estimate
  - Replace sum over actions with importance sampling ratio $\frac{\pi_theta(a\vert s_n)}{q(a\vert s_n)}$ where $q$ is the sampling distribution
  - Two sampling schemes: single path and vine
- Single Path
  - Collect sequence of states by sampling $s_0 \sim \rho_0$ and then simulating a policy
  - $q(a\vert s) = \pi_{\theta_{old}}$
  - $Q_{old}$ computed at each state-action pair by taking the discounted sum of future rewards
- Vine
  - Sample $s_0 \sim \rho_0$ and then simulating a policy, $\pi_{\theta_i}$
  - Generate a number of trajectories
  - Choose a subset N of states known as the rollout set
  - For each state in the rollout set, sample K actions from$q(\cdot \vert s_n)$ → $q(\cdot \vert s_n) = \pi(\cdot \vert s_n)$ works well in practice
  - Estimate $\hat{Q}_{\theta_i}(s_n, a_{n,k})$ by doing a small rollout
  - For finite action space
    - $L_n(\theta) = \sum_{k=1}^K \pi_\theta(a_k \vert s_n)\hat{Q}(s_n, a_k)$
  - For continuous action space, we can estimate using importance sampling and normalize it:
    - $L_n(\theta) = \frac{\sum_{k=1}^K\frac{\pi_\theta(a_{n,k}\vert s_n)}{\pi_{\theta_{old}}(a_{n,k}\vert s_n)}\hat{Q}(s_n, a_k))}{\sum_{k=1}^K\frac{\pi_\theta(a_{n,k}\vert s_n)}{\pi_{\theta_{old}}(a_{n,k}\vert s_n)}}$
    - Removes need for baselines (gradient unchanged)
  - Vine results in much lower variance than single path given same number of Q-value samples
    - But requires more calls to simulator
Practical Algorithm
- Use single path or vine to collect set of state-action pairs + Q value estimates
- Construct estimated objective + constraint
- Approximately solve to update parameters
  - Use conjugate gradient with line search
  - Compute fisher information matrix by analytically computing hessian on KL divergence
    - Removes need to store dense hessian or policy gradients for batch of trajectories
Connections with Prior Work
- Natural policy gradient sets the step size as a hyperparameter
  - TRPO forces the constraint size at each step
    - Improves the performance significantly
- We can get the standard policy gradient update using an L2 constraint instead of KL
- Relative entropy policy search constrains $p(s,a)$ whereas TRPO constrains $p(s\vert A)$
- KL Divergence has been used to ensure policy does not stray away from regions where dynamics model is valid
Experiments
- Simulated Robotic Locomotion
  - Tested on swimmer, hopper, and walker
  - TRPO had the best solutions on all problems relative to natural policy gradient
  - KL divergence way better way to choose step sizes instead of a fixed penalty
  - TRPO learned all policies with simple rewards + minimal prior knowledge
- Playing Games from Images
  - Only outperformed previous methods on some games but achieved reasonable scores
  - TRPO was not made for these tasks, but its performance demonstrates generalization

April 7, 2025 · research