High-Dimensional Continuous Control Using Generalized Advantage Estimation

Resources
- Paper
Introduction
- Credit Assignment Problem / Distal Reward Problem: Delay between actions and their positive/negative impact on rewards
  - Value functions allow us to estimate goodness of action
- Using stochastic policy → noisy gradient estimates of expected total returns
  - Variance scales unfavorably with more time due to compounding
  - Actor-critic algorithms use value function instead of raw returns → lower variance but higher bias
    - Bias can cause algorithm to fail to converge or converge suboptimally
- Generalized Advantage Estimation (GAE): policy gradient estimator that reduces variance while maintaining tolerable bias. Parameterized by $\lambda \in [0,1], \gamma \in [0,1]$
Preliminaries
- Use undiscounted formulation of policy optimization (discount factor absorbed into reward function)
- Policy gradient: $g = \mathbb{E}[\sum_{t=0}^\infty\psi \nabla_\theta \log \pi_\theta (a_t\vert s_t)]$
  - $\psi$ can be total reward of trajectory, reward following action, baselined reward, state-action value function, advantage function, or a TD residual
    - Advantage usually results in lowest variance but usually needs to be estimated
      - Intuitively makes sense because policy gradient increases probability of better than average actions decreases probability of worse than average actions → advantage inherently does this!
- $\gamma$ parameter for downweighting rewards corresponding to delayed effects (less variance but more bias)
  - Equivalent to discount factor usually used
- Discounted approximation to policy gradient using advantages: $g^\gamma = \mathbb{E}[\sum_{t=0}^\infty A^{\pi, \gamma}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t\vert s_t)]$
- $\gamma-just$ estimator is one where: $g^\gamma = \mathbb{E}[\hat{A}(s_{0:\infty}, a_{0:\infty}) \nabla_\theta \log \pi_\theta (a_t\vert s_t)] = \mathbb{E}[A^{\pi, \gamma}(s_{t}, a_{t}) \nabla_\theta \log \pi_\theta (a_t\vert s_t)]$
  - If it is $\gamma -just$ for all $t$ then
    - $\mathbf{E}[\sum_{t=0}^\infty \hat{A}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t\vert s_t)]= g^\gamma$
  - Can be $\gamma-just$ if $\hat{A} = Q_t(s_{t:\infty}, a_{t:\infty}) - b_t(s_{0:t}, a_{0:t-1})$
Advantage Function Estimation
- We want to create an accurate estimate of $\hat{A_t}$ which we can use for estimating policy gradients
- We can have 1-step, 2-step,…, k-step estimates of advantages:
  - $\hat{A}_t^{(k)} = \sum_{l=0}^{k-1}\gamma^l\delta_{t+1}^V = -V(s_t) + r_t + \gamma r_{t+1} + \dots + \gamma^{k-1}r_{t+k-1} + \gamma^kV(s_{t+k})$
    - Bias becomes smaller as $k \rightarrow \infty$ (this is empirical returns minus value function baseline)
- GAE: Exponentially weighted average of k-step estimators; uses two parameters, $\lambda, \gamma$
  - $\hat{A}_t^{GAE(\gamma, \lambda)} = (1-\lambda)(\hat{A}^{(1)}_t + \lambda\hat{A}^{(2)}_t + \lambda^2 \hat{A}^{(3)}_t + \dots) = \sum_{l=0}^\infty (\gamma\lambda)^l\delta^V_{t+l}$
  - $\lambda = 1$: High variance term because it is sum of terms but is $\gamma-just$ regardless of $V$.
  - $\lambda = 0$: Lower variance and is $\gamma-just$ for $V = V^{\pi, \gamma}$; induces bias otherwise
  - $0 \lt \lambda \lt 1$ makes compromise between bias and variance
  - $\gamma$ determines scale of value function → if $\gamma \lt 1$ then policy gradient estimate is biased regardless of function accuracy
  - $\lambda \lt 1$ only introduces bias if value function is inaccurate
  - Generally best value for $\lambda$ is lower than best value for $\gamma$
  - Biased estimator of $g^\gamma$:
    - $g^\gamma \approx \mathbb{E}[\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t \vert s_t) \hat{A}_t^{GAE(\gamma, \lambda)}] = \mathbb{E}[\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t \vert s_t) \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+1}^V]$
- Interpretation as Reward Shaping
  - Reward shaping is transforming a reward function of an MDP
    - $\tilde{r}(s,a,s’) = r(s,a,s) + \gamma\Phi(s’) -\Phi(s)$
    - Same as defining a transformed MDP with new rewards
    - Transformed value and advantage functions
      - $\tilde{Q}^{\pi, \gamma}(s,a) = Q^{\pi, \gamma}(s,a) - \Phi(s)$
      - $\tilde{V}^{\pi, \gamma}(s,a) = V^{\pi, \gamma}(s,a) - \Phi(s)$
      - $\tilde{A}^{\pi, \gamma}(s,a) = (Q^{\pi, \gamma}(s,a) - \Phi(s)) - (V^{\pi, \gamma}(s,a) - \Phi(s)) = A^{\pi, \gamma}(s,a)$
  - Reward shaping leaves policy gradient + optimal policy unchanged when we want to maximize the sum of discounted rewards
  - When we set $\Phi = V$, we get:
    - $\sum_{l=0}^\infty (\gamma\lambda)^l\tilde{r}(s_{t+l}, a_t, s_{t+l+1}) = \sum_{l = 0}^\infty (\gamma\lambda)^l \delta_{t+1}^V = \hat{A}_t^{GAE(\gamma, \lambda)}$
    - GAE is $\gamma\lambda$ discounted sum of shaped rewards
  - Response Function
    - $\chi(l;s_t, a_t) = \mathbb{E}[r_{t+l} \vert s_t, a_t] - \mathbb{E}[r_{t+l} \vert s_t]$
    - $A^{\pi,\gamma}(s,a) = \sum_{l=0}^\infty \gamma^l \chi(l;s,a)$; response function decomposes advantage across timesteps
    - Allows us to quantify temporal assignment (long range dependencies)
    - Discounted policy gradient estimator: $\nabla_{\theta}\log \pi_{\theta}(a_t \vert s_t)A^{\pi, \gamma}(s_t, a_t) = \nabla_{\theta}\log \pi_\theta(a_t \vert s_t) \sum_{l = 0}^\infty \gamma^l \chi(l;s_t, a_t)$
      - $\gamma < 1$ means you drop terms with $l » \frac{1}{1-\gamma}$
    - If the reward function is obtained using $\phi = V^{\pi, \gamma}$, $\mathbb{E}[\tilde{r}_{t+l} \vert s_t, a_t] =\mathbb{E}[\tilde{r} _{t+l}\vert s_t]$: A temporally extended response turns into an immediate response because the value function reduces temporal spread (this is done through reward shaping).
      - Helps gradient focus on near-term outcomes
Value Function Estimation
- To estimate value functions, you can use monte-carlo returns and solve linear regression on this: $minimize_\phi \sum_{n=1}^N\vert\vert V_\phi (s_n) - \hat{V_n} \vert\vert^2$
- Use trust region to avoid overfitting on batch of data (constraint $\frac{1}{N}\sum_{n=1}^N \frac{\vert\vert V_\phi (s_n) - V_{\phi_{old}} \vert\vert^2}{2 \sigma^2} \leq \epsilon$)
  - Similar to TRPO KL Divergence Constraint
  - Use approximate solution with conjugate gradients
Experiments
- Policy Optimization Algorithm
  - For experiements, they use GAE with TRPO
  - Vary the $\gamma, \lambda$ parameters to see effects
  - Use value function for advantage estimation
- Experimental Setup
  - Architecture
    - 3 Hidden Layers with tanh activations
    - Same policy + value function architecture
    - Final layer = linear
  - Task Details
    - See cartpole, mujoco Bipedal locomotion, quadrapedal locomotion, and dynamically standing up bipedal
- Experimental Results
  - Cart-Pole
    - Fixed $\gamma$ with $\lambda = [0, 1]$. Best $\lambda = [0.92, 0.98]$ for fastest policy improvement
  - 3D Bipedal Locomotion
    - Best $\gamma = [0.99, 0.995]$ with $\lambda = [0.96, 0.99]$
  - 3D Robot Tasks
    - Quadrapedal Locomotion: Fixed $\gamma = 0.995$ with best $\lambda = 0.96$
    - 3D standing: Fixed $\gamma = 0.995$ with best $\lambda = [0.96, 1]$
Discussion
- Control problems difficult to solve because of high sample complexity
- Reduce sample complexity by good estimates of advantages
- Future work should look to tune $\gamma, \lambda$ automatically
- If we know the relationship between the policy gradient estimation error and value function estimation error, we could choose a error function for value function that is well-matched

April 15, 2025 · research