Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Resources
- Paper
Introduction
- On-Policy: Stable but data inefficient
- Off-Policy: Data efficient but unstable / hard to use
- Combining on-policy and off-policy
  - Mix a ratio of on and off-policy gradients / update steps (ACER, PGQL)
    - No theoretical bounds on error of off-policy updates
  - Off-policy Q critic trained and used as control variate to reduce on-policy gradient variance (Q-Prop)
    - Policy updates don’t use off-policy data
- Interpolated Policy Gradient: interpolates between on-policy and off-policy learning
  - Biased method but bias is bounded
Preliminaries
- On-Policy Likelihood Ratio Policy Gradient
  - Likelihood Policy Gradient: $\nabla _{\theta} J(\theta) = \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t)]$
    - Unbiased
    - High variance
    - Sample insensitive
- Off-Policy Deterministic Policy Gradient
  - Value and Policy Weights:
    - Value: $w \leftarrow argmin \mathbb{E} _{\beta}[(Q_w(s_t, a_t) - y_t)^2]$
    - Policy: $\theta \leftarrow argmin \mathbb{E} _{\beta}[(Q_w(s_t, \mu _\theta(s_t)))]$
      - $\beta$: Behavior policy (i.e., exploration buffer)
  - Deterministic Policy Gradient: $\nabla _{\theta} J(\theta) \approx \mathbb{E} _{\rho^\beta}[\nabla _{\theta}Q_w(s_t, \mu _{\theta}(s_t))]$
    - Biased due to imperfect estimator in $Q_w$ + off-policy sampling from $\beta$ - unbounded bias
      - Causes off-policy to be less stable
- Off-Policy Control Variate Fitting
  - We can use baselines to reduce the variance of a monte carlo estimator
    - Q-Prop used a first order taylor expansion of $Q_w$
    - Gradient: $\nabla _{\theta} J(\theta) = \mathbb{E} _{\rho^\pi, \pi}[\nabla _{\theta} \log \pi _{\theta}(a_t \vert s_t)(\hat{Q}(s_t, a_t) - \tilde{Q}_w (s_t, a_t))] + \mathbb{E} _{\rho^\pi}[\nabla _{\theta}Q_w(s_t, \pi _{\theta}(s_t))]$
      - Combines likelihood ratio + deterministic policy gradient
      - Lower variance + is stable
      - Uses only on-policy samples for estimating policy gradient
Interpolated Policy Gradient
- Mixes likelihood gradient with deterministic gradient
  - $\nabla _\theta J(\theta) \approx (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t)] + \nu \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
    - Biased from off-policy sampling + inaccuracies in $Q_w$
    - $\bar{Q}_w^\pi$: state-value function
- Control Variates for Interpolated Policy Gradient
  - $\nabla _{\theta}J(\theta) \approx (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t)] + \nu \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
    - Biased approximation from IPG
  - $= (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)] + (1 - \nu)\mathbb{E} _{\rho^\pi}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)] + \nu \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
  - $\approx (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)] + \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
    - Replaces $\rho^\pi$ in control variate correction term with $\rho^\beta$
    - Adds bias if $\beta \neq \pi$
    - Gradients proportional to Residual likelihood ratio gradient: $\hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)$
      - Difference between advantages of on-policy and off policy
- Relationship to Prior Policy Gradient and Actor-Critic Methods
  - REINFORCE: $\nu = 0$, no control variate
  - Q-Prop: $\beta = \pi, \nu = 0$, no control variate
  - DDPG: $\nu = 1$
  - PGQL: $\beta \neq \pi$, no control variate
  - Algorithm:
    - Initialize $Q_w$, stochastic policy $\pi _{\theta}$, replay buffer
    - Repeat until convergence:
      - Rollout $\pi _{\theta}$ to collect batch of episodes
      - Fit $Q_w$ using replay buffer and $\pi$
      - Fit state-value function using batch of epsiodes (E) with timesteps (T)
      - Compute advantage estimate using batch data + state-value function
      - If using control variate
        
        Compute critic advantage estimate ($\bar{A} _{t,e}$) using batch data, $Q_w$, and $\pi _{\theta}$
        
        Compute + center learning signals: $l _{t,e} = \hat{A} _{t,e} - \bar{A} _{t,e}$ + set $b=1$ else:
        
        Center learning signals: $l _{t,e} = \hat{A} _{t,e}$ + set $b=\nu$
      - Multiply $l _{t,e}$ by $\nu$
      - Sample data (M samples) from replay buffer or batch data (depending on behavior policy)
      - Compute gradient $\nabla _{\theta} J(\theta) \approx \frac{1}{ET} \sum_e \sum_t \nabla _\theta \log \pi _{\theta}(a _{t,e} \vert s _{t,e})l _{t,e} + \frac{b}{M} \sum_m \nabla _{\theta}\bar{Q}_w^\pi(s_m)$
      - Update policy using gradient
- $\nu = 1$: Actor-Critic methods
  - Policy can be deterministic
  - Learning can be done completely off-policy
  - Bias from off-policy sampling increases as total variation or KL divergence between $\beta$ and $\pi$ increases
  - Actor-Critic with on-policy exploration could be more reliable
Theoretical Analysis
- $\beta \neq \pi, \nu = 0$: Policy Gradient with Control Variate and Off-Policy Sampling
  - When plugging in $\beta \neq \pi, \nu = 0$ to gradient approximation, we get equation similar to Q-Prop
    - $\approx \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)] + \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
      - Main difference: allows utilizing off-policy data for updating policy too
  - Let $\tilde{J}(\pi, \tilde{\pi})$ be a local approximation to $J(\pi)$
    - $J(\pi) = J(\tilde{\pi}) + \mathbb{E} _{\rho^\pi, \pi}[A^{\tilde{\pi}}(s_t, a_t)] \approx J(\tilde{\pi}) + \mathbb{E} _{\rho^{\tilde{\pi}}, \pi}[A^{\tilde{\pi}}(s_t, a_t)] = \tilde{J}(\pi, \tilde{\pi})$
      - $\tilde{\pi}$: Usually policy at iteration k
      - $\pi$: Usually policy at iteration k+1
    - Approximate objective: $\tilde{J}^{\beta, \nu = 0, CV} (\pi, \tilde{\pi}) = J(\tilde{\pi}) + \mathbb{E} _{\rho^\tilde{\pi}, \pi}[A^{\tilde{\pi}}(s_t, a_t) - A_w^{\tilde{\pi}}(s_t, a_t)] + \mathbb{E} _{\rho^\beta}[\bar{A}_w^{\pi, \tilde{\pi}}(s_t)] \approx \tilde{J}(\pi, \tilde{\pi})$
      - $\bar{A}_w^{\pi, \tilde{\pi}}(s_t) = \mathbb{E} _{\pi}[Q_w(s_t, \cdot)] - \mathbb{E} _{\tilde{\pi}}[Q_w(s_t, \cdot)]$
  - KL divergence between policies can be bounded: $\vert \vert J(\pi) - \tilde{J}^{\beta, \nu = 0, CV} \vert \vert_1 \leq 2 \frac{\gamma}{(1 - \gamma)^2}(\epsilon \sqrt{D _{KL}^{max}(\tilde{\pi}, beta)} + \zeta \sqrt{D _{KL}^{max}(\tilde{\pi}, \tilde{\pi})})$ $\epsilon = max_s \vert \bar{A}_w^{\pi, \tilde{\pi}}(s) \vert$ $\zeta = max_s \vert \bar{A}^{\pi, \tilde{\pi}}(s) \vert$
    - First term bounds bias from off-policy sampling using KL between $\tilde{\pi}$ and $\beta$
    - Second term confirms $\tilde{J}^{\beta, \nu = 0, CV}$ is a local approximation around $\pi$
    - Works well with policy gradient methods that constrain KL divergence (TRPO, NPG, etc.)
  - Monotonic Policy Improvement Guarantee
    - IPG does in fact have theoretical guarantees on policy improvement
      - Impractical to implement
      - IPG with trust region updates approximates this monotonicity
- General Bounds on the Interpolated Policy Gradient
  - Let $\delta = max _{s,a} \vert A^{\tilde{\pi}(s,a)} - A_w^{\tilde{\pi}(s,a)}\vert$, $\epsilon = max_s \vert \bar{A}_w^{\pi, \tilde{\pi}}(s) \vert$, $\zeta = max_s \vert \bar{A}^{\pi, \tilde{\pi}}(s) \vert$
    - Without Control Variate:
      - Local Approximation: $\tilde{J}^{\beta, \nu} (\pi, \tilde{\pi}) = J(\tilde{\pi}) + (1 - \nu)\mathbb{E} _{\rho^{\tilde{\pi}, \pi}}[\hat{A}^{\tilde{\pi}}] + \nu \mathbb{E} _{\rho^{\beta}}[\bar{A}_w^{\pi, \tilde{\pi}}]$
      - Bias Bound: $\vert \vert J(\pi) - \tilde{J}^{\beta, \nu}(\pi, \tilde{\pi}) \vert \vert_1 \leq \frac{\nu \delta}{1-\gamma} + 2 \frac{\gamma}{(1 - \gamma)^2}(\nu\epsilon \sqrt{D _{KL}^{max}(\tilde{\pi}, beta)} + \zeta \sqrt{D _{KL}^{max}(\tilde{\pi}, \tilde{\pi})})$
    - With Control Variate:
      - Local Approximation: $\tilde{J}^{\beta, \nu, CV} (\pi, \tilde{\pi}) = J(\tilde{\pi}) + (1 - \nu)\mathbb{E} _{\rho^{\tilde{\pi}, \pi}}[\hat{A}^{\tilde{\pi}} - A_w^{\tilde{\pi}}] + \nu \mathbb{E} _{\rho^{\beta}}[\bar{A}_w^{\pi, \tilde{\pi}}]$
      - Bias Bound: $\vert \vert J(\pi) - \tilde{J}^{\beta, \nu, CV}(\pi, \tilde{\pi}) \vert \vert_1 \leq \frac{\nu \delta}{1-\gamma} + 2 \frac{\gamma}{(1 - \gamma)^2}(\epsilon \sqrt{D _{KL}^{max}(\tilde{\pi}, beta)} + \zeta \sqrt{D _{KL}^{max}(\tilde{\pi}, \tilde{\pi})})$
Experiments
- $\beta \neq \pi, \nu = 0$, with the control variate
  - A variant of this method gives us monotonic convergence guarantees under certain conditions
  - When using off-policy (random) sampling from replay buffer, we get faster convergence than solely on-policy
    - Decorrelates the samples, allowing for more stable gradients
  - Replay buffer random sampling provides improvement on top of Q-Prop
    - These samples are not the same as DDPG / DQN samples
    - They are samples from within a trust region update, allowing greater regularity but less exploration
      - Key to good performance
- $\beta = \pi, \nu = 1$
  - On-policy sampling + on-policy version of deterministic actor-critic
    - More similar to TRPO or Q-Prop than DDPG
  - DDPG gets stuck while IPG monotonically improves
  - Running IPG with Ornstein–Uhlenbeck exploration (what DDPG does) degrades performance
    - Bias upper bounds become larger
      - In the off-policy case, difference between $\pi, \beta$ was bounded by trust region which bounded bias
      - in this case, the off-policy samples from exploration result in excessive bias
        More effective to use on-policy exploration with bounded policy updates than design heurstic exploration rules
- General Cases of Interpolated Policy Gradient
  - In general, $\nu = 0.2$ performed better than Q-Prop, TRPO, or other actor-critic methods consistently
  - $\nu = 0$ is Q-Prop and TRPO (depending if control variate is used)
  - $\nu = 1$ is a variant of actor-critic
  - Best performing cases are ones that interpolate between actor-critic and policy-gradient

May 21, 2025 · research