Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning
-
Resources
-
Introduction
- On-Policy: Stable but data inefficient
- Off-Policy: Data efficient but unstable / hard to use
- Combining on-policy and off-policy
- Mix a ratio of on and off-policy gradients / update steps (ACER, PGQL)
- No theoretical bounds on error of off-policy updates
- Off-policy Q critic trained and used as control variate to reduce on-policy gradient variance (Q-Prop)
- Policy updates don’t use off-policy data
- Interpolated Policy Gradient: interpolates between on-policy and off-policy learning
- Biased method but bias is bounded
-
Preliminaries
-
On-Policy Likelihood Ratio Policy Gradient
- Likelihood Policy Gradient: $\nabla _{\theta} J(\theta) = \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t)]$
- Unbiased
- High variance
- Sample insensitive
-
Off-Policy Deterministic Policy Gradient
- Value and Policy Weights:
- Value: $w \leftarrow argmin \mathbb{E} _{\beta}[(Q_w(s_t, a_t) - y_t)^2]$
- Policy: $\theta \leftarrow argmin \mathbb{E} _{\beta}[(Q_w(s_t, \mu _\theta(s_t)))]$
- $\beta$: Behavior policy (i.e., exploration buffer)
- Deterministic Policy Gradient: $\nabla _{\theta} J(\theta) \approx \mathbb{E} _{\rho^\beta}[\nabla _{\theta}Q_w(s_t, \mu _{\theta}(s_t))]$
- Biased due to imperfect estimator in $Q_w$ + off-policy sampling from $\beta$ - unbounded bias
- Causes off-policy to be less stable
-
Off-Policy Control Variate Fitting
- We can use baselines to reduce the variance of a monte carlo estimator
- Q-Prop used a first order taylor expansion of $Q_w$
- Gradient: $\nabla _{\theta} J(\theta) = \mathbb{E} _{\rho^\pi, \pi}[\nabla _{\theta} \log \pi _{\theta}(a_t \vert s_t)(\hat{Q}(s_t, a_t) - \tilde{Q}_w (s_t, a_t))] + \mathbb{E} _{\rho^\pi}[\nabla _{\theta}Q_w(s_t, \pi _{\theta}(s_t))]$
- Combines likelihood ratio + deterministic policy gradient
- Lower variance + is stable
- Uses only on-policy samples for estimating policy gradient
-
Interpolated Policy Gradient
- Mixes likelihood gradient with deterministic gradient
- $\nabla _\theta J(\theta) \approx (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t)] + \nu \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
- Biased from off-policy sampling + inaccuracies in $Q_w$
- $\bar{Q}_w^\pi$: state-value function
-
Control Variates for Interpolated Policy Gradient
- $\nabla _{\theta}J(\theta) \approx (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t)] + \nu \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
- Biased approximation from IPG
- $= (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)] + (1 - \nu)\mathbb{E} _{\rho^\pi}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)] + \nu \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
- $\approx (1-\nu) \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)] + \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
- Replaces $\rho^\pi$ in control variate correction term with $\rho^\beta$
- Adds bias if $\beta \neq \pi$
- Gradients proportional to Residual likelihood ratio gradient: $\hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)$
- Difference between advantages of on-policy and off policy
-
Relationship to Prior Policy Gradient and Actor-Critic Methods
- REINFORCE: $\nu = 0$, no control variate
- Q-Prop: $\beta = \pi, \nu = 0$, no control variate
- DDPG: $\nu = 1$
- PGQL: $\beta \neq \pi$, no control variate
- Algorithm:
- Initialize $Q_w$, stochastic policy $\pi _{\theta}$, replay buffer
- Repeat until convergence:
- Rollout $\pi _{\theta}$ to collect batch of episodes
- Fit $Q_w$ using replay buffer and $\pi$
- Fit state-value function using batch of epsiodes (E) with timesteps (T)
- Compute advantage estimate using batch data + state-value function
- If using control variate
- Compute critic advantage estimate ($\bar{A} _{t,e}$) using batch data, $Q_w$, and $\pi _{\theta}$
- Compute + center learning signals: $l _{t,e} = \hat{A} _{t,e} - \bar{A} _{t,e}$ + set $b=1$ else:
- Center learning signals: $l _{t,e} = \hat{A} _{t,e}$ + set $b=\nu$
- Multiply $l _{t,e}$ by $\nu$
- Sample data (M samples) from replay buffer or batch data (depending on behavior policy)
- Compute gradient $\nabla _{\theta} J(\theta) \approx \frac{1}{ET} \sum_e \sum_t \nabla _\theta \log \pi _{\theta}(a _{t,e} \vert s _{t,e})l _{t,e} + \frac{b}{M} \sum_m \nabla _{\theta}\bar{Q}_w^\pi(s_m)$
- Update policy using gradient
-
$\nu = 1$: Actor-Critic methods
- Policy can be deterministic
- Learning can be done completely off-policy
- Bias from off-policy sampling increases as total variation or KL divergence between $\beta$ and $\pi$ increases
- Actor-Critic with on-policy exploration could be more reliable
-
Theoretical Analysis
-
$\beta \neq \pi, \nu = 0$: Policy Gradient with Control Variate and Off-Policy Sampling
- When plugging in $\beta \neq \pi, \nu = 0$ to gradient approximation, we get equation similar to Q-Prop
- $\approx \mathbb{E} _{\rho^\pi, \pi}[\nabla _\theta \log \pi _{\theta}(a_t \vert s_t) \hat{A}(s_t, a_t) - A_w^\pi(s_t, a_t)] + \mathbb{E} _{\rho^\beta}[\nabla _{\theta}\bar{Q}_w^\pi(s_t)]$
- Main difference: allows utilizing off-policy data for updating policy too
- Let $\tilde{J}(\pi, \tilde{\pi})$ be a local approximation to $J(\pi)$
- $J(\pi) = J(\tilde{\pi}) + \mathbb{E} _{\rho^\pi, \pi}[A^{\tilde{\pi}}(s_t, a_t)] \approx J(\tilde{\pi}) + \mathbb{E} _{\rho^{\tilde{\pi}}, \pi}[A^{\tilde{\pi}}(s_t, a_t)] = \tilde{J}(\pi, \tilde{\pi})$
- $\tilde{\pi}$: Usually policy at iteration k
- $\pi$: Usually policy at iteration k+1
- Approximate objective: $\tilde{J}^{\beta, \nu = 0, CV} (\pi, \tilde{\pi}) = J(\tilde{\pi}) + \mathbb{E} _{\rho^\tilde{\pi}, \pi}[A^{\tilde{\pi}}(s_t, a_t) - A_w^{\tilde{\pi}}(s_t, a_t)] + \mathbb{E} _{\rho^\beta}[\bar{A}_w^{\pi, \tilde{\pi}}(s_t)] \approx \tilde{J}(\pi, \tilde{\pi})$
- $\bar{A}_w^{\pi, \tilde{\pi}}(s_t) = \mathbb{E} _{\pi}[Q_w(s_t, \cdot)] - \mathbb{E} _{\tilde{\pi}}[Q_w(s_t, \cdot)]$
- KL divergence between policies can be bounded: $\vert \vert J(\pi) - \tilde{J}^{\beta, \nu = 0, CV} \vert \vert_1 \leq 2 \frac{\gamma}{(1 - \gamma)^2}(\epsilon \sqrt{D _{KL}^{max}(\tilde{\pi}, beta)} + \zeta \sqrt{D _{KL}^{max}(\tilde{\pi}, \tilde{\pi})})$ $\epsilon = max_s \vert \bar{A}_w^{\pi, \tilde{\pi}}(s) \vert$ $\zeta = max_s \vert \bar{A}^{\pi, \tilde{\pi}}(s) \vert$
- First term bounds bias from off-policy sampling using KL between $\tilde{\pi}$ and $\beta$
- Second term confirms $\tilde{J}^{\beta, \nu = 0, CV}$ is a local approximation around $\pi$
- Works well with policy gradient methods that constrain KL divergence (TRPO, NPG, etc.)
-
Monotonic Policy Improvement Guarantee
- IPG does in fact have theoretical guarantees on policy improvement
- Impractical to implement
- IPG with trust region updates approximates this monotonicity
-
General Bounds on the Interpolated Policy Gradient
- Let $\delta = max _{s,a} \vert A^{\tilde{\pi}(s,a)} - A_w^{\tilde{\pi}(s,a)}\vert$, $\epsilon = max_s \vert \bar{A}_w^{\pi, \tilde{\pi}}(s) \vert$, $\zeta = max_s \vert \bar{A}^{\pi, \tilde{\pi}}(s) \vert$
- Without Control Variate:
- Local Approximation: $\tilde{J}^{\beta, \nu} (\pi, \tilde{\pi}) = J(\tilde{\pi}) + (1 - \nu)\mathbb{E} _{\rho^{\tilde{\pi}, \pi}}[\hat{A}^{\tilde{\pi}}] + \nu \mathbb{E} _{\rho^{\beta}}[\bar{A}_w^{\pi, \tilde{\pi}}]$
- Bias Bound: $\vert \vert J(\pi) - \tilde{J}^{\beta, \nu}(\pi, \tilde{\pi}) \vert \vert_1 \leq \frac{\nu \delta}{1-\gamma} + 2 \frac{\gamma}{(1 - \gamma)^2}(\nu\epsilon \sqrt{D _{KL}^{max}(\tilde{\pi}, beta)} + \zeta \sqrt{D _{KL}^{max}(\tilde{\pi}, \tilde{\pi})})$
- With Control Variate:
- Local Approximation: $\tilde{J}^{\beta, \nu, CV} (\pi, \tilde{\pi}) = J(\tilde{\pi}) + (1 - \nu)\mathbb{E} _{\rho^{\tilde{\pi}, \pi}}[\hat{A}^{\tilde{\pi}} - A_w^{\tilde{\pi}}] + \nu \mathbb{E} _{\rho^{\beta}}[\bar{A}_w^{\pi, \tilde{\pi}}]$
- Bias Bound: $\vert \vert J(\pi) - \tilde{J}^{\beta, \nu, CV}(\pi, \tilde{\pi}) \vert \vert_1 \leq \frac{\nu \delta}{1-\gamma} + 2 \frac{\gamma}{(1 - \gamma)^2}(\epsilon \sqrt{D _{KL}^{max}(\tilde{\pi}, beta)} + \zeta \sqrt{D _{KL}^{max}(\tilde{\pi}, \tilde{\pi})})$
-
Experiments
-
$\beta \neq \pi, \nu = 0$, with the control variate
- A variant of this method gives us monotonic convergence guarantees under certain conditions
- When using off-policy (random) sampling from replay buffer, we get faster convergence than solely on-policy
- Decorrelates the samples, allowing for more stable gradients
- Replay buffer random sampling provides improvement on top of Q-Prop
- These samples are not the same as DDPG / DQN samples
- They are samples from within a trust region update, allowing greater regularity but less exploration
-
$\beta = \pi, \nu = 1$
- On-policy sampling + on-policy version of deterministic actor-critic
- More similar to TRPO or Q-Prop than DDPG
- DDPG gets stuck while IPG monotonically improves
- Running IPG with Ornstein–Uhlenbeck exploration (what DDPG does) degrades performance
- Bias upper bounds become larger
- In the off-policy case, difference between $\pi, \beta$ was bounded by trust region which bounded bias
- in this case, the off-policy samples from exploration result in excessive bias
- More effective to use on-policy exploration with bounded policy updates than design heurstic exploration rules
-
General Cases of Interpolated Policy Gradient
- In general, $\nu = 0.2$ performed better than Q-Prop, TRPO, or other actor-critic methods consistently
- $\nu = 0$ is Q-Prop and TRPO (depending if control variate is used)
- $\nu = 1$ is a variant of actor-critic
- Best performing cases are ones that interpolate between actor-critic and policy-gradient
· research