Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
Resources
Introduction
Problems with using deep neural nets:
Hyperparameter sensitivity causes unstable / non-convergent learning
High sample complexity
Monte carlo policy gradient gives unbiased but high variance estimates of gradient
Can constrain policy change
Can mix value-based back ups
Still require high number of samples
Problem with policy gradient methods is they can only use on-policy samplesNeed to collect more samples after each parameter update
Off-policy Q learning and actor critic can use all samples
more sample efficient
Convergence is not guaranteed with non-linear function approximators
Need extensive hyperparameter tuning
Q-Prop: combines advantages of on-policy policy gradient with efficiency of off-policy learning
Reduces variance of gradient estimates without adding bias
Learns action-value off-policy
First order taylor expansion of critic as control variate
Monte carlo policy gradient term with residuals in advantage approximation
Uses off-policy critic to reduce varaince or on-policy monte carlo returns to correct for bias in critic gradient
Background
Assume standard RL setting
Combines strengths of monte carlo policy gradient (i.e., TRPO, REINFORCE) and policy gradient with function approximation (i.e., actor-critic)
Monte Carlo Policy Gradient Methods
Use vanilla policy gradient with baselines (REINFORCE) gradient: $\nabla _{\theta}J(\theta) = \mathbb{E} _{s_t \sim \rho _\pi (\cdot), a_t \sim \pi(\cdot \vert s_t)}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(R_t - b(s_t))]$Use value function as baseline: $V _\pi(s_t) = \mathbb{E}[R_t] = \mathbb{E} _{\pi _\theta(a_t \vert s_t)}[Q _{\pi(s_t, a_t)}]$ ($R_t - b(s_t) = A _\pi(s_t, a_t)$)
We can use off-policy data with importance sampling with policy gradient to reduce sample complexityDifficult to scale to high dimensions because of degenerating importance weights
Policy Gradient With Function Approximation
Actor-critic methods use a policy evaluation step with TD learning and policy improvement step
More sample efficient because we use experience replayBiased gradient: $\nabla _\theta J(\theta) \approx \mathbb{E} _{s_t \sim \rho _{\beta} (\cdot)}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t) \nabla _\theta \mu _\theta(s_t)}]$
Does not rely on high variance REINFORCE gradients
Q-Prop
Unbiased + high variance estimator = monte carlo policy gradient
Deterministic + biased estimator as control variate for monte carlo policy gradient = policy gradient with function approximation
Q-Prop Estimator
Start with first order taylor expansion of arbitrary function as control variate for policy gradient estimator$\bar{f}(s_t, a_t) = f(s_t, \bar{a_t}) + \nabla_a f(s_t, a) \vert _{a = \bar{a_t}}(a_t - \bar{a_t})$
Denote monte carlo returns from state and action as $\hat{Q}(s_t, a_t)$
Using $f = Q_w$, $\mu _\theta(s_t) = \mathbb{E} _{\pi _{\theta}(a_t \vert s_t)}[a_t]$ denoting the expected action of a stochastic policy, we get the Q-Prop gradient estimator as:$\nabla _\theta J(\theta) = \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{Q}(s_t, a_t) - \bar{Q} _w(s_t, a_t))] + \mathbb{E} _{\rho _\pi}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
Using advantages instead of Q values, we can rewrite this estimator:
$\nabla _\theta J(\theta) = \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{A}(s_t, a_t) - \bar{A} _w(s_t, a_t))] + \mathbb{E} _{\rho _\pi}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
Advantage taylor approximation: $\bar{A}(s_t, a_t) = \nabla_a Q_w (s_t, a) \vert _{a = \mu _\theta(s_t)}(a_t - \mu _\theta(s_t))$
Two main components to estimator:
Analytic gradient from critic: $\mathbb{E} _{\rho _\pi}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
Residual gradient from REINFORCE: $ \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{A}(s_t, a_t) - \bar{A} _w(s_t, a_t))]$
Q-Prop is effectively actor-critic where critic updated off-policy and actor updated on-policy
Inlcudes a REINFORCE correction term so that it remains a monte carlo policy gradient
Allows you to combine on and off-policy methods
Control Variate Analysis and Adaptive Q-Prop
$\eta(s_t)$: weighing variable that modulates strength of control variate (doesn’t introduce bias)New Estimator: $\nabla _\theta J(\theta) = \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{A}(s_t, a_t) - \bar{A} _w(s_t, a_t))] + \mathbb{E} _{\rho _\pi}[\eta(s_t)\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
Varaince: $Var^* = \mathbb{E} _{\rho _\pi}[\sum_m Var _{a_t}(\nabla _{\theta_m} \log \pi _\theta(a_t \vert s_t)(\hat{A}(s_t, a_t) - \eta(s_t)\bar{A} _w(s_t, a_t)))]$
m: indices of dimension of $\theta$
We want to choose $Var^* < Var$ where $Var = \mathbb{E} _{\rho _\pi}[\sum_m Var _{a_t}(\nabla _{\theta_m} \log \pi _\theta(a_t \vert s_t)\hat{A}(s_t, a_t))]$Usually impractical to get multiple action samples from same state
Use surrogate measure for variance: $Var = \mathbb{E} _{\rho _\pi}[Var _{a_t}(\hat{A}(s_t, a_t))]$
Surrogate for state-dependent baselines: $Var^* = \mathbb{E} _{\rho _\pi}[Var _{a_t}(\hat{A}(s_t, a_t) - \eta(s_t)\bar{A}(s_t, a_t))]$$= Var + \mathbb{E} _{\rho _\pi}[-2\eta(s_t)Cov _{a_t}(\hat{A}(s_t, a_t), \bar{A}(s_t, a_t)) + \eta(s_t)^2 Var _{a_t}(\bar{A}(s_t, a_t))]$ (derived with variance expansions)
$\mathbb{E} _\pi [\hat{A}(s_t, a_t)]= \mathbb{E} _\pi [\bar{A}(s_t, a_t)] = 0$
$Cov _{a_t}(\hat{A}, \bar{A}) = \mathbb{E} _\pi [\hat{A}(s_t, a_t)\bar{A}(s_t, a_t)]$
$Var _{a_t}(\bar{A}) = \mathbb{E} _\pi [\bar{A}(s_t, a_t)^2] = \nabla_a Q_w(s_t, a) \vert^T _{a = \mu _\theta(s_t)} \sum _\theta(s_t) \nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)}$
$\sum _\theta$ is the covariance matrix for $\pi _\theta$
$Cov _{a_t}(\hat{A}, \bar{A})$ can be estimate with single action sample
Adaptive Q-Prop:Maximum reduction in variance occurs when $\eta^*(s_t) = Cov(\hat{A}, \bar{A}) / Var _{a_t}(\bar{A})$Simplified variance; $Var^* = \mathbb{E} _{\rho _\pi}[(1 - \rho _{corr}(\hat{A}, \bar{A})^2)Var _{a_t}(\hat{A})]$
$\rho _{corr}$ is the correlation coefficient
Guarantees variance reduction if $\bar{A}$ is correlated with $\hat{A}$ for any state
$Q_w$ doesn’t necessarily need to be approximating $Q _\pi$ well for good resultsTaylor expansion just needs to be correlated with $\hat{A}$
Contrastive and Aggressive Q-Prop:
Single sample estimate of $Cov(\hat{A}, \bar{A})$ has high variance
Conservative Q-Prop:
$\eta (s_t) = 1 \text{ if } \hat{Cov}(\hat{A}, \bar{A}) > 0$ else $\eta (s_t) = 0$
Disables control variate for some samples of states
Makes sense if $\hat{A}$ and $\bar{A}$ have negative correlation (critic is poor)
Aggresive Q-Prop: $\eta (s_t) = sign(\hat{Cov}(\hat{A}, \bar{A}))$More liberal use of control variate
Q-Prop Algorithm
On each iteration
Rolls out stochastic policy to collect on-policy samples
Adds batch to replay buffer
Takes few gradient steps on critic
Computes $\hat{A}, \bar{A}$
Applies gradient step on $\pi _\theta$
Critic is computed using the same off-policy TD learning found in DDPG (i.e., from replay buffer)
GAE is used to estimate $\hat{A}$
Policy update can be done with any method that uses first-order gradients and/or on-policy batch data
Limitations
If data collection is fast, compute time bound by critic trainingIf slow, there is sufficient time between updates to fit $Q_w$ well (can be done asynchronously)Compute time will be about the same as TRPO
Conservative Q-Prop more robust to bad critics than standard Q-Prop or off-policy actor-critic
Difficult to know when off-policy critic is reliable (can use stable off-policy algorithms like Retrace($\lambda$))
Experiments
Adaptive Q-Prop
Conservative Q-Prop achieves more stable performance than agressive or standard Q-Prop
All Q-Prop outperform TRPO in terms of sample efficiency
Evaluation Across Algorithms
Conservative Q Prop outperforms TRPO and VPG
Conservative Q prop with VPG is comparable to TRPO
DDPG is very hyperparameter sensitive but Q prop has monotonic learning behavior comparativelyQ Prop can outperform DDPG in more complex domains
Evaluation Across Domains
Q-Prop is more sample efficient than TRPO on humanoidsDDPG can’t find a good solution
More stable RL algorithms allow us to avoid looking for hyperparameter regions in unstable algorithms
May 8, 2025 · research