Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Resources
- Paper
Introduction
- Problems with using deep neural nets:
  - Hyperparameter sensitivity causes unstable / non-convergent learning
  - High sample complexity
- Monte carlo policy gradient gives unbiased but high variance estimates of gradient
  - Can constrain policy change
  - Can mix value-based back ups
  - Still require high number of samples
  - Problem with policy gradient methods is they can only use on-policy samples
    - Need to collect more samples after each parameter update
- Off-policy Q learning and actor critic can use all samples
  - more sample efficient
  - Convergence is not guaranteed with non-linear function approximators
  - Need extensive hyperparameter tuning
- Q-Prop: combines advantages of on-policy policy gradient with efficiency of off-policy learning
  - Reduces variance of gradient estimates without adding bias
  - Learns action-value off-policy
    - First order taylor expansion of critic as control variate
    - Monte carlo policy gradient term with residuals in advantage approximation
    - Uses off-policy critic to reduce varaince or on-policy monte carlo returns to correct for bias in critic gradient
Background
- Assume standard RL setting
- Combines strengths of monte carlo policy gradient (i.e., TRPO, REINFORCE) and policy gradient with function approximation (i.e., actor-critic)
- Monte Carlo Policy Gradient Methods
  - Use vanilla policy gradient with baselines (REINFORCE) gradient: $\nabla _{\theta}J(\theta) = \mathbb{E} _{s_t \sim \rho _\pi (\cdot), a_t \sim \pi(\cdot \vert s_t)}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(R_t - b(s_t))]$
    - Use value function as baseline: $V _\pi(s_t) = \mathbb{E}[R_t] = \mathbb{E} _{\pi _\theta(a_t \vert s_t)}[Q _{\pi(s_t, a_t)}]$ ($R_t - b(s_t) = A _\pi(s_t, a_t)$)
  - We can use off-policy data with importance sampling with policy gradient to reduce sample complexity
    - Difficult to scale to high dimensions because of degenerating importance weights
- Policy Gradient With Function Approximation
  - Actor-critic methods use a policy evaluation step with TD learning and policy improvement step
  - More sample efficient because we use experience replay
    - Biased gradient: $\nabla _\theta J(\theta) \approx \mathbb{E} _{s_t \sim \rho _{\beta} (\cdot)}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t) \nabla _\theta \mu _\theta(s_t)}]$
  - Does not rely on high variance REINFORCE gradients
Q-Prop
- Unbiased + high variance estimator = monte carlo policy gradient
- Deterministic + biased estimator as control variate for monte carlo policy gradient = policy gradient with function approximation
- Q-Prop Estimator
  - Start with first order taylor expansion of arbitrary function as control variate for policy gradient estimator
    - $\bar{f}(s_t, a_t) = f(s_t, \bar{a_t}) + \nabla_a f(s_t, a) \vert _{a = \bar{a_t}}(a_t - \bar{a_t})$
  - Denote monte carlo returns from state and action as $\hat{Q}(s_t, a_t)$
  - Using $f = Q_w$, $\mu _\theta(s_t) = \mathbb{E} _{\pi _{\theta}(a_t \vert s_t)}[a_t]$ denoting the expected action of a stochastic policy, we get the Q-Prop gradient estimator as:
    - $\nabla _\theta J(\theta) = \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{Q}(s_t, a_t) - \bar{Q} _w(s_t, a_t))] + \mathbb{E} _{\rho _\pi}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
  - Using advantages instead of Q values, we can rewrite this estimator:
    - $\nabla _\theta J(\theta) = \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{A}(s_t, a_t) - \bar{A} _w(s_t, a_t))] + \mathbb{E} _{\rho _\pi}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
    - Advantage taylor approximation: $\bar{A}(s_t, a_t) = \nabla_a Q_w (s_t, a) \vert _{a = \mu _\theta(s_t)}(a_t - \mu _\theta(s_t))$
  - Two main components to estimator:
    - Analytic gradient from critic: $\mathbb{E} _{\rho _\pi}[\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
    - Residual gradient from REINFORCE: $ \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{A}(s_t, a_t) - \bar{A} _w(s_t, a_t))]$
  - Q-Prop is effectively actor-critic where critic updated off-policy and actor updated on-policy
    - Inlcudes a REINFORCE correction term so that it remains a monte carlo policy gradient
    - Allows you to combine on and off-policy methods
- Control Variate Analysis and Adaptive Q-Prop
  - $\eta(s_t)$: weighing variable that modulates strength of control variate (doesn’t introduce bias)
    - New Estimator: $\nabla _\theta J(\theta) = \mathbb{E} _{\rho _\pi, \pi}[\nabla _\theta \log \pi _\theta (a_t \vert s_t)(\hat{A}(s_t, a_t) - \bar{A} _w(s_t, a_t))] + \mathbb{E} _{\rho _\pi}[\eta(s_t)\nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)} \nabla _\theta \mu _\theta (s_t)]$
  - Varaince: $Var^* = \mathbb{E} _{\rho _\pi}[\sum_m Var _{a_t}(\nabla _{\theta_m} \log \pi _\theta(a_t \vert s_t)(\hat{A}(s_t, a_t) - \eta(s_t)\bar{A} _w(s_t, a_t)))]$
    - m: indices of dimension of $\theta$
    - We want to choose $Var^* < Var$ where $Var = \mathbb{E} _{\rho _\pi}[\sum_m Var _{a_t}(\nabla _{\theta_m} \log \pi _\theta(a_t \vert s_t)\hat{A}(s_t, a_t))]$
      - Usually impractical to get multiple action samples from same state
        
        Use surrogate measure for variance: $Var = \mathbb{E} _{\rho _\pi}[Var _{a_t}(\hat{A}(s_t, a_t))]$
        
        Surrogate for state-dependent baselines: $Var^* = \mathbb{E} _{\rho _\pi}[Var _{a_t}(\hat{A}(s_t, a_t) - \eta(s_t)\bar{A}(s_t, a_t))]$
        $= Var + \mathbb{E} _{\rho _\pi}[-2\eta(s_t)Cov _{a_t}(\hat{A}(s_t, a_t), \bar{A}(s_t, a_t)) + \eta(s_t)^2 Var _{a_t}(\bar{A}(s_t, a_t))]$ (derived with variance expansions)
        
        $\mathbb{E} _\pi [\hat{A}(s_t, a_t)]= \mathbb{E} _\pi [\bar{A}(s_t, a_t)] = 0$
        
        $Cov _{a_t}(\hat{A}, \bar{A}) = \mathbb{E} _\pi [\hat{A}(s_t, a_t)\bar{A}(s_t, a_t)]$
        
        $Var _{a_t}(\bar{A}) = \mathbb{E} _\pi [\bar{A}(s_t, a_t)^2] = \nabla_a Q_w(s_t, a) \vert^T _{a = \mu _\theta(s_t)} \sum _\theta(s_t) \nabla_a Q_w(s_t, a) \vert _{a = \mu _\theta(s_t)}$
        
        $\sum _\theta$ is the covariance matrix for $\pi _\theta$
        
        $Cov _{a_t}(\hat{A}, \bar{A})$ can be estimate with single action sample
  - Adaptive Q-Prop:
    - Maximum reduction in variance occurs when $\eta^*(s_t) = Cov(\hat{A}, \bar{A}) / Var _{a_t}(\bar{A})$
      - Simplified variance; $Var^* = \mathbb{E} _{\rho _\pi}[(1 - \rho _{corr}(\hat{A}, \bar{A})^2)Var _{a_t}(\hat{A})]$
        
        $\rho _{corr}$ is the correlation coefficient
        
        Guarantees variance reduction if $\bar{A}$ is correlated with $\hat{A}$ for any state
        
        $Q_w$ doesn’t necessarily need to be approximating $Q _\pi$ well for good results
        Taylor expansion just needs to be correlated with $\hat{A}$
  - Contrastive and Aggressive Q-Prop:
    - Single sample estimate of $Cov(\hat{A}, \bar{A})$ has high variance
    - Conservative Q-Prop:
      - $\eta (s_t) = 1 \text{ if } \hat{Cov}(\hat{A}, \bar{A}) > 0$ else $\eta (s_t) = 0$
      - Disables control variate for some samples of states
      - Makes sense if $\hat{A}$ and $\bar{A}$ have negative correlation (critic is poor)
    - Aggresive Q-Prop: $\eta (s_t) = sign(\hat{Cov}(\hat{A}, \bar{A}))$
      - More liberal use of control variate
- Q-Prop Algorithm
  - On each iteration
    - Rolls out stochastic policy to collect on-policy samples
    - Adds batch to replay buffer
    - Takes few gradient steps on critic
    - Computes $\hat{A}, \bar{A}$
    - Applies gradient step on $\pi _\theta$
  - Critic is computed using the same off-policy TD learning found in DDPG (i.e., from replay buffer)
  - GAE is used to estimate $\hat{A}$
  - Policy update can be done with any method that uses first-order gradients and/or on-policy batch data
- Limitations
  - If data collection is fast, compute time bound by critic training
    - If slow, there is sufficient time between updates to fit $Q_w$ well (can be done asynchronously)
      - Compute time will be about the same as TRPO
  - Conservative Q-Prop more robust to bad critics than standard Q-Prop or off-policy actor-critic
  - Difficult to know when off-policy critic is reliable (can use stable off-policy algorithms like Retrace($\lambda$))
Experiments
- Adaptive Q-Prop
  - Conservative Q-Prop achieves more stable performance than agressive or standard Q-Prop
  - All Q-Prop outperform TRPO in terms of sample efficiency
- Evaluation Across Algorithms
  - Conservative Q Prop outperforms TRPO and VPG
  - Conservative Q prop with VPG is comparable to TRPO
  - DDPG is very hyperparameter sensitive but Q prop has monotonic learning behavior comparatively
    - Q Prop can outperform DDPG in more complex domains
- Evaluation Across Domains
  - Q-Prop is more sample efficient than TRPO on humanoids
    - DDPG can’t find a good solution
  - More stable RL algorithms allow us to avoid looking for hyperparameter regions in unstable algorithms

May 8, 2025 · research