Combining Policy Gradient and Q-learning
Resources
Introduction
With action value techniques, we fit an action value function
SARSA is on-policy and fits Q function current policy
Q learning is off-policy and directly finds optimal Q values
Policy gradients improve policy by updating parameters in direction of performanceActor-Critic methods are online and use an estimation of the action-value function (can be off-policy too)
This paper derives link between Q values induced by policy and the policy itself (when policy is a fixed point of regularized policy gradient)Allows us to estimate Q values which can be refined with off-policy data
Reinforcement Learning
Assume standard reinforcement learning setting
For PGQL, policy maps state-action pair to probability of taking that action in that state
Action-Value Learning
Approximate Q values using function approximators
Update parameters so that Q values are as close as possible to fixed point of bellman equation
SARSA uses normal bellman operator
Q learning uses optimal bellman operator (maximum over actions)
Bellman operator approximated via sampling + boostrapping in online setting
Exploration happens via epsilon-greedy learningAlternatively, we can use boltzmann exploration which computes the softmax over Q values with a temperature: $\pi(s,a) = \frac{exp(Q(s,a)/T)}{\sum_b exp(Q(s,a)/T)}$
Policy Gradient
Parameterizes policy directly and improves it via gradient ascent
In online case of policy gradient, we usually add an entropy regularizer for exploration (else policy becomes deterministic)
Regularized Policy Gradient Theorem
Tabular Case
Let:
$f(\theta) = \mathbb{E} _{s,a}[Q^\pi(s,a) \nabla _{\theta} \log \pi(s,a)] + \alpha \mathbb{E} _{s}[\nabla _{\theta}H(\pi_s)]$
$g_s(\pi) \sum_a \pi(s,a)$
Fixed Point: When we can no longer update $\theta$ without violating a constraint
Constraints: $g_s(\pi) = 1$, $f(\theta)$ in span of vectors $\nabla _{\theta} g_s(\pi)$
Fixed point must satisfy $f(\theta) = \sum_s \lambda_s \nabla _{\theta}g_s(\pi)$ s.t. $g_s(\pi) = 1$ where $\lambda_s$ is a lagrange multiplier for a state, $s$
Equal to: $\mathbb{E} _{s,a}[Q^\pi(s,a) - \alpha \log \pi(s,a) - c_s \nabla _{\theta} \pi(s,a)] = 0$$c_s$ absorbs all constants
In tabular case, $\nabla _{\theta} \pi(s,a) = 1$ (single number for each state-action pair)
We want to find: $Q^\pi(s,a) - \alpha \log \pi(s,a) - c_s = 0$
Multiply by $\pi(a,s)$ and sum over $a \in \mathcal{A}$, $c_s = \alpha H^\pi(s) + V^\pi(s)$
Final policy formulation: $\pi(s,a) = exp(A^\pi(s,a) / \alpha - H^\pi(s))$
Softmax over the advantage function
$\alpha$ can be used as a temperature
Estimate of Q values using policy formulation: $\tilde{Q}^\pi(s,a) = \tilde{A}(s,a) + V^\pi(s) = \alpha(\log \pi(s,a) + H^\pi(s)) + V^\pi(s)$Gradient: $\theta \propto \mathbb{E} _{s,a}[(Q^\pi(s,a) - \tilde{Q}^\pi(s,a)) \nabla _\theta \log \pi(s,a)]$
When policy parameterized by softmax, action preferences are Q values scaled by $1/\alpha$
General Case
Constrained optimization problem:
Minimize: $\mathbb{E} _{s,a}[(q(s,a) - \alpha \log \pi(s,a))^2]$
Subject to: $\sum_a \pi(s,a) = 1, s \in \mathcal{S}$
Find $\theta$ which parameterize $\pi$
Optimality Condition: $\mathbb{E} _{s,a}[(q(s,a) - \alpha \log \pi(s,a) + c_s)\nabla _\theta \log \pi(s,a)] = 0$
$c_s$ is the lagrange multiplier for a state
If $q = Q^\pi$, we get same set of fixed points
Fixed points of regularized policy gradient = regression of log policy onto Q values
Softmax formulation of policy only approximately holdsWe go ahead and estimate Q values using this formulation
Connection to Action-Value Methods
Actor critic methods arejust action-value fitting methods
In Actor CriticPolicy: $\pi^k(s,a) = exp(W^k(s,a) / \alpha) / \sum_b exp(W^k(s,b) / \alpha)$
$W^k$: action preferences at iteration k, parameterized by $\theta$
Gradient: $\nabla _\theta \log \pi^k (s,a) = (1 / \alpha)(\nabla _{\theta}W^k(s,a) - \sum_b \pi^k(s,b) \nabla _\theta W^k(s,b))$
Updates:
$\Delta \theta \propto \mathbb{E} _{s,a}[\delta _{ac}(\nabla _{\theta} W^k(s,a) - \sum_b \pi^k(s,b) \nabla _{\theta} W^k(s,b))]$
$\Delta w \propto \mathbb{E} _{s,a}\delta _{ac} \nabla_w V^k(s)$
$w$: the parameters of $V$
$\delta _{ac}$: critic minus baseline
In action-value methods
Q-Value Function: $Q(s,a) = Y^k(s,a) - \sum_b \mu(s,b) Y^k(s,b) + V^k(s)$
$\mu$: Probability distribution
$Y^k$: parameterized function (by $\theta^k$)
Exploration Policy: Boltzmann distribution$pi^k(s,a) = exp(Y^k(s,a) / \alpha) / \sum_b exp(Y^k(s,b) / \alpha)$
Updates:
$\Delta \theta \propto \mathbb{E} _{s,a}[\delta _{av}(\nabla _{\theta} Y^k(s,a) - \sum_b \mu(s,b) \nabla _{\theta} Y^k(s,b))]$
$\Delta w \propto \mathbb{E} _{s,a}\delta _{av} \nabla_w V^k(s)$$\delta _{av}$: action-value error term
Equivalence between methods
Policies identical if $W^k = Y^k$
Updates identical if $\mu = \pi^k$
Slightly modified action-value method is equivalent to actor-critic policy gradient
Bellman Residual We can show that $\vert \vert \mathcal{T}^\ast Q^{\pi _\alpha} -Q^{\pi _\alpha} \vert \vert \rightarrow 0$
$\mathcal{T}^\ast Q^{\pi _\alpha} \geq \mathcal{T}^{\pi _\alpha} Q^{\pi _\alpha} = Q^{\pi _\alpha}$
$\Rightarrow \mathcal{T}^\ast Q^{\pi _\alpha}(s,a) - Q^{\pi _\alpha}(s,a) \geq 0$
$= \mathcal{T}^\ast Q^{\pi _\alpha}(s,a) - \mathcal{T}^{\pi _{\alpha}} Q^{\pi _\alpha}(s,a) \geq 0$
$= \mathbb{E} _{s’}[max_c Q^{\pi _\alpha}(s’, c) - \sum_b \pi _{\alpha}(s’, b)Q^{\pi _{\alpha}}(s’,b)]$
$= \mathbb{E} _{s’}[\sum_b \pi _{\alpha}(s’, b) (max_c Q^{\pi _\alpha}(s’, c) - Q^{\pi _{\alpha}}(s’,b))]$
$\leq \mathbb{E} _{s’}[\sum_b exp((Q^{\pi _{\alpha}}(s’, b) - Q^{\pi _\alpha}(s’, b^\ast)) / \alpha) (max_c Q^{\pi _\alpha}(s’, c) - Q^{\pi _{\alpha}}(s’,b))]$
$= \mathbb{E} _{s’}[\sum_b f _{\alpha} (max_c Q^{\pi _\alpha}(s’, c) - Q^{\pi _{\alpha}}(s’,b))]$
$f _{\alpha}(x) = x exp(-x / \alpha)$
Since $f _\alpha (x) \leq sup_x f _{\alpha}(x) = f _{\alpha}(\alpha) = ae^{-1}$, we get:
$0 \leq \mathcal{T}^\ast Q^{\pi _\alpha}(s,a) - Q^{\pi _\alpha}(s, a) \leq \vert \mathcal{A}\vert \alpha e^{-1}$
This tells us the bellman residual converges to 0 with decreasing $\alpha$
Also implies $\lim _{\alpha \rightarrow 0} Q^{\pi _\alpha} = Q^\ast$
PGQL
PGQL Update
Estimate of Q using policy: $\tilde{Q}^\pi(s,a) = \alpha(\log \pi(s,a) + H^{\pi}(s) + V(s))$Update parameters to reduce bellman residual similar to Q learning
$\Delta \theta \propto \mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla _{\theta} \log \pi(s,a)]$
$\Delta w \propto \mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla_w V(s)]$
Full PGQL update combines regularized policy gradient with Q learning update
$\Delta \theta \propto (1 - \eta) \mathbb{E} _{s,a}[(Q^\pi-\tilde{Q}^\pi)\nabla _{\theta}\log \pi ] +\eta \mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla _{\theta} \log \pi]$
$\Delta w \propto (1-\eta) \mathbb{E} _{s,a}[(Q^\pi-\tilde{Q}^\pi)\nabla_w V] + \eta\mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla_w V]$
$\eta$: Weighting parameter to control how much of each update to apply
$\eta = 0$: Entropy regularized policy gradient
$\eta = 1$: Q learning variant
First term in updates encourage consistency with critic
Second term in updates encourage optimality over time
Interpretations:
Actor critic update where critic is weighted between standard and optimizing critic
Update is a combination of expected SARSA and Q-learningQ values paramterized as sum of advantage and value functions
Practical Implementation
We don’t have access to an exact critic, $Q^\pi$
Instead agents interact with environment and update shared paramters of actor-critic algorithm
Policy + critic update online
Maintains a replay buffer that is occasionally sampled and performs a step of Q learning on policy
Allows critic to accumulate MC return over many time periods
Can prioritize important samples
Reduces temporal correlation
Acts as a regularizer to prevent policy from moving too far from bellman equations
Modified Fixed Point
PGQL updates modified the fixed point of the algorithm
New Q value estimate: $\tilde{Q}^\pi = (1 - \eta)Q^\pi + \eta \mathcal{T}^\ast \tilde{Q}^\pi$
Numerical Experiments
Grid World
Tested against actor-critic where estimate of value function is used for critic and Q learning where an update is performed from replay buffer data
Deterministic environment
PGQL outperformed Actor-Critic and Q learning
Atari
Compared against A3C + asynchronous version of deep Q learning
PGQL performed the best 34 games, A3C in 7, and Q learning in 106 games had two methods tie
PGQL has highest mean + median in human starts
PGQL only the worst in 1 game
In every game that PGQL did not perform well, it had better data efficiency early on
May 18, 2025 · research