Bridging the Gap Between Value and Policy Based Reinforcement Learning
Resources
Introduction
Challenge: How do we combine advantages of value and policy based RL while mitigating shortcomings
Policy Based RL
Stable under function approximation (given a small learning rate)
Sample inefficient
High variance gradients
Actor Critic MethodsReduce variance at the cost of some bias
On-policy learning still very inefficient
Either need to use on-policy data
Or need to update slow enough to avoid bias
Importance correction not sufficient
Off-policy learning
Can learn from any trajectory sampled from same environment
More sample efficient
Require extensive hyperparameter tuning (not stable otherwise)
Ideal: Combine unbiasedness + stability of on-policy with data efficiency of off-policyPrevious approaches exist but don’t resolve theoretical difficulty of off-policy learning with function approximation
Notation & Background
Assume a stochastic policy over finite actions: $\pi _\theta(a \vert s)$
Assume deterministic state dynamics (for simplicity)
Assume standard reinforcement learning setting
Hard-max bellman temporal consistency: $V^\circ(s) = O _{ER}(s, \pi^\circ) = max_a (r(s,a) + \gamma V^\circ(s’))$
In terms of optimal action values $Q^\circ(s,a) = r(s,a) + \gamma max _{a’}Q^\circ(s’,a’)$
The $\circ$ represents optimal
Optimal policy, $\pi^\circ$, becomes a one-hot vector
Softmax Temporal Consistency
Softmax temporal consistency comes from augmenting reward with a discounted entropy regularizer, encouraging exploration
$O _{ENT}(s, \pi) = O _{ER} (s, \pi) + \tau \mathbb{H}(s, \pi)$
$\tau$: User specified temperature to control degree of regularization
$\mathbb{H}(s, \pi) = \sum_a \pi(a \vert s)[- \log \pi(a \vert s) + \gamma \mathbb{H}(s’, \pi)]$$O _{ENT}(s, \pi) = \sum_a \pi(s \vert s)[r(s,a) - \tau \log \pi(a \vert s) + \gamma O _{ENT}(s’, \pi)]$
Soft value function: $V^* (s) = max_\pi O _{ENT}(s, \pi)$
Let $\pi^*(a \vert s)$ be the optimal policy that maximizes $O _{ENT}$This is no longer a one-hot vector because of the entropy term
Policy takes form of Boltzmann distribution: $\pi^*(a \vert s) \propto exp(\frac{r(s,a) + \gamma V^{\ast}(s)}{\tau})$Substitute policy with boltzmann distribution form to get softmax backup: $V^*(s) = O _{ENT}(s, \pi^\ast) = \tau \log \sum_a exp((r(s,a) + \gamma V^\ast(s’))/ \tau)$In terms of Q values: $Q^\ast(s,a) = r(s,a) + \gamma \tau \log \sum _{a’} exp(Q^\ast(s’,a’)/ \tau)$
Consistency Between Optimal Value & Policy
$exp(V^\ast(s) / \tau)$ is a noramlization factor for $\pi^\ast(a \vert s)$: $\pi^\ast(a \vert s) = \frac{exp((r(s,a) + \gamma V^\ast(s’))/ \tau)}{exp(V^\ast(s) / \tau)}$
Theorem 1 - (1-step) Temporal Consistency Property: $V^\ast (s) - \gamma V^\ast(s’) = r(s,a) - \tau \log \pi^\ast(a \vert s)$
Can be extended to multiple steps
Can also express $\pi^\ast(a \vert s) = exp((Q^\ast(s,a) - V^\ast(s)) / \tau)$
Corollary 2 - (Extended) Temporal Consistency Property: $V^\ast (s) - \gamma^{t-1} V^\ast(s’) = \sum _{i=1}^{t-1}[r(s_i,a_i) - \tau \log \pi^\ast(a_i \vert s_i)]$
Theorem 3: If a policy $\pi(a \vert s)$ and value function $V(s)$ satisfy the consistency property for all states and actions, then $\pi(a \vert s)$, $V(s)$ are optimal
Path Consistency Learning (PCL)
Soft consistency for trajectory of length d, $s _{i:i+d}$, policy $\pi _\theta$, and value function $V _{\phi}$
$C(s _{i:i+d}, \theta, \phi) = - V _{\phi}(s_i) + \gamma^d V _\phi(s _{i+d}) + \sum _{j=0}^{d-1} \gamma^j [r(s _{i+j}, a _{i+j}) - \tau \log \pi _\theta (a _{i+j} \vert s _{i+j})]$
Find $\phi, \theta$ so that $C(s _{i:i+d}, \theta, \phi)$ is as close to 0 as possible for all subtrajectories $s _{i:i+d}$
Path Consistency Learning (PCL): Minimize squared soft consistency over subtrajectories, $E$
Objective Function: $O _{PCL}(\theta, \phi) = \sum _{s _{i:i+d} \in E} = \frac{1}{2}C(s _{i:i+d}, \theta, \phi)$
Update Rules:
$\Delta \theta = \eta _\pi C(s _{i:i+d}, \theta, \phi) \sum _{j=0}^{d-1} \gamma^j \nabla _\theta \log \pi _\theta (a _{i+g} \vert s _{i+j})$
$\Delta \phi = \eta _v C(s _{i:i+d}, \theta, \phi) \nabla _{\phi} V _{\phi}(s_i) - \gamma^d \nabla _{\phi}V _{\phi}(s _{i +d})$
$\eta _{\pi}, \eta_v$ are the learning rates
Can apply PCL updates both on-policy and off-policy
In stochastic settings, the inconsistency objective is a biased estimate of the true squared inconsistency
Unified Path Consistency Learning (Unified PCL)
Normal PCL maintains seperate models for policy and state value approximation
We can express soft consistency errors with only Q values, parameterized by $\rho$ ($Q _\rho$)
$V _\rho (s) = \tau \log \sum_a exp(Q _\rho (s,a) / \tau)$
$\pi _\rho (a\vert s) = exp((Q _\rho (s,a) - V _\rho(s))/\tau)$
Combines actor and policy into single model
In practice, its better to apply updates to $\rho$ from $V _\rho$ and $\pi _\rho$ using different learning rates
Update rule: $\Delta \rho = \eta _\pi C(s _{i:i+d}, \rho) \sum _{j=0}^{d-1} \gamma^j \nabla _\rho \log \pi _\rho (a _{i+g} \vert s _{i+j}) + \eta _v C(s _{i:i+d}, \rho) \nabla _{\rho} V _{\rho}(s_i) - \gamma^d \nabla _{\rho}V _{\rho}(s _{i +d})$
Connections to Actor-Critic and Q-learning
Advantage actor critic (A2C) exploits value function to reduce variance
Updates in A2C
Policy: $\nabla \theta = \eta _\pi \mathbb{E} _{s _{i:i+d} \vert \theta}[A(s _{i:i+d}, \phi) \nabla _{\theta} \log \pi _{\theta}(a_i \vert s_i)]$
Critic: $\nabla \phi = \eta _v \mathbb{E} _{s _{i:i+d} \vert \theta}[A(s _{i:i+d}, \phi) \nabla _{\phi} V _{\phi}(s_i)]$
Very similar to PCL updates!
In PCL, if we take $\tau \rightarrow 0$, we get a variation of A2CPCL can be thought of as a generalization of A2C
A2C is restricted to on-policy data; PCL can do on or off-policy data
Relation to hard-max temporal consistency algorithms
When $d = 1$ (i.e., SARSA), Unified PCL becomes a form of soft Q learning (degree of softness determined by $\tau$)
PCL generalizes Q learning
Q learning is restricted to single step consistencies because rewards after non-optimal action do not related to hard max Q valuePCL can do multistep backups
Experiments
Compare PCL, Unified PCL to A3C, double Q learning with PER
PCL consistently beats the baselinesUnified model is competitive with PCL
PCL also trained with expert trajectories
Results
For simple tasks, PCL and A3C do roughly the sameMore noticeable gaps in harder tasks
Prioritized DQN is worse than PCL in all tasks
Using a unified model is slgihtly detrimental on simpler tasks but on difficult ones its competitive or better than PCL
Using a small number of expert trajectories with PCL signficantly improves agent performance
May 11, 2025 · research