Bridging the Gap Between Value and Policy Based Reinforcement Learning

Resources
- Paper
Introduction
- Challenge: How do we combine advantages of value and policy based RL while mitigating shortcomings
- Policy Based RL
  - Stable under function approximation (given a small learning rate)
  - Sample inefficient
  - High variance gradients
- Actor Critic Methods
  - Reduce variance at the cost of some bias
- On-policy learning still very inefficient
  - Either need to use on-policy data
  - Or need to update slow enough to avoid bias
  - Importance correction not sufficient
- Off-policy learning
  - Can learn from any trajectory sampled from same environment
  - More sample efficient
  - Require extensive hyperparameter tuning (not stable otherwise)
- Ideal: Combine unbiasedness + stability of on-policy with data efficiency of off-policy
  - Previous approaches exist but don’t resolve theoretical difficulty of off-policy learning with function approximation
Notation & Background
- Assume a stochastic policy over finite actions: $\pi _\theta(a \vert s)$
- Assume deterministic state dynamics (for simplicity)
- Assume standard reinforcement learning setting
- Hard-max bellman temporal consistency: $V^\circ(s) = O _{ER}(s, \pi^\circ) = max_a (r(s,a) + \gamma V^\circ(s’))$
  - In terms of optimal action values $Q^\circ(s,a) = r(s,a) + \gamma max _{a’}Q^\circ(s’,a’)$
  - The $\circ$ represents optimal
  - Optimal policy, $\pi^\circ$, becomes a one-hot vector
Softmax Temporal Consistency
- Softmax temporal consistency comes from augmenting reward with a discounted entropy regularizer, encouraging exploration
- $O _{ENT}(s, \pi) = O _{ER} (s, \pi) + \tau \mathbb{H}(s, \pi)$
  - $\tau$: User specified temperature to control degree of regularization
  - $\mathbb{H}(s, \pi) = \sum_a \pi(a \vert s)[- \log \pi(a \vert s) + \gamma \mathbb{H}(s’, \pi)]$
    - $O _{ENT}(s, \pi) = \sum_a \pi(s \vert s)[r(s,a) - \tau \log \pi(a \vert s) + \gamma O _{ENT}(s’, \pi)]$
- Soft value function: $V^* (s) = max_\pi O _{ENT}(s, \pi)$
  - Let $\pi^*(a \vert s)$ be the optimal policy that maximizes $O _{ENT}$
    - This is no longer a one-hot vector because of the entropy term
  - Policy takes form of Boltzmann distribution: $\pi^*(a \vert s) \propto exp(\frac{r(s,a) + \gamma V^{\ast}(s)}{\tau})$
    - Substitute policy with boltzmann distribution form to get softmax backup: $V^*(s) = O _{ENT}(s, \pi^\ast) = \tau \log \sum_a exp((r(s,a) + \gamma V^\ast(s’))/ \tau)$
      - In terms of Q values: $Q^\ast(s,a) = r(s,a) + \gamma \tau \log \sum _{a’} exp(Q^\ast(s’,a’)/ \tau)$
Consistency Between Optimal Value & Policy
- $exp(V^\ast(s) / \tau)$ is a noramlization factor for $\pi^\ast(a \vert s)$: $\pi^\ast(a \vert s) = \frac{exp((r(s,a) + \gamma V^\ast(s’))/ \tau)}{exp(V^\ast(s) / \tau)}$
- Theorem 1 - (1-step) Temporal Consistency Property: $V^\ast (s) - \gamma V^\ast(s’) = r(s,a) - \tau \log \pi^\ast(a \vert s)$
  - Can be extended to multiple steps
  - Can also express $\pi^\ast(a \vert s) = exp((Q^\ast(s,a) - V^\ast(s)) / \tau)$
- Corollary 2 - (Extended) Temporal Consistency Property: $V^\ast (s) - \gamma^{t-1} V^\ast(s’) = \sum _{i=1}^{t-1}[r(s_i,a_i) - \tau \log \pi^\ast(a_i \vert s_i)]$
- Theorem 3: If a policy $\pi(a \vert s)$ and value function $V(s)$ satisfy the consistency property for all states and actions, then $\pi(a \vert s)$, $V(s)$ are optimal
Path Consistency Learning (PCL)
- Soft consistency for trajectory of length d, $s _{i:i+d}$, policy $\pi _\theta$, and value function $V _{\phi}$
  - $C(s _{i:i+d}, \theta, \phi) = - V _{\phi}(s_i) + \gamma^d V _\phi(s _{i+d}) + \sum _{j=0}^{d-1} \gamma^j [r(s _{i+j}, a _{i+j}) - \tau \log \pi _\theta (a _{i+j} \vert s _{i+j})]$
  - Find $\phi, \theta$ so that $C(s _{i:i+d}, \theta, \phi)$ is as close to 0 as possible for all subtrajectories $s _{i:i+d}$
- Path Consistency Learning (PCL): Minimize squared soft consistency over subtrajectories, $E$
  - Objective Function: $O _{PCL}(\theta, \phi) = \sum _{s _{i:i+d} \in E} = \frac{1}{2}C(s _{i:i+d}, \theta, \phi)$
  - Update Rules:
    - $\Delta \theta = \eta _\pi C(s _{i:i+d}, \theta, \phi) \sum _{j=0}^{d-1} \gamma^j \nabla _\theta \log \pi _\theta (a _{i+g} \vert s _{i+j})$
    - $\Delta \phi = \eta _v C(s _{i:i+d}, \theta, \phi) \nabla _{\phi} V _{\phi}(s_i) - \gamma^d \nabla _{\phi}V _{\phi}(s _{i +d})$
    - $\eta _{\pi}, \eta_v$ are the learning rates
  - Can apply PCL updates both on-policy and off-policy
  - In stochastic settings, the inconsistency objective is a biased estimate of the true squared inconsistency
- Unified Path Consistency Learning (Unified PCL)
  - Normal PCL maintains seperate models for policy and state value approximation
  - We can express soft consistency errors with only Q values, parameterized by $\rho$ ($Q _\rho$)
    - $V _\rho (s) = \tau \log \sum_a exp(Q _\rho (s,a) / \tau)$
    - $\pi _\rho (a\vert s) = exp((Q _\rho (s,a) - V _\rho(s))/\tau)$
  - Combines actor and policy into single model
    - In practice, its better to apply updates to $\rho$ from $V _\rho$ and $\pi _\rho$ using different learning rates
    - Update rule: $\Delta \rho = \eta _\pi C(s _{i:i+d}, \rho) \sum _{j=0}^{d-1} \gamma^j \nabla _\rho \log \pi _\rho (a _{i+g} \vert s _{i+j}) + \eta _v C(s _{i:i+d}, \rho) \nabla _{\rho} V _{\rho}(s_i) - \gamma^d \nabla _{\rho}V _{\rho}(s _{i +d})$
- Connections to Actor-Critic and Q-learning
  - Advantage actor critic (A2C) exploits value function to reduce variance
    - Updates in A2C
      - Policy: $\nabla \theta = \eta _\pi \mathbb{E} _{s _{i:i+d} \vert \theta}[A(s _{i:i+d}, \phi) \nabla _{\theta} \log \pi _{\theta}(a_i \vert s_i)]$
      - Critic: $\nabla \phi = \eta _v \mathbb{E} _{s _{i:i+d} \vert \theta}[A(s _{i:i+d}, \phi) \nabla _{\phi} V _{\phi}(s_i)]$
      - Very similar to PCL updates!
    - In PCL, if we take $\tau \rightarrow 0$, we get a variation of A2C
      - PCL can be thought of as a generalization of A2C
    - A2C is restricted to on-policy data; PCL can do on or off-policy data
  - Relation to hard-max temporal consistency algorithms
    - When $d = 1$ (i.e., SARSA), Unified PCL becomes a form of soft Q learning (degree of softness determined by $\tau$)
    - PCL generalizes Q learning
    - Q learning is restricted to single step consistencies because rewards after non-optimal action do not related to hard max Q value
      - PCL can do multistep backups
Experiments
- Compare PCL, Unified PCL to A3C, double Q learning with PER
- PCL consistently beats the baselines
  - Unified model is competitive with PCL
- PCL also trained with expert trajectories
- Results
  - For simple tasks, PCL and A3C do roughly the same
    - More noticeable gaps in harder tasks
  - Prioritized DQN is worse than PCL in all tasks
  - Using a unified model is slgihtly detrimental on simpler tasks but on difficult ones its competitive or better than PCL
  - Using a small number of expert trajectories with PCL signficantly improves agent performance

May 11, 2025 · research