Combining Policy Gradient and Q-learning

Resources
- Paper
Introduction
- With action value techniques, we fit an action value function
  - SARSA is on-policy and fits Q function current policy
  - Q learning is off-policy and directly finds optimal Q values
- Policy gradients improve policy by updating parameters in direction of performance
  - Actor-Critic methods are online and use an estimation of the action-value function (can be off-policy too)
- This paper derives link between Q values induced by policy and the policy itself (when policy is a fixed point of regularized policy gradient)
  - Allows us to estimate Q values which can be refined with off-policy data
Reinforcement Learning
- Assume standard reinforcement learning setting
- For PGQL, policy maps state-action pair to probability of taking that action in that state
- Action-Value Learning
  - Approximate Q values using function approximators
  - Update parameters so that Q values are as close as possible to fixed point of bellman equation
    - SARSA uses normal bellman operator
    - Q learning uses optimal bellman operator (maximum over actions)
  - Bellman operator approximated via sampling + boostrapping in online setting
  - Exploration happens via epsilon-greedy learning
    - Alternatively, we can use boltzmann exploration which computes the softmax over Q values with a temperature: $\pi(s,a) = \frac{exp(Q(s,a)/T)}{\sum_b exp(Q(s,a)/T)}$
- Policy Gradient
  - Parameterizes policy directly and improves it via gradient ascent
  - In online case of policy gradient, we usually add an entropy regularizer for exploration (else policy becomes deterministic)
Regularized Policy Gradient Theorem
- Tabular Case
  - Let:
    - $f(\theta) = \mathbb{E} _{s,a}[Q^\pi(s,a) \nabla _{\theta} \log \pi(s,a)] + \alpha \mathbb{E} _{s}[\nabla _{\theta}H(\pi_s)]$
    - $g_s(\pi) \sum_a \pi(s,a)$
  - Fixed Point: When we can no longer update $\theta$ without violating a constraint
    - Constraints: $g_s(\pi) = 1$, $f(\theta)$ in span of vectors $\nabla _{\theta} g_s(\pi)$
    - Fixed point must satisfy $f(\theta) = \sum_s \lambda_s \nabla _{\theta}g_s(\pi)$ s.t. $g_s(\pi) = 1$ where $\lambda_s$ is a lagrange multiplier for a state, $s$
      - Equal to: $\mathbb{E} _{s,a}[Q^\pi(s,a) - \alpha \log \pi(s,a) - c_s \nabla _{\theta} \pi(s,a)] = 0$
        $c_s$ absorbs all constants
      - In tabular case, $\nabla _{\theta} \pi(s,a) = 1$ (single number for each state-action pair)
      - We want to find: $Q^\pi(s,a) - \alpha \log \pi(s,a) - c_s = 0$
      - Multiply by $\pi(a,s)$ and sum over $a \in \mathcal{A}$, $c_s = \alpha H^\pi(s) + V^\pi(s)$
      - Final policy formulation: $\pi(s,a) = exp(A^\pi(s,a) / \alpha - H^\pi(s))$
        
        Softmax over the advantage function
        
        $\alpha$ can be used as a temperature
  - Estimate of Q values using policy formulation: $\tilde{Q}^\pi(s,a) = \tilde{A}(s,a) + V^\pi(s) = \alpha(\log \pi(s,a) + H^\pi(s)) + V^\pi(s)$
    - Gradient: $\theta \propto \mathbb{E} _{s,a}[(Q^\pi(s,a) - \tilde{Q}^\pi(s,a)) \nabla _\theta \log \pi(s,a)]$
  - When policy parameterized by softmax, action preferences are Q values scaled by $1/\alpha$
- General Case
  - Constrained optimization problem:
    - Minimize: $\mathbb{E} _{s,a}[(q(s,a) - \alpha \log \pi(s,a))^2]$
    - Subject to: $\sum_a \pi(s,a) = 1, s \in \mathcal{S}$
  - Find $\theta$ which parameterize $\pi$
  - Optimality Condition: $\mathbb{E} _{s,a}[(q(s,a) - \alpha \log \pi(s,a) + c_s)\nabla _\theta \log \pi(s,a)] = 0$
    - $c_s$ is the lagrange multiplier for a state
    - If $q = Q^\pi$, we get same set of fixed points
    - Fixed points of regularized policy gradient = regression of log policy onto Q values
  - Softmax formulation of policy only approximately holds
    - We go ahead and estimate Q values using this formulation
- Connection to Action-Value Methods
  - Actor critic methods arejust action-value fitting methods
  - In Actor Critic
    - Policy: $\pi^k(s,a) = exp(W^k(s,a) / \alpha) / \sum_b exp(W^k(s,b) / \alpha)$
      - $W^k$: action preferences at iteration k, parameterized by $\theta$
      - Gradient: $\nabla _\theta \log \pi^k (s,a) = (1 / \alpha)(\nabla _{\theta}W^k(s,a) - \sum_b \pi^k(s,b) \nabla _\theta W^k(s,b))$
      - Updates:
        
        $\Delta \theta \propto \mathbb{E} _{s,a}[\delta _{ac}(\nabla _{\theta} W^k(s,a) - \sum_b \pi^k(s,b) \nabla _{\theta} W^k(s,b))]$
        
        $\Delta w \propto \mathbb{E} _{s,a}\delta _{ac} \nabla_w V^k(s)$
        
        $w$: the parameters of $V$
        
        $\delta _{ac}$: critic minus baseline
  - In action-value methods
    - Q-Value Function: $Q(s,a) = Y^k(s,a) - \sum_b \mu(s,b) Y^k(s,b) + V^k(s)$
      - $\mu$: Probability distribution
      - $Y^k$: parameterized function (by $\theta^k$)
    - Exploration Policy: Boltzmann distribution
      - $pi^k(s,a) = exp(Y^k(s,a) / \alpha) / \sum_b exp(Y^k(s,b) / \alpha)$
    - Updates:
      - $\Delta \theta \propto \mathbb{E} _{s,a}[\delta _{av}(\nabla _{\theta} Y^k(s,a) - \sum_b \mu(s,b) \nabla _{\theta} Y^k(s,b))]$
      - $\Delta w \propto \mathbb{E} _{s,a}\delta _{av} \nabla_w V^k(s)$
        $\delta _{av}$: action-value error term
  - Equivalence between methods
    - Policies identical if $W^k = Y^k$
    - Updates identical if $\mu = \pi^k$
    - Slightly modified action-value method is equivalent to actor-critic policy gradient
- Bellman Residual
  - We can show that $\vert \vert \mathcal{T}^\ast Q^{\pi _\alpha} -Q^{\pi _\alpha} \vert \vert \rightarrow 0$
    - $\mathcal{T}^\ast Q^{\pi _\alpha} \geq \mathcal{T}^{\pi _\alpha} Q^{\pi _\alpha} = Q^{\pi _\alpha}$
    - $\Rightarrow \mathcal{T}^\ast Q^{\pi _\alpha}(s,a) - Q^{\pi _\alpha}(s,a) \geq 0$
    - $= \mathcal{T}^\ast Q^{\pi _\alpha}(s,a) - \mathcal{T}^{\pi _{\alpha}} Q^{\pi _\alpha}(s,a) \geq 0$
    - $= \mathbb{E} _{s’}[max_c Q^{\pi _\alpha}(s’, c) - \sum_b \pi _{\alpha}(s’, b)Q^{\pi _{\alpha}}(s’,b)]$
    - $= \mathbb{E} _{s’}[\sum_b \pi _{\alpha}(s’, b) (max_c Q^{\pi _\alpha}(s’, c) - Q^{\pi _{\alpha}}(s’,b))]$
    - $\leq \mathbb{E} _{s’}[\sum_b exp((Q^{\pi _{\alpha}}(s’, b) - Q^{\pi _\alpha}(s’, b^\ast)) / \alpha) (max_c Q^{\pi _\alpha}(s’, c) - Q^{\pi _{\alpha}}(s’,b))]$
    - $= \mathbb{E} _{s’}[\sum_b f _{\alpha} (max_c Q^{\pi _\alpha}(s’, c) - Q^{\pi _{\alpha}}(s’,b))]$
      - $f _{\alpha}(x) = x exp(-x / \alpha)$
      - Since $f _\alpha (x) \leq sup_x f _{\alpha}(x) = f _{\alpha}(\alpha) = ae^{-1}$, we get:
        
        $0 \leq \mathcal{T}^\ast Q^{\pi _\alpha}(s,a) - Q^{\pi _\alpha}(s, a) \leq \vert \mathcal{A}\vert \alpha e^{-1}$
        
        This tells us the bellman residual converges to 0 with decreasing $\alpha$
        
        Also implies $\lim _{\alpha \rightarrow 0} Q^{\pi _\alpha} = Q^\ast$
PGQL
- PGQL Update
  - Estimate of Q using policy: $\tilde{Q}^\pi(s,a) = \alpha(\log \pi(s,a) + H^{\pi}(s) + V(s))$
    - Update parameters to reduce bellman residual similar to Q learning
      - $\Delta \theta \propto \mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla _{\theta} \log \pi(s,a)]$
      - $\Delta w \propto \mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla_w V(s)]$
  - Full PGQL update combines regularized policy gradient with Q learning update
    - $\Delta \theta \propto (1 - \eta) \mathbb{E} _{s,a}[(Q^\pi-\tilde{Q}^\pi)\nabla _{\theta}\log \pi ] +\eta \mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla _{\theta} \log \pi]$
    - $\Delta w \propto (1-\eta) \mathbb{E} _{s,a}[(Q^\pi-\tilde{Q}^\pi)\nabla_w V] + \eta\mathbb{E} _{s,a}[(\mathcal{T}^\ast \tilde{Q}^\pi(s,a) - \tilde{Q}^\pi(s,a))\nabla_w V]$
      - $\eta$: Weighting parameter to control how much of each update to apply
        
        $\eta = 0$: Entropy regularized policy gradient
        
        $\eta = 1$: Q learning variant
      - First term in updates encourage consistency with critic
      - Second term in updates encourage optimality over time
  - Interpretations:
    - Actor critic update where critic is weighted between standard and optimizing critic
    - Update is a combination of expected SARSA and Q-learning
      - Q values paramterized as sum of advantage and value functions
- Practical Implementation
  - We don’t have access to an exact critic, $Q^\pi$
  - Instead agents interact with environment and update shared paramters of actor-critic algorithm
    - Policy + critic update online
    - Maintains a replay buffer that is occasionally sampled and performs a step of Q learning on policy
      - Allows critic to accumulate MC return over many time periods
      - Can prioritize important samples
      - Reduces temporal correlation
      - Acts as a regularizer to prevent policy from moving too far from bellman equations
- Modified Fixed Point
  - PGQL updates modified the fixed point of the algorithm
  - New Q value estimate: $\tilde{Q}^\pi = (1 - \eta)Q^\pi + \eta \mathcal{T}^\ast \tilde{Q}^\pi$
Numerical Experiments
- Grid World
  - Tested against actor-critic where estimate of value function is used for critic and Q learning where an update is performed from replay buffer data
  - Deterministic environment
  - PGQL outperformed Actor-Critic and Q learning
- Atari
  - Compared against A3C + asynchronous version of deep Q learning
  - PGQL performed the best 34 games, A3C in 7, and Q learning in 10
    - 6 games had two methods tie
  - PGQL has highest mean + median in human starts
  - PGQL only the worst in 1 game
  - In every game that PGQL did not perform well, it had better data efficiency early on

May 18, 2025 · research