Deterministic Policy Gradient Algorithms

Resources
- Paper
Introduction
- Policy gradient algorithms adjust parameters in direction of greater cumulative reward
  - Stochastically sample from policy
- We want deterministic policies using the same approach as policy gradient
  - This is just the stochastic case as policy variance tends to 0!
- Stochastic policy gradient = integrate over state and action space (requires more samples)
- Deterministic policy gradient = integrate over state space
- Stochasticity enables exploration in stochastic policy gradients
  - For deterministic, use off-policy learning based on a stochastic behavior policy
- Introduce compatible function approximation: ensures approximation doesn’t bias policy gradient
Background
- Preliminaries
  - Standard MDP Setting
    - Objective is to maximize expected rewards
- Stochastic Policy Gradient Theorem
  - Policy gradient theorem: $\begin{multline}\nabla _\theta J(\pi _\theta) = \int_S \rho^\pi (s) \int_A \nabla _\theta \pi _\theta(a \vert s) Q^\pi(s,a) da ds = \mathbb{E} _{s \sim \rho^\pi, a \sim \pi _{\theta}}[\nabla _\theta \log \pi(a \vert s) Q^\pi(s,a)] \end{multline}$
    - Can use sample returns to estimate Q value function
- Stochastic Actor-Critic Algorithms
  - Two components:
    - Actor adjusts parameter of a stochastic policy, $\pi _\theta(s)$
    - Critic estimates action value function, $Q^w(s,a)$ via temporal difference learning
      - Introduces bias when critic is not empirical returns
      - Compatability Requirements; ensures no bias:
        
        $Q^w(s,a) = \nabla _\theta \log \pi _\theta(a\vert s)^Tw$: linear in features of stochastic policy
        
        Parameters chosen to minimize MSE: linear regression estimates Q from these features
      - If compatability is achieved, equivalent to using no critic
- Off-Policy Actor-Critic
  - Objective modified to be value function of target policy averaged over state distribution of behavior policy
    - $J _\beta (\pi _\theta) = \int_S \int_A \rho^\beta(s) \pi _\theta(a\vert s) Q^\pi(s,a)dads$
    - Off policy actor critic gradient: $\nabla _\theta J _\beta (\pi _\theta) = \int_S \int_A \rho^\beta(s) \nabla _\theta \pi _\theta(a\vert s) Q^\pi(s,a)dads$
      - $= \mathbb{E} _{s \sim \rho^\beta, a \sim \beta}[\frac{\pi _\theta(a \vert s)}{\beta _\theta(a \vert s)} \nabla _\theta \log \pi _\theta (a \vert s)Q^\pi(s,a)]$
    - Instead of Q value function, we can use temporal difference error for objective
Gradients of Deterministic Policies
- Action-Value Gradients
  - Policy evaluation methods estimate action-value function via TD learning or monte carlo evaluation
  - Policy improvement methods update policy via greedy maximization wrt the action value function
    - In continuous spaces, greedy maximization is problematic; instead move policy in direction of Q
      - Each state suggests different direction; take the expectation over the state distribution: $\theta^{k+1} = \theta^{k} + \alpha \mathbb{E} _{s \sim \rho^{\mu^k}}[\nabla _\theta Q^{\mu^k}(s,\mu _\theta(s))]$
      - Can decompose policy gradients (via chain rule) into gradients of action values and gradients of policy:
        $\theta^{k+1} = \theta^{k} + \alpha \mathbb{E} _{s \sim \rho^{\mu^k}}[\nabla _\theta \mu _\theta(s) \nabla_a Q^{\mu^k}(s,a)]$
- Deterministic Policy Gradient Theorem
  - Performance objective for a deterministic policy $\mu: S \rightarrow A$
    - $J(\mu _\theta) = \int_S \rho^\mu(s) r(s, \mu _\theta (s))ds = \mathbb{E} _{s \sim \rho^{\mu}}[r(s, \mu _\theta (s))]$
  - Deterministic policy gradient: $\nabla _\theta J(\mu _\theta) = \int_S \rho^\mu(s) \nabla _\theta \mu _\theta(s) \nabla_a Q(s,a) \vert _{a = \mu _\theta(s)}ds = \mathbb{E} _{s \sim \rho^{\mu}}[\nabla _\theta \mu _\theta(s) \nabla_a Q(s,a) \vert _{a = \mu _\theta(s)}]$
    - Uses the action value gradient rule from the previous section!
- Limit of the Stochastic Policy Gradient
  - Given a stochastic policy, we can parameterize the variance to be 0 and get a deterministic policy
    - This can be proven using the stochastic policy gradient and deterministic policy gradients when taking the limit of the variance to 0 of the stochastic gradient
Deterministic Actor-Critic Algorithms
- On-Policy Deterministic Actor-Critic
  - Mainly useful if environment’s stochasticity is sufficient for exploration, else use an off-policy actor-critic
  - Substitute a differentable critic in place of the true critic
    - Use some form of TD error to train critic ($Q^w$)
    - New update rule: $\theta _{t+1} = \theta_t + \alpha(\nabla _\theta \mu _\theta(s_t) \nabla_a Q^w (s_t, a_t)) \vert _{a = \mu _\theta(s)}$
- Off-Policy Deterministic Actor-Critic
  - Performance objective: Value function of deterministic target policy, averaged over state distribution of stochastic behavior policy
    - $J _\beta (\mu _\theta) = \int _S \rho ^\beta (s) Q^\mu(s, \mu _\theta(s))ds$
    - Gradient: $\nabla _\theta J _\beta (\mu _\theta) = \int _S \rho^\beta (s) \nabla _\theta \mu(s) \nabla _a Q^\mu(s, \mu _\theta(s)) \vert _{a = \mu(s)}ds = \mathbb{E} _{s \sim \rho^\beta} [\nabla _\theta \mu(s) \nabla _a Q^\mu(s, \mu _\theta(s)) \vert _{a = \mu(s)}]$
  - Differentiable action value function used in place of true action-value
    - Same update rule as on-policy actor critic
- Compatible Function Approximation
  - Requirements for compatability:
    - $Q^w(s,a) = \nabla _\theta \log \pi _\theta(a\vert s)^Tw$: linear in features of stochastic policy
    - Parameters chosen to minimize MSE: linear regression estimates Q from these features
  - Substituting a differentiable critic is not necessarily enough to follow the true gradient of the action-value
  - We want a compatible function approximator: gradient of $Q^\mu$ can be replaced with gradient of $Q^\mu$
  - For a deterministic policy, there always exists a compatible function approximator in the form
    - $Q^w(s,a) = (a - \mu _\theta(s))^T \nabla _\theta \mu _\theta(s)^T w+ V^v(s)$
    - Where $V$ is any baseline function indepedent of the action
    - We can set the first term of this equation equal to $A(s,a)$, the advantage
  - Linear function approximators are good local critics, not good global ones
    - Represents the local advantage of deviating from current deterministic policy by small amount ($\delta$)
    - Local Advantage: $A^w(s, \mu _\theta(s) + \delta) = \delta^T \nabla _\theta \mu _\theta(s)^T w$
  - Linear regression problem with MSE:
    - Features: $\phi(s,a)$: state-action features
    - Target: $\nabla _a Q^\mu (s,a) \vert _{a = \mu _\theta(s)}$
    - Difficult to do
    - Learn Q value function by standard policy evaluation methods
  - Compatible Off-Policy Deterministic Actor Critic
    - Critic: Linear function approximator from state-action value features; $\phi(s,a) = a^T \nabla _\theta \mu _\theta(s)$
      - Can be learned using samples from off-policy behavior policy
    - Actor: updates parameters in direction of critic’s action-value gradient
    - Update Rules:
      - Actor: $\theta _{t+1} = \theta _{t} + \alpha _\theta \nabla _\theta \mu _\theta(s_t)(\nabla _\theta \mu _\theta (s_t)^T w)$
      - Critic: $w _{t+1} = w_t + \alpha_w \delta_t \phi(s,a)$
      - Value Function: $v _{t+1} = v_t + \alpha_v \delta_t \phi(s)$
  - Compatible Off-Policy Deterministic Actor Critic with Gradient Q Learning (COPDAC-GQ)
    - Newer methods based on true gradient descent + gradient TD learning
      - Minimize the mean squared project bellman error (MSPBE)
    - Uses step sizes to ensure critic updated on faster time scale than actor (so that critic converges to minimizing MSPBE)
  - Natural policy gradients can be extended into deterministic policies
    - Fisher information matrix metric for deterministic policies: $M _\mu (\theta) = \mathbb{E} _{s \sim \rho^\mu}[\nabla _\theta \mu _\theta(s)\nabla _\theta \mu _\theta(s)^T]$
    - Policy gradient with compatible function approximation: $\nabla _\theta J(\mu _\theta) = \mathbb{E} _{s \sim \rho^\mu}[\nabla _\theta \mu _\theta(s)\nabla _\theta \mu _\theta(s)^T w]$
      - Steepest ascent direction: $w = M _\mu(\theta)^{-1}\nabla _\theta J _\beta(\mu _\theta)$
Experiments
- Continuous Bandit
  - Continuous bandit problem with high dimensional quadratic cost function
  - Compare SAC to COPDAC
    - SAC: Uses isotropic gaussian
    - COPDAC: Fixed width gaussian behavior policy
    - Critic estimated by mapping features to costs
    - Critic recomputed each successive batch of 2 million steps
    - Actor updated once per batch
  - Evaluated via average cost per step incurred by the mean
  - COPDAC outperforms SAC by wide margin and this margin increases as the dimensionality increases
- Continuous Reinforcement Learning
  - Mountain car, pendulum, and 2d puddle world tasks
  - COPDAC slightly outperforms SAC and OffPAC
- Octopus Arm
  - COPDAC achieves good results on this environment
Discussion
- In stochastic policy gradient, policy becomes more deterministic as it finds a good strategy
  - Harder to estimate gradient because it changes rapidy near the mean
- Deterministic actor-critic similar to Q learning which learns deterministic greedy policy, off policy while executing a noisy version of that policy
  - COPDAC does the same thing

April 24, 2025 · research