Deterministic Policy Gradient Algorithms
Resources
Introduction
Policy gradient algorithms adjust parameters in direction of greater cumulative rewardStochastically sample from policy
We want deterministic policies using the same approach as policy gradientThis is just the stochastic case as policy variance tends to 0!
Stochastic policy gradient = integrate over state and action space (requires more samples)
Deterministic policy gradient = integrate over state space
Stochasticity enables exploration in stochastic policy gradientsFor deterministic, use off-policy learning based on a stochastic behavior policy
Introduce compatible function approximation: ensures approximation doesn’t bias policy gradient
Background
Preliminaries Standard MDP SettingObjective is to maximize expected rewards
Stochastic Policy Gradient Theorem Policy gradient theorem: $\begin{multline}\nabla _\theta J(\pi _\theta) = \int_S \rho^\pi (s) \int_A \nabla _\theta \pi _\theta(a \vert s) Q^\pi(s,a) da ds = \mathbb{E} _{s \sim \rho^\pi, a \sim \pi _{\theta}}[\nabla _\theta \log \pi(a \vert s) Q^\pi(s,a)] \end{multline}$Can use sample returns to estimate Q value function
Stochastic Actor-Critic Algorithms Two components:
Actor adjusts parameter of a stochastic policy, $\pi _\theta(s)$
Critic estimates action value function, $Q^w(s,a)$ via temporal difference learning
Introduces bias when critic is not empirical returns
Compatability Requirements; ensures no bias:
$Q^w(s,a) = \nabla _\theta \log \pi _\theta(a\vert s)^Tw$: linear in features of stochastic policy
Parameters chosen to minimize MSE: linear regression estimates Q from these features
If compatability is achieved, equivalent to using no critic
Off-Policy Actor-Critic Objective modified to be value function of target policy averaged over state distribution of behavior policy
$J _\beta (\pi _\theta) = \int_S \int_A \rho^\beta(s) \pi _\theta(a\vert s) Q^\pi(s,a)dads$
Off policy actor critic gradient: $\nabla _\theta J _\beta (\pi _\theta) = \int_S \int_A \rho^\beta(s) \nabla _\theta \pi _\theta(a\vert s) Q^\pi(s,a)dads$$= \mathbb{E} _{s \sim \rho^\beta, a \sim \beta}[\frac{\pi _\theta(a \vert s)}{\beta _\theta(a \vert s)} \nabla _\theta \log \pi _\theta (a \vert s)Q^\pi(s,a)]$
Instead of Q value function, we can use temporal difference error for objective
Gradients of Deterministic Policies
Action-Value Gradients
Policy evaluation methods estimate action-value function via TD learning or monte carlo evaluation
Policy improvement methods update policy via greedy maximization wrt the action value functionIn continuous spaces, greedy maximization is problematic; instead move policy in direction of Q
Each state suggests different direction; take the expectation over the state distribution: $\theta^{k+1} = \theta^{k} + \alpha \mathbb{E} _{s \sim \rho^{\mu^k}}[\nabla _\theta Q^{\mu^k}(s,\mu _\theta(s))]$
Can decompose policy gradients (via chain rule) into gradients of action values and gradients of policy:$\theta^{k+1} = \theta^{k} + \alpha \mathbb{E} _{s \sim \rho^{\mu^k}}[\nabla _\theta \mu _\theta(s) \nabla_a Q^{\mu^k}(s,a)]$
Deterministic Policy Gradient Theorem
Performance objective for a deterministic policy $\mu: S \rightarrow A$$J(\mu _\theta) = \int_S \rho^\mu(s) r(s, \mu _\theta (s))ds = \mathbb{E} _{s \sim \rho^{\mu}}[r(s, \mu _\theta (s))]$
Deterministic policy gradient: $\nabla _\theta J(\mu _\theta) = \int_S \rho^\mu(s) \nabla _\theta \mu _\theta(s) \nabla_a Q(s,a) \vert _{a = \mu _\theta(s)}ds = \mathbb{E} _{s \sim \rho^{\mu}}[\nabla _\theta \mu _\theta(s) \nabla_a Q(s,a) \vert _{a = \mu _\theta(s)}]$Uses the action value gradient rule from the previous section!
Limit of the Stochastic Policy Gradient Given a stochastic policy, we can parameterize the variance to be 0 and get a deterministic policyThis can be proven using the stochastic policy gradient and deterministic policy gradients when taking the limit of the variance to 0 of the stochastic gradient
Deterministic Actor-Critic Algorithms
On-Policy Deterministic Actor-Critic
Mainly useful if environment’s stochasticity is sufficient for exploration, else use an off-policy actor-critic
Substitute a differentable critic in place of the true critic
Use some form of TD error to train critic ($Q^w$)
New update rule: $\theta _{t+1} = \theta_t + \alpha(\nabla _\theta \mu _\theta(s_t) \nabla_a Q^w (s_t, a_t)) \vert _{a = \mu _\theta(s)}$
Off-Policy Deterministic Actor-Critic
Performance objective: Value function of deterministic target policy , averaged over state distribution of stochastic behavior policy
$J _\beta (\mu _\theta) = \int _S \rho ^\beta (s) Q^\mu(s, \mu _\theta(s))ds$
Gradient: $\nabla _\theta J _\beta (\mu _\theta) = \int _S \rho^\beta (s) \nabla _\theta \mu(s) \nabla _a Q^\mu(s, \mu _\theta(s)) \vert _{a = \mu(s)}ds = \mathbb{E} _{s \sim \rho^\beta} [\nabla _\theta \mu(s) \nabla _a Q^\mu(s, \mu _\theta(s)) \vert _{a = \mu(s)}]$
Differentiable action value function used in place of true action-valueSame update rule as on-policy actor critic
Compatible Function Approximation
Requirements for compatability:
$Q^w(s,a) = \nabla _\theta \log \pi _\theta(a\vert s)^Tw$: linear in features of stochastic policy
Parameters chosen to minimize MSE: linear regression estimates Q from these features
Substituting a differentiable critic is not necessarily enough to follow the true gradient of the action-value
We want a compatible function approximator: gradient of $Q^\mu$ can be replaced with gradient of $Q^\mu$
For a deterministic policy, there always exists a compatible function approximator in the form
$Q^w(s,a) = (a - \mu _\theta(s))^T \nabla _\theta \mu _\theta(s)^T w+ V^v(s)$
Where $V$ is any baseline function indepedent of the action
We can set the first term of this equation equal to $A(s,a)$, the advantage
Linear function approximators are good local critics, not good global ones
Represents the local advantage of deviating from current deterministic policy by small amount ($\delta$)
Local Advantage: $A^w(s, \mu _\theta(s) + \delta) = \delta^T \nabla _\theta \mu _\theta(s)^T w$
Linear regression problem with MSE:
Features: $\phi(s,a)$: state-action features
Target: $\nabla _a Q^\mu (s,a) \vert _{a = \mu _\theta(s)}$
Difficult to do
Learn Q value function by standard policy evaluation methods
Compatible Off-Policy Deterministic Actor Critic
Critic: Linear function approximator from state-action value features; $\phi(s,a) = a^T \nabla _\theta \mu _\theta(s)$Can be learned using samples from off-policy behavior policy
Actor: updates parameters in direction of critic’s action-value gradient
Update Rules:
Actor: $\theta _{t+1} = \theta _{t} + \alpha _\theta \nabla _\theta \mu _\theta(s_t)(\nabla _\theta \mu _\theta (s_t)^T w)$
Critic: $w _{t+1} = w_t + \alpha_w \delta_t \phi(s,a)$
Value Function: $v _{t+1} = v_t + \alpha_v \delta_t \phi(s)$
Compatible Off-Policy Deterministic Actor Critic with Gradient Q Learning (COPDAC-GQ)
Newer methods based on true gradient descent + gradient TD learningMinimize the mean squared project bellman error (MSPBE)
Uses step sizes to ensure critic updated on faster time scale than actor (so that critic converges to minimizing MSPBE)
Natural policy gradients can be extended into deterministic policies
Fisher information matrix metric for deterministic policies: $M _\mu (\theta) = \mathbb{E} _{s \sim \rho^\mu}[\nabla _\theta \mu _\theta(s)\nabla _\theta \mu _\theta(s)^T]$
Policy gradient with compatible function approximation: $\nabla _\theta J(\mu _\theta) = \mathbb{E} _{s \sim \rho^\mu}[\nabla _\theta \mu _\theta(s)\nabla _\theta \mu _\theta(s)^T w]$Steepest ascent direction: $w = M _\mu(\theta)^{-1}\nabla _\theta J _\beta(\mu _\theta)$
Experiments
Continuous Bandit
Continuous bandit problem with high dimensional quadratic cost function
Compare SAC to COPDAC
SAC: Uses isotropic gaussian
COPDAC: Fixed width gaussian behavior policy
Critic estimated by mapping features to costs
Critic recomputed each successive batch of 2 million steps
Actor updated once per batch
Evaluated via average cost per step incurred by the mean
COPDAC outperforms SAC by wide margin and this margin increases as the dimensionality increases
Continuous Reinforcement Learning
Mountain car, pendulum, and 2d puddle world tasks
COPDAC slightly outperforms SAC and OffPAC
Octopus Arm COPDAC achieves good results on this environment
Discussion
In stochastic policy gradient, policy becomes more deterministic as it finds a good strategyHarder to estimate gradient because it changes rapidy near the mean
Deterministic actor-critic similar to Q learning which learns deterministic greedy policy, off policy while executing a noisy version of that policyCOPDAC does the same thing
April 24, 2025 · research