Addressing Function Approximation Error in Actor-Critic Methods

Resources
- Paper
Introduction
- Discrete action space function approximators result in overestimation
  - Similar issues in actor-critic
- Overestimation is caused by noisy value estimates in function approximation + using bootstrapping in TD learning
  - Accumulates error over time
- With double DQN, we use a seperate target value function for estimation
  - Slow changing policies (critics get updated more frequently) in actor-critic cause current and target value estimates to be too similar
- Older variant of double q learning trains 2 critics independently
  - Less bias but higher variance = overestimations in future values
- Use a clipped double Q learning which uses the idea that a value estimate suffering from overestimation can be a upper bound for the true value
  - Favors underestimations which don’t get propagated (policies avoid underestimations)
- To address noise variance, use target networks
- To address coupling of policy and value networks, delay policy updates until value has converged
- Include SARSA update regularization for variance reduction
- Twin Delayed Deep Deterministic Policy Gradient: Actor Critic Algorithm that considers function approximation errors in policy and value updates
Background
- Consider a standard reinforcement learning setting
- In actor-critc, we can use the deterministic policy gradient theorem for the actor: $\nabla _\phi J(\phi) = \mathbb{E} _{s \sim p _\pi}[\nabla_a Q^\pi (s,a) \vert _{a = \pi(s)} \nabla _\phi \pi _{\phi} (s)]$
- In Q Learning, update the value function using TD learning
- Update weights either using soft updates every step or hard update every k steps
- Make your training off-policy using experience replay
Overestimation Bias
- In Q learning, we take a maximum over all actions
  - If there is error in the Q function estimates, then maximum of the estimate will be greater than the true maximum
  - This means even if the expected error is 0, there can still be overestimation
- Overestimation Bias in Actor-Critic
  - Because we know that the gradient direction is a local maximizer, the approximate $\pi _{approx}$ is bounded below an approximate $\pi _{true}$: $\mathbb{E}[Q _\theta(s, \pi _{approx}(s))] \geq \mathbb{E}[Q _\theta(s, \pi _{true}(s))]$
    - To the critic, it looked like the policy improved
  - However, due to overestimation bias, the policy might have not improved. The relationship between the true values may look something like this: $\mathbb{E}[Q _\pi(s, \pi _{approx}(s))] \leq \mathbb{E}[Q _\pi(s, \pi _{true}(s))]$
    - To your environment, it looks like your policy didn’t improve
  - Because of this, we get overestimation: $\mathbb{E}[Q _\theta(s, \pi _{approx}(s))] \geq \mathbb{E}[Q _\pi(s, \pi _{approx}(s))]$
- Clipped Double Q-Learning for Actor-Critic
  - In double DQN, use target network for value estimate and policy from current network
    - Analogously, in actor-critic we could use current policy instead of target policy to learn target for critics
    - However, policies are too slow changing in actor-critic
  - Instead use Double Q learning formulation with two critics and two actors
    - $y_1 = r + \gamma Q _{\theta_2’}(s’, \pi _{\phi _1(s’)})$
    - $y_2 = r + \gamma Q _{\theta_1’}(s’, \pi _{\phi _2(s’)})$
    - Less overestimation than DDPG but doesn’t eliminate overestimation
      - Avoids bias from policy update because $\pi _{\phi_1}$ optimizes with respect to $Q _{\theta_1}$ with an independent estimate of target update of $Q _{\theta_1}$
      - Critics not independent - Uses opposite critic for target values
        
        Some states have larger Q values in critic 1 than 2
        
        $Q _{\theta_1}$ will generally overestimate values; in certain areas of state space, overestimation becomes exaggerated
        
        To avoid this overestimation, take minimum between two estimates to get target update
  - Clipped double Q-learning Target Update: $y_1 = r + \gamma min _{i =1, 2} Q _{\theta_i’}(s’, \pi _{\phi _1(s’)})$
    - Target doesn’t introduce additional overestimation in Q learning
    - Induces underestimation bias but is preferable because it doesn’t propagate to policy
    - Minimum operator also provides higher value to states with lower variance estimation error
Addressing Variance
- Accumulating Error
  - When training the value function via function approximation, there is residual TD error ($\delta(s,a)$)
    - $Q _\theta (s,a) = r + \gamma \mathbb{E}[Q _\theta (s’,a’)] - \delta(s,a)$
  - Value estimate approximates to expected return minus discounted sum of future TD error
    - $Q _\theta (s,a) = \mathbb{E} _{s_i \sim p _\pi, a_i \sim \pi}[\sum _{i = t}^T \gamma^{i - t}(r_i - \delta _i)]$
    - Variance proportional to future reward + estimation error
- Target Networks and Delayed Policy Updates
  - Target networks provide stable targets
    - Prevents residual error accumulation
    - Also prevents divergence when learning a policy
  - Actor Critic Methods Failure
    - Policies update with high variance value estimate
    - Value estimate overestimation –> divergence –> Poor policy
  - Instead, delay policy updates until value error is as small as possible
    - Update policy network after $d$ updates to critics
    - Ensure TD error is small as possible by updating target networks slowly
- Target Policy Smoothing Regularization
  - Deterministic policies may overfit at narrow peaks in the value estimate
  - Critics susceptible to inaccuracies from function approximation, increasing variance
    - Reduce variance via regularization
  - Fit the value around a small area around target action
    - $y = r + \mathbb{E} _\epsilon [Q _{\theta’}(s’, \pi _{\phi’}(s’) + \epsilon)]$
    - Smoothes the value estimate by bootstrapping around similar state-action value estimates
    - Approximate expectation by adding noise to target policy + averaging over minibatches
      - $y = r + \gamma Q _{\theta’}(s’, \pi _{\phi’}(s’) + \epsilon)$
      - $\epsilon \sim clip(\mathcal{N}(0, \sigma), -c, c)$
      - Clipping keeps the target action close to original action
      - Analogous to expected SARSA
Experiments
- Evaluation
  - For Mujoco tasks, TD3 matches or outperforms DDPG, PPO, TRPO, ACKTR, and SAC
- Ablation Studies
  - Addition of a single component provides insignificant improvement in most cases
  - Combination of all outperforms other algorithms
  - Actor is trained for half the steps but deplayed policy updates improves performance and reduces training time
  - Actor Critic Variants of Double Q learning and Double DQN reduced overestimation bias less than clipped double Q learning
    - Reducing overestimations is an effective way to improve performance

April 26, 2025 · research