Discrete action space function approximators result in overestimation
Similar issues in actor-critic
Overestimation is caused by noisy value estimates in function approximation + using bootstrapping in TD learning
Accumulates error over time
With double DQN, we use a seperate target value function for estimation
Slow changing policies (critics get updated more frequently) in actor-critic cause current and target value estimates to be too similar
Older variant of double q learning trains 2 critics independently
Less bias but higher variance = overestimations in future values
Use a clipped double Q learning which uses the idea that a value estimate suffering from overestimation can be a upper bound for the true value
Favors underestimations which don’t get propagated (policies avoid underestimations)
To address noise variance, use target networks
To address coupling of policy and value networks, delay policy updates until value has converged
Include SARSA update regularization for variance reduction
Twin Delayed Deep Deterministic Policy Gradient: Actor Critic Algorithm that considers function approximation errors in policy and value updates
Background
Consider a standard reinforcement learning setting
In actor-critc, we can use the deterministic policy gradient theorem for the actor: $\nabla _\phi J(\phi) = \mathbb{E} _{s \sim p _\pi}[\nabla_a Q^\pi (s,a) \vert _{a = \pi(s)} \nabla _\phi \pi _{\phi} (s)]$
In Q Learning, update the value function using TD learning
Update weights either using soft updates every step or hard update every k steps
Make your training off-policy using experience replay
Overestimation Bias
In Q learning, we take a maximum over all actions
If there is error in the Q function estimates, then maximum of the estimate will be greater than the true maximum
This means even if the expected error is 0, there can still be overestimation
Overestimation Bias in Actor-Critic
Because we know that the gradient direction is a local maximizer, the approximate $\pi _{approx}$ is bounded below an approximate $\pi _{true}$: $\mathbb{E}[Q _\theta(s, \pi _{approx}(s))] \geq \mathbb{E}[Q _\theta(s, \pi _{true}(s))]$
To the critic, it looked like the policy improved
However, due to overestimation bias, the policy might have not improved. The relationship between the true values may look something like this: $\mathbb{E}[Q _\pi(s, \pi _{approx}(s))] \leq \mathbb{E}[Q _\pi(s, \pi _{true}(s))]$
To your environment, it looks like your policy didn’t improve
Because of this, we get overestimation: $\mathbb{E}[Q _\theta(s, \pi _{approx}(s))] \geq \mathbb{E}[Q _\pi(s, \pi _{approx}(s))]$
Clipped Double Q-Learning for Actor-Critic
In double DQN, use target network for value estimate and policy from current network
Analogously, in actor-critic we could use current policy instead of target policy to learn target for critics
However, policies are too slow changing in actor-critic
Instead use Double Q learning formulation with two critics and two actors
$y_1 = r + \gamma Q _{\theta_2’}(s’, \pi _{\phi _1(s’)})$
$y_2 = r + \gamma Q _{\theta_1’}(s’, \pi _{\phi _2(s’)})$
Less overestimation than DDPG but doesn’t eliminate overestimation
Avoids bias from policy update because $\pi _{\phi_1}$ optimizes with respect to $Q _{\theta_1}$ with an independent estimate of target update of $Q _{\theta_1}$
Critics not independent - Uses opposite critic for target values
Some states have larger Q values in critic 1 than 2
$Q _{\theta_1}$ will generally overestimate values; in certain areas of state space, overestimation becomes exaggerated
To avoid this overestimation, take minimum between two estimates to get target update
Clipped double Q-learning Target Update: $y_1 = r + \gamma min _{i =1, 2} Q _{\theta_i’}(s’, \pi _{\phi _1(s’)})$
Target doesn’t introduce additional overestimation in Q learning
Induces underestimation bias but is preferable because it doesn’t propagate to policy
Minimum operator also provides higher value to states with lower variance estimation error
Addressing Variance
Accumulating Error
When training the value function via function approximation, there is residual TD error ($\delta(s,a)$)