Use marginal value functions over limiting distribution: $g^{marg} = \mathbb{E} _{x _t \sim \beta, a _t \sum \mu}[\rho _t \nabla _\theta \log \pi _\theta (a _t \vert x _t) Q^{\pi}(x _t, a _t)]$
$\beta(x) = \lim _{t \rightarrow \infty} P(x _t = x \vert x _0, \mu)$ is the limiting distribution
$\mu$ is the behavior policy
Avoids having to compute the gradient for the whole trajectory (no longer a product of importance weights - only uses marginal importance weight)
ACER gradient with respect to $\phi$: $\hat{g}^{acer} _t = \bar{\rho}_t \nabla _{\phi _\theta (x_t)} \log f(a_t \vert \phi _\theta (x)) Q^{ret}(x_t, a_t) - V _{\theta _{v}} (x_t) + \mathbb{E} _{a \sim \pi} ([ \frac{\rho_t(a) - c}{\rho_t(a)}] _{+} \nabla _{\phi _\theta (x_t)} \log f (a \vert \phi _\theta (x)) (Q _{\theta _{v}}(x_t, a) - V _{\theta _{v}} (x_t)))$
Trust region Update:
Linear KL divergence constraint: $minimize \frac{1}{2}\vert\vert \hat{g}^{acer}_t -z \vert\vert^2_2$ subject to $\nabla _{\phi _\theta(x_t)} D _{KL}[f(\cdot \vert \phi _{\phi _{\theta _a}(x_t)})\vert\vert f(\cdot \vert \phi _{\phi _\theta(x_t)})]^T z \leq \delta$
Linear constraint means we can solve it using KKT conditions: $z^* = \hat{g}^{acer} _t - max(0, \frac{k^T\hat{g}^{acer} _t -\delta}{\vert\vert k \vert\vert^2_2}k)$
Trust region step done on statistics space of $f$ instead of policy parameters to avoid an extra backprop step
ACER has off-policy and on-policy components
Can control number of on-policy vs off-policy updates via the replay ratio
On-policy ACER is just A3C with Q baselines and trust region optimization used
Results on Atari
Single algorithm + hyperparameters to play atari games
Using replay signifcantly increases data efficiency
Higher replay ratio = accumulates average reward faster (but with diminishing returns)
ACER matches performance of best DQNs
ACER Off policy is more efficient than on-policy (A3C)
Similar amount of wall clock training time for ACERs
Continuous Actor Critic with Experience Replay
Policy Evaluation
To compute the $V _{\theta_v}$ given $Q _{\theta_v}$, we would need to integrate over the action space; this is intractable
We could use importance sampling but has high variance
Stochastic Dueling Networks: Estimates $Q^\pi$ and $V^\pi$ off-policy while maintaing consistency between estimates
Outputs stochastic $\tilde{Q _{\theta_v}}$ of $Q^\pi$
Ouputs deterministic $V _{\theta_v}$ of $V^\pi$
Follows equation: $\tilde{Q _{\theta_v}}(x_t, a_t) \sim V _{\theta_v}(x_t) + A _{\theta_v}(x_t, a_t) - \frac{1}{n}\sum _{i=1}^n A _{\theta_v}(x_t, u_i) \text{ where } u_i \sim \pi _\theta(\cdot \vert x_t)$
Target for estimating $V^\pi$: $V^{target}(x_t) = min(1, \frac{\pi(a_t \vert x_t)}{\mu(a_t \vert x_t)})(Q^{ret}(x_t, a_t) - (Q _{\theta_v}(x_t, a_t))) + V _{\theta_v}(x_t)$
Can raise the importance sampling weight to the power of $\frac{1}{d}$ for faster learning