The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Resources
- Paper
Introduction
- Value Function Agents
  - Create value function and act $\epsilon$-greedily on it
  - Improved sample complexity
  - Long Runtimes
- Actor-Critic Agents
  - (Can have) parallel actors training
  - Worse sample complexity
  - Faster training
- Retrace-Actor (Reactor)
  - Combines sample-efficiency of off-policy experience replay with time-efficiency of asynchronous algorithms
Background
- Assume standard MDP RL setting
- Value-Based Algorithms
  - Common methods:
    - DQN
    - Double DQN
    - Dueling Architecture
    - Rainbow
  - Prioritized Experience Replay
    - Experience replay that samples with probabilities proportional to the absolutde TD error
    - See PER notes
  - Retrace($\lambda$)
    - Assume a trajectory has been generated from a behavior policy ($\mu$)
    - We want to evaluate the value of a target policy, $\pi$
    - To estimate $Q^\pi$, we use following gradient
      - $\Delta Q(x_t, a_t) = \sum _{s \geq t} \gamma^{s-t}(c _{t+1} \dots c_s) \delta_s^\pi Q$
        
        $\delta_s^\pi Q$: temporal difference error at time $s$ under $\pi$
        
        $c_s = \lambda min(1, \rho_s)$: truncated importance sampling weight
        $\rho_s = \frac{\pi(a_s \vert x_s)}{\mu(a_s \vert x_s)}$
    - Guarantees that in finite state + action spaces, $Q$ estimate converges to $Q^\pi$
  - Distributional RL
    - Directly estimates distribution over returns (C51 algorithm)
      - Parameterizes distribution over returns with a mixture over diracs
    - $Q(x, a; \theta) = \sum _{i=0}^{N-1} q_i(x,a;\theta)z_i$
      - $q_i = \frac{e^{\theta_i(x,a)}}{\sum _{j=0}^{N-1}e^{\theta_j(x,a)}}$
      - $z_i = v _{min} + i \frac{v _{max} - v _{min}}{N-1}$
- Actor Critic Algorithms
  - A3C Updates:
    - Policy update: $\Delta \theta = \nabla _{\theta} \log \pi(a_t \vert x_t; \theta)A(x_t, a_t; \theta_v)$
    - Critic Update: $\Delta \theta_v = A(x_t, a_t; \theta_v) \nabla _{\theta_v}V(x_t)$
  - PPO is A3C but replaces advantage function with: $min(\rho_t A(x_t, a_t; \theta_v), clip(\rho_t, 1 - \epsilon, 1 + \epsilon)A(x_t, a_t; \theta_v))$
  - ACER is A3C but uses
    - Experience replay
    - Retrace algorithm for off-policy correction
    - Truncated importance sampling likelihood ratio for off-policy policy optimization
The Reactor
- $\beta$-LOO
  - Use policy gradient theorem to train actor
    - $\nabla V^\pi(x_0) = \mathbb{E}[\sum_t \gamma^t \sum_a Q^\pi(x_t, a)\nabla \pi(a \vert x_t)]$
    - We want to estimate: $G = \sum_a Q^\pi(a) \nabla \theta(a)$ (drop dependence on state)
      - If we sample $\hat{a}$ from a behavior distribution and have access to unbiased $R(\hat{a}), Q^\pi(\hat{a})$, we can estimate G with likelihood ratio + importance sampling: $\hat{G} _{ISLR} = \frac{\pi(\hat{a})}{\mu(\hat{a})}(R(\hat{a}) - V) \nabla \log \pi(\hat{a})$
        Suffers from high variance
      - Alternatively, estimate G by using $R$ for chosen action and $Q$ for all other actions
        
        Leave-one-out (LOO) policy gradient estimate: $\hat{G} _{LOO} = R(\hat{a})\nabla \pi(\hat{a}) + \sum _{a \neq \hat{a}} Q(a)\nabla \pi(a)$
        
        Low variance but may be biased if $Q$ is significantly different from $Q^\pi$
      - $\beta$-LOO Policy Gradient estimate: $\hat{G} _{\beta-LOO} = \beta(R(\hat{a}) - Q(\hat{a}))\nabla \pi(\hat{a}) + \sum _{a \neq \hat{a}} Q(a)\nabla \pi(a)$
        
        $\beta = \beta(\mu, \pi, \hat{a})$: can be a function of both policies and selected action
        
        When $\beta=1$, this turns into $1$-LOO
      - $\frac{1}{\mu}$-LOO: When $\beta = 1/\mu(\hat{a})$
        $\hat{G} _{\frac{1}{\mu}-LOO} = \frac{\pi(\hat{a})}{\mu(\hat{a})}(R(\hat{a}) - Q(\hat{a}))\nabla \pi(\hat{a}) + \sum _{a \neq \hat{a}} Q(a)\nabla \pi(a)$
        
        Unbiased (second term corrects bias)
        
        Generalization of $\hat{G} _{ISLR}$ where a state-action dependent baseline is used
  - Proposition
    - $\hat{a} \sim \mu$ and $\mathbb{E}[R(\hat{a})] = Q^\pi(\hat{a})$, bias of $G _{\beta-LOO}$ is $\vert \sum_a (1 - \mu(a)\beta(a))\nabla \pi(a) [Q(a) - Q^\pi(a)] \vert$
      - Bias decreases as $\beta(a) \rightarrow 1/\mu(a)$
      - Bias decreases as $Q \rightarrow Q^\pi$
    - Thus we should use $\beta$-LOO with $\beta(\hat{a}) = min(c, \frac{1}{\mu(\hat{a})})$
      - Similar to truncated ISLR; differences:
        
        Truncates $\frac{1}{\mu(\hat{a})}$ for additional variance reduction
        
        Uses Q baseline instead of V baseline
- Distributional Retrace
  - Its difficult to produce unbiased $R(\hat{a})$ of $Q^\pi(\hat{a})$ for a target policy when following a behavior policy
    - Requires full importance sampling correction
  - Use corrected return from retrace algorithm
    - Biased estimate of $Q^\pi(\hat{a})$
    - Bias vanishes asymptotically
  - Multistep Distributional Bellman Operator
    - n-step distributional bellman target:
      - $\sum_i q_i(x _{t+n}, a)\delta _{z_i^n}$
      - $z_i^n = \sum _{s=t}^{t+n-1}\gamma^{s-t}r_s + \gamma^n z_i$
    - Do the same projection + KL divergence loss steps in C51
  - Distributional Retrace
    - Retrace algorithm has off-policy correction (not handled by n-step distributional bellman backup)
    - Rewrite retrace as linear combination of n-step backup, weighted by coefficients $\alpha _{n,a}$:
      - $\Delta Q(x_t, a_t) = \sum _{n \geq 1} \sum _{a \in \mathcal{A}} \alpha _{n,a}[\sum _{s=t}^{t+n-1}\gamma^{s-t}r_s + \gamma^n Q(x _{t+n},a)] - Q(x_t, a_t)$
      - $\alpha _{n,a} = (c _{t+1} \dots c _{t+n-1})(\pi(a \vert x _{t+n}) - \mathbb{I}(a = a _{t+n})c _{t+n})$
        
        Depends on degree of off-policiness along trajectory
        
        Coefficients can be negative but in expectation are non-negative
    - Distributional retrace is backing up mixture of n-step distributions
      - Retrace target: $\sum _{i=1} q_i^\ast(x_t, a_t)\delta _{z_i}$
      - $q_i^\ast(x_t, a_t) = \sum _{n \geq 1} \sum _{a} \alpha _{n,a} \sum_j q_j(x _{t+n}, a _{t+n})h _{z_i}(z_j^n)$
        $h _{z_i}$: Linear interpolation kernel used for projection (see paper for details)
      - Update probabilities using KL gradient
        $\nabla KL(q^\ast (x_t, a_t), q(x_t, a_t)) = - \sum _{i=1}q_i^\ast(x_t, a_t)\nabla \log q_i (x_t, a_t)$
- Prioritized Sequence Replay
  - PER gives same priority to all unsampled transitions
  - Assume TD error is temporally correlated with correlation decaying as time increases
  - Lazy Initialization: Instead, give new experience no priority when added and only give it priority after its been sampled and used for training
    - Assign priorities to all overlapping sequences of length $n$
    - When sampling, sequences with no priority sampled proportionally to average priority of assigned sequences within a local neighborhood
    - Starts with priorities $p_t$ for sequences already assigned
    - Define a partition $I_i$ that contains one $s_i$ with assigned priority
    - Estimated $\hat{p}_i$ for all other sequences: $\hat{p}_i = \sum _{s_i \in J(t)} \frac{w_i}{\sum _{i’ \in J(t)} w _{i’}}p(s_i)$
      - $J(t)$ collection of contiguous partitions containing time t
      - $w_i = \vert I_i \vert$: length of partition
      - Cell sizes used as importance weights
      - When $I_i$ is not a function of the priorities, the algorithm is unbiased
    - With probability $\epsilon$, sample uniformly randomly, and probability $1 - \epsilon$ sample proportional to $\hat{p}_t$
    - Implemented using contextual priority tree
    - Prioritization used for variance reduction
- Agent Architecture
  - Decouple acting from learning to improve CPU usage
    - Acting thread gets observations and submits actions and stores experiences in memory
    - Learning thread samples experiences and trains on them
    - 4 - 6 acting steps per learning step
  - Agent can be distributed over machines
    - Current and target network stored on shared parameter server
    - Each machine has its own replay memory
    - Training done by downloading shared network + evaluting local gradients + sending them to shared network
  - Network Architecture
    - Use a recurrent network architecture to evaluate action-values over sequences (allows Convolution layers to process each frame once instead of n times if frame stacking was used)
    - $x_t$ used as input and action-value distribution $q_i(x_t, a)$ outputted + policy probability $\pi(a \vert x_t)$
    - Architecture inspired by dueling network
      - Splits action-value distribution logits into state-value logits and advantage logits
    - Final action-value logits produced by summing state + action-specific logits
    - Use softmax for probability distributions
    - Policy uses softmax layer mixed with fixed uniform distribution with a mixing hyperparmeter
    - Seperate LSTMs for Policy and Q Networks
      - LSTM Policy gradients blocked from back-propagating into CNN for stability (avoids positive feedback loops between $\pi,, q_i$ from shared representations)
Experimental Results
- Trained on Atari games
- Compared performance with different versions of Reactor
- Prioritization was most important component
- Beta-LOO outperforms truncated importance sampling likelihood ratio
- Distributional and non-distributional versions had similar results
  - Distributional was better with human starts
- Comparing to Prior Work
  - Reactor exceeds performance of all algorithms (Rainbow, A3C, Dueling, etc. - see the paper for list) across all metrics
  - Reactor doesn’t use noisy networks found in Rainbow (which helped with performance in Rainbow)
  - With no-op starts, reactor outperforms all except rainbow - same performance in evaluation however
    - Rainbow more sample efficient during training
    - Rainbow performance drops with random human starts
      - May be overfitting to certain trajectories
  - classical + distributional reactor outperformed ACER (another retrace algorithm)

May 20, 2025 · research