We want to estimate: $G = \sum_a Q^\pi(a) \nabla \theta(a)$ (drop dependence on state)
If we sample $\hat{a}$ from a behavior distribution and have access to unbiased $R(\hat{a}), Q^\pi(\hat{a})$, we can estimate G with likelihood ratio + importance sampling: $\hat{G} _{ISLR} = \frac{\pi(\hat{a})}{\mu(\hat{a})}(R(\hat{a}) - V) \nabla \log \pi(\hat{a})$
Suffers from high variance
Alternatively, estimate G by using $R$ for chosen action and $Q$ for all other actions
PER gives same priority to all unsampled transitions
Assume TD error is temporally correlated with correlation decaying as time increases
Lazy Initialization: Instead, give new experience no priority when added and only give it priority after its been sampled and used for training
Assign priorities to all overlapping sequences of length $n$
When sampling, sequences with no priority sampled proportionally to average priority of assigned sequences within a local neighborhood
Starts with priorities $p_t$ for sequences already assigned
Define a partition $I_i$ that contains one $s_i$ with assigned priority
Estimated $\hat{p}_i$ for all other sequences: $\hat{p}_i = \sum _{s_i \in J(t)} \frac{w_i}{\sum _{i’ \in J(t)} w _{i’}}p(s_i)$
$J(t)$ collection of contiguous partitions containing time t
$w_i = \vert I_i \vert$: length of partition
Cell sizes used as importance weights
When $I_i$ is not a function of the priorities, the algorithm is unbiased
With probability $\epsilon$, sample uniformly randomly, and probability $1 - \epsilon$ sample proportional to $\hat{p}_t$
Implemented using contextual priority tree
Prioritization used for variance reduction
Agent Architecture
Decouple acting from learning to improve CPU usage
Acting thread gets observations and submits actions and stores experiences in memory
Learning thread samples experiences and trains on them
4 - 6 acting steps per learning step
Agent can be distributed over machines
Current and target network stored on shared parameter server
Each machine has its own replay memory
Training done by downloading shared network + evaluting local gradients + sending them to shared network
Network Architecture
Use a recurrent network architecture to evaluate action-values over sequences (allows Convolution layers to process each frame once instead of n times if frame stacking was used)
$x_t$ used as input and action-value distribution $q_i(x_t, a)$ outputted + policy probability $\pi(a \vert x_t)$
Architecture inspired by dueling network
Splits action-value distribution logits into state-value logits and advantage logits
Final action-value logits produced by summing state + action-specific logits
Use softmax for probability distributions
Policy uses softmax layer mixed with fixed uniform distribution with a mixing hyperparmeter
Seperate LSTMs for Policy and Q Networks
LSTM Policy gradients blocked from back-propagating into CNN for stability (avoids positive feedback loops between $\pi,, q_i$ from shared representations)
Experimental Results
Trained on Atari games
Compared performance with different versions of Reactor
Prioritization was most important component
Beta-LOO outperforms truncated importance sampling likelihood ratio
Distributional and non-distributional versions had similar results
Distributional was better with human starts
Comparing to Prior Work
Reactor exceeds performance of all algorithms (Rainbow, A3C, Dueling, etc. - see the paper for list) across all metrics
Reactor doesn’t use noisy networks found in Rainbow (which helped with performance in Rainbow)
With no-op starts, reactor outperforms all except rainbow - same performance in evaluation however
Rainbow more sample efficient during training
Rainbow performance drops with random human starts