Prioritized Experience Replay

Resources
- Paper
Introduction
- Without experience replay, we get temporally correlated updates
  - Loses IID assumption
  - Forgetting potentially rare experiences is more useful
- With experience replay, we break temporal correlations
  - Stabalizes value function training
  - Reduces amount of experience needed to learn
- By prioritizing which experiences are replayed, we can make experience replay even more efficient
  - RL agent can learn from some experiences effectively than others
    - Some transitions more more/less surprising or redundant
    - Some transitions not immediately relevant but relevant later on
- Prioritized Experience Replay replays transitions with high expected learning progress → measured by magnitude of TD error
  - Prioritization can lead to loss of diversity
  - Use stochastic prioritization + bias and correct with importance sampling for diversity
Background
- In neuroscience, experience replay is found in hippocampus of rodents
  - Sequences with high TD error or rewards replayed more
- In value iteration, prioritizing updates makes planning more efficient
  - Prioritized sweeping prioritizes updating state based on change in value of state
  - TD error also can be used
  - PER uses similar prioritization (TD error) but for model-free RL instead of model-based
  - Stochastic prioritization more robust when using function approximators
- TD errors tell us where we should focus exploration
- When dealing with imbalanced datasets, we can resample
  - Separate experience into two buckets (positive and negative rewards)
  - Choose fixed fraction from each bucket to replay
  - Only applicable for domains with positive and negative experiences
- DQN + DDQN are SOTA on atari
Prioritized Replay
- PER can be implemented in two ways
  - Which experiences to store
  - Which experiences to replay (paper does this)
- Motivating Example
  - If we could experience replay with an oracle that tells us what order to use, we could get an optimal policy in far less updates
- Prioritizing with TD-Error
  - TD error is a reasonable proxy for the amount the RL agent can learn from a transition
    - Tells us how surprising a transition is by measuring how far value is from next step bootstrap estimate
  - TD error prioritization algorithm
    - Store TD error along with transition in memory
    - Play transition with highest TD error
    - New transitions without a known TD error are given maximum priority
  - Implemented using a binary heap
- Stochastic Prioritization
  - Issues with greedy prioritization
    - TD errors only updated for replayed transitions
      - First visit low TD errors may never be replayed for long periods of time
    - Sensitive to noise spikes which is exacerbated with bootstrapping
    - Focuses on small subset of experience
      - Errors shrink over time → initially high error transitions get replayed more frequently → prone to overfitting
  - Stochastic Sampling: interpolates between greedy and uniform sampling
    - Probability of sampling a transition: $P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$
      - $\alpha$ is a parameter which determines degree of prioritization ($\alpha = 0$) means using uniform sampling
  - Proportional Prioritization: $p_i = \vert\delta_i\vert + \epsilon$ where $\epsilon$ is a small positive constant to prevent 0 probability transitions
    - Implemented with a sum-tree data structure
  - Rank based prioritization: $p_i = \frac{1}{rank(i)}$ where $rank(i)$ is rank of transition $i$ when sorted by $\delta_i$
    - P becomes a power law distribution
    - Insensitive to outliers
    - Implemented with an approximate CDF with piecewise linear function with k segments of equal probability
      - Sample a segment and then sample transitions along it
      - Choose k to be size of minibatch (one transition / segment)
- Annealing the Bias
  - Estimation of expected value with stochastic updates requires update distribution = expectation distribution
  - PER introduces bias because it changes distribution in uncontrolled fashion
    - Changes solution estimates converge to
    - Since we are sampling based on importance, our updates to Q networks are not following the transitions they generated and are instead from a different distribution
      - Use importance sampling to adjust how much we change weights by
  - Use weighted importance sampling:
    - $w_i = (\frac{1}{N} \cdot \frac{1}{P(i)})^\beta$
    - Compensates for non-uniform probabilities if $\beta=1$
    - Weights can be folded into Q-learning update
    - Normalize weights by $1/max_i w_i$ for stability
  - In RL, unbiased updates important near convergence (process tends to be very non-stationary anyways)
    - Bias can generally be ignored
    - Annealing level of importance sampling over time such that $\beta$ reaches 1 near end → i.e., linear annealing
  - Large steps disruptive because first order approximation only accurate locally → prevent with small step size
    - With PER, high error transitions seen many times → importance sampling reduces gradient magnitudes + taylor expansion constantly re-approximated
Atari Experiments
- PER leads to substantial score improvements on majority of games
- Aggregate learning is about twice as fast
- PER is complementary to DDQN → increases performance another notch
Discussion
- Rank based is more robust because not impacted by outliers nor error magnitudes
  - Heavy-tail = diverse samples
  - Stratified sampling = stable gradient magnitude
- Rank-based blind to error scales = performance drop when there is structure in the error distribution
  - Clipped rewards + TD errors in DQN = similar empirical performance
- With regular experience replay some experience slides out of window before ever replayed
  - With PER, the bonus given to unseen transitions ensures this doesn’t happen
- When using PER with representations of transitions, representation with good transitions replayed much less and learning focuses on where representation is poor → more resources into distinguishing aliased states
Extensions
- Prioritized Supervised Learning: We can use PER to replay samples where we can still learn much
- Off policy Replay: PER is analogous to using off policy RL with rejection sampling or importance sampling
- Feedback for Exploration: The number of times a transition is replayed is feedback for how useful it was to an agent → signal can be fed back for exploration
  - Sample exploration hyperparameters from parameterized distribution
  - Monitor usefulness of experience
  - Update distribution towards generating more useful experience for exploration
- Prioritized Memories
  - Controlling which memories to store and erase reduce redundancy
  - When erasing, we need to have stronger considerations of diversity (i.e., age of memory) to preserve old memories and prevent cycles
  - Also flexible enough to incorporate experience from external sources

April 4, 2025 · research