Asynchronous Methods for Deep Reinforcement Learning

Resources
- Paper
Introduction
- Online RL algorithms thought to be unstable because of non-stationary data → updates highly correlated → use experience replay
- Experience replay uses more memory and compute + requires off-policy learning
- Instead of experience replay, use parallel agents in parallel environments
  - Decorrelates agent data into stationary process
  - Agents experience will be very different at any arbitrary timestep
  - Enables on-policy SARSA, n-step methods, or actor-critic AND off-policy Q learning
  - Can also be used on multi-core CPU instead of GPU → takes less time than GPU methods
Related Work
- GORILA: Asynchronous training of RL agents in distributed setting
  - Each process has agent that acts on own environment with separate replay memory
  - Gradients asynchronously sent to central server which updates central model
- Map Reduce for RL: Used parallelism to speed up matrix operations
- Parallel SARSA: Seperate actors learn using SARSA and use p2p communication to share experience with other actors
- Q Learning Convergence: Q learning is guaranteed to converge even with outdated information as long as it eventually gets discarded
- Evolutionary Methods: These can be parallelized
Reinforcement Learning Background
- Value based Methods: Minimize MSE in estimated Q value and actual Q value (from env) → make policy greedy or epsilon greedy with respect to Q value
- Policy based Methods: Directly parameterize the policy
  - REINFORCE: Update policy parameters in direction of $\nabla_\theta \log \pi(a_t \vert s_t; \theta)R_t$
  - Reduce variance by subtracting baselines from the return; converts gradient into $\nabla_\theta \log \pi(a_t \vert s_t; \theta)(R_t - b_t(s_t))$
    - Common baseline is value function $V^\pi(s_t)$
    - Advantage of action: $A(s_t, a_t) = Q(a_t, s_t) - V(s_t)$
    - Similar to actor critic where policy is actor and baseline is critic
Asynchronous RL Framework
- Use multiple async actors on one machine’s CPU threads → removes communication costs
- Each actor gets a different exploration policy → makes online updates less correlated
  - No replay memory
- Reduction in training time, roughly linear with number of actor-learners
- On-policy training is now stable
- Async one-step Q Learning
  - Each thread interacts with own copy of environment and computes Q learning loss
  - Accumulates multiple steps of gradients before applying (similar to using minibatches)
    - Reduces chance of actors overwriting other actor updates
    - Trades off computational efficiency for data efficiency
  - Use epsilon greedy with a sampled epsilon value for each environment
- Async one-step SARSA
  - Same algorithm as Q learning except instead of taking max Q, we use SARSA pairs for target
- Async n-step Q Learning
  - Computes n-step returns in forward view (instead of backward view like its usually done)
  - Computes gradients for n-step Q learning updates for each state-action pair encountered since last update
    - Uses longest possible n-step return
      - One-step update for last state, two step for second to last, etc.
    - Accumulated updates applied in single gradient step
- Advantageous Actor Critic (A3C)
  - Maintains a policy and value function
  - Operates in forward view eligibility trace and uses n-step returns to update policy + value function
  - Updated every $t_{max}$ steps or when a terminal state is reached
    - Updates based on REINFORCE update step
    - $\nabla_\theta^{‘}\log \pi(a_t \vert s_t; \theta^{‘})A(s_t, a_t; \theta, \theta_v)$
  - Parallel actors improve policy and value
    - Policy and value likely share some parameters
  - Added entropy to policy to improve exploration to discourage suboptimal convergence
    - Gradient with entropy regularization: $\nabla_\theta^{‘}\log \pi(a_t \vert s_t; \theta^{‘})(R_t - V(s_t; \theta_v)) + \beta\nabla_\theta^{‘} H(\pi(s_t;\theta))$
- Optimization
  - Used non-centered RMSProp
Experiments
- Atari 2600
  - Asynchronous methods faster than synchronous ones
  - N-step methods faster than one-step ones
  - Tuned hyperparameters using a search
  - A3C significantly improves on SOTA average score in half the training time and no GPU for 57 games
  - Matches human median score of dueling double DQN Almost matches median human score of Gorila
- TORCS Car Racing Simulator
  - A3C best performing agent -- received 75% to 90% of the score obtained by human tester
- Continuous Action Control Using the MuJoCo Physics Simulator
  - Found good policies in under 24 hours
- Labyrinth
  - Each episode is a new maze → much more challenging
    - Finds a reasonable strategy for exploring mazes
- Scalability and Data Efficiency
  - Parallel workers lead to substantial speed ups
    - Order of magnitude faster with 16 threads
  - Async Q Learning + SARSA = superlinear speedups that cannot be explained by computational gains
  - One-step methods require less data
    - Reduced bias with multiple threads
- Robustness and Stability
  - Large choice of learning rates leads to good scores → async methods are robust
    - Almost no points with scores of 0
    - Methods are stable and do not collapse or diverge
Conclusions and Discussions
- Parallel actors has stabalizing effect on learning process
- Online q learning possible without experience replay
  - Although experience replay with async environments could substantially improve data efficiency
- Combining other RL methods with async framework could show even more improvements
- Can improve A3C using generalized advantage estimation
- Can try to use non-linear function approximation with temporal methods
- Can use dueling architecture or spatial softmax for more improvements

April 6, 2025 · research