Playing Atari With Deep Reinforcement Learning

Resources
- Paper
Introduction
- Controlling agents with high dimensional sensory inputs is difficult
  - Without DL: Requires hand-crafted features
  - Need to learn from sparse reward signals
  - RL data distribution changes as algorithm learns new behaviors (many DL methods assume fixed data distributions)
- CNNs + Q-Learning can overcome these challenges from raw video data
  - Uses experience replay for smoothening training distribution
Background
- Emulator internal state not available – just the video data
- Feedback about an action only received after many thousand timesteps
- Partially observable task: agent only observes images of the current screen
  - Perceptual aliasing: impossible to understand situation from only current screen
- Assumes discount factor for rewards
- Maximize the Q value function
  - Value Iteration: Iterative update using bootstrapped estimate
    - Impractical because no generalization
  - Use function approximation instead
    - Q-Network estimates q-value
    - Loss function is MSE between q-value for a trajectory and estimated q-valued
      - $L_i(\theta_i) = \mathbb{E}_{s, a \sim \rho(\cdot)}[(y_i - Q(s, a; \theta_i))^2]$
        $\rho(\cdot)$ is the behavior distribution
- Algorithm Is:
  - Model Free: Doesn’t estimate environment
  - Off-Policy: Uses a greedy policy while following behavior distribution for exploration
  - Behavior distribution: Epsilon Greedy
Deep Reinforcement Learning
- Goal: Connect a deep RL algorithm to image data
- Can estimate an on-policy-based value function from SARSA experience
- Utilize experience replay to store agent’s experiences and apply q-learning updates on sampled batches
  - Selects action based on epsilon greedy policy
- Advantages of Deep Q Learning over standard online Q learning
  - Data efficiency: experience used in many updates
  - Avoid temporal correlations with learning with consecutive samples
  - With on-policy (no experience replay) current parameters determine next data sample parameters trained on
    - Training distribution shifts based on training sample → causes feedback loops
    - Behavior distribution is smoothened with experience replay
    - Experience replay is off-policy
- Experience replay stores limited experiences with no priority → more sophisticated sampling = emphasize specific transitions
- Preprocessing and Model Architecture
  - Preprocessing: Grayscale + downsampling + crop + combine last 4 frames of the history as a stack → processes them
  - Q-Network parameterization
    - Input: state/history
    - Predict Q values of all actions (actions are discrete) in a single forward pass
      - Better than predicting Q-value for each action in multiple forward passes
  - Architecture: Convolutional Layers + Nonlinearities + Fully connected layer at the end
Experiments
- Training and Stability
  - Evaluation Metric: total reward / average reward over games
    - Tends to be noisy because small changes to weights = large changes in the distribution of state visits
    - Can also look at action-value – less noisy
- Main Evaluation
  - Compare to SARSA for hand-engineered feature sets on atari task
  - Their RL agents must learn the objects (don’t use color like the SARSA one does)
  - Also reports human performance
  - Run an epsilon greedy strategy at epsilon = 0.05
  - TLDR of results: DQN is better than most other ML + RL Algorithms and, depending on the game, can be better, as good as, or worse than humans
    - More challenging games = not as good

November 4, 2024 · research