Continuous Control with Deep Reinforcement Learning

Resources
- Paper
Introduction
- DQNs can’t handle continuous action spaces
- Discretizing the action space leads to curse of dimensionality and number of degrees of freedom leads to exponentially more actions
  - Even worse for tasks that require fine-grained control
  - Discretization can throw away information on structure of the action domain
- Deep DPG combines actor-critic and DQN ideas
  - Leverages replay buffer for off-policy training
  - Uses target Q networks for consistent targets for TD backups
Background
- Assume standard reinforcement learning setup
- Q Learning: Learn Q function through minimizing MSE loss: $L(\theta^Q)\mathbb{E} _{s_t \sim \rho^\beta, a_t \sim \beta, r_t \sim E}[(Q(s_t, a_t \vert \theta^Q) - y_t)^2]$
  - Where $y_t = r(s_t, a_t) + \gamma Q(s _{t+1}, \mu(s _{t+1}) \vert \theta^Q)$
  - Take policy greedy with respect to learned Q function
Algorithm
- Can’t apply Q learning directly
  - In continuous spaces, finding the greedy policy requires optimization of $a_t$ at every timestep
  - Too slow for continuous action spaces
- DDPG uses actor critic approach
  - Parameterized deterministic actor ($S \rightarrow A$): $\mu(s \vert \theta^\mu)$
  - Critic: Learned using bellman equation in Q learning
  - Actor updated using chain rule to expected return from objective with respect to actor parameters
    - $\nabla _{\theta^\mu} \approx \mathbb{E} _{s_t \sim \rho^\beta}[\nabla _\theta Q(s,a \vert \theta^Q) \vert _{s = s_t, a = \mu(s_t \vert \theta^\mu)}]$
    - $\Rightarrow \mathbb{E} _{s_t \sim \rho^\beta}[\nabla _a Q(s,a \vert \theta^Q) \vert _{s = s_t, a = \mu(s_t \vert \theta^\mu)} \nabla _{\theta _\mu} \mu(s \vert \theta^\mu) \vert _{s = s_t}]$
- Introducing function approximators removes convergence guarantees but needed for state space generalization
- Use a replay buffer for training IID data in minibatches
- Used target network for target updates because without it, critic updates were unstable
  - Prevented divergence at the cost of slower learning
- Used batch normalization to normalize observation units across environments, making it easier for the network to learn
  - Minimizes covariance shift during training
- For exploration, added noise sampled from a noise process to the actor policy
  - Exploration Policy: $\mu’(s_t) = \mu(t_t \vert \theta_t^\mu) + \mathcal{N}$
Results
- Tested on classical RL tasks and high dimensional tasks
  - Tested on both low dimensional observations (i.e., joint angles) and high dimensional observations (i.e., images)
  - For high dimensional observations, stacked frames for the input to compute velocities (else it became a POMDP)
- Found that learning from both high and low dimensional states worked well and led to equally fast convergence
- Q estimates are good on easier tasks and worse on harder tasks but DDPG still learns a good policy

April 26, 2025 · research