Continuous Control with Deep Reinforcement Learning
Resources
Paper
-
Introduction
- DQNs can’t handle continuous action spaces
- Discretizing the action space leads to curse of dimensionality and number of degrees of freedom leads to exponentially more actions
- Even worse for tasks that require fine-grained control
- Discretization can throw away information on structure of the action domain
- Deep DPG combines actor-critic and DQN ideas
- Leverages replay buffer for off-policy training
- Uses target Q networks for consistent targets for TD backups
-
Background
- Assume standard reinforcement learning setup
- Q Learning: Learn Q function through minimizing MSE loss: $L(\theta^Q)\mathbb{E} _{s_t \sim \rho^\beta, a_t \sim \beta, r_t \sim E}[(Q(s_t, a_t \vert \theta^Q) - y_t)^2]$
- Where $y_t = r(s_t, a_t) + \gamma Q(s _{t+1}, \mu(s _{t+1}) \vert \theta^Q)$
- Take policy greedy with respect to learned Q function
-
Algorithm
- Can’t apply Q learning directly
- In continuous spaces, finding the greedy policy requires optimization of $a_t$ at every timestep
- Too slow for continuous action spaces
- DDPG uses actor critic approach
- Parameterized deterministic actor ($S \rightarrow A$): $\mu(s \vert \theta^\mu)$
- Critic: Learned using bellman equation in Q learning
- Actor updated using chain rule to expected return from objective with respect to actor parameters
- $\nabla _{\theta^\mu} \approx \mathbb{E} _{s_t \sim \rho^\beta}[\nabla _\theta Q(s,a \vert \theta^Q) \vert _{s = s_t, a = \mu(s_t \vert \theta^\mu)}]$
- $\Rightarrow \mathbb{E} _{s_t \sim \rho^\beta}[\nabla _a Q(s,a \vert \theta^Q) \vert _{s = s_t, a = \mu(s_t \vert \theta^\mu)} \nabla _{\theta _\mu} \mu(s \vert \theta^\mu) \vert _{s = s_t}]$
- Introducing function approximators removes convergence guarantees but needed for state space generalization
- Use a replay buffer for training IID data in minibatches
- Used target network for target updates because without it, critic updates were unstable
- Prevented divergence at the cost of slower learning
- Used batch normalization to normalize observation units across environments, making it easier for the network to learn
- Minimizes covariance shift during training
- For exploration, added noise sampled from a noise process to the actor policy
- Exploration Policy: $\mu’(s_t) = \mu(t_t \vert \theta_t^\mu) + \mathcal{N}$
-
Results
- Tested on classical RL tasks and high dimensional tasks
- Tested on both low dimensional observations (i.e., joint angles) and high dimensional observations (i.e., images)
- For high dimensional observations, stacked frames for the input to compute velocities (else it became a POMDP)
- Found that learning from both high and low dimensional states worked well and led to equally fast convergence
- Q estimates are good on easier tasks and worse on harder tasks but DDPG still learns a good policy
· research