DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills
-
Resources
-
Introduction
- Models humans + animals is challenging
- Rely on manually designed controllers that don’t generalize well
- Difficult for humans to articulate internal strategies for skills
- Reinforcement learning is promising but lags behind kinematic methods
- Produces extraneous motion or peculiar gaits
- Can use motion capture or animation data to improve controller quality
- Prior work = layering physics based tracking controller on kinematic animation system
- Challenging because animation system needs to produce feasible to track reference motions
- Limits recovery and deviations
- Ideal learning system
- Supply refernece motions and generate goal directed + realistic behavior
- DeepMimic directly rewards policies that resemble reference animation data
- DeepMimic Methods
- Multi-clip reward based on max operator
- Policy training for skills triggered by user
- Sequencing single clip policies by using value functions to determine feasibility of transitions
-
Overview
- Input: character model, kinematic reference motions, and task defined by reward function
- Output: controller that imitates reference motions while satisfying task objective
- Reference motion is a sequence of target poses ($\{ \hat{q}_t \}$)
- Control policy maps state ($s_t$) and goal ($g_t$) to an action ($a_t$)
- Action specifies target angles for PD controllers
- Reference motions used for imitation reward: $r^I(s_t, a_t)$
- Goal used for task specific reward $r^G(s_t, a_t, g_t)$
-
Background
- Standard RL setting
- Policies trained with proximal policy optimization
- Value function trained with $TD(\lambda)$
- Advantage function computed via generalized advantage estimation ($GAE(\lambda)$)
-
Policy Representation
- Reference motion only provides kinematic information; policy must figure out which actions should be applied
-
States and Actions
- $s$: configuration of character body
- Link positions relative with respect to root
- Rotations in quaternions
- Linear + angular velocities
- Computed in character’s local coordinate frame
- Phase variable included amongst because reference motions vary with time
- Goal provided if one exists too
- $a$: Specifies target orientations for PD controllers for each joint
- Spherical joints in angle-axis form
- Revolute joints in scalar rotation angles
-
Network
- Policy is neural net that maps state and goal to action distribution, modeled as a gaussian
- $\pi(a \vert s, g) = \mathcal{N}(\mu(s), \Sigma)$
- $\Sigma$: Diagonal covariance matrix treated as hyperparameter
- Vision based tasks augment input with heightmap $H$ of the terrain
- Use convolutional layers to process the heightmap
- Features then concatenated with the state and goal
-
Reward
- Two termed reward: $r_t = \omega^I r^I_t + \omega^G r^G_t$
- $r_t^I$: imitation reward
- $r_t^G$: task reward
- Imitation reward: $r^I_t = w^pr^p_t + w^vr^v_t + w^er^e_t + w^cr^c_t$
- $r^p$: Pose reward; encourages to match joint orientations of reference motion
- $r^p_t = exp[-2(\Sigma_j \vert \vert \hat{q}_t^j \ominus q_t^j \vert \vert^2)]$
- $\ominus$: indicates quaternion difference
- $r^v_t$: Difference of local joint velocities
- $r^v_t = exp[-0.1(\Sigma_j \vert \vert \hat{\dot q_t^j} - \dot q_t^j \vert \vert^2)]$
- Difference in angular velocities
- Velocity computed via finite differences
- $r_t^e$: End effect reward; encourages hand and feet to match positions from reference motion
- $r^e_t = exp[-40(\Sigma_e \vert \vert \hat{p_t^e} - p_t^e \vert \vert^2)]$
- $r_t^c$: Penalizes deviations in center of mass of character
- $r_t^c = exp[-10 \vert \vert \hat{p_t^c} - p_t^c \vert \vert^2]$
-
Training
- Policies trained with PPO-Clip
- Policy network parameterized by $\theta$, Value network by $\psi$
- Initial state sampled from reference motions
- Rollouts generated by sampling actions
- Episodes terminated at a horizon or until termination condition
- Target values computed with $TD(\lambda)$
- Advantages computed with $GAE(\lambda)$
- Use initial state distribution + early termination for exploration
-
Initial State Distribution
- Simple strategy = initialize character to starting state of motion
- Forces policy to learn motion in a sequential manner
- Problematic for motions like backflips; learning landing is pre-requisitie
- Not good for exploration
- Reference state initialization: state sampled from reference motion and used to initialize agent
- Encounters desirable states from reference motion
-
Early Termination
- Early termination triggered when certain links hit the ground
- Character gets 0 reward for remainder of episode
- Advantages
- Can be used a means for reward shaping
- Biases data distribution in favor of samples relevant to the task
- I.e., without early termination, early samples of training dominated by character on the ground
-
Multi-Skill Integration
- We shouldn’t be limited to a single reference clip - can use multi-clip reward
- User can also control character via a skill selector policy
- We can also train a composite policy
- Multiple policies learned independently but value functions used to determine which policy to activate
- Multi-clip reward: Takes the max imitation reward across all clips
- $r^I_t = max _{j = 1, \dots, k} r_t^j$
- Skill Selector: Single policy imitates diverse skills and can then execute arbitrary policies on demand
- Policy provided with a goal
- During training, this goal is sampled randomly
- Composite Policy:
- Training becomes more difficult as the number of skills needed to learn in 1 policy increases
- Can train seperate policies for each skill
- Then use value functions to determine which skill to execute
- Constructed using a boltzmann distribution
- $\Pi(a \vert s) = \Sigma _{i=1}^k p^i(s)\pi^i(a \vert s)$
- $p^i(s) = \frac{exp[V^i(s) / T]}{\Sigma _{j=1}^k V^j(s) / T}$
- Where $T$ is the temperature
-
Characters
- 3D Humanoid, Atlas Robot, T-Rex, Dragon
- PD controllers at each joint
- Atlas and humanoid have similar structure but atlas is heavier
- Dragon and T-Rex used for examples where is no motion capture data (use keyframe animations)
-
Tasks
- Target Heading: Encourage character to travel in a specific direction
- $r_t^G = exp[-2.5 max(0, v^\ast - v^T_td_t^\ast)^2]$
- $v^\ast$: desired speed
- $d_t^\ast$: target direction
- Penalizes for traveling slower than desired speed but not faster
- During training target direction chosen randomly
- Strike: Character must strike a random target
- $r^G_t = 1$ if target has been hit
- $r^G_t = exp[-4\vert\vert p_t^{tar} - p_t^e \vert \vert^2]$ otherwise
- $p_t^{tar}$: location of the target
- $p_t^{e}$: position of link used to hit target
- Goal is $g_t = (p_t^{tar}, h)$
- $h$ indicates if the target was hit in previous timestep
- Throw: Need to throw ball to target
- Same reward as strike task but with
- Terrain Traversal: Character traverses obstacle filled environments
- Obstacles:
- Winding balance beam
- Stairs
- Mixed obstacles
- Gaps
- Use progressive learning approach
- Use fully connected networks to imitate motions on flat terrain
- Next augment with height map and train on irregular environments
-
Results
- For locomotion skills, policies produce natural gaits
- Able to learn variety of skills, even those with long flight phases (i.e., backflip)
- Can reproduce contact rich motions like crawling or rolling
- Can also reproduce motions that require coordination with environment
- Policies robust to perturbations and produce recovery behaviors
-
Tasks
- Policies are able to satisfy additional task objectives
- Throwing success rate is 75% for policy with dual objective (vs 5% for imitation only policy)
- Strike success rate is 99% for policy with dual objective (vs 19% for imitation only policy)
- Can deviate from initial reference motion and use additional strategies to satisfy goals
- Without reference motion, policies produce unnatural behaviors
-
Multi-Skill Integration
- Multi-Clip Reward
- Resulting policy learns many agile stepping behaviours to follow heading
- When heading changes, character’s motion becomes more closely aligned with turning motions
- Once re-aligned, it goes back to forward walking motion
- Shows multi-clip does allow policy to learn from many clips
- Mixing very different clips results in policy imitating only a subset of the clips
- Skill Selector
- Use a one hot vector representation to train policy on many types of skills
- Once trained, policy was able to execute arbitrary sequences of skills
- Composite Policy
- To integrate diverse policies, use the output of the value function and sample from the composite policy to determine which skill to execute
- Policy restricted never to sample same skill consecutively
- Not trained to transition between skills; value functions enable this transition
-
Retargeting
- Character Retargeting
- Copy local joint rotations from humanoid to atlas
- New policies trained for atlas to imitate retargeted clips
- Despite different character morphologies, system can train policies to reproduce various skills with Atlas model
- Environment Retargeting
- For the jumping motion with reference clip on flat terrain, it was able to apply it again to a new environment that is different from original clip
- With vision based locomotion, we could augment network inputs with a height map
- During training, Policy was able to learn various strategies to traverse classes of obstacles
- Was able to adapt original reference motion for irregular terrians
- Physics Retargeting
- Change gravity to do a spin kick
- Despite the differences, policies were able to adapt the motions
-
Ablations
- Reference state initialization + early termination are important!
- Early termination eliminates local optima by penalizing character when on the ground
- RSI important for skills with long flight times (without it, the policy can’t reproduce behaviors)
-
Robustness
- Able to stand external perturbations
- No perturbations during training but this behavior occurs likely due to noise froms stochastic policy
-
Discussions and Limitations
- Requires a phase state variable for synchronization with reference motion
- Limits adjusting timing of motion
- Multi-clip integration only works well for small number of clips
- PD servos require insight to set properly for each character morphology
- Learning takes a while per skill
- State symmetric simiarlity defined manually
· research