DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

Resources
- Paper
Introduction
- Models humans + animals is challenging
  - Rely on manually designed controllers that don’t generalize well
  - Difficult for humans to articulate internal strategies for skills
- Reinforcement learning is promising but lags behind kinematic methods
  - Produces extraneous motion or peculiar gaits
- Can use motion capture or animation data to improve controller quality
  - Prior work = layering physics based tracking controller on kinematic animation system
    - Challenging because animation system needs to produce feasible to track reference motions
      - Limits recovery and deviations
- Ideal learning system
  - Supply refernece motions and generate goal directed + realistic behavior
  - DeepMimic directly rewards policies that resemble reference animation data
- DeepMimic Methods
  - Multi-clip reward based on max operator
  - Policy training for skills triggered by user
  - Sequencing single clip policies by using value functions to determine feasibility of transitions
Overview
- Input: character model, kinematic reference motions, and task defined by reward function
- Output: controller that imitates reference motions while satisfying task objective
- Reference motion is a sequence of target poses ($\{ \hat{q}_t \}$)
- Control policy maps state ($s_t$) and goal ($g_t$) to an action ($a_t$)
  - Action specifies target angles for PD controllers
- Reference motions used for imitation reward: $r^I(s_t, a_t)$
- Goal used for task specific reward $r^G(s_t, a_t, g_t)$
Background
- Standard RL setting
- Policies trained with proximal policy optimization
- Value function trained with $TD(\lambda)$
- Advantage function computed via generalized advantage estimation ($GAE(\lambda)$)
Policy Representation
- Reference motion only provides kinematic information; policy must figure out which actions should be applied
- States and Actions
  - $s$: configuration of character body
    - Link positions relative with respect to root
    - Rotations in quaternions
    - Linear + angular velocities
    - Computed in character’s local coordinate frame
    - Phase variable included amongst because reference motions vary with time
    - Goal provided if one exists too
  - $a$: Specifies target orientations for PD controllers for each joint
    - Spherical joints in angle-axis form
    - Revolute joints in scalar rotation angles
- Network
  - Policy is neural net that maps state and goal to action distribution, modeled as a gaussian
    - $\pi(a \vert s, g) = \mathcal{N}(\mu(s), \Sigma)$
      - $\Sigma$: Diagonal covariance matrix treated as hyperparameter
  - Vision based tasks augment input with heightmap $H$ of the terrain
    - Use convolutional layers to process the heightmap
    - Features then concatenated with the state and goal
- Reward
  - Two termed reward: $r_t = \omega^I r^I_t + \omega^G r^G_t$
    - $r_t^I$: imitation reward
    - $r_t^G$: task reward
  - Imitation reward: $r^I_t = w^pr^p_t + w^vr^v_t + w^er^e_t + w^cr^c_t$
    - $r^p$: Pose reward; encourages to match joint orientations of reference motion
      - $r^p_t = exp[-2(\Sigma_j \vert \vert \hat{q}_t^j \ominus q_t^j \vert \vert^2)]$
        $\ominus$: indicates quaternion difference
    - $r^v_t$: Difference of local joint velocities
      - $r^v_t = exp[-0.1(\Sigma_j \vert \vert \hat{\dot q_t^j} - \dot q_t^j \vert \vert^2)]$
        
        Difference in angular velocities
        
        Velocity computed via finite differences
    - $r_t^e$: End effect reward; encourages hand and feet to match positions from reference motion
      - $r^e_t = exp[-40(\Sigma_e \vert \vert \hat{p_t^e} - p_t^e \vert \vert^2)]$
    - $r_t^c$: Penalizes deviations in center of mass of character
      - $r_t^c = exp[-10 \vert \vert \hat{p_t^c} - p_t^c \vert \vert^2]$
Training
- Policies trained with PPO-Clip
- Policy network parameterized by $\theta$, Value network by $\psi$
- Initial state sampled from reference motions
- Rollouts generated by sampling actions
- Episodes terminated at a horizon or until termination condition
- Target values computed with $TD(\lambda)$
- Advantages computed with $GAE(\lambda)$
- Use initial state distribution + early termination for exploration
- Initial State Distribution
  - Simple strategy = initialize character to starting state of motion
    - Forces policy to learn motion in a sequential manner
    - Problematic for motions like backflips; learning landing is pre-requisitie
    - Not good for exploration
  - Reference state initialization: state sampled from reference motion and used to initialize agent
    - Encounters desirable states from reference motion
- Early Termination
  - Early termination triggered when certain links hit the ground
    - Character gets 0 reward for remainder of episode
  - Advantages
    - Can be used a means for reward shaping
    - Biases data distribution in favor of samples relevant to the task
      - I.e., without early termination, early samples of training dominated by character on the ground
Multi-Skill Integration
- We shouldn’t be limited to a single reference clip - can use multi-clip reward
- User can also control character via a skill selector policy
- We can also train a composite policy
  - Multiple policies learned independently but value functions used to determine which policy to activate
- Multi-clip reward: Takes the max imitation reward across all clips
  - $r^I_t = max _{j = 1, \dots, k} r_t^j$
- Skill Selector: Single policy imitates diverse skills and can then execute arbitrary policies on demand
  - Policy provided with a goal
  - During training, this goal is sampled randomly
- Composite Policy:
  - Training becomes more difficult as the number of skills needed to learn in 1 policy increases
  - Can train seperate policies for each skill
    - Then use value functions to determine which skill to execute
  - Constructed using a boltzmann distribution
    - $\Pi(a \vert s) = \Sigma _{i=1}^k p^i(s)\pi^i(a \vert s)$
    - $p^i(s) = \frac{exp[V^i(s) / T]}{\Sigma _{j=1}^k V^j(s) / T}$
      - Where $T$ is the temperature
Characters
- 3D Humanoid, Atlas Robot, T-Rex, Dragon
- PD controllers at each joint
- Atlas and humanoid have similar structure but atlas is heavier
- Dragon and T-Rex used for examples where is no motion capture data (use keyframe animations)
Tasks
- Target Heading: Encourage character to travel in a specific direction
  - $r_t^G = exp[-2.5 max(0, v^\ast - v^T_td_t^\ast)^2]$
    - $v^\ast$: desired speed
    - $d_t^\ast$: target direction
  - Penalizes for traveling slower than desired speed but not faster
  - During training target direction chosen randomly
- Strike: Character must strike a random target
  - $r^G_t = 1$ if target has been hit
  - $r^G_t = exp[-4\vert\vert p_t^{tar} - p_t^e \vert \vert^2]$ otherwise
    - $p_t^{tar}$: location of the target
    - $p_t^{e}$: position of link used to hit target
  - Goal is $g_t = (p_t^{tar}, h)$
    - $h$ indicates if the target was hit in previous timestep
- Throw: Need to throw ball to target
  - Same reward as strike task but with
- Terrain Traversal: Character traverses obstacle filled environments
  - Obstacles:
    - Winding balance beam
    - Stairs
    - Mixed obstacles
    - Gaps
  - Use progressive learning approach
    - Use fully connected networks to imitate motions on flat terrain
    - Next augment with height map and train on irregular environments
Results
- For locomotion skills, policies produce natural gaits
- Able to learn variety of skills, even those with long flight phases (i.e., backflip)
- Can reproduce contact rich motions like crawling or rolling
- Can also reproduce motions that require coordination with environment
- Policies robust to perturbations and produce recovery behaviors
- Tasks
  - Policies are able to satisfy additional task objectives
  - Throwing success rate is 75% for policy with dual objective (vs 5% for imitation only policy)
  - Strike success rate is 99% for policy with dual objective (vs 19% for imitation only policy)
  - Can deviate from initial reference motion and use additional strategies to satisfy goals
  - Without reference motion, policies produce unnatural behaviors
- Multi-Skill Integration
  - Multi-Clip Reward
    - Resulting policy learns many agile stepping behaviours to follow heading
    - When heading changes, character’s motion becomes more closely aligned with turning motions
    - Once re-aligned, it goes back to forward walking motion
    - Shows multi-clip does allow policy to learn from many clips
      - Mixing very different clips results in policy imitating only a subset of the clips
  - Skill Selector
    - Use a one hot vector representation to train policy on many types of skills
    - Once trained, policy was able to execute arbitrary sequences of skills
  - Composite Policy
    - To integrate diverse policies, use the output of the value function and sample from the composite policy to determine which skill to execute
    - Policy restricted never to sample same skill consecutively
    - Not trained to transition between skills; value functions enable this transition
- Retargeting
  - Character Retargeting
    - Copy local joint rotations from humanoid to atlas
    - New policies trained for atlas to imitate retargeted clips
      - Despite different character morphologies, system can train policies to reproduce various skills with Atlas model
  - Environment Retargeting
    - For the jumping motion with reference clip on flat terrain, it was able to apply it again to a new environment that is different from original clip
    - With vision based locomotion, we could augment network inputs with a height map
      - During training, Policy was able to learn various strategies to traverse classes of obstacles
      - Was able to adapt original reference motion for irregular terrians
  - Physics Retargeting
    - Change gravity to do a spin kick
    - Despite the differences, policies were able to adapt the motions
- Ablations
  - Reference state initialization + early termination are important!
    - Early termination eliminates local optima by penalizing character when on the ground
    - RSI important for skills with long flight times (without it, the policy can’t reproduce behaviors)
- Robustness
  - Able to stand external perturbations
  - No perturbations during training but this behavior occurs likely due to noise froms stochastic policy
Discussions and Limitations
- Requires a phase state variable for synchronization with reference motion
  - Limits adjusting timing of motion
- Multi-clip integration only works well for small number of clips
- PD servos require insight to set properly for each character morphology
- Learning takes a while per skill
- State symmetric simiarlity defined manually

October 13, 2025 · research