PDP: Physics-Based Character Animation via Diffusion Policy

Resources
- Paper
Introduction
- Generating agents that can traverse and interact with the environment can be solved via RL or behavioral cloning
- Conditional VAEs and GANs can capture humanoid skills
  - VAEs suffer with trade-off between diversity and robustness
  - GANs can suffer from mode collapse
- Diffusion models unexplored in high frequency control domains
- Behavioral cloning with diffusion is ineffective due to compounding errors in high frequency or under-actuated tasks
- PDP: Uses diffusion policies with large scale motion datasets to learn diverse multimodal motor skills
  - Uses expert RL policies to gather valid sequences of observations and actions to overcome domain shift sensitivity
  - Key Insight: RL policies provide optimal trajectories + corrective actions from suboptimal states
  - We can train with noisy states + clean actions for a more robust policy
Methods
- 3 Stages
  - Train a set of expert policies specialized in a small task but completing wide variety of motion tracking tasks
  - Generate state-action trajectories from policies stochastically to build noisy-state, clean-action trajectories
  - Train diffusion model via behavioral cloning
- Expert Policy Training
  - Train an RL policy for a set of tasks
  - If the set of tasks is large, a singular policy can be difficult to learn
  - Seperate the tasks into subsets and train seperate policies for each one
- Stochastic Data Collection
  - For each task, generate a dataset by rolling out each policy
    - Use a noisy version of the optimal action from expert policy
  - Combine the datasets together
    - Note: Clean action stored in dataset but noisy action used for rollout.
      - Clean action acts as a corrective action
      - Creates a noise band around the clean trajectories (similar to DASS)
      - They expand the noise band by generating short recovery episodes by initializing the character with random root position and orientations
- Behavioral Cloning with Diffusion Policy
  - Diffusion Model
    - Action distribution conditioned on observations
    - Uses denoising diffusion probabilistic models
    - Denoising process learned by noise-prediction network: $\epsilon _{\theta}(A_t^k, O_t, \tau_t,k)$
      - $A_t^k$: Action sequence sampled from dataset
      - $k$: Diffusion step
      - Conditioned on $O_t$
      - $\tau$: task / goal
      - $\theta$: Model parameters
    - Sampling occurs through stochastic langevin dynamics (starts from pure noise)
      - $A^{k-1}_t = \alpha (A^k_t - \gamma \epsilon _{\theta}(A_t^k, O_t, \tau_t,k) + \mathcal{N}(0, \sigma^2, I))$
        $\gamma, \alpha, \sigma$: Tunable hyperparamters
    - Noise prediction model learned in self-supervised manner
      - $\mathcal{L} = MSE(\epsilon^k, \epsilon _{\theta}(A_t^0 + \epsilon^k, O_t, \tau_t, k))$
  - Model Architecture
    - Use a similar architecture to time-series transformer diffusion architecture
    - For locomotion control + motion tracking, task information in observation
    - For text to motion, text is encoded with CLIP and then passed through an MLP
      - Observation also passed through an MLP
      - Diffusion step embedded into same space and added to text embedding
      - Result fed through a Feature-wise Linear Modulation (FiLM) layer (learned scale + shift)
      - Diffusion embedding concatenated with FiLM result; produces conditioning (input for transformer encoder)
      - Transformer decoder takes embedding of noisy action sequence + encoder result and predicts noise applied to action
Experiments
- 3 Applications:
  - Locomotion control under large perturbations
  - Universal motion tracking
  - Physics Based Text to motion synthesis
- Perturbation Recovery
  - Train a single diffusion policy that is capable of capturing wide range of human responses to perturbations
  - Dataset: recovery motions of humans being physically pushed while walking on a treadmill
  - Experimental Details: 25 joint skeletal model
    - Environment simulated in MuJoCo
    - Observations = center of mass positions, linear velocities, and body rotations
    - During training agent receives same perturbation as human did in dataset
    - After training RL policies, collect new observations
      - Include binary signal for whether human is being perturbed
        Helps diffusion policy differentiate between recovery and walking
- Universal Motion Tracking
  - Train a single diffusion policy capable of controlling a character to track a reference motion under physics simulation
  - Dataset: Subset of AMASS Dataset (exclude infeasible motions)
  - Experimetnal Details: Reference motion includes linear position, 6D rotation of each joint, linear + angular velocities
    - Prediction horizon is 4 observations and 1 action
- Text-to-Motion
  - Generate motions conditioned on natural language prompt
  - Dataset: KIT dataset + annotations from HumanML3D
    - Task vector generated through passing text through CLIP
  - Experimental Details: Use joint position, joint velocity, joint rotation, and joint rotational velocities
Results
- Sampling Strategy
  - Clean-state clean-action = lowest performance
    - Success rates: 3.36% for perturbation and 68.8% for tracking
  - Noisy-state noisy-action = Success rates: 66.9% for perturbation and 64.5% for tracking
  - Noisy-state clean-action = Success rates: 100% for perturbation and 93.5% for tracking
- Perturbation Recovery
  - Compare PDP to C-VAE (generative) and MLP (deterministic)
  - Robustness
    - Perturbations are either in-distribution (ID) or out-of-distribution (OOD)
    - All 3 models handle ID perturbations
    - With OOD, PDP (96.3%) and C-VAE (91.3%) have good performance
    - 2 Hyperparameters:
      - Noise level: 0 noise performs badly – increasing this improves performance until a certain level of noise
      - Action Prediction Horizon: Lower horizons = better performance
  - Foot Placement Correctness
    - Measure how far apart foot positions are from policy compared to closes ground truth
    - $FPC = \frac{1}{N}\sum _{i=1}^N (min _{j \in 1,2, \dots, M} \sqrt{(x_i - \bar{x}_j)^2 + (y_i - \bar{y}_j)^2})$
      - FPC is much lower for PDP than C-VAE
    - C-VAE fails to capture multimodality as well; favors one mode while PDP favors both modes
      - C-VAE has a trade off in capturing multimodality and having more variance in foot placement
- Motion Tracking
  - PDP achieves 96.4% success rate on AMASS
  - Using an MLP outperforms PDP - diffusion for motion tracking does not have clear benefits
- Text-to-Motion
  - PDP can follow diverse text commands
  - Cannot do composite text prompts because it lacks necessary memory of initial action
  - PDP (57.1%) outperforms MLP (11.9%)
Discussion
- MLPs which cannot capture multimodality lack robustness to OOD perturbations
- C-VAE Posterior Collapse: Tuning $\beta$ in C-VAE is hard
  - Increasing it forces the latent distribution to align to a normal distribution and cause the model disregard latent vector; causes it to function like an MLP
- PDP and MLP Tracking Task: PDP can train a model through supervised learning and exceed performance of hierarchal RL policies
  - Can use the local experts to fine tune on a seperate dataset
    - Using RL, you would need to control many low level controllers or even train a whole new policy
- Text2Motion Challenges
  - Does not perform at the same level as kinematic motion generation
  - Must balance motion and maintaining equilibrium
    - Losing balance = disrupts motion with corrective steps
  - 2 distinct motions can be close in kinematic space; achieving them can be signficantly different in skill space
- Limitations
  - Speed of denoising process (K times longer than MLP)
  - Predicting multiple actions dilutes importance of immediate action

September 30, 2025 · research