PDP: Physics-Based Character Animation via Diffusion Policy
Resources
Introduction
Generating agents that can traverse and interact with the environment can be solved via RL or behavioral cloning
Conditional VAEs and GANs can capture humanoid skills
VAEs suffer with trade-off between diversity and robustness
GANs can suffer from mode collapse
Diffusion models unexplored in high frequency control domains
Behavioral cloning with diffusion is ineffective due to compounding errors in high frequency or under-actuated tasks
PDP: Uses diffusion policies with large scale motion datasets to learn diverse multimodal motor skills
Uses expert RL policies to gather valid sequences of observations and actions to overcome domain shift sensitivity
Key Insight: RL policies provide optimal trajectories + corrective actions from suboptimal states
We can train with noisy states + clean actions for a more robust policy
Methods
3 Stages
Train a set of expert policies specialized in a small task but completing wide variety of motion tracking tasks
Generate state-action trajectories from policies stochastically to build noisy-state, clean-action trajectories
Train diffusion model via behavioral cloning
Expert Policy Training
Train an RL policy for a set of tasks
If the set of tasks is large, a singular policy can be difficult to learn
Seperate the tasks into subsets and train seperate policies for each one
Stochastic Data Collection
For each task, generate a dataset by rolling out each policyUse a noisy version of the optimal action from expert policy
Combine the datasets togetherNote: Clean action stored in dataset but noisy action used for rollout.
Clean action acts as a corrective action
Creates a noise band around the clean trajectories (similar to DASS)
They expand the noise band by generating short recovery episodes by initializing the character with random root position and orientations
Behavioral Cloning with Diffusion Policy
Diffusion Model
Action distribution conditioned on observations
Uses denoising diffusion probabilistic models
Denoising process learned by noise-prediction network: $\epsilon _{\theta}(A_t^k, O_t, \tau_t,k)$
$A_t^k$: Action sequence sampled from dataset
$k$: Diffusion step
Conditioned on $O_t$
$\tau$: task / goal
$\theta$: Model parameters
Sampling occurs through stochastic langevin dynamics (starts from pure noise)$A^{k-1}_t = \alpha (A^k_t - \gamma \epsilon _{\theta}(A_t^k, O_t, \tau_t,k) + \mathcal{N}(0, \sigma^2, I))$$\gamma, \alpha, \sigma$: Tunable hyperparamters
Noise prediction model learned in self-supervised manner$\mathcal{L} = MSE(\epsilon^k, \epsilon _{\theta}(A_t^0 + \epsilon^k, O_t, \tau_t, k))$
Model Architecture
Use a similar architecture to time-series transformer diffusion architecture
For locomotion control + motion tracking, task information in observation
For text to motion, text is encoded with CLIP and then passed through an MLP
Observation also passed through an MLP
Diffusion step embedded into same space and added to text embedding
Result fed through a Feature-wise Linear Modulation (FiLM) layer (learned scale + shift)
Diffusion embedding concatenated with FiLM result; produces conditioning (input for transformer encoder)
Transformer decoder takes embedding of noisy action sequence + encoder result and predicts noise applied to action
Experiments
3 Applications:
Locomotion control under large perturbations
Universal motion tracking
Physics Based Text to motion synthesis
Perturbation Recovery
Train a single diffusion policy that is capable of capturing wide range of human responses to perturbations
Dataset: recovery motions of humans being physically pushed while walking on a treadmill
Experimental Details: 25 joint skeletal model
Environment simulated in MuJoCo
Observations = center of mass positions, linear velocities, and body rotations
During training agent receives same perturbation as human did in dataset
After training RL policies, collect new observationsInclude binary signal for whether human is being perturbedHelps diffusion policy differentiate between recovery and walking
Universal Motion Tracking
Train a single diffusion policy capable of controlling a character to track a reference motion under physics simulation
Dataset: Subset of AMASS Dataset (exclude infeasible motions)
Experimetnal Details: Reference motion includes linear position, 6D rotation of each joint, linear + angular velocitiesPrediction horizon is 4 observations and 1 action
Text-to-Motion
Generate motions conditioned on natural language prompt
Dataset: KIT dataset + annotations from HumanML3DTask vector generated through passing text through CLIP
Experimental Details: Use joint position, joint velocity, joint rotation, and joint rotational velocities
Results
Sampling Strategy
Clean-state clean-action = lowest performanceSuccess rates: 3.36% for perturbation and 68.8% for tracking
Noisy-state noisy-action = Success rates: 66.9% for perturbation and 64.5% for tracking
Noisy-state clean-action = Success rates: 100% for perturbation and 93.5% for tracking
Perturbation Recovery
Compare PDP to C-VAE (generative) and MLP (deterministic)
Robustness
Perturbations are either in-distribution (ID) or out-of-distribution (OOD)
All 3 models handle ID perturbations
With OOD, PDP (96.3%) and C-VAE (91.3%) have good performance
2 Hyperparameters:
Noise level: 0 noise performs badly – increasing this improves performance until a certain level of noise
Action Prediction Horizon: Lower horizons = better performance
Foot Placement Correctness
Measure how far apart foot positions are from policy compared to closes ground truth
$FPC = \frac{1}{N}\sum _{i=1}^N (min _{j \in 1,2, \dots, M} \sqrt{(x_i - \bar{x}_j)^2 + (y_i - \bar{y}_j)^2})$FPC is much lower for PDP than C-VAE
C-VAE fails to capture multimodality as well; favors one mode while PDP favors both modesC-VAE has a trade off in capturing multimodality and having more variance in foot placement
Motion Tracking
PDP achieves 96.4% success rate on AMASS
Using an MLP outperforms PDP - diffusion for motion tracking does not have clear benefits
Text-to-Motion
PDP can follow diverse text commands
Cannot do composite text prompts because it lacks necessary memory of initial action
PDP (57.1%) outperforms MLP (11.9%)
Discussion
MLPs which cannot capture multimodality lack robustness to OOD perturbations
C-VAE Posterior Collapse: Tuning $\beta$ in C-VAE is hardIncreasing it forces the latent distribution to align to a normal distribution and cause the model disregard latent vector; causes it to function like an MLP
PDP and MLP Tracking Task: PDP can train a model through supervised learning and exceed performance of hierarchal RL policiesCan use the local experts to fine tune on a seperate datasetUsing RL, you would need to control many low level controllers or even train a whole new policy
Text2Motion Challenges
Does not perform at the same level as kinematic motion generation
Must balance motion and maintaining equilibriumLosing balance = disrupts motion with corrective steps
2 distinct motions can be close in kinematic space; achieving them can be signficantly different in skill space
Limitations
Speed of denoising process (K times longer than MLP)
Predicting multiple actions dilutes importance of immediate action
September 30, 2025 · research