ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning
-
Resources
-
Introduction
- Precise + expressive humanoid locomotion is hard
- Requires rich whole body contact data that isn’t available at scale
- Direct imitation of humans is attractive
- Contact locations + relative object poses in human demonstrations fail to translate
- General motion tracking (GMT) policies trained on human datasets are unaware of objects
- Humanoid loco-manipulation appraoches rely on task specific designs; limits scalability and generalization
- Robotic foundation models are powerful but pretrain-finetune for humanoids has been largely unexplored
- Key Insight
- Diverse human motions can be pretrained via GMT
- Object centric loco-manipulation requires task-specific corrections
- Whole body motions have shared attributes
- Fine grained object interaction requires adaptation
- New Approach: Stable motion prior augmented with lightweight task-specific adjustments
- ResMimic
- First stage: Train GMT policy on motion capture data to serve as a prior for human motions
- Second stage: Train a task-specific residual policy to condition on object reference trajectory
- Outputs corrective actions that refine GMT + enable precise object manipulation
- Decoupling stages alleviates need for task specific rewards + has better data efficiency
-
Method
- Problem is framed as goal-conditioned reinforcement learning problem with an MDP structure
- State ($s \in \mathcal{S}$)
- Robot proprioception ($s_r^t$)
- Object state ($s_o^t$)
- Motion goal state ($\hat{s}_t^r$)
- Object goal state ($\hat{s}_t^o$)
- Action
- Target joint angles executed through PD controller
-
Two-Stage Residual Learning
- General motion tracking policy ($\pi _{GMT}$)
- Uses robot proprioception + reference motion to get coarse action
- $a_t^{gmt} = \pi _{GMT}(s_r^t, \hat{s}_t^r)$
- Maximizes a motion tracking reward ($r_t^m$)
- Residual Refinement
- Train efficient and precise residual policy per task
- $\pi _{Res}(s_r^t, s_o^t,\hat{s}_t^r, \hat{s}_t^o) = \Delta a_t^{res}$
- Maximizes combined motion and object reward ($r_t^m$ and $r_t^o$)
- Trained with PPO
-
General Motion Tracking Policy
- Dataset: Use popular Mocap datasets like AMASS, OMOMO
- Apply kinematics based retargeting to transfer human motions to humanoid reference motion
- Training
- Proprioceptive observation: $s_t^r = [\theta_t, \omega_t, q_t, \dot q_t, a_t^{hist}] _{t-10:t}$
- $\theta$: Root orientation
- $\omega$: Root angular velocity
- $q_t$: Joint positions
- $\dot q_t$: Joint positions
- $a_t^{hist}$: recent action history
- Reference motion: $\hat{s}_t^r = [\hat{p}_t, \hat{\theta_t}, \hat{q}_t] _{t-10:t+10}$
- $\hat{p}_t$: reference root translation
- $\hat{\theta_t}$: reference root orientation
- $\hat{q}_t$: reference joint position
- Use future reference motion to plan for upcoming targets
- Reward and Domain Randomization
- Motion tracking reward ($r_t^m$) is sum of task rewards, penalty terms, and regularization term
- Use domain randomization for better sim2real
-
Residual Refinement Policy
- Use retargeted reference motions of humanoid and object to train residual policy ($\{ (\hat{s}_t^r, \hat{s}_t^o)\} _{t=1}^T$)
- Training
- Use PPO
- Takes in $(s_r^t, s_o^t,\hat{s}_t^r, \hat{s}_t^o)$ and outputs residual action, $\Delta a_t^{res}$
- Network Initialization
- Initialize the final layer of the PPO actor using xavier uniform initialization with a small gain
- Ensures initial outputs are close to 0
- Virtual Object Force Curriculum
- Residual framework fails when reference motions are noisy or objects are heavy
- Usually occurs from penetration from kinematic retargeting
- Instability when handling large object masses
- Use a curriculum that stabilizes training by driving object toward reference trajectory
- PD Controllers apply virtual force + torque
- $\mathcal{F}_t = k_p(\hat{p}_t^o - p_t^o - k_dv_t^o)$
- $\mathcal{T}_t = k_p(\hat{\theta}_t^o \ominus \theta_t^o) - k_d\omega_t^o$
- Reward and Early Termination
- Decoupling motion tracking and object interaction allows them to avoid tuning weights
- Instead, use motion reward + domain randomization from GMT, and introduce two additional terms
- $r^o_t$: Object tracking reward; encourages task completion
- $r^c_t$: Contact tracking reward; gives explicit guidance on body-object contact
- Object tracking reward:
- Sample N points from object mesh surface and compute point cloud difference betweeen current and reference states
- $r^o_ = exp(\lambda_o \sum _{i=1}^N \vert \vert P[i]_t - \hat{P}[i]_t \vert \vert_2)t$
- Contact reward:
- Discretize contact locations into links (excluding feet since they are usually on the ground)
- Oracle contact information: $\hat{c}_t[i] 1(\vert \vert \hat{d}_t[i] \vert \vert) < \sigma_c$
- $1(\cdot)$ is indicator function
- $r^c_t = \sum_i \hat{c}_t[i] \cdot exp(- \frac{\lambda}{f_t[i]})$
- $f_t[i]$: contact force at link $i$
- Early termination:
- Object mesh deviates beyond certain threshold: $\vert \vert P_t - \hat{P}_t \vert \vert_2$
- Required object-body contact lost for more than 10 frames
-
Experiments
- Questions
- Can GMT without task specific training accomplish diverse loco-manipulation tasks
- Does initializing from pretrained GMT improve training efficiency + final performance relative to training from scratch
- Is residual learning more effective than fine-tuning when adapting GMT to loco-manipulation
- Can ResMimic achieve robust control in real world
-
Experiment Setup
- Tasks
- Kneel on 1 knee and lift a box
- Carry a box onto the back
- Squat + lift box with arms and torso
- Lift up a chair
- Evaluation
- Training iterations until convergence
- Object tracking error
- $E_o = \frac{1}{T} \sum _{t=1}^T \sum _{i=1}^N \vert \vert P[i]_t - \hat{P}[i]_t \vert \vert_2$
- Motion tracking error
- $E_m = \frac{1}{T} \sum _{t=1}^T \sum _{i=1}^N \vert \vert p[i]_t - \hat{p}[i]_t \vert \vert_2$
- Joint tracking error
- $E_j = \frac{1}{T} \sum _{t=1}^T \vert \vert q_t - \hat{q}_t \vert \vert_2$
- Task success rate
- Success if $E_o$ is below predefined threshold + robot is balanced
- Baselines
- Base Policy: GMT policy follows human reference motion without object information
- Train from scratch: RL policy trained to track human motion + object trajectories without GMT
- Bae policy + fine tune: Base policy fine tuned to track human motion + object trajectory
-
Sim-to-Sim Evaluation
- GMT cannot do complete loco-manipulation but provides strong initialization (10% success rate vs 92.5% for ResMimic)
- Using GMT as a base policy improves training efficiency + effectiveness
- Residual learning outperforms direct fine-tuning
- Fine tuning cannot incorporate additional object inputs since GMT is based on human motion data
- Fine-tuning overwrites generalization of GMT
- Lack of explicit object state prevents learning robust behaviors
-
Real-world Evaluation
- In the real world, ResMimic results in:
- Expressive carrying motions
- Humanoid interaction beyond manipulation
- Heavy payload carrying with whole-body contact
- Generalization to irregular heavy objects
- Can manipulate objects from random poses, autonomously perform consecutive loco-manipulation tasks, and react to external perturbations
-
Ablation Studies
- Effect of virtual object controller
- Stabalizes early stage training by applying curriculum based forces to guide object toward reference trajectory
- Effect of contact reward
- Explicit guidance on leveraging whole body strategies
· research