ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning

Resources
- Paper
Introduction
- Precise + expressive humanoid locomotion is hard
  - Requires rich whole body contact data that isn’t available at scale
- Direct imitation of humans is attractive
  - Contact locations + relative object poses in human demonstrations fail to translate
- General motion tracking (GMT) policies trained on human datasets are unaware of objects
- Humanoid loco-manipulation appraoches rely on task specific designs; limits scalability and generalization
- Robotic foundation models are powerful but pretrain-finetune for humanoids has been largely unexplored
- Key Insight
  - Diverse human motions can be pretrained via GMT
  - Object centric loco-manipulation requires task-specific corrections
  - Whole body motions have shared attributes
  - Fine grained object interaction requires adaptation
- New Approach: Stable motion prior augmented with lightweight task-specific adjustments
- ResMimic
  - First stage: Train GMT policy on motion capture data to serve as a prior for human motions
  - Second stage: Train a task-specific residual policy to condition on object reference trajectory
    - Outputs corrective actions that refine GMT + enable precise object manipulation
  - Decoupling stages alleviates need for task specific rewards + has better data efficiency
Method
- Problem is framed as goal-conditioned reinforcement learning problem with an MDP structure
- State ($s \in \mathcal{S}$)
  - Robot proprioception ($s_r^t$)
  - Object state ($s_o^t$)
  - Motion goal state ($\hat{s}_t^r$)
  - Object goal state ($\hat{s}_t^o$)
- Action
  - Target joint angles executed through PD controller
- Two-Stage Residual Learning
  - General motion tracking policy ($\pi _{GMT}$)
    - Uses robot proprioception + reference motion to get coarse action
    - $a_t^{gmt} = \pi _{GMT}(s_r^t, \hat{s}_t^r)$
    - Maximizes a motion tracking reward ($r_t^m$)
  - Residual Refinement
    - Train efficient and precise residual policy per task
    - $\pi _{Res}(s_r^t, s_o^t,\hat{s}_t^r, \hat{s}_t^o) = \Delta a_t^{res}$
    - Maximizes combined motion and object reward ($r_t^m$ and $r_t^o$)
  - Trained with PPO
- General Motion Tracking Policy
  - Dataset: Use popular Mocap datasets like AMASS, OMOMO
    - Apply kinematics based retargeting to transfer human motions to humanoid reference motion
  - Training
    - Proprioceptive observation: $s_t^r = [\theta_t, \omega_t, q_t, \dot q_t, a_t^{hist}] _{t-10:t}$
      - $\theta$: Root orientation
      - $\omega$: Root angular velocity
      - $q_t$: Joint positions
      - $\dot q_t$: Joint positions
      - $a_t^{hist}$: recent action history
    - Reference motion: $\hat{s}_t^r = [\hat{p}_t, \hat{\theta_t}, \hat{q}_t] _{t-10:t+10}$
      - $\hat{p}_t$: reference root translation
      - $\hat{\theta_t}$: reference root orientation
      - $\hat{q}_t$: reference joint position
      - Use future reference motion to plan for upcoming targets
  - Reward and Domain Randomization
    - Motion tracking reward ($r_t^m$) is sum of task rewards, penalty terms, and regularization term
    - Use domain randomization for better sim2real
- Residual Refinement Policy
  - Use retargeted reference motions of humanoid and object to train residual policy ($\{ (\hat{s}_t^r, \hat{s}_t^o)\} _{t=1}^T$)
  - Training
    - Use PPO
    - Takes in $(s_r^t, s_o^t,\hat{s}_t^r, \hat{s}_t^o)$ and outputs residual action, $\Delta a_t^{res}$
  - Network Initialization
    - Initialize the final layer of the PPO actor using xavier uniform initialization with a small gain
      - Ensures initial outputs are close to 0
  - Virtual Object Force Curriculum
    - Residual framework fails when reference motions are noisy or objects are heavy
      - Usually occurs from penetration from kinematic retargeting
      - Instability when handling large object masses
    - Use a curriculum that stabilizes training by driving object toward reference trajectory
      - PD Controllers apply virtual force + torque
        
        $\mathcal{F}_t = k_p(\hat{p}_t^o - p_t^o - k_dv_t^o)$
        
        $\mathcal{T}_t = k_p(\hat{\theta}_t^o \ominus \theta_t^o) - k_d\omega_t^o$
  - Reward and Early Termination
    - Decoupling motion tracking and object interaction allows them to avoid tuning weights
    - Instead, use motion reward + domain randomization from GMT, and introduce two additional terms
      - $r^o_t$: Object tracking reward; encourages task completion
      - $r^c_t$: Contact tracking reward; gives explicit guidance on body-object contact
    - Object tracking reward:
      - Sample N points from object mesh surface and compute point cloud difference betweeen current and reference states
      - $r^o_ = exp(\lambda_o \sum _{i=1}^N \vert \vert P[i]_t - \hat{P}[i]_t \vert \vert_2)t$
    - Contact reward:
      - Discretize contact locations into links (excluding feet since they are usually on the ground)
      - Oracle contact information: $\hat{c}_t[i] 1(\vert \vert \hat{d}_t[i] \vert \vert) < \sigma_c$
        $1(\cdot)$ is indicator function
      - $r^c_t = \sum_i \hat{c}_t[i] \cdot exp(- \frac{\lambda}{f_t[i]})$
        $f_t[i]$: contact force at link $i$
    - Early termination:
      - Object mesh deviates beyond certain threshold: $\vert \vert P_t - \hat{P}_t \vert \vert_2$
      - Required object-body contact lost for more than 10 frames
Experiments
- Questions
  - Can GMT without task specific training accomplish diverse loco-manipulation tasks
  - Does initializing from pretrained GMT improve training efficiency + final performance relative to training from scratch
  - Is residual learning more effective than fine-tuning when adapting GMT to loco-manipulation
  - Can ResMimic achieve robust control in real world
- Experiment Setup
  - Tasks
    - Kneel on 1 knee and lift a box
    - Carry a box onto the back
    - Squat + lift box with arms and torso
    - Lift up a chair
  - Evaluation
    - Training iterations until convergence
    - Object tracking error
      - $E_o = \frac{1}{T} \sum _{t=1}^T \sum _{i=1}^N \vert \vert P[i]_t - \hat{P}[i]_t \vert \vert_2$
    - Motion tracking error
      - $E_m = \frac{1}{T} \sum _{t=1}^T \sum _{i=1}^N \vert \vert p[i]_t - \hat{p}[i]_t \vert \vert_2$
    - Joint tracking error
      - $E_j = \frac{1}{T} \sum _{t=1}^T \vert \vert q_t - \hat{q}_t \vert \vert_2$
    - Task success rate
      - Success if $E_o$ is below predefined threshold + robot is balanced
  - Baselines
    - Base Policy: GMT policy follows human reference motion without object information
    - Train from scratch: RL policy trained to track human motion + object trajectories without GMT
    - Bae policy + fine tune: Base policy fine tuned to track human motion + object trajectory
- Sim-to-Sim Evaluation
  - GMT cannot do complete loco-manipulation but provides strong initialization (10% success rate vs 92.5% for ResMimic)
  - Using GMT as a base policy improves training efficiency + effectiveness
  - Residual learning outperforms direct fine-tuning
    - Fine tuning cannot incorporate additional object inputs since GMT is based on human motion data
    - Fine-tuning overwrites generalization of GMT
    - Lack of explicit object state prevents learning robust behaviors
- Real-world Evaluation
  - In the real world, ResMimic results in:
    - Expressive carrying motions
    - Humanoid interaction beyond manipulation
    - Heavy payload carrying with whole-body contact
    - Generalization to irregular heavy objects
    - Can manipulate objects from random poses, autonomously perform consecutive loco-manipulation tasks, and react to external perturbations
- Ablation Studies
  - Effect of virtual object controller
    - Stabalizes early stage training by applying curriculum based forces to guide object toward reference trajectory
  - Effect of contact reward
    - Explicit guidance on leveraging whole body strategies

October 14, 2025 · research