HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation
Resources
Introduction
Humans continuously create bimanual manipulation data
Data targets robots with grippers, failing to generalize to dexterous hands
Interaction between robotic hands + manipulation objects usually omitted
Recent approaches use RL to learn motion strategies under guidance of reference trajectories
Usually draw on limited human motion data
Oftentimes has not been transferred to real world
Current sim2real require full knowledge over object and robot stateFails to achieve end-to-end visual learning
HERMES: Embodied learning framework for bimanual dexterous hand manipulation
Diverse sources of human motion
End2end vision-based sim2real transfer
Uses DAgger distillation to convert state-based expert policies into vision-based student policies
Introduce generalized object-centric depth augmentation + hybrid control
Mobile manipulation
Gives robots mobile manipulation skills
Uses RGB-D for localization
Task modeled as a Perspective-n-Point (PnP) problem addressed through iterative process
System Design
Hardware Design
X1 mobile base, two 6-DoF Galaxea A1 arms, and two OYMotion 6-DoF dexterous hands
RealSense L515 to capture RGBD observations
RERVISION Fisheye camera for navigation
Simulation Design
Use MuJoCo + MJX
Actuation range of joints matches real robot
Use Mujoco’s closed chain mechanisms to model DoFs without motors (i.e., fingers)
Use equality constraint feature in MuJoCo
Approximate geometry using primitive shapes for collisions
Reinforcement Learning Method
Task Formulation
Standard RL MDP Formulation
Use reference trajectory as the goal, $\mathcal{G}$
State includes proprioception info ($s^p$) and goal state ($s^g$)
Reward is a function of both states
Collect One-shot Human Motion
3 sources of human motion
Teleop in sim
MoCap
Hand object poses from raw video
Can use single human reference to get generalizable policy without need for extensive demonstrations
TeleopUse apple vision pro to get hand poses + arm movements
MoCap
Retargeting data to robot hands is hard; use RL to learn behavior
Use the OakInk2 dataset
Arm + Hand poses from video
Use WiLoR to detect hands in video + extract 2D keypoints + corresponding 3D counterpart
Select stable keypoints for estimationWrist + metacarpophalangeal joints
Spatial translation of wrist is estimated by solving perspective-n-point problemPalm orientation derived by fitting 3d plane to selected 3d points
Use FoundationPose to estimate object pose from video
Synthesize Multiple Trajectories
Trajectory augmentation by randomizing positions + orientations in predefined range
$\hat{A}^{pose}[\tau_k] = T^{trans} \cdot A^{pose}[\tau_k]$
Apply transformation matrix to alter pose
Enables spatial generalization (avoids need for more demonstrations)
Use Dexpilot for retargeting
Apply RL to refine + adapt robot behaviors
Generalizable Reward Design for Manipulation
Use one reward function that can be reused across tasks
Note: see paper for formulas for each reward
Object centric distance chain ($r _{chain}$): Use fingers and center of object’s collision mesh as keypoints to model spatial relationships between hand and objectCompute number of contact points between mesh and fingers and if its above threshold, reward is activated
Object trajectory tracking ($r _{obj}$): Align’s policy behavior with object’s trajectory
Power Penalty ($r_{penalty}$): Used to alleviate jittering actions
Residual Action Learning
Arm actions:
Coarse action derived from human trajectory
Fine (residual) component, $\Delta _{a _{f}}$ from learned network
Hand actions:Let network model entire action due to inaccuracies in retargeting
Use early termination to avoid inefficient exploration
Disable object collisions in early stages of training
Reinforcement Learning Algorithm
Implement DrM algorithmOff-policy method that uses dormant ratio mechanism to enhance exploration
Concurrently, also use PPO
Sim-to-Real Transfer
Need to distill state policy into visual policy
Leveraging Depth Image as Visual Input
Clip depth values beyond a threshold distance, $d$Missing depth values filled with max depth
Emulate real-world edge noise + blur, add gaussian noise + blur to simulated depth imagesMimic missing depth by setting 0.5% of pixel values to max depth
Linearly blend simulation rendered depth with data depth map: $\hat{o} = \alpha o _{sim} + (1 - \alpha) o _{dataset}$Enriches diversity of depth noise distribution
DAgger Distillation Training
SOTA expert policy acts as teacher for student visual policy
HERMES distills state into raw visual observations of entire visual scenarioAvoids need for camera calibration
Model Architecture
Input = 140x140 pixels + stacked 3 frames
Passed into image encoderFirst 2 layers used to capture fine-grained visual details
Use GroupNorm instead of BatchNorm for distributional consistency
Trajectory Rollout Scheduler
Expert policy rolls out trajectories at beginning of DAgger
These rollouts gradually decrease throughout training by annealing the probabilityAlso increases student’s participation in rollouts
Exponentially decay the probability
Optimize student policy using L1 and L2 action loss terms
Add uniform noise into proprioception states for regularization
Hybrid Sim2real Control
Real world visuals used to infer action which is then applied to sim environment
Updated joints positions from sim are transferred to real robot
Camera then captures real world image + incorporates proprioception states for next cycle
Sim2real discrepancy is mitigated because of shared inverse kinematics + dynamic parameters
Navigation Methodology
ViNT Navigation Foundation Model
HERMES uses image goal navigation foundation model
ViNT searches for goal observations in a topological mapComputes sequence of relative waypoints based on current and goal observations
Closed-loop PnP Localization
Discrepancy between final and target pose can lead manipulation policy to fail
ViNT doesn’t guarantee termination within a tight enough bound
Local refinement step after ViNTPnP algorithm adjusts robot pose
Use neural feature matching module (Efficient LoFTR) to detect correspondence between captured and goal image
Features lifted to 3D space using intrinsics + depth map
Apply RANSAC PnP to compute relative rotation + translation and minimize reprojection error
Using realtime feedback from PnP, incrementally converge toward target pose + refine pose estimation
Use PID controller to adjust pose of robot
Outputs planar velocity commands
Includes a sequential adjustment strategy which prioritizes error correction because of reorientation displacement on omnidirectional chassis
Experiments
Goals
Verify efficacy of HERMES
Exhibit effectiveness for sim2real
Quantify accuracy + reliability in navigation localization
Demonstrate effectiveness of HERMES in mobile manipulation
Sample Efficiency of HERMES
Regardless of human motion data, HERMES successfully converts actions to robot executable behaviors
HERMES has superior performance relative to ObjDex across all tasks with ObjDex formulation in frameworkAlso achieves higher sample efficiency
Comparison with Non-learning Approach
Kinematic retargeting fails to map capture object interactions + contact information
RL shapes policies to be human like while establising context-appropriate object interactions
Learning residual actions adaptively adjusts the movements + enhances execution success rates
Training Wall Time
HERMES benefits from reduced wall clock training time
HERMES also has better sample efficiency + stronger asymptotic performance under PPO
Real-world Manipulation Evaluation
Substantial noise in trajectory or transparent objects leads to jittering
Fine tune the policy with real-world trajectories
HERMES achieves 0-shot transfer diverse long-horizon + contact-rich bimanual dexterous manipulation tasks
The Effectiveness of Closed-loop PnP
ViNT by itself suffers from instability in localization accuracy
Proposed approach helps mitigate this
HERMES aligns both RGB images + point clouds with target position
The Localization Ability of Closed-loop PnP in the Texture-less Scenario HERMES PnP refinement is robust in texture-less scenarios too
Mobile Manipulation Evaluation
Without closed-loop PnP, policy cannot generalize or complete the tasks when there are positional/rotational shifts
HERMES achieves notable improvement in manipulation success rate
Limitations and Future Work
For highly dynamic velocity-dependent tasks, system identification still required for sim2real
Physics collision paramters manually tuned
Objects approximated with primitive shapes
Assembly + calibration between sim and hardware persist
October 17, 2025 · research