HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation

Resources
- Paper
Introduction
- Humans continuously create bimanual manipulation data
- Data targets robots with grippers, failing to generalize to dexterous hands
- Interaction between robotic hands + manipulation objects usually omitted
- Recent approaches use RL to learn motion strategies under guidance of reference trajectories
  - Usually draw on limited human motion data
  - Oftentimes has not been transferred to real world
  - Current sim2real require full knowledge over object and robot state
    - Fails to achieve end-to-end visual learning
- HERMES: Embodied learning framework for bimanual dexterous hand manipulation
  - Diverse sources of human motion
    - Teleop, Mocap, Video
  - End2end vision-based sim2real transfer
    - Uses DAgger distillation to convert state-based expert policies into vision-based student policies
    - Introduce generalized object-centric depth augmentation + hybrid control
  - Mobile manipulation
    - Gives robots mobile manipulation skills
    - Uses RGB-D for localization
    - Task modeled as a Perspective-n-Point (PnP) problem addressed through iterative process
System Design
- Hardware Design
  - X1 mobile base, two 6-DoF Galaxea A1 arms, and two OYMotion 6-DoF dexterous hands
  - RealSense L515 to capture RGBD observations
  - RERVISION Fisheye camera for navigation
- Simulation Design
  - Use MuJoCo + MJX
  - Actuation range of joints matches real robot
  - Use Mujoco’s closed chain mechanisms to model DoFs without motors (i.e., fingers)
  - Use equality constraint feature in MuJoCo
  - Approximate geometry using primitive shapes for collisions
Reinforcement Learning Method
- Task Formulation
  - Standard RL MDP Formulation
  - Use reference trajectory as the goal, $\mathcal{G}$
  - State includes proprioception info ($s^p$) and goal state ($s^g$)
  - Reward is a function of both states
- Collect One-shot Human Motion
  - 3 sources of human motion
    - Teleop in sim
    - MoCap
    - Hand object poses from raw video
  - Can use single human reference to get generalizable policy without need for extensive demonstrations
  - Teleop
    - Use apple vision pro to get hand poses + arm movements
  - MoCap
    - Retargeting data to robot hands is hard; use RL to learn behavior
    - Use the OakInk2 dataset
  - Arm + Hand poses from video
    - Use WiLoR to detect hands in video + extract 2D keypoints + corresponding 3D counterpart
    - Select stable keypoints for estimation
      - Wrist + metacarpophalangeal joints
    - Spatial translation of wrist is estimated by solving perspective-n-point problem
      - Palm orientation derived by fitting 3d plane to selected 3d points
    - Use FoundationPose to estimate object pose from video
  - Synthesize Multiple Trajectories
    - Trajectory augmentation by randomizing positions + orientations in predefined range
      - $\hat{A}^{pose}[\tau_k] = T^{trans} \cdot A^{pose}[\tau_k]$
      - Apply transformation matrix to alter pose
    - Enables spatial generalization (avoids need for more demonstrations)
  - Use Dexpilot for retargeting
  - Apply RL to refine + adapt robot behaviors
- Generalizable Reward Design for Manipulation
  - Use one reward function that can be reused across tasks
  - Note: see paper for formulas for each reward
  - Object centric distance chain ($r _{chain}$): Use fingers and center of object’s collision mesh as keypoints to model spatial relationships between hand and object
    - Compute number of contact points between mesh and fingers and if its above threshold, reward is activated
  - Object trajectory tracking ($r _{obj}$): Align’s policy behavior with object’s trajectory
  - Power Penalty ($r_{penalty}$): Used to alleviate jittering actions
- Residual Action Learning
  - Arm actions:
    - Coarse action derived from human trajectory
    - Fine (residual) component, $\Delta _{a _{f}}$ from learned network
  - Hand actions:
    - Let network model entire action due to inaccuracies in retargeting
  - Use early termination to avoid inefficient exploration
  - Disable object collisions in early stages of training
- Reinforcement Learning Algorithm
  - Implement DrM algorithm
    - Off-policy method that uses dormant ratio mechanism to enhance exploration
  - Concurrently, also use PPO
Sim-to-Real Transfer
- Need to distill state policy into visual policy
- Leveraging Depth Image as Visual Input
  - Clip depth values beyond a threshold distance, $d$
    - Missing depth values filled with max depth
  - Emulate real-world edge noise + blur, add gaussian noise + blur to simulated depth images
    - Mimic missing depth by setting 0.5% of pixel values to max depth
  - Linearly blend simulation rendered depth with data depth map: $\hat{o} = \alpha o _{sim} + (1 - \alpha) o _{dataset}$
    - Enriches diversity of depth noise distribution
- DAgger Distillation Training
  - SOTA expert policy acts as teacher for student visual policy
  - HERMES distills state into raw visual observations of entire visual scenario
    - Avoids need for camera calibration
  - Model Architecture
    - Input = 140x140 pixels + stacked 3 frames
    - Passed into image encoder
      - First 2 layers used to capture fine-grained visual details
    - Use GroupNorm instead of BatchNorm for distributional consistency
  - Trajectory Rollout Scheduler
    - Expert policy rolls out trajectories at beginning of DAgger
    - These rollouts gradually decrease throughout training by annealing the probability
      - Also increases student’s participation in rollouts
    - Exponentially decay the probability
    - Optimize student policy using L1 and L2 action loss terms
    - Add uniform noise into proprioception states for regularization
- Hybrid Sim2real Control
  - Real world visuals used to infer action which is then applied to sim environment
  - Updated joints positions from sim are transferred to real robot
  - Camera then captures real world image + incorporates proprioception states for next cycle
  - Sim2real discrepancy is mitigated because of shared inverse kinematics + dynamic parameters
Navigation Methodology
- ViNT Navigation Foundation Model
  - HERMES uses image goal navigation foundation model
  - ViNT searches for goal observations in a topological map
    - Computes sequence of relative waypoints based on current and goal observations
- Closed-loop PnP Localization
  - Discrepancy between final and target pose can lead manipulation policy to fail
  - ViNT doesn’t guarantee termination within a tight enough bound
  - Local refinement step after ViNT
    - PnP algorithm adjusts robot pose
  - Use neural feature matching module (Efficient LoFTR) to detect correspondence between captured and goal image
  - Features lifted to 3D space using intrinsics + depth map
  - Apply RANSAC PnP to compute relative rotation + translation and minimize reprojection error
  - Using realtime feedback from PnP, incrementally converge toward target pose + refine pose estimation
  - Use PID controller to adjust pose of robot
    - Outputs planar velocity commands
    - Includes a sequential adjustment strategy which prioritizes error correction because of reorientation displacement on omnidirectional chassis
Experiments
- Goals
  - Verify efficacy of HERMES
  - Exhibit effectiveness for sim2real
  - Quantify accuracy + reliability in navigation localization
  - Demonstrate effectiveness of HERMES in mobile manipulation
- Sample Efficiency of HERMES
  - Regardless of human motion data, HERMES successfully converts actions to robot executable behaviors
  - HERMES has superior performance relative to ObjDex across all tasks with ObjDex formulation in framework
    - Also achieves higher sample efficiency
- Comparison with Non-learning Approach
  - Kinematic retargeting fails to map capture object interactions + contact information
  - RL shapes policies to be human like while establising context-appropriate object interactions
  - Learning residual actions adaptively adjusts the movements + enhances execution success rates
- Training Wall Time
  - HERMES benefits from reduced wall clock training time
  - HERMES also has better sample efficiency + stronger asymptotic performance under PPO
- Real-world Manipulation Evaluation
  - Substantial noise in trajectory or transparent objects leads to jittering
  - Fine tune the policy with real-world trajectories
  - HERMES achieves 0-shot transfer diverse long-horizon + contact-rich bimanual dexterous manipulation tasks
- The Effectiveness of Closed-loop PnP
  - ViNT by itself suffers from instability in localization accuracy
  - Proposed approach helps mitigate this
  - HERMES aligns both RGB images + point clouds with target position
- The Localization Ability of Closed-loop PnP in the Texture-less Scenario
  - HERMES PnP refinement is robust in texture-less scenarios too
- Mobile Manipulation Evaluation
  - Without closed-loop PnP, policy cannot generalize or complete the tasks when there are positional/rotational shifts
  - HERMES achieves notable improvement in manipulation success rate
Limitations and Future Work
- For highly dynamic velocity-dependent tasks, system identification still required for sim2real
- Physics collision paramters manually tuned
- Objects approximated with primitive shapes
- Assembly + calibration between sim and hardware persist

October 17, 2025 · research