$v _{\theta} (A^\tau_t, o_t)$ trained to match denoising vector field (negative derivative of the noisy action) $u(A^\tau_t \vert A_t) = \epsilon - A_t$
Action expert uses bidirectional attention mask (all tokens attend to each other)
During inference, we integrate the learned vector field by integrating it from $\tau = 0$ to $\tau = 1$
Use PaliGemma as backbone and add 300M parameters for the action expert
They also trained a smaller version that didn’t use a pre-trained VLM
Data Collection and Training Recipe
Pre-Training and Post-Training
Open source datasets (OXE, Bridge V2, and DROID)
Robots + tasks have 1 - 2 cameras
Custom datasets
Consists of 68 tasks but each task has many behaviors
Datasets are imbalanced so each dataset is weighed by $n^{0.43}$ where $n$ is the number of samples
Configuration and action vectors have dimensionality of largest robot in dataset
Lower DoF robots have these zero-padded
For robots with fewer than 3 images, missing image slots masked
Post-training consists of fine-tuning on task-specific dataset
5 hours of data for simple tasks and 100 hours of data for complex tasks
Language and High Level Policies
Use high level VLM to decompose task into immediate subtasks
Robot System Details
7 Robot Configurations (See paper for details)
UR5e
Bimanual UR5e
Franka
Bimanual Trossen
Bimanual ARX & bimanual AgileX
Mobile Trossen & mobile ARX
Mobile Fibocom
Experimental Evaluation
How well does $\pi_0$ perform after pre-training on tasks in the pre-training data
How well does $\pi_0$ follow language commands
How does $\pi_0$ compare to prior methods for dexterous manipulation
Can $\pi_0$ be adapted for complex multi-stage tasks
Evaluating the Base Model
Model evaluated on shirt folding, bussing, grocery bagging, and toast out of a toaster
Compare OpenVLA (7B Model) trained on full mixture of data, Octo (93M Model), OpenVLA without Cross embodiment training, and smaller version of $\pi_0$
$\pi_0$ obtains best results on out-of-box tasks
$\pi_0$ small also outperforms Octa and OpenVLA
OpenVLA strugglest because it’s autoregressive discretization architecture doesn’t support chunks
Octo doesn’t support chunks and also has limited representational capacity
Following Language Commands
Compare $\pi_0$ to $\pi_0$-small
Measures how much VLM pretraining boosts ability to follow language instructions
Two versions tested:
Flat: Directly command model with task description without intermediate language commands
Human: Include intermediate commands
$\pi_0$ is significantly better than $\pi_0$-small for both human and flat, indicating improvement from pretrained VLM
Learning New Dexterous Tasks
Tasks:
UR5e Stack Bowls
Towel Folding
Tupperware in microwave
Paper towel replacement
Franka items in drawer
Compare to OpenVLA and Octo
$\pi_0$ generally outperforms other methods
Strongest prior models are trained completely from scratch on target tasks; leveraging pretraining presents a challenge for prior approaches
Pre-training with $\pi_0$ yields better results than non-pretrained
Mastering Complex Multi-Stage Tasks
Tasks:
Laundry Folding
Mobile Laundry
Dryer Unloading
Table Bussing
Box Building
To-go Box
Packing Eggs
Because there are very difficult tasks, only $\pi_0$ can solve them
Compare pre-training + fine-tuning, pre-training only, and training on fine-tuning data only
Pre-training + fine-tuning usually yields best performance
Pre-training causes jump in performance
Discussions, Limitations, and Future Work
Limitations:
No comprehensive understanding of how pre-training datasets should be composed
Not all tasks in evaluation work reliably
Unsure how much positive transfer there is from combining diverse data from different tasks and robots