$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Resources
- Paper
Introduction
- Human intelligence is significantly more versatile than machine intelligence
  - Solve diverse tasks, respond to envrionment, language, etc.
- LLMs / VLMs aren’t situated in the physical world
  - We need to train on physically situated data
- We train on highly diverse datasets and then fine tune to solve data scarcity issues
  - For robotics, train on robot, non-robot, and other sources of data before fine-tuning on robot data
- 3 Bottlenecks
  - Pre-training benefits only available at scale
  - Model Architecture needs to injest diverse data and intricate behaviors
  - Need the right training recipe
- $\pi_0$ Architecture
  - Uses a pre-trained VLM to inherit internet-scale knowledge
  - Train it further to incorporate robot actions and turn it into a vision-language-action (VLA) model
  - Uses cross-embodiment training: data from many robot types combined into 1 model
  - Use action chunking + flow matching for continuous action distributions
  - Augment VLM outputs with flow-based action expert
- $\pi_0$ Training
  - Follows pre-training / post-training paradigm
  - Trained on high and low quality data to teach it to 1) act robustly and 2) recover from mistakes
Overview
- Pre-training mixture is a weighted combination of the different dexterous manipulation datasets from different robots
  - Also uses language labels (i.e., task name, segment annotation, etc.)
  - Purpose: Train base model with broad capabilities + generalization but not necessarily high performance
- Post training procedure for downstream tasks
- Model based on PaliGemma VLM
  - Add action outputs that use flow matching to PaliGemma
The $\pi_0$ Model
- Consists of a language model backbone
- Image encoders embed image into language tokens embedding space
- Augment backbone with proprioceptive state + robot actions
- Uses conditional flow matching to model a continuous action distribution
- Architecture based on transfusion
  - Trains transformer using multiple objectives
  - Continuous output tokens supervised via flow matching loss
  - Discrete output tokens supervised via cross entropy loss
- Seperate weights for robotics state + action tokens (action expert)
- We want to model $P(A_t \vert o_t)$
  - $A_t$ is a future action chunk (sequence of future actions)
  - $o_t$: observation consisting of images, language tokens, joint angles
    - Images and joint angles encoded and then projected into language space
  - For each action in $A_t$, we have an action token that is fed through the action expert
    - Supervised by conditional flow matching loss: $L^T(\theta) = \mathbb{E} _{p(A_t \vert o_t), q(A^\tau_t \vert A_t)} \vert \vert v _{\theta} (A^\tau_t, o_t) - u(A^\tau_t \vert A_t)\vert \vert^2$
      - $t$: Robot time step
      - $\tau$: Flow matching timestep
      - $q_t$: Sampled from a normal distribution $\epsilon \sim \mathcal{N}(0, I)$
        Noisy Action: $A^\tau_t = \tau A_t + (1 - \tau)\epsilon$
      - $v _{\theta} (A^\tau_t, o_t)$ trained to match denoising vector field (negative derivative of the noisy action) $u(A^\tau_t \vert A_t) = \epsilon - A_t$
      - Action expert uses bidirectional attention mask (all tokens attend to each other)
    - During inference, we integrate the learned vector field by integrating it from $\tau = 0$ to $\tau = 1$
      - $A^{\tau + \delta}_t = A^{\tau}_t + \delta v _{\theta} (A^\tau_t, o_t)$
        $\delta$ is the integration step size
- Use PaliGemma as backbone and add 300M parameters for the action expert
  - They also trained a smaller version that didn’t use a pre-trained VLM
Data Collection and Training Recipe
- Pre-Training and Post-Training
  - Open source datasets (OXE, Bridge V2, and DROID)
    - Robots + tasks have 1 - 2 cameras
  - Custom datasets
    - Consists of 68 tasks but each task has many behaviors
  - Datasets are imbalanced so each dataset is weighed by $n^{0.43}$ where $n$ is the number of samples
  - Configuration and action vectors have dimensionality of largest robot in dataset
    - Lower DoF robots have these zero-padded
  - For robots with fewer than 3 images, missing image slots masked
  - Post-training consists of fine-tuning on task-specific dataset
    - 5 hours of data for simple tasks and 100 hours of data for complex tasks
- Language and High Level Policies
  - Use high level VLM to decompose task into immediate subtasks
- Robot System Details
  - 7 Robot Configurations (See paper for details)
    - UR5e
    - Bimanual UR5e
    - Franka
    - Bimanual Trossen
    - Bimanual ARX & bimanual AgileX
    - Mobile Trossen & mobile ARX
    - Mobile Fibocom
Experimental Evaluation
- How well does $\pi_0$ perform after pre-training on tasks in the pre-training data
- How well does $\pi_0$ follow language commands
- How does $\pi_0$ compare to prior methods for dexterous manipulation
- Can $\pi_0$ be adapted for complex multi-stage tasks
- Evaluating the Base Model
  - Model evaluated on shirt folding, bussing, grocery bagging, and toast out of a toaster
  - Compare OpenVLA (7B Model) trained on full mixture of data, Octo (93M Model), OpenVLA without Cross embodiment training, and smaller version of $\pi_0$
  - $\pi_0$ obtains best results on out-of-box tasks
  - $\pi_0$ small also outperforms Octa and OpenVLA
  - OpenVLA strugglest because it’s autoregressive discretization architecture doesn’t support chunks
  - Octo doesn’t support chunks and also has limited representational capacity
- Following Language Commands
  - Compare $\pi_0$ to $\pi_0$-small
  - Measures how much VLM pretraining boosts ability to follow language instructions
  - Two versions tested:
    - Flat: Directly command model with task description without intermediate language commands
    - Human: Include intermediate commands
  - $\pi_0$ is significantly better than $\pi_0$-small for both human and flat, indicating improvement from pretrained VLM
- Learning New Dexterous Tasks
  - Tasks:
    - UR5e Stack Bowls
    - Towel Folding
    - Tupperware in microwave
    - Paper towel replacement
    - Franka items in drawer
  - Compare to OpenVLA and Octo
  - $\pi_0$ generally outperforms other methods
  - Strongest prior models are trained completely from scratch on target tasks; leveraging pretraining presents a challenge for prior approaches
  - Pre-training with $\pi_0$ yields better results than non-pretrained
- Mastering Complex Multi-Stage Tasks
  - Tasks:
    - Laundry Folding
    - Mobile Laundry
    - Dryer Unloading
    - Table Bussing
    - Box Building
    - To-go Box
    - Packing Eggs
  - Because there are very difficult tasks, only $\pi_0$ can solve them
  - Compare pre-training + fine-tuning, pre-training only, and training on fine-tuning data only
  - Pre-training + fine-tuning usually yields best performance
  - Pre-training causes jump in performance
Discussions, Limitations, and Future Work
- Limitations:
  - No comprehensive understanding of how pre-training datasets should be composed
  - Not all tasks in evaluation work reliably
  - Unsure how much positive transfer there is from combining diverse data from different tasks and robots

August 28, 2025 · research