Emergence of Locomotion Behaviours in Rich Environments

Resources
- Paper
Introduction
- In continuous control tasks, like locomotion, we need to carefully handcraft the reward function
- Reward engineering is brittle + agent should be able to learn from itself
- Sensitivity to reward functions = type of overfitting (not generalizable)
- Paper uses procedurelly generated obstacle courses for agent with different difficulties
  - Acts as implicit curriculum
  - Increases difficulty over time improves learning speed
- Distributed PPO (DPPO): Builds on top of TRPO and PPO + distributes computation like A3C
Large scale reinforcement learning with Distributed PPO
- Robust policy gradients with PPO
  - Use policy gradients with baselines (advantage function)
  - Use trust region constraint that restricts policy update size (via adaptive PPO KL Penalty)
- Scalable reinforcement learning with Distributed PPO
  - Data collection + gradient calculation distributed across workers
  - Averaging gradients + applying them synchronously leads to better results
  - Use K-step advantage estimation: $\hat{A}_t = \sum _{i=1}^K \gamma^i r _{t+i} + \gamma V _\phi(s _{t+K}) - V _\phi(s_t)$
  - Uses centralized parameter server
  - Workers synchronize their parameters after every gradient step
Evaluation of Distributed PPO
- Benchmarked on Planar Walker, Humanoid, and Memory Reacher
- DPPO achieves similar performance to TRPO but with lower wall clock time.
- DPPO is faster than A3C in wall clock time
Methods: environments and models
- Training environments
  - Rewards
    - Rewards are simple and only signal the agent to make progress (i.e., going forward or not deviating from track)
    - Not carefully fine tuned to achieve certain behavior
  - Terrain / Obstacles
    - Hurdles: Agent needs to jump or climb over
    - Gaps: Agent needs to jump
    - Variable Terrain: Different features like hills, ramps,
    - Slalom Walls: Agent needs to walk around
    - Platform: Agent can jump over or crouch under
    - Trained on different types of courses: single-type, mixtures of single-typed, mixed-terrains, stationary courses, and curriculum courses
  - Observations
    - Agents have sensors on them for proprioceptive and exteroceptive features.
    - Terrain information given too
- Policy parameterization
  - Two subnetworks
    - One for proprioceptive information
    - One for exteroceptive information
Results
- Planar Walker
  - Trained on hurdles, gaps, platforms, and variable terrain on mixed course and mixture of terrains
  - Acquired emergent behaviors spontaneously without fine-tuned reward function
- Quadruped
  - Trained on variation of hurdles + obstacles that can be avoided and others that require climbing/jumping
  - Traverses reasonably well despite limitations with its body
- Agents trained with gradually increasing difficulty improves faster than one trained on stationary terrain
- Next train on flat and challenging courses to see if policies are robust to unobserved variation in the terrain
- Humanoid
  - Learning is sensitive to the algorithm + many degrees of freedom
  - More variation with results
  - Despite this, they achieve a good policy on diverse terrains
Discussion
- Training on richer environments + broader spectrum of tasks = likely improves quality + robustness of learned behaviors.

April 17, 2025 · research