The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously

Resources
- Paper
Introduction
- We can use a single stream of experience to learn + perfect many policies
- Use actor-critic architecture with deterministic policy gradients
  - Solve 1 task on-policy while solving other tasks off-policy
  - Policies learned unintentionally can be used for intentional policies
- Use 2 neural networks
  - Actor has multiple heads representing different policies with shared lower level representation
  - Critic has several state-action value functions with a shared representation
- Introduces automatic procedure to generate semantic goals for agent
The Intentional Unintentional Agent
- Vanilla policy gradient theorem = for a stochastic policy
  - Deterministic policy gradient = for a deterministic policy (uses action-value gradients)
    - Cannot use off-policy learning
  - Deep deterministic policy gradient = enables off-policy learning
- Given stream of rewards, $r_t^i$, we maximize: $J(\theta) = \mathbb{E} _{\rho^\beta}[\sum_i Q _{\mu}^i(s, \mu _{\theta}^i(s))]$
  - $\mu _{\theta}^i$: Policy for task $i$
  - $\rho^\beta$: stationary distribution of behavior policy ($\beta(a \vert s)$)
  - Gradient: $\nabla _{\theta} J(\theta) = \mathbb{E} _{\rho^\beta}[\sum_i \nabla _{\theta} \mu _{\theta}^i(s) \nabla _{a^i}Q _{\mu}^i(s, a^i) \vert _{a^i = \mu^i _{\theta}(s)}]$
- Behavior Policy
  - Given by intentional policies
  - Input: $s_t$, Output: $a_t$
  - $\beta (a \vert s) = \mu^i _{\theta}(s) + Z$
    - $Z$: random variable for exploration
    - At the beginning of each episode, intentional task $i$ is selected as behavior
      - Alternatively, can switch tasks when task $i$ is successful in episode
- Actor and Critic Updates:
  - $\delta_j^i = r_j^i + \gamma Q _{w’}^i(s _{j+1}, \mu _{\theta’}^i(s _{j+1})) - Q _{w}^i(s _{j}, a_j)$
  - $w \leftarrow w + \alpha _{critic}\sum_j \sum_i \delta_j^i \nabla_w Q^i_w(s_j, a_j)$
  - $\theta \leftarrow \theta + \alpha _{actor}\sum_j \sum_i \nabla _{\theta}\mu _{\theta^i}(s_j) \nabla _{a^i}Q_w^i(s_j, a^i) \vert _{a^i = \mu _{\theta}^i(s_j)}$
  - $j$: minibatch indices
  - $\theta’, w’$: target networks for stability
Experimental Setup
- The physical playroom domain
  - N objects in a playroom
  - Agent uses a fist to interact with objects and can only move objects via contact
  - Includes a goal position
- Automatic reward generation with formal language
  - Generate rewards based on properties of the objects
  - Property functions: $p: \mathcal{O} \times \mathcal{S} \rightarrow [0,1]$
    - Binary function representing if object ($o \in \mathcal{O}$) satisfies property
    - Usually independent of state $\rightarrow$ right as function of objects, $\mathcal{O}$: $p(o)$
  - Binary relation functions: $p: \mathcal{O} \times \mathcal{O} \times \mathcal{S} \rightarrow [0,1]$
    - Use binary relation functions and property functions to write rewards
    - I.e., nearness relation: $r _{red-near-blue}(s) = \sum _{o_1, o_2} p _{red}(o_1) p _{blue}(o_2) b _{near}(o_1, o_2, s)$
      - “Bring red objects near blue object”
  - Can automatically generate many rewards by logically combining operations
    - Based on color of object
    - Properties identifying fist and goal
    - Near and far relations
    - Directional relations
    - Can create conjunctions with these
Results
- Test ability to maximize $b _{near}(red, blue, s)$ in 3 scenarios
  - No additional tasks (standard DDPG)
  - Near and far tasks
  - All additional tasks
    - All 18 policies were learned simultaneously
    - More tasks helped learn the policy faster
- Also test when additional green cube is added to playroom
  - Slows learning but still completes task
- Also test on task spaces of 1, 7, and 43
  - 1 task is insufficient for DDPG
  - With 7 task, DDPG succeeds
  - With large tasks spaces, IU agents struggle
    - Still gains some rewards with 43 tasks (doesn’t completely fail like it did in the 1 task scenario)
Discussion
- Behavior policy is based on the hardest tasks
  - Is it worth following a different learning cirricula for behavior policy
    - No - choosing the hardest is the best; behavior policy data is what ends up in the replay buffer
    - With simpler tasks, replay buffer fails to explore because trajectories don’t have rich behavior
- IU agent may still fail if the hardest task is too hard
  - Needs to be solved with hierarchal RL or better understanding of objects + relations
- Future work should look at policy re-use too
  - How can we construct controllers based on various policies learned

June 29, 2025 · research