The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously
-
Resources
-
Introduction
- We can use a single stream of experience to learn + perfect many policies
- Use actor-critic architecture with deterministic policy gradients
- Solve 1 task on-policy while solving other tasks off-policy
- Policies learned unintentionally can be used for intentional policies
- Use 2 neural networks
- Actor has multiple heads representing different policies with shared lower level representation
- Critic has several state-action value functions with a shared representation
- Introduces automatic procedure to generate semantic goals for agent
-
The Intentional Unintentional Agent
- Vanilla policy gradient theorem = for a stochastic policy
- Deterministic policy gradient = for a deterministic policy (uses action-value gradients)
- Cannot use off-policy learning
- Deep deterministic policy gradient = enables off-policy learning
- Given stream of rewards, $r_t^i$, we maximize: $J(\theta) = \mathbb{E} _{\rho^\beta}[\sum_i Q _{\mu}^i(s, \mu _{\theta}^i(s))]$
- $\mu _{\theta}^i$: Policy for task $i$
- $\rho^\beta$: stationary distribution of behavior policy ($\beta(a \vert s)$)
- Gradient: $\nabla _{\theta} J(\theta) = \mathbb{E} _{\rho^\beta}[\sum_i \nabla _{\theta} \mu _{\theta}^i(s) \nabla _{a^i}Q _{\mu}^i(s, a^i) \vert _{a^i = \mu^i _{\theta}(s)}]$
- Behavior Policy
- Given by intentional policies
- Input: $s_t$, Output: $a_t$
- $\beta (a \vert s) = \mu^i _{\theta}(s) + Z$
- $Z$: random variable for exploration
- At the beginning of each episode, intentional task $i$ is selected as behavior
- Alternatively, can switch tasks when task $i$ is successful in episode
- Actor and Critic Updates:
- $\delta_j^i = r_j^i + \gamma Q _{w’}^i(s _{j+1}, \mu _{\theta’}^i(s _{j+1})) - Q _{w}^i(s _{j}, a_j)$
- $w \leftarrow w + \alpha _{critic}\sum_j \sum_i \delta_j^i \nabla_w Q^i_w(s_j, a_j)$
- $\theta \leftarrow \theta + \alpha _{actor}\sum_j \sum_i \nabla _{\theta}\mu _{\theta^i}(s_j) \nabla _{a^i}Q_w^i(s_j, a^i) \vert _{a^i = \mu _{\theta}^i(s_j)}$
- $j$: minibatch indices
- $\theta’, w’$: target networks for stability
-
Experimental Setup
-
The physical playroom domain
- N objects in a playroom
- Agent uses a fist to interact with objects and can only move objects via contact
- Includes a goal position
-
Automatic reward generation with formal language
- Generate rewards based on properties of the objects
- Property functions: $p: \mathcal{O} \times \mathcal{S} \rightarrow [0,1]$
- Binary function representing if object ($o \in \mathcal{O}$) satisfies property
- Usually independent of state $\rightarrow$ right as function of objects, $\mathcal{O}$: $p(o)$
- Binary relation functions: $p: \mathcal{O} \times \mathcal{O} \times \mathcal{S} \rightarrow [0,1]$
- Use binary relation functions and property functions to write rewards
- I.e., nearness relation: $r _{red-near-blue}(s) = \sum _{o_1, o_2} p _{red}(o_1) p _{blue}(o_2) b _{near}(o_1, o_2, s)$
- “Bring red objects near blue object”
- Can automatically generate many rewards by logically combining operations
- Based on color of object
- Properties identifying fist and goal
- Near and far relations
- Directional relations
- Can create conjunctions with these
-
Results
- Test ability to maximize $b _{near}(red, blue, s)$ in 3 scenarios
- No additional tasks (standard DDPG)
- Near and far tasks
- All additional tasks
- All 18 policies were learned simultaneously
- More tasks helped learn the policy faster
- Also test when additional green cube is added to playroom
- Slows learning but still completes task
- Also test on task spaces of 1, 7, and 43
- 1 task is insufficient for DDPG
- With 7 task, DDPG succeeds
- With large tasks spaces, IU agents struggle
- Still gains some rewards with 43 tasks (doesn’t completely fail like it did in the 1 task scenario)
-
Discussion
- Behavior policy is based on the hardest tasks
- Is it worth following a different learning cirricula for behavior policy
- No - choosing the hardest is the best; behavior policy data is what ends up in the replay buffer
- With simpler tasks, replay buffer fails to explore because trajectories don’t have rich behavior
- IU agent may still fail if the hardest task is too hard
- Needs to be solved with hierarchal RL or better understanding of objects + relations
- Future work should look at policy re-use too
- How can we construct controllers based on various policies learned
· research