Use an LSTM to jointly approixmates policy and value function
Auxiliary Tasks for Reinforcement Learning
Auxiliary Control Tasks
Auxiliary Control Tasks: Additional pseudo-reward functions in the environment agent interacts with
$r^{(c)}: \mathcal{S} x \mathcal{A} \rightarrow \mathbb{R}$
Set of auxiliary control tasks $\mathcal{C}$
For $c \in \mathcal{C}$, we want to find: $argmax _{\theta} \mathbb{E} _{\pi}[R _{1:\infty}] + \lambda_c \sum _{c \in \mathcal{C}}\mathbb{E} _{\pi_c}[R^{(c)} _{1:\infty}]$
$R^{(c)} _{1:\infty}$: discounted return for $r^{(c)}$
$\theta$: Shared parameters across $\pi, \pi^{(c)}$
Sharing ensures balancing performance across global reward and auxiliary tasks
To efficiently learn many pseudo-rewards in parallel, use off-policy Q learning
Types of auxiliary reward functions
Pixel Control: Changes in perceptual stream = important events. Train a policy that maximizes pixel change
Feature Control: Networks extract high level features. Use activation of hidden units as auxiliary reward. Train seperate policy to maximize the hidden units activated in a layer
Auxiliary Reward Tasks
Agent also needs to maximize global reward stream
Needs to recognize states that lead to high reward + value
However sparse reward environments make this difficult
Want to remove sparsity while keeping policy unbiased
Reward Prediction: Require agent to predict reward attained in a subsequent unseen frame
Helps shape features of the agent $\rightarrow$ biased reward predictor + feature shaper but not policy or value function
Train reward prediction on $S _{\tau} = (s _{\tau - k}, s _{\tau - k + 1}, \dots, s _{\tau -1})$ to predict $r _{\tau}$
Sample $S _{\tau}$ in skewed manner to over-represent rewarding events
Zero rewards and non-zero rewards equally represented ($P(r _{\tau} \neq 0) = 0.5$)
Use different architecture from policy network
Concatenate stack of states after encoded from CNN
Focuses on immediate reward prediction instead of long term returns via looking at immediate predecessor states instead of entire history
These features shared with LSTM
Experience Replay
Uses prioritized replay with oversampling rewarding states
Also do value function replay: Does off-policy regression for value function from replay buffer
Randomly varies truncation window for returns (i.e., uses random n for n-step returns)