Expand single-step consistency to multi-step consistency: $V^\ast(s_t) = \mathbb{E} _{\tilde{r} _{t+i}, s _{t+i}}[\gamma^d V^\ast (s _{t+d}) + \sum _{i=0}^{d-1} \gamma^i(r _{t+i} - (\tau + \lambda) \log \pi^\ast (a _{t+i} \vert s _{t+i}) + \lambda \log \tilde{\pi}(a _{t+i} \vert s _{t+i}))]$
Trust-PCL
Multi-step consistency error: $C(s _{t:t+d}, \theta, \phi) = - V _{\phi}(s_t) + \gamma^d V _\phi(s _{t+d}) + \sum _{i=0}^{d-1} \gamma^i (r _{t+i} - (\tau + \lambda) \log \pi _\theta (a _{t+i} \vert s _{t+i}) + \lambda \log \pi _{\tilde{\theta}}(a _{t+i} \vert s _{t + i}))$
Minimize the squared consistency error over a batch of episodes
Batch can be from on or off-policy data
Automatic Tuning of the Lagrange Multiplied $\lambda$
$\lambda$ needs to adapt to distribution of rewards
Instead make $\lambda$ a function of $\epsilon$ where $\epsilon$ is a hard constraint on relative entropy
In trust PCL, you can perform a line search to find a $\lambda (\epsilon)$ which finds a $KL(\pi^\ast \vert \vert \pi _{\tilde{\theta}})$ as close as possible to $\epsilon$
See paper for analysis for KL maxium divergence
$\epsilon$ can change during training; as episode length increases, KL generally increases too
For a set of episodes, approximate $\lambda$ that yields maximum divergence of $\fracP{\epsilon}{N}\sum _{k=1}^N T_k$
$\epsilon$ becomes constraint on length averaged KL
To avoid many interactions with environment, use last 100 episodes in practice
Not exactly the same as sampling from old policy
But close enough since old policy is lagged version of online policy
Experiments
Setup
Tested on discrete + continuous control tasks
Compared with TRPO
Results
Trust PCL is able to match or exceed TRPO in reward and sample efficiency
Hyperparmeter Analysis
As $\epsilon$ increases, instability also increases
Standard PCL would fail in many of these scenarios because standard PCL is when $\epsilon \rightarrow \infty$
Trust PCL is better than TRPO because of its ability to learn in an off-policy manner (better sample efficiency)