Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Resources
- Paper
Introduction
- For continuous control, we cannot use classical Q learning so we use DDPG instead
  - Unfortunately this is hyperparameter sensitive
- For better stability, we got TRPO but TRPO cannot exploit off-policy data
- When using entropy regularization with and optimal policy + value function, we satisfy a set of pathwise consistency properties under any path
  - Allows on + off-policy data to be used for training (i.e., PCL algorithm)
- This paper expands on PCL for more challenging continuous control tasks
  - Augmenting maximum reward objective with relative entropy regularizers (KL Divergence), we still satisfy consistency properties
  - Resulting objective = penalty based divergence constraint from previous policy
  - Entropy coefficient agnostic to reward scale
Notation and Background
- Assume standard reinforcement learning setting
  - Maximize expected reward
- Path Consistency Learning (see PCL notes)
  - Augment objective with entropy regularizer
  - Minimize squared error between LHS and RHS of $V^\ast(s_0) = \mathbb{E} _{r_i, s_i}[\gamma^d V^\ast(s_d) + \sum^{d-1} _{i=0}\gamma^i(r_i - \tau \log \pi^\ast (a_i \vert s_i))]$
    - Can simultaneously optimize parameterized $\pi _\phi$ and $V _\phi$
    - Can use on and off-policy data
  - TRPO (see TRPO notes)
    - Creates a KL divergence constraint that ensures policy updates are within a certain boundary
Method
- Augment entropy regularized expected reward with discount relative trust region around prior policy
  - $maximize _{\pi} \mathbb{E}_s [O _{ENT}(\pi)] s.t. {E}_s [O _{ENT}(\mathbb{G}(s, \pi, \tilde{\pi}))] \leq \epsilon$
  - Discounted relative entropy: $\mathbb{G}(s, \pi, \tilde{\pi}) = \mathbb{E} _{a, s’}[\log (a \vert s) - \log \tilde{\pi}(a \vert s) + \gamma \mathbb{G}(s’, \pi, \tilde{\pi})]$
  - Objective tries to maximize entropy regularized expected reward + maintain proximity to previous policy
    - Entropy regularization: Improves exploration
    - Relative entropy: improves stability + faster learning
- Cast constrained optimization into maximization problem: $O _{RELENT}(s, \pi) = O _{ENT}(s, \pi) - \gamma \mathbb{G}(s, \pi, \tilde{\pi})$
  - Usually computed as expectation over states: $O _{RELENT}(\pi) = \mathbb{E}_s [O _{RELENT}(s, \pi)]$
- Path Consistency with Relative Entropy
  - $O _{RELENT}$ can be decomposed into entropy regularized expected reward with transformed rewards
    - $O _{RELENT} = \tilde{O} _{ER}(s, \pi) + (\tau + \lambda) \mathbb{H}(s, \pi)$
      - $\tilde{O} _{ER}(s, \pi)$ has transformed reward distribution: $\tilde{r}(s,a) = r(s,a) + \lambda \log \tilde{\pi}(a \vert s)$
  - Optimal policy: $\pi^\ast(a_t \vert s_t) = exp(\frac{\mathbb{E} _{\tilde{r}_t \sim \tilde{r}(s_t, a_t), s _{t+1}}[\tilde{r}_t + \gamma V^\ast(s _{t+1})] - V^\ast(s_t)}{\tau + \lambda})$
    - Softmax state values: $V^\ast(s_t) = (\tau + \lambda)\log \int_A exp(\frac{\mathbb{E} _{\tilde{r}_t \sim \tilde{r}(s_t, a_t), s _{t+1}}[\tilde{r}_t + \gamma V^\ast(s _{t+1})]}{\tau + \lambda})da$
      - Simplification for single-step consistency: $V^\ast(s_t) = \mathbb{E} _{\tilde{r}_t, s _{t+1}}[\tilde{r}_t - (\tau + \lambda) \log \pi^\ast (a_t \vert s_t) + \lambda \log \tilde{\pi}(a _{t+i} \vert s _{t+i}) + \gamma V^\ast(s _{t+1})]$
    - Expand single-step consistency to multi-step consistency: $V^\ast(s_t) = \mathbb{E} _{\tilde{r} _{t+i}, s _{t+i}}[\gamma^d V^\ast (s _{t+d}) + \sum _{i=0}^{d-1} \gamma^i(r _{t+i} - (\tau + \lambda) \log \pi^\ast (a _{t+i} \vert s _{t+i}) + \lambda \log \tilde{\pi}(a _{t+i} \vert s _{t+i}))]$
- Trust-PCL
  - Multi-step consistency error: $C(s _{t:t+d}, \theta, \phi) = - V _{\phi}(s_t) + \gamma^d V _\phi(s _{t+d}) + \sum _{i=0}^{d-1} \gamma^i (r _{t+i} - (\tau + \lambda) \log \pi _\theta (a _{t+i} \vert s _{t+i}) + \lambda \log \pi _{\tilde{\theta}}(a _{t+i} \vert s _{t + i}))$
    - Minimize the squared consistency error over a batch of episodes
    - Batch can be from on or off-policy data
- Automatic Tuning of the Lagrange Multiplied $\lambda$
  - $\lambda$ needs to adapt to distribution of rewards
  - Instead make $\lambda$ a function of $\epsilon$ where $\epsilon$ is a hard constraint on relative entropy
  - In trust PCL, you can perform a line search to find a $\lambda (\epsilon)$ which finds a $KL(\pi^\ast \vert \vert \pi _{\tilde{\theta}})$ as close as possible to $\epsilon$
    - See paper for analysis for KL maxium divergence
    - $\epsilon$ can change during training; as episode length increases, KL generally increases too
    - For a set of episodes, approximate $\lambda$ that yields maximum divergence of $\frac{\epsilon}{N}\sum _{k=1}^N T_k$
      - $\epsilon$ becomes constraint on length averaged KL
  - To avoid many interactions with environment, use last 100 episodes in practice
    - Not exactly the same as sampling from old policy
    - But close enough since old policy is lagged version of online policy
Experiments
- Setup
  - Tested on discrete + continuous control tasks
  - Compared with TRPO
- Results
  - Trust PCL is able to match or exceed TRPO in reward and sample efficiency
  - Hyperparmeter Analysis
    - As $\epsilon$ increases, instability also increases
    - Standard PCL would fail in many of these scenarios because standard PCL is when $\epsilon \rightarrow \infty$
    - Trust PCL is better than TRPO because of its ability to learn in an off-policy manner (better sample efficiency)
    - $\tau$ not too important for tasks evaluated
      - Best results with $\tau = 0$
      - $\tau > 0$ had marginal effect

May 15, 2025 · research