$\psi$ can be total reward of trajectory, reward following action, baselined reward, state-action value function, advantage function, or a TD residual
Advantage usually results in lowest variance but usually needs to be estimated
Intuitively makes sense because policy gradient increases probability of better than average actions decreases probability of worse than average actions → advantage inherently does this!
$\gamma$ parameter for downweighting rewards corresponding to delayed effects (less variance but more bias)
Equivalent to discount factor usually used
Discounted approximation to policy gradient using advantages: $g^\gamma = \mathbb{E}[\sum_{t=0}^\infty A^{\pi, \gamma}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t\vert s_t)]$
$\gamma < 1$ means you drop terms with $l » \frac{1}{1-\gamma}$
If the reward function is obtained using $\phi = V^{\pi, \gamma}$, $\mathbb{E}[\tilde{r}_{t+l} \vert s_t, a_t] =\mathbb{E}[\tilde{r} _{t+l}\vert s_t]$: A temporally extended response turns into an immediate response because the value function reduces temporal spread (this is done through reward shaping).
Helps gradient focus on near-term outcomes
Value Function Estimation
To estimate value functions, you can use monte-carlo returns and solve linear regression on this: $minimize_\phi \sum_{n=1}^N\vert\vert V_\phi (s_n) - \hat{V_n} \vert\vert^2$
Use trust region to avoid overfitting on batch of data (constraint $\frac{1}{N}\sum_{n=1}^N \frac{\vert\vert V_\phi (s_n) - V_{\phi_{old}} \vert\vert^2}{2 \sigma^2} \leq \epsilon$)
Similar to TRPO KL Divergence Constraint
Use approximate solution with conjugate gradients
Experiments
Policy Optimization Algorithm
For experiements, they use GAE with TRPO
Vary the $\gamma, \lambda$ parameters to see effects
Use value function for advantage estimation
Experimental Setup
Architecture
3 Hidden Layers with tanh activations
Same policy + value function architecture
Final layer = linear
Task Details
See cartpole, mujoco Bipedal locomotion, quadrapedal locomotion, and dynamically standing up bipedal
Experimental Results
Cart-Pole
Fixed $\gamma$ with $\lambda = [0, 1]$. Best $\lambda = [0.92, 0.98]$ for fastest policy improvement
3D Bipedal Locomotion
Best $\gamma = [0.99, 0.995]$ with $\lambda = [0.96, 0.99]$
3D Robot Tasks
Quadrapedal Locomotion: Fixed $\gamma = 0.995$ with best $\lambda = 0.96$
3D standing: Fixed $\gamma = 0.995$ with best $\lambda = [0.96, 1]$
Discussion
Control problems difficult to solve because of high sample complexity
Reduce sample complexity by good estimates of advantages
Future work should look to tune $\gamma, \lambda$ automatically
If we know the relationship between the policy gradient estimation error and value function estimation error, we could choose a error function for value function that is well-matched