Distributional Reinforcement Learning with Quantile Regression

Resources
- Paper
Introduction
- In classical distributional RL (C51 algorithm), we could not use wasserstein metric between probability distributions
  - Cannot minimize using SGD
  - C51 instead used a projection with a KL divergence minimization
- However, we can do distributional RL with Wasserstein metric
  - C51 uses $N$ fixed locations for distribution supports and adjusts probabibilties
    - Instead, assign fixed uniform probabibilties with adjustable locations
  - Quantile regression for adjusting locations
- They use smoothened version of quantile regression (Huber quantile regression) to beat C51
Distributional RL
- Assume standard distributional RL setting (see C51 paper notes)
- Value function = mean of value distribution
  - Value function = expectation of value distribution over all intrinsic randomness (randomness in MDP)
- Bellman operator: $\mathcal{T}^\pi Q(x,a) = \mathbb{E}[R(x,a)] + \gamma \mathbb{E} _{P, \pi} [Q(x’, a’)]$
  - Distributional Bellman operator: $\mathcal{T}^\pi Z(x,a) \stackrel{D}{=} R(x,a) + \gamma P^\pi Z(x,a)$
  - C51 models $Z^\pi (x,a)$ using discrete distribution on fixed locations ($z_1 \leq \dots \leq z_N$)
    - Parameters: probabilities of each location
    - Projects $\mathcal{T}^\pi Z$ onto finite element support + minimizes KL
- The Wasserstein Metric
  - p-Wasserstein metric between U and Y: $W_p(U, Y) = (\int_0^1 \vert F^{-1}(\omega - G^{-1}(\omega)) \vert^p d\omega)^{\frac{1}{p}}$
    - Where $F^-1$ is the inverse CDF
- Convergence of Distributional Bellman Operator
  - Maximal form of the wasserstein metric: $\bar{d_p}(Z_1, Z_2) = sup W_p (Z_1(x,a), Z_2(x,a))$
    - Z_1, Z_2 are two action-value distributions
  - Distributional bellman operator is a contraction in $\bar{d_p}$
    - $\bar{d_p}(\mathcal{T}^\pi Z_1, \mathcal{T}^\pi Z_2) \leq \gamma \bar{d_p}(Z_1, Z_2)$
    - Tells us minimizing wasserstein distance is useful for learning distributions - but this is not possible
      - Because we train with SARSA transitions, we use samples
        Minimum of expected sample loss is different from of true wasserstein loss
Approximately Minimizing Wasserstein
- Instead of fixed locations on parameterized probabibilities, we can use fixed probabilities with variable locations
  - New probabilities: $q_i = 1/N$
- We want to estimate quantiles of target distribution
  - $Z_Q$: space of quantile distributions for $N$
  - CDF denoted by $\mathcal{T}_i = \frac{i}{N}$
  - Create a parameteric model that maps state-action pairs to uniform probability distribution
    - $Z_\theta (x,a) = \frac{1}{N} \sum _{i=1}^N \delta _{\theta _{i(x,a)}}$
      - $\delta_z$ is a dirac at $z$
- Parameterizing quantile:
  - No restrictions on prespecified bounds on support (more accuracy)
  - No projection step needed
  - We can minimize wasserstein loss without biased gradients (with quantile regression)
- The Quantile Approximation
  - Bellman update projected onto parameterized quantile distribution = contraction
  - Quantile Projection
    - We want to project a value distribution $Z \in \mathcal{Z}$ onto $\mathcal{Z}_Q$
      - $\Pi _{W_1} Z = argmin _{Z _{\theta} \in Z_Q} W_1(Z, Z _{\theta})$
    - 1-Wasserstein distance: $W_1(Y, U) = \sum _{i=1}^N \int _{\mathcal{T} _{i-1}}^\mathcal{T} \vert F_Y^{-1}(\omega) - \theta_i \vert d\omega$
    - For a given interval, we want to find the $\theta$ that minimizes: $\int _{\mathcal{T}}^{\mathcal{T}^{‘}} \vert F_Y^{-1}(\omega) - \theta \vert d\omega$
      - The minimizer of this is the median of the interval: $F_Y(\theta) = \frac{\mathcal{T} + \mathcal{T}’}{2} \Rightarrow \theta = F_Y^{-1}(\frac{\mathcal{T} + \mathcal{T}’}{2})$
- Quantile Regression
  - Quantile parameterization leads to biased gradients
  - We can get unbiased approximations of the quantile function using quantile regression
    - Quantile Regression Loss: $\mathcal{L}^\tau _{QR}(\theta) = \mathbb{E} _{\hat{Z} \sim Z}[\rho _{\tau}(\hat{Z} - \theta)]$ where $\rho _\tau = u(\tau - \delta _{u < 0}) \forall u \in \mathbb{R}$
      - Penalizes overestimation of quantiles by $\tau$ and underestimation by $1 - \tau$
      - We want to find $\theta_1 \dots \theta_N$ that minimizes: $\sum _{i=1}^N \mathbb{E} _{\hat{Z} \sim Z}[\rho _{\tau}(\hat{Z} - \theta_i)]$
        Do this through SGD
  - Quantile Huber Loss:
    - Quantile regression loss not smooth at 0; as $u^+ \rightarrow 0$, gradient stays constant
    - Quantile Huber Loss acts as asymmetric squared loss around 0 in interval $[-\mathcal{K}, \mathcal{K}]$ and reverts back to original loss outside of it
    - Huber Loss: $\begin{multline}\begin{cases} \frac{1}{2}u^2 \text{ if } \vert u \vert \leq \mathcal{K} \\ \mathcal{K}(\vert u \vert - \frac{1}{2}\mathcal{K}) \text{ otherwise }\end{cases}\end{multline}$
    - Quantile Huber Loss: $\rho^\mathcal{K} _\mathcal{T}(u) = \vert \mathcal{T} - \delta _{u < 0}\vert \mathcal{L} _\mathcal{K}(u)$
- Combining Projection and Bellman Update
  - The quantile projection guarantees a non-expansion in $\infty$-wasserstein distance
    - $\bar{d} _\infty(\Pi _{W_1}\mathcal{T}^\pi Z_1, \Pi _{W_1}\mathcal{T}^\pi Z_2) \leq \bar{d} _\infty(Z_1, Z_2)$
    - Repeated application of $\Pi _{W_1}\mathcal{T}^\pi$ leads to a fixed point solution
    - Because $\bar{d}_p \leq \bar{d} _\infty$, convergence occurs for all $1 \leq p \leq \infty$
Distributional RL using Quantile Regression
- Approximate value distribution using quantile midpoints
- Quantile Regression Temporal Difference Learning
  - Standard TD update: $V(x) \leftarrow V(x) + \alpha(r + \gamma V(x’) - V(x))$
  - Quantile Regression TD Update: $\theta_i(x) \leftarrow \theta_i(x) + \alpha(\hat{\mathcal{T}}_i - \delta _{r + \gamma z’ < \theta_i(x)})$
    - $z’ \sim Z _\theta$
    - $\hat{\mathcal{T}}_i$: midpoint of quantile
    - $Z _\theta$ quantile distribution
    - $\theta_i(x)$ estimated value of quantile function in state x ($F^{-1} _{Z^\pi(x)}(\hat{\mathcal{T}}_i)$)
    - We should draw many samples $z’ \sim Z _\theta$ to minimize the expected update
- Quantile Regression DQN
  - Bellman Optimality Operator: $\mathcal{T} Q(x,a) = \mathbb{E}[R(x,a)] + \gamma \mathbb{E} _{x’ \sim P}[max _{a’}Q(x’,a’)]$
  - Distributional Bellman Operator: $\mathcal{T} Z(x,a) = R(x,a) + \gamma Z(x’,a’)$
    - $a’ = argmax _{a’} \mathbb{E} _{z \sim Z(x’,a’)}[z]$
  - Three modifications to DQN algorithm
    - Change output layer of DQN output layer to be size $\vert A \vert \times N$
      - N is number of quantile targets (atoms)
    - Replace L1 loss with quantile huber loss
    - Replace RMSProp with Adam
  - Because we are setting quantile targets, QR-DQN can expand/contract arbitrarily as needed
  - Higher N = able to estimate lower + higher quantiles of distribution + distinguish low probability events
Experimental Results
- Value Distribution Approximation Error: Quantile regression TD minimizes the 1-wasserstein distance and converges correctly despite the approximation errors
- Evaluation on Atari 2600
  - Best Agent Performance: QR-DQN outperforms all previous agents in mean and median human-normalized score
  - Online performance:
    - Early in learning, most algorithms perform worse than a random policy
    - QRTD has similar sample complexity as prioritized experience replay (but better final performance)
    - On a small subset of games, it reaches less than 10% of human performance

May 5, 2025 · research