Distributional Reinforcement Learning with Quantile Regression
Resources
Introduction
In classical distributional RL (C51 algorithm), we could not use wasserstein metric between probability distributions
Cannot minimize using SGD
C51 instead used a projection with a KL divergence minimization
However, we can do distributional RL with Wasserstein metric
C51 uses $N$ fixed locations for distribution supports and adjusts probabibiltiesInstead, assign fixed uniform probabibilties with adjustable locations
Quantile regression for adjusting locations
They use smoothened version of quantile regression (Huber quantile regression) to beat C51
Distributional RL
Assume standard distributional RL setting (see C51 paper notes )
Value function = mean of value distributionValue function = expectation of value distribution over all intrinsic randomness (randomness in MDP)
Bellman operator: $\mathcal{T}^\pi Q(x,a) = \mathbb{E}[R(x,a)] + \gamma \mathbb{E} _{P, \pi} [Q(x’, a’)]$
Distributional Bellman operator: $\mathcal{T}^\pi Z(x,a) \stackrel{D}{=} R(x,a) + \gamma P^\pi Z(x,a)$
C51 models $Z^\pi (x,a)$ using discrete distribution on fixed locations ($z_1 \leq \dots \leq z_N$)
Parameters: probabilities of each location
Projects $\mathcal{T}^\pi Z$ onto finite element support + minimizes KL
The Wasserstein Metric p-Wasserstein metric between U and Y: $W_p(U, Y) = (\int_0^1 \vert F^{-1}(\omega - G^{-1}(\omega)) \vert^p d\omega)^{\frac{1}{p}}$Where $F^-1$ is the inverse CDF
Convergence of Distributional Bellman Operator
Maximal form of the wasserstein metric: $\bar{d_p}(Z_1, Z_2) = sup W_p (Z_1(x,a), Z_2(x,a))$Z_1, Z_2 are two action-value distributions
Distributional bellman operator is a contraction in $\bar{d_p}$
$\bar{d_p}(\mathcal{T}^\pi Z_1, \mathcal{T}^\pi Z_2) \leq \gamma \bar{d_p}(Z_1, Z_2)$
Tells us minimizing wasserstein distance is useful for learning distributions - but this is not possibleBecause we train with SARSA transitions, we use samplesMinimum of expected sample loss is different from of true wasserstein loss
Approximately Minimizing Wasserstein
Instead of fixed locations on parameterized probabibilities, we can use fixed probabilities with variable locationsNew probabilities: $q_i = 1/N$
We want to estimate quantiles of target distribution
$Z_Q$: space of quantile distributions for $N$
CDF denoted by $\mathcal{T}_i = \frac{i}{N}$
Create a parameteric model that maps state-action pairs to uniform probability distribution$Z_\theta (x,a) = \frac{1}{N} \sum _{i=1}^N \delta _{\theta _{i(x,a)}}$$\delta_z$ is a dirac at $z$
Parameterizing quantile:
No restrictions on prespecified bounds on support (more accuracy)
No projection step needed
We can minimize wasserstein loss without biased gradients (with quantile regression)
The Quantile Approximation
Bellman update projected onto parameterized quantile distribution = contraction
Quantile Projection
We want to project a value distribution $Z \in \mathcal{Z}$ onto $\mathcal{Z}_Q$$\Pi _{W_1} Z = argmin _{Z _{\theta} \in Z_Q} W_1(Z, Z _{\theta})$
1-Wasserstein distance: $W_1(Y, U) = \sum _{i=1}^N \int _{\mathcal{T} _{i-1}}^\mathcal{T} \vert F_Y^{-1}(\omega) - \theta_i \vert d\omega$
For a given interval, we want to find the $\theta$ that minimizes: $\int _{\mathcal{T}}^{\mathcal{T}^{‘}} \vert F_Y^{-1}(\omega) - \theta \vert d\omega$The minimizer of this is the median of the interval: $F_Y(\theta) = \frac{\mathcal{T} + \mathcal{T}’}{2} \Rightarrow \theta = F_Y^{-1}(\frac{\mathcal{T} + \mathcal{T}’}{2})$
Quantile Regression
Quantile parameterization leads to biased gradients
We can get unbiased approximations of the quantile function using quantile regressionQuantile Regression Loss: $\mathcal{L}^\tau _{QR}(\theta) = \mathbb{E} _{\hat{Z} \sim Z}[\rho _{\tau}(\hat{Z} - \theta)]$ where $\rho _\tau = u(\tau - \delta _{u < 0}) \forall u \in \mathbb{R}$
Penalizes overestimation of quantiles by $\tau$ and underestimation by $1 - \tau$
We want to find $\theta_1 \dots \theta_N$ that minimizes: $\sum _{i=1}^N \mathbb{E} _{\hat{Z} \sim Z}[\rho _{\tau}(\hat{Z} - \theta_i)]$
Quantile Huber Loss:
Quantile regression loss not smooth at 0; as $u^+ \rightarrow 0$, gradient stays constant
Quantile Huber Loss acts as asymmetric squared loss around 0 in interval $[-\mathcal{K}, \mathcal{K}]$ and reverts back to original loss outside of it
Huber Loss: $\begin{multline}\begin{cases} \frac{1}{2}u^2 \text{ if } \vert u \vert \leq \mathcal{K} \\ \mathcal{K}(\vert u \vert - \frac{1}{2}\mathcal{K}) \text{ otherwise }\end{cases}\end{multline}$
Quantile Huber Loss: $\rho^\mathcal{K} _\mathcal{T}(u) = \vert \mathcal{T} - \delta _{u < 0}\vert \mathcal{L} _\mathcal{K}(u)$
Combining Projection and Bellman Update The quantile projection guarantees a non-expansion in $\infty$-wasserstein distance
$\bar{d} _\infty(\Pi _{W_1}\mathcal{T}^\pi Z_1, \Pi _{W_1}\mathcal{T}^\pi Z_2) \leq \bar{d} _\infty(Z_1, Z_2)$
Repeated application of $\Pi _{W_1}\mathcal{T}^\pi$ leads to a fixed point solution
Because $\bar{d}_p \leq \bar{d} _\infty$, convergence occurs for all $1 \leq p \leq \infty$
Distributional RL using Quantile Regression
Approximate value distribution using quantile midpoints
Quantile Regression Temporal Difference Learning
Standard TD update: $V(x) \leftarrow V(x) + \alpha(r + \gamma V(x’) - V(x))$
Quantile Regression TD Update: $\theta_i(x) \leftarrow \theta_i(x) + \alpha(\hat{\mathcal{T}}_i - \delta _{r + \gamma z’ < \theta_i(x)})$
$z’ \sim Z _\theta$
$\hat{\mathcal{T}}_i$: midpoint of quantile
$Z _\theta$ quantile distribution
$\theta_i(x)$ estimated value of quantile function in state x ($F^{-1} _{Z^\pi(x)}(\hat{\mathcal{T}}_i)$)
We should draw many samples $z’ \sim Z _\theta$ to minimize the expected update
Quantile Regression DQN
Bellman Optimality Operator: $\mathcal{T} Q(x,a) = \mathbb{E}[R(x,a)] + \gamma \mathbb{E} _{x’ \sim P}[max _{a’}Q(x’,a’)]$
Distributional Bellman Operator: $\mathcal{T} Z(x,a) = R(x,a) + \gamma Z(x’,a’)$$a’ = argmax _{a’} \mathbb{E} _{z \sim Z(x’,a’)}[z]$
Three modifications to DQN algorithm
Change output layer of DQN output layer to be size $\vert A \vert \times N$N is number of quantile targets (atoms)
Replace L1 loss with quantile huber loss
Replace RMSProp with Adam
Because we are setting quantile targets, QR-DQN can expand/contract arbitrarily as needed
Higher N = able to estimate lower + higher quantiles of distribution + distinguish low probability events
Experimental Results
Value Distribution Approximation Error: Quantile regression TD minimizes the 1-wasserstein distance and converges correctly despite the approximation errors
Evaluation on Atari 2600
Best Agent Performance: QR-DQN outperforms all previous agents in mean and median human-normalized score
Online performance:
Early in learning, most algorithms perform worse than a random policy
QRTD has similar sample complexity as prioritized experience replay (but better final performance)
On a small subset of games, it reaches less than 10% of human performance
May 5, 2025 · research