Implicit Quantile Networks for Distributional Reinforcement Learning
Resources
Introduction
Distributional RL parts: Parameterization of distribution + distance / loss functionCategorical DQN / C51: Cross entropy loss + categorical distribution with a Cramer-minimizing projection
Assigned probabilities on fixed + discrete set of returns
QR-DQN: Uniform mixture of adjustable Diracs for quantiles + uses wasserstein loss
Implicit Quantile Networks: extends QR-DQN a continuous map of probabilities to returnsDistributional generalization of DQNs
Benefits of IQNs
Approximation error no longer dictated by number of quantiles $\rightarrow$ dictated by network size + amount of training
Allows as few or as many samples for updates per update
Can expand class of policies to take advantage of learnd distribution
Background
Distributional RL
p-Wasserstein Metric Expresses distances between distributions in terms of the minimal cost for transporting mass to make the two distributions identical
Quantile Regression for Distributional RL
Estimates quantiles and minimizes the wasserstein distance
Uses quantile regression for minimization
See QR-DQN paper notes
Risk in Reinforcement Learning
Distributional RL policies were still based on the mean of the return distributionCan we expand our classes of policies using other information on distribution of returns
Risk: Uncertainty over outcomes
Risk-Sensitive: Policies that depend on more than the mean
Intrinsic Uncertainty: uncertainty captured by the distribution
Parametric Uncertainty: uncertainty over the value estimate
When considering risk, we want our policy to maximize utility, not necessarily return: $\pi(x) = argmax_a \mathbb{E} _{Z(x,a)}[U(z)]$
Risk Neutral: Utility function is linear
Risk Averse: Utility function is concave
Risk Seeking: Utility function is convex
Distorted Risk Measure:
Given 3 random variables: X, Y, Z where X is preferred over Y
$\alpha F_X + (1 - \alpha)F_Z \geq \alpha F_Y + (1 - \alpha)F_Z$
$\alpha F_X^{-1} + (1 - \alpha)F_Z^{-1} \geq \alpha F_Y^{-1} + (1 - \alpha)F_Z^{-1}$
By these axioms, policy will now maximize a distorted expectation for a distortion risk measure, h
$\pi(x) = argmax_a \int _{-\infty}^{\infty} z \frac{\partial}{\partial z} (h \circ F _{Z(x,a)})(z)dz$
$h$: distortion risk measure $\rightarrow$ distorts CDFs of a random variable
Utility and distortion functions are inverses of each other
Utility: Behavior invariant to randomness being mixed in
Distortion: Behavior invariant to convex combinations of outcomes
Implicit Quantile Networks
Implicit Quantile Network: Deterministic parametric distribution trained to reparameterize samples from a base distribution ($\tau$)
$F_Z^{-1}(\tau)$ where $\tau \sim U(0,1)$: Quantile function for Z $Z _\tau(x,a) \sim Z(x,a)$
Model the state-action quantile function as mapping from state-action + samples from base distribution to $Z _\tau(x,a)$
Distortion risk measure: $\beta: [0,1]\rightarrow[0,1]$Distorted expectation of $Z(x,a)$ under $\beta$: $Q _\beta(x,a) = \mathbb{E} _{\tau \sim U([0,1])[Z _\beta(\tau)(x,a)]} = \int_0^1 F_Z^{-1}(\tau)d\beta(\tau)$Any distorted expectation can be represented as weighted sum over quantiles
Risk Sensitive Greedy Policy: $\pi _\beta(x) = argmax _{a \in \mathcal{A}}Q _\beta(x,a)$
TD error for two samples $\tau, \tau’ \sim U([0,1])$$\delta^{\tau, \tau’}_t = r_t + \gamma Z _{\tau’}(x _{t+1},\pi _\beta(x _{t+1}) - Z _{\tau}(x _{t+1},\pi _\beta(x _{t+1}))$
IQN Loss: $\mathcal{L}(x_t, a_t, r_t, x _{t+1}) = \frac{1}{N’} \sum _{i=1}^N \sum _{j=1}^{N’} \rho _{\tau_i}^\mathcal{K}(\delta_t^{\tau_i, \tau_j’})$$N, N’$ are number of IID samples of $\tau, \tau’$
Sample based risk sensitive policy: $\tilde{\pi} _\beta(x) = argmax _{a \in \mathcal{A}} \frac{1}{K} \sum _{k=1}^K Z _{\beta(\tilde{\tau}_k)}(x,a)$
Instead of approximating quantile function at $n$ fixed values, its trained over many $\tau$; this means that IQNs are universal function approximators
We approximate $Z _\tau(x,a) \approx f(\psi(x), \phi(\tau))_a$ where $f, \psi, \phi$ are function approximators
Also leads to better generalization + sample complexity
Sample TD errors are decorrelated because action values are a sample mean of the implicit distribution
Implementation
$\psi$: $\chi \rightarrow \mathbb{R}^d$ computed by convolutional layers
$f$: $\mathbb{R}^d \rightarrow \mathbb{R}^{\vert A \vert}$ maps $\psi(x)$ to action-values
$\phi$: $[0,1] \rightarrow \mathbb{R}^d$ embedding for $\tau$$\phi_j (\tau) = ReLU(\sum _{i=0}^{n-1} cos(\pi i \tau)w _{ij} + b_j)$
$Z _\tau(x,a) \approx f(\psi(x) \odot \phi(\tau))_a$Can be parameterized in other ways too (i.e., concatenation)
Varying $N$ had a dramatic effect on early performance. Varying $N’$ had strong effect on early performance but minimal impact on long-term performance$N = N’ = 1$ is comparable to DQN but long-term performance much better
IQN was not sensitive to number of samples ($K$)
Risk-Sensitive Reinforcement Learning
Tested effects of moving $\beta$ away from identity
Cumulative Probability Weighting Parameterization (CPW): $CPW(\eta, \tau) = \frac{\tau^\eta}{(\tau^\eta + (1- \tau)^\eta)^{\frac{1}{\eta}}}$
For $\eta = 0.71$, it most closely matches human subjects
As $\tau$ becomes larger it becomes more convex (and vice versa)
$CPW(0.71)$: Reduces impact of tails of distribution
Standard Normal CDF and its inverse (Wang): $Wang(\eta, \tau) = \Phi(\Phi^{-1}(\tau) +\eta)$
Produces risk averse policies for $\eta < 0$
Gives all $\tau > \eta$ vanishingly small probabilities
Power Formula: $\begin{multline}\begin{cases} \tau^\frac{1}{1 + \vert \eta \vert} \text{ if } \eta \geq 0 \\ 1 - (1 - \tau)^{\frac{1}{1 + \vert \eta \vert}} \text{ otherwise }\end{cases}\end{multline}$
$\eta < 0$: Risk averse
$\eta > 0$: Risk seeking
Conditional value-at-risk (CVaR): $CVaR(\eta, \tau) = \eta \tau$
Changes $\tau \sim U([0 ,\eta])$
IGnores all values $\tau > \eta$
Norm: $Norm(\eta)$
Takes $\eta$ samples from $U([0,1])$ and averages them
$Norm(3)$: reduces impact of tails of distribution
Evaluation under risk-neutral criterion
For some games, there are advantages to risk-averse policies
CPW performs as good as risk-neutral
$Wang(1.5)$ performs as good as or worse than risk-neutral
Both CPW and $Wang(1.5)$ risk-averse perform better than standard IQN
$CVaR(0.1)$ suffers performance on some games when risk averse
Full Atari-57 Results
Compared to results from rainbow paper , QR-DQN , and PER DQN
Risk neutral variant of IQN with $\epsilon$-greedy policy with respect to expectation of state-action return distribution
IQN achieves QR-DQN performance in half the training frames
IQN achieves 162% of human normalized score (better than 153% from rainbow)
IQN outperforms QR-DQN on games where there is still a gap between humans and RL
Discussion
Sample based RL convergence results fromc categorical distributional RL needs to be extended to QR-based algorithms
Contraction results from QR distributional RL needs to extended to IQNs
Convergence to a fixed point with the bellman operator needs to be shown for distored expectations
Creating a Rainbow IQN could lead to massive improvements in distributional RL
May 7, 2025 · research