Implicit Quantile Networks for Distributional Reinforcement Learning

Resources
- Paper
Introduction
- Distributional RL parts: Parameterization of distribution + distance / loss function
  - Categorical DQN / C51: Cross entropy loss + categorical distribution with a Cramer-minimizing projection
    - Assigned probabilities on fixed + discrete set of returns
    - QR-DQN: Uniform mixture of adjustable Diracs for quantiles + uses wasserstein loss
- Implicit Quantile Networks: extends QR-DQN a continuous map of probabilities to returns
  - Distributional generalization of DQNs
- Benefits of IQNs
  - Approximation error no longer dictated by number of quantiles $\rightarrow$ dictated by network size + amount of training
  - Allows as few or as many samples for updates per update
  - Can expand class of policies to take advantage of learnd distribution
Background
- Distributional RL
  - Assume standard distributional RL setting (see C51 paper notes)
- p-Wasserstein Metric
  - Expresses distances between distributions in terms of the minimal cost for transporting mass to make the two distributions identical
- Quantile Regression for Distributional RL
  - Estimates quantiles and minimizes the wasserstein distance
  - Uses quantile regression for minimization
  - See QR-DQN paper notes
- Risk in Reinforcement Learning
  - Distributional RL policies were still based on the mean of the return distribution
    - Can we expand our classes of policies using other information on distribution of returns
  - Risk: Uncertainty over outcomes
  - Risk-Sensitive: Policies that depend on more than the mean
  - Intrinsic Uncertainty: uncertainty captured by the distribution
  - Parametric Uncertainty: uncertainty over the value estimate
  - When considering risk, we want our policy to maximize utility, not necessarily return: $\pi(x) = argmax_a \mathbb{E} _{Z(x,a)}[U(z)]$
    - Risk Neutral: Utility function is linear
    - Risk Averse: Utility function is concave
    - Risk Seeking: Utility function is convex
  - Distorted Risk Measure:
    - Given 3 random variables: X, Y, Z where X is preferred over Y
      - $\alpha F_X + (1 - \alpha)F_Z \geq \alpha F_Y + (1 - \alpha)F_Z$
      - $\alpha F_X^{-1} + (1 - \alpha)F_Z^{-1} \geq \alpha F_Y^{-1} + (1 - \alpha)F_Z^{-1}$
    - By these axioms, policy will now maximize a distorted expectation for a distortion risk measure, h
      - $\pi(x) = argmax_a \int _{-\infty}^{\infty} z \frac{\partial}{\partial z} (h \circ F _{Z(x,a)})(z)dz$
      - $h$: distortion risk measure $\rightarrow$ distorts CDFs of a random variable
    - Utility and distortion functions are inverses of each other
      - Utility: Behavior invariant to randomness being mixed in
      - Distortion: Behavior invariant to convex combinations of outcomes
Implicit Quantile Networks
- Implicit Quantile Network: Deterministic parametric distribution trained to reparameterize samples from a base distribution ($\tau$)
- $F_Z^{-1}(\tau)$ where $\tau \sim U(0,1)$: Quantile function for Z $Z _\tau(x,a) \sim Z(x,a)$
- Model the state-action quantile function as mapping from state-action + samples from base distribution to $Z _\tau(x,a)$
- Distortion risk measure: $\beta: [0,1]\rightarrow[0,1]$
  - Distorted expectation of $Z(x,a)$ under $\beta$: $Q _\beta(x,a) = \mathbb{E} _{\tau \sim U([0,1])[Z _\beta(\tau)(x,a)]} = \int_0^1 F_Z^{-1}(\tau)d\beta(\tau)$
    - Any distorted expectation can be represented as weighted sum over quantiles
- Risk Sensitive Greedy Policy: $\pi _\beta(x) = argmax _{a \in \mathcal{A}}Q _\beta(x,a)$
- TD error for two samples $\tau, \tau’ \sim U([0,1])$
  - $\delta^{\tau, \tau’}_t = r_t + \gamma Z _{\tau’}(x _{t+1},\pi _\beta(x _{t+1}) - Z _{\tau}(x _{t+1},\pi _\beta(x _{t+1}))$
- IQN Loss: $\mathcal{L}(x_t, a_t, r_t, x _{t+1}) = \frac{1}{N’} \sum _{i=1}^N \sum _{j=1}^{N’} \rho _{\tau_i}^\mathcal{K}(\delta_t^{\tau_i, \tau_j’})$
  - $N, N’$ are number of IID samples of $\tau, \tau’$
- Sample based risk sensitive policy: $\tilde{\pi} _\beta(x) = argmax _{a \in \mathcal{A}} \frac{1}{K} \sum _{k=1}^K Z _{\beta(\tilde{\tau}_k)}(x,a)$
  - Instead of approximating quantile function at $n$ fixed values, its trained over many $\tau$; this means that IQNs are universal function approximators
    - We approximate $Z _\tau(x,a) \approx f(\psi(x), \phi(\tau))_a$ where $f, \psi, \phi$ are function approximators
    - Also leads to better generalization + sample complexity
  - Sample TD errors are decorrelated because action values are a sample mean of the implicit distribution
- Implementation
  - $\psi$: $\chi \rightarrow \mathbb{R}^d$ computed by convolutional layers
  - $f$: $\mathbb{R}^d \rightarrow \mathbb{R}^{\vert A \vert}$ maps $\psi(x)$ to action-values
  - $\phi$: $[0,1] \rightarrow \mathbb{R}^d$ embedding for $\tau$
    - $\phi_j (\tau) = ReLU(\sum _{i=0}^{n-1} cos(\pi i \tau)w _{ij} + b_j)$
  - $Z _\tau(x,a) \approx f(\psi(x) \odot \phi(\tau))_a$
    - Can be parameterized in other ways too (i.e., concatenation)
  - Varying $N$ had a dramatic effect on early performance. Varying $N’$ had strong effect on early performance but minimal impact on long-term performance
    - $N = N’ = 1$ is comparable to DQN but long-term performance much better
  - IQN was not sensitive to number of samples ($K$)
Risk-Sensitive Reinforcement Learning
- Tested effects of moving $\beta$ away from identity
  - Cumulative Probability Weighting Parameterization (CPW): $CPW(\eta, \tau) = \frac{\tau^\eta}{(\tau^\eta + (1- \tau)^\eta)^{\frac{1}{\eta}}}$
    - For $\eta = 0.71$, it most closely matches human subjects
    - As $\tau$ becomes larger it becomes more convex (and vice versa)
    - $CPW(0.71)$: Reduces impact of tails of distribution
  - Standard Normal CDF and its inverse (Wang): $Wang(\eta, \tau) = \Phi(\Phi^{-1}(\tau) +\eta)$
    - Produces risk averse policies for $\eta < 0$
    - Gives all $\tau > \eta$ vanishingly small probabilities
  - Power Formula: $\begin{multline}\begin{cases} \tau^\frac{1}{1 + \vert \eta \vert} \text{ if } \eta \geq 0 \\ 1 - (1 - \tau)^{\frac{1}{1 + \vert \eta \vert}} \text{ otherwise }\end{cases}\end{multline}$
    - $\eta < 0$: Risk averse
    - $\eta > 0$: Risk seeking
  - Conditional value-at-risk (CVaR): $CVaR(\eta, \tau) = \eta \tau$
    - Changes $\tau \sim U([0 ,\eta])$
    - IGnores all values $\tau > \eta$
  - Norm: $Norm(\eta)$
    - Takes $\eta$ samples from $U([0,1])$ and averages them
    - $Norm(3)$: reduces impact of tails of distribution
- Evaluation under risk-neutral criterion
  - For some games, there are advantages to risk-averse policies
  - CPW performs as good as risk-neutral
  - $Wang(1.5)$ performs as good as or worse than risk-neutral
  - Both CPW and $Wang(1.5)$ risk-averse perform better than standard IQN
  - $CVaR(0.1)$ suffers performance on some games when risk averse
Full Atari-57 Results
- Compared to results from rainbow paper, QR-DQN, and PER DQN
- Risk neutral variant of IQN with $\epsilon$-greedy policy with respect to expectation of state-action return distribution
- IQN achieves QR-DQN performance in half the training frames
- IQN achieves 162% of human normalized score (better than 153% from rainbow)
- IQN outperforms QR-DQN on games where there is still a gap between humans and RL
Discussion
- Sample based RL convergence results fromc categorical distributional RL needs to be extended to QR-based algorithms
- Contraction results from QR distributional RL needs to extended to IQNs
- Convergence to a fixed point with the bellman operator needs to be shown for distored expectations
- Creating a Rainbow IQN could lead to massive improvements in distributional RL

May 7, 2025 · research