A Distributional Perspective on Reinforcement Learning

Resources
- Paper
Introduction
- Bellman’s equation tells us the expected reward + outcome of a random transition
- In distributional RL, we want the random return Z who’s expectation is Q:
  - Distributional Bellman Equation: $Z(x,a) \stackrel{D}{=} R(x, a) + \gamma Z(X’, A’)$
    - Three random variables:
      - $R$: Reward
      - $(X’, A’)$: Next State Action
      - $Z(X’, A’)$: Random return (value distribution)
- Bellman operator is a contraction in regular policy evaluation (guarantees convergence)
  - This is not the case with distributional RL
  - The metric we use to measure distance between distributions matters (i.e., KL, total variation, kolmogorov distance)
- There is instability in distributional RL
  - While the bellman operator is a contraction in the expectation, it is not over any metric of distributions
    - Want to learn algorithms that model the effects of nonstationary policies
- Learning distributions leads to more stable training
Setting
- Assume standard MDP setting
- Policy maps from a state to a probability distribution over actions
- Bellman’s Equations
  - $Z^\pi$: Sum of discounted rewards along agent’s trajectory of interactions with environment
  - $Q^\pi$: expectation of $Z^\pi$
    - $Q^\pi(x,a) = \mathbb{E}[Z^\pi(x,a)] = \mathbb{E}[\sum _{t=0}^\infty \gamma^t R(x_t, a_t)]$
    - Bellman Operator: $\tau^\pi Q^\pi(x,a) =\mathbb{E}[R(x,a)] + \mathbb{E} _{P, \pi}[Q^\pi(x’, a’)]$
    - Bellman Optimality Operator: $TQ^*(x,a) = \mathbb{E}[R(x,a)] + \gamma \mathbb{E}_P [max _{a’ \in A} Q(x’, a’)]$
      - Both are contractive operators
The Distributional Bellman Operators
- Distributional Equations
  - $\Omega$: Space of all possible outcomes of an experiment
  - $F_U(y) = Pr (U \leq y)$ and $F^{-1}_U(q)$ be the inverse CDF
- The Wasserstein Metric
  - Wasserstein Distance: $d_P(F,G) = inf _{U,V}\vert\vert U - V\vert\vert _p = \vert\vert F^{-1}(\mathcal{U} - G^{-1}(\mathcal{U})) \vert\vert_p$
    - inf: Infinum (greatest lower bound)
    - Where F and G are the CDFs over U and V
    - $\mathcal{U}$ is the uniform distribution over $[0,1]$
    - If $p < \infty$, $d_P(F,G) = (\int_0^1 \vert F^{-1}(\mathcal{u} - G^{-1}(\mathcal{u})) \vert^p du)^{\frac{1}{p}}$
  - Properties of $d_p$: Given a scalar $a$ and random variable $A$ independent of $U,V$
    - $d_p(aU, aV) \leq \vert a \vert d_p(U, V)$
    - $d_p(A + U, A + V) \leq d_p(U, V)$
    - $d_p(AU, AV) \leq \vert\vert A \vert\vert _p d_p(U, V)$
  - Partition Lemma: Let $A_1, A_2, \dots$ be partitions of $\Omega$
    - $d_p(U, V) \leq \sum_i d_p(A_iU, A_iV)$
  - Lemma 2 $\bar{d}_p$ is a metric over value distributions:
    - If $\mathcal{Z}$ is the space of value distributions with $Z_1, Z_2 \in \mathcal{Z}$, maximal form of wasserstein metric:
      - $\bar{d}_p (Z_1, Z_2)= sup _{x,a} d_p(Z_1(x,a), Z_2(x,a))$
        sup: Supremum (smallest upper bound)
- Policy Evaluation
  - Characterize $Z^\pi$ as intrinsic randomness of agent’s interactions with environment
  - Transition Operator: $P^\pi Z(x,a) = Z(X’, A’)$
  - Distributional Bellman Operator: $\tau^\pi Z(x,a) = R(x,a) + \gamma P^\pi Z(x,a)$
    - Randomness from reward
    - Randomness from transition
    - Randomness from next-state value distribution
  - Contraction in $\bar{d}_p$
    - We expect $\tau^\pi$ to be a contraction operator on $\bar{d}_p$
      - We expect limiting expectation of $Z_k$ to converge exponentially quickly to $Q^\pi$ (along wiht all moments)
      - Lemma 3: $\tau^\pi$ is a $\gamma-contraction$ in $\bar{d}_p$
        
        Implies $\tau^\pi$ has a unique fixed point
        
        $Z_k$ converges to $Z^\pi$ in $\bar{d}_p$ for $1 \leq p \leq \infty$
      - Not all distributional metrics are equal; KL divergence, total variational distance, and Kolmogorov distance do not contract
  - Contraction in Centered Moments
    - Coupling: $C(\omega) = U(\omega) - V(\omega)$
    - We see that $d_2^2(U, V) \leq \mathbb{E}[(U-V)^2] = \mathbb{V}(C) + (\mathbb{E} C)^2$
      - We cannot use $d_2$ to bound variance differences like $\vert \mathbb{V}(\tau^\pi Z(x,a)) - \mathbb{V}(\tau^\pi Z(x,a)) \vert$
      - However, $\tau^\pi$ is a contraction in variance (but not for higher centered moments, $p > 2$)
- Control
  - Control Setting: Seek a policy $\pi$ that maximizes value + find an optimal value distribution
    - There are many optimal value distributions
    - Bellman operator does converge to set of optimal value distributions
      - Does not cause a contraction between distributions
      - Is more unstable due to greedy updates
  - Let $\Pi^*$ be the set of optimal policies
    - Optimal Value Distribution: Value distribution of an optimal policy
      - Set of all optimal value distributions: $\mathcal{Z}^* = Z^\pi : \pi^* \in \Pi^*$
    - Greedy Policy: $\pi$ for $Z \in \mathcal{Z}$ maximizes expectation of Z
      - Set of all greedy policies: $\mathcal{G}_Z = \pi : \sum_a \pi (a \vert x) \mathbb{E}Z(x,a) = max _{a’ \in A} Z(x,a)$
  - Distributional Bellman Operator: Any operator, $\tau$, which implements greedy selection: $\tau Z = \tau^\pi Z$ for $\pi \in \mathcal{G}_Z$
  - Behavior of $\mathbb{E} Z_k$: $\vert \vert \mathbb{E} \tau Z_1 - \mathbb{E} \tau Z_2 \vert \vert _\infty \leq \vert \vert \mathbb{E} Z_1 - \mathbb{E} Z_2 \vert \vert _\infty$
    - Particularly, $\mathbb{E} Z_k \rightarrow Q^*$ exponentially quickly
    - Convergence to set $\mathcal{Z}^*$ not assured but best we can hope for is convergence to set of nonstationary optimal value distributions
      - Nonstationary Optimal Value Distributions: value distribution corresponding to set of optimal policies ($\mathcal{Z}^{**}$)
  - Convergence in control setting: If greedy policies are totally ordered and the state + action space is finite, you can get a unique fixed point $Z^* \in \mathcal{Z}^*$
    - The expectation of $Z_k$ converges to $Q^*$ but the distribution not necessarily well behaved
  - Distribution not well behaved means
    - $\tau$ is not a contraction (the distributions can go further apart first before going closer together)
    - Not all optimality operators have a fixed point (i.e., operator can alternate between actions that are tied)
    - Existence of a fixed point for an operator, $\tau$, is insufficient to guarantee convergence
Approximate Distributional Learning
- Parameteric Distribution
  - Model the value distribution via discrete distribution parameterized by $N, V _{MIN}, V _{MAX}$ with a support of { $z_i = V _{MIN} + i \Delta z : 0 \leq i < N$ } where $\Delta z = \frac{V _{MAX} - V _{MIN}}{N-1}$
    - Parameteric model for probabilities: $Z _{\theta}(x,a) = z_i$ with probabilities $p_i(x,a) = \frac{e^{\theta _i(x,a)}}{\sum_j e^{\theta _j(x,a)}}$
- Projected Bellman Update
  - Problem: Bellman update ($\tau Z _{\theta}$) and parameterization ($Z _{\theta}$) have disjoint supports
    - We cannot minimize wasserstein loss because we are restricted to learning from sample transitions
  - Solution: Categorical Algorithm
    - Instead project bellman update $\hat{\tau} Z _{\theta}$ onto support of $Z _\theta$
    - Compute bellman update for each transition ( $\hat{\tau} Z _{\theta} = r + \gamma z_j$) for each $z_j$
      - Distribute its probability $p_j(x’, \pi(x’))$ to neighbors $\hat{\tau _{z_j}}$
      - i-th component of projected update: $(\Phi \hat{\tau} Z_\theta (x,a))_i = \sum _{j=0}^{N-1} [1 - \frac{\vert [\hat{\tau _{z_j}}] _{V _{MIN}}^{V _{MAX}} - z_i\vert}{\Delta z}] _0^1 p_j(x’, \pi(x’))$
    - Sample loss is cross entropy term of KL divergence between $(\Phi \hat{\tau} Z _{\tilde{\theta}} (x,a))$ and $Z _\theta (x,a)$
Evaluation on Atari 2600 Games
- Use a DQN architecture that outputs atom probabilities instead of action values
  - Replace MSE Loss with KL Loss for distributional RL
  - Simple $\epsilon$-greedy policy over expected action values
- Safe actions have similar probabilities whereas unfavorable actions have 0 probabibility
  - Seperates low value losing events from high value survival events
  - Doesn’t combine these events into a singular expectation
- Distributions generally close to gaussians
- Varying Number of Atoms
  - Too few atoms leads to poor performance
  - At 51 atoms, categorical DQN outperforms DQN on all games
  - Agent is able to pick up on intrinsic stochasticity in its predictions
- State-of-the-art Results
  - C51: The 51 atom agent
  - C51 compared to SOTA algorithms surpasses by a large margin in a number of the games
    - Particularly good in sparse reward scenarios
  - C51 also outperforms DQN in far fewer training steps on 45/57 games
Discussion
- Why does learning a distribution matter?
  - Learning distributions matter because of approximations:
    - Reduced Chattering:
      - Bellman optimality operator is unstable + combined with function approximation means no convergence guarantees
      - Categorical algorithm avoids this by averaging different distributions
    - State Aliasing:
      - State aliasing occurs when two different states are indistinguishable to the agent
      - Modeling a distribution provides a more stable learning target
    - Richer Predictions:
      - Distribution will tell us the probability the return will take on certain value
    - Inductive Bias:
      - We can impose assumptions about the domain
        
        I.e., we bounded support of the distribution between $[V _{MIN}, V _{MAX}]$
        Suprisingly clipping values in DQN degrades performance
        
        I.e., we can interpret the discount factor as a probabibility
    - Optimization:
      - KL divergence between categorical distributions is easy to minimize
      - KL betweeen continuous distributions had worse results (KL insensitive to values of outcomes)
        Metric closer to wasserstein should yield even better results

April 27, 2025 · research