A Distributional Perspective on Reinforcement Learning
Resources
Introduction
Bellman’s equation tells us the expected reward + outcome of a random transition
In distributional RL, we want the random return Z who’s expectation is Q:Distributional Bellman Equation: $Z(x,a) \stackrel{D}{=} R(x, a) + \gamma Z(X’, A’)$Three random variables:
$R$: Reward
$(X’, A’)$: Next State Action
$Z(X’, A’)$: Random return (value distribution)
Bellman operator is a contraction in regular policy evaluation (guarantees convergence)
This is not the case with distributional RL
The metric we use to measure distance between distributions matters (i.e., KL, total variation, kolmogorov distance)
There is instability in distributional RLWhile the bellman operator is a contraction in the expectation, it is not over any metric of distributionsWant to learn algorithms that model the effects of nonstationary policies
Learning distributions leads to more stable training
Setting
Assume standard MDP setting
Policy maps from a state to a probability distribution over actions
Bellman’s Equations
$Z^\pi$: Sum of discounted rewards along agent’s trajectory of interactions with environment
$Q^\pi$: expectation of $Z^\pi$
$Q^\pi(x,a) = \mathbb{E}[Z^\pi(x,a)] = \mathbb{E}[\sum _{t=0}^\infty \gamma^t R(x_t, a_t)]$
Bellman Operator: $\tau^\pi Q^\pi(x,a) =\mathbb{E}[R(x,a)] + \mathbb{E} _{P, \pi}[Q^\pi(x’, a’)]$
Bellman Optimality Operator: $TQ^*(x,a) = \mathbb{E}[R(x,a)] + \gamma \mathbb{E}_P [max _{a’ \in A} Q(x’, a’)]$Both are contractive operators
The Distributional Bellman Operators
Distributional Equations
$\Omega$: Space of all possible outcomes of an experiment
$F_U(y) = Pr (U \leq y)$ and $F^{-1}_U(q)$ be the inverse CDF
The Wasserstein Metric
Wasserstein Distance: $d_P(F,G) = inf _{U,V}\vert\vert U - V\vert\vert _p = \vert\vert F^{-1}(\mathcal{U} - G^{-1}(\mathcal{U})) \vert\vert_p$
inf: Infinum (greatest lower bound)
Where F and G are the CDFs over U and V
$\mathcal{U}$ is the uniform distribution over $[0,1]$
If $p < \infty$, $d_P(F,G) = (\int_0^1 \vert F^{-1}(\mathcal{u} - G^{-1}(\mathcal{u})) \vert^p du)^{\frac{1}{p}}$
Properties of $d_p$: Given a scalar $a$ and random variable $A$ independent of $U,V$
$d_p(aU, aV) \leq \vert a \vert d_p(U, V)$
$d_p(A + U, A + V) \leq d_p(U, V)$
$d_p(AU, AV) \leq \vert\vert A \vert\vert _p d_p(U, V)$
Partition Lemma: Let $A_1, A_2, \dots$ be partitions of $\Omega$$d_p(U, V) \leq \sum_i d_p(A_iU, A_iV)$
Lemma 2 $\bar{d}_p$ is a metric over value distributions:If $\mathcal{Z}$ is the space of value distributions with $Z_1, Z_2 \in \mathcal{Z}$, maximal form of wasserstein metric:$\bar{d}_p (Z_1, Z_2)= sup _{x,a} d_p(Z_1(x,a), Z_2(x,a))$sup: Supremum (smallest upper bound)
Policy Evaluation
Characterize $Z^\pi$ as intrinsic randomness of agent’s interactions with environment
Transition Operator: $P^\pi Z(x,a) = Z(X’, A’)$
Distributional Bellman Operator: $\tau^\pi Z(x,a) = R(x,a) + \gamma P^\pi Z(x,a)$
Randomness from reward
Randomness from transition
Randomness from next-state value distribution
Contraction in $\bar{d}_p$ We expect $\tau^\pi$ to be a contraction operator on $\bar{d}_p$
We expect limiting expectation of $Z_k$ to converge exponentially quickly to $Q^\pi$ (along wiht all moments)
Lemma 3: $\tau^\pi$ is a $\gamma-contraction$ in $\bar{d}_p$
Implies $\tau^\pi$ has a unique fixed point
$Z_k$ converges to $Z^\pi$ in $\bar{d}_p$ for $1 \leq p \leq \infty$
Not all distributional metrics are equal; KL divergence, total variational distance, and Kolmogorov distance do not contract
Contraction in Centered Moments
Coupling: $C(\omega) = U(\omega) - V(\omega)$
We see that $d_2^2(U, V) \leq \mathbb{E}[(U-V)^2] = \mathbb{V}(C) + (\mathbb{E} C)^2$
We cannot use $d_2$ to bound variance differences like $\vert \mathbb{V}(\tau^\pi Z(x,a)) - \mathbb{V}(\tau^\pi Z(x,a)) \vert$
However, $\tau^\pi$ is a contraction in variance (but not for higher centered moments, $p > 2$)
Control
Control Setting: Seek a policy $\pi$ that maximizes value + find an optimal value distribution
There are many optimal value distributions
Bellman operator does converge to set of optimal value distributions
Does not cause a contraction between distributions
Is more unstable due to greedy updates
Let $\Pi^*$ be the set of optimal policies
Optimal Value Distribution: Value distribution of an optimal policySet of all optimal value distributions: $\mathcal{Z}^* = Z^\pi : \pi^* \in \Pi^*$
Greedy Policy: $\pi$ for $Z \in \mathcal{Z}$ maximizes expectation of ZSet of all greedy policies: $\mathcal{G}_Z = \pi : \sum_a \pi (a \vert x) \mathbb{E}Z(x,a) = max _{a’ \in A} Z(x,a)$
Distributional Bellman Operator: Any operator, $\tau$, which implements greedy selection: $\tau Z = \tau^\pi Z$ for $\pi \in \mathcal{G}_Z$
Behavior of $\mathbb{E} Z_k$: $\vert \vert \mathbb{E} \tau Z_1 - \mathbb{E} \tau Z_2 \vert \vert _\infty \leq \vert \vert \mathbb{E} Z_1 - \mathbb{E} Z_2 \vert \vert _\infty$
Particularly, $\mathbb{E} Z_k \rightarrow Q^*$ exponentially quickly
Convergence to set $\mathcal{Z}^*$ not assured but best we can hope for is convergence to set of nonstationary optimal value distributionsNonstationary Optimal Value Distributions: value distribution corresponding to set of optimal policies ($\mathcal{Z}^{**}$)
Convergence in control setting: If greedy policies are totally ordered and the state + action space is finite, you can get a unique fixed point $Z^* \in \mathcal{Z}^*$The expectation of $Z_k$ converges to $Q^*$ but the distribution not necessarily well behaved
Distribution not well behaved means
$\tau$ is not a contraction (the distributions can go further apart first before going closer together)
Not all optimality operators have a fixed point (i.e., operator can alternate between actions that are tied)
Existence of a fixed point for an operator, $\tau$, is insufficient to guarantee convergence
Approximate Distributional Learning
Parameteric Distribution Model the value distribution via discrete distribution parameterized by $N, V _{MIN}, V _{MAX}$ with a support of { $z_i = V _{MIN} + i \Delta z : 0 \leq i < N$ } where $\Delta z = \frac{V _{MAX} - V _{MIN}}{N-1}$Parameteric model for probabilities: $Z _{\theta}(x,a) = z_i$ with probabilities $p_i(x,a) = \frac{e^{\theta _i(x,a)}}{\sum_j e^{\theta _j(x,a)}}$
Projected Bellman Update
Problem: Bellman update ($\tau Z _{\theta}$) and parameterization ($Z _{\theta}$) have disjoint supportsWe cannot minimize wasserstein loss because we are restricted to learning from sample transitions
Solution: Categorical Algorithm
Instead project bellman update $\hat{\tau} Z _{\theta}$ onto support of $Z _\theta$
Compute bellman update for each transition ( $\hat{\tau} Z _{\theta} = r + \gamma z_j$) for each $z_j$
Distribute its probability $p_j(x’, \pi(x’))$ to neighbors $\hat{\tau _{z_j}}$
i-th component of projected update: $(\Phi \hat{\tau} Z_\theta (x,a))_i = \sum _{j=0}^{N-1} [1 - \frac{\vert [\hat{\tau _{z_j}}] _{V _{MIN}}^{V _{MAX}} - z_i\vert}{\Delta z}] _0^1 p_j(x’, \pi(x’))$
Sample loss is cross entropy term of KL divergence between $(\Phi \hat{\tau} Z _{\tilde{\theta}} (x,a))$ and $Z _\theta (x,a)$
Evaluation on Atari 2600 Games
Use a DQN architecture that outputs atom probabilities instead of action values
Replace MSE Loss with KL Loss for distributional RL
Simple $\epsilon$-greedy policy over expected action values
Safe actions have similar probabilities whereas unfavorable actions have 0 probabibility
Seperates low value losing events from high value survival events
Doesn’t combine these events into a singular expectation
Distributions generally close to gaussians
Varying Number of Atoms
Too few atoms leads to poor performance
At 51 atoms, categorical DQN outperforms DQN on all games
Agent is able to pick up on intrinsic stochasticity in its predictions
State-of-the-art Results
C51: The 51 atom agent
C51 compared to SOTA algorithms surpasses by a large margin in a number of the gamesParticularly good in sparse reward scenarios
C51 also outperforms DQN in far fewer training steps on 45/57 games
Discussion
Why does learning a distribution matter? Learning distributions matter because of approximations:
Reduced Chattering:
Bellman optimality operator is unstable + combined with function approximation means no convergence guarantees
Categorical algorithm avoids this by averaging different distributions
State Aliasing:
State aliasing occurs when two different states are indistinguishable to the agent
Modeling a distribution provides a more stable learning target
Richer Predictions:Distribution will tell us the probability the return will take on certain value
Inductive Bias:We can impose assumptions about the domain
I.e., we bounded support of the distribution between $[V _{MIN}, V _{MAX}]$Suprisingly clipping values in DQN degrades performance
I.e., we can interpret the discount factor as a probabibility
Optimization:
KL divergence between categorical distributions is easy to minimize
KL betweeen continuous distributions had worse results (KL insensitive to values of outcomes)Metric closer to wasserstein should yield even better results
April 27, 2025 · research