Universal Value Function Approximators
Resources
Introduction
General value function ($V_g(s)$): represent utility of state $s$ in achieving goal $g$
Collection of these can learn from single stream of experience
Each one can generate a policy (i.e., greedy policy)
Can be used as a predictive representation of state
Usually represented as a neural net or linear combination
Usually exploits state space structure for generalization
Goal space usually also similar amount of structure
Universal Value Function Approximator: $V(s, g, \theta)$
Extends value function approximation to states and goals
Exploits structure across states and goals
Genearlizes to the set of all goals (even infinite sets!)
Exploits two kinds of structures between goals
Structure of induced value function
Similarity encoded priors in goal representations
Learning UVFA is hard because we see small subset of $(s,g)$
Challenging regression problem in supervised setting
Decompose regression
View data as sparse table of values - one row + one col for each state-goal pair $\rightarrow$ Find low rank factorization into goal and state embeddings $\phi(s), \psi(g)$
Learn non-linear mappings between states and state embeddings + goals and goal embeddings
2 Approaches to learn UVFA:
Maintain finite horde of value functions and seed a table to learn $V(s, g, \theta)$
Bootstrap from value of UVFA at successor states
Background
Assume standard MDP RL setting
$\gamma_g$: Pseudo-discount function
State-dependent discounting
Soft termination (equal to 0 if state is terminal based on goal)
Pseudo-discounted expected pseudo-return: $V _{g, \pi}(s) = \mathbb{E}[\sum _{t=0}^\infty R_g(s _{t+1}, a_t, s_t) \prod _{k=0}^t \gamma_g(s_k) \vert s_0 = s]$Action value function: $Q _{g, \pi}(s,a) = \mathbb{E} _{s’}[R_g(s, a, s’) + \gamma_g(s’) \cdot V _{g, \pi}(s’)]$
Universal Value Function Approximators
3 possible value function approximators
$\mathcal{F}: \mathcal{S} x \mathcal{G} \rightarrow \mathbb{R}$: Concatenate goal and state
$\phi: \mathcal{S} \rightarrow \mathbb{R}^n, \psi: \mathcal{G} \rightarrow \mathbb{R}^n, h: \mathbb{R}^n x \mathbb{R}^n \rightarrow \mathbb{R}$: Two stream architecture
$\phi, \psi$ are general function approximators
Exploits common structures between states and goals
If $\mathcal{G} \subseteq \mathcal{S}$, can use shared representation for $\phi, \psi$
UVFA can be symmetric: $V^\ast _s(g) = V^\ast _g(s)$
Partially Symmetric: Share some of the same parameters between goal and state but not identical
Symmetric: $\phi = \psi$
Small distances between representations = indicate similar states
Supervised Learning of UVFAs
Approach 1: End to End TrainingBackprop on MSE: $\mathbb{E}[(V^\ast_g(s) - V(s, g; \theta))^2]$ and apply SGD
Approach 2: Two stage training procedure based on matrix factorization
Layout all values of $V^\ast_g(s)$ in table, one row for each state, one column for each goal
Factorize the matrix and find low rank approximation
$\hat{\phi}_s$: Target embedding vector for row of $s$
$\hat{\psi}_g$: Target embedding vector for column of $g$
Learn parameters for $\phi,\psi$ via regression toward target embeddings
(Optional) fine-tune with end-to-end training
Factorization = Finds idealized embeddings
Learning = achieve idealized embeddings from states and goals
Supervised Learning Experiments
Train UVFA on ground truth data
Evaluate using MSE on unseen state-goal pairs
Measure policy quality of a value function approximator as true expected discounted reward average over all start states
Follow softmax policy of values (with a temperature) and compare it to optimal value function
Normalize policy quality such that optimal policy = 1, and uniform random policy = 0
Test on LavaWorld
4 rooms for states + 4 directions for actions
Contains deadly lava blocks when touched
Tabular Completion
States + goals represented as 1 hot vectors
$\phi, \psi$ are identity functions
We see how unseen state-goal pairs can be reconstructed with low rank approximation
Policy quality saturates optimally even if value error continues to improve
Low rank embeddings can recover topological structures in LavaWorld
Test reliability with respect to missing/unreliable data
Reconstruct $V(s, g; \theta) = \hat{\phi}_s \cdot \hat{\psi}_g$
Policy quality degrades gracefully as less and less value info is provided
Interpolation
We want to know if training set goals gives reasonable estimates to never seen goals
Interpolation does in fact occur and we get good estimates
Extrapolation
We can interpolate between similar goals but can we extrapolate between dissimilar goals
Partially symmetry allows us to transfer knowledge between $\phi$ to $\psi$
Doing this enables extrapolation
Reinforcement Learning Experiments
In RL we don’t have no ground truth values
Test via horde of value functions for targets
Test via Bootstrapping for targets
Generalizing from Horde
Seed the data matrix from the horde
Use two stream factorization to build a UVFA
Each demon learns a $Q_g(s,a)$ for its goal off-policy
Build the data matrix from estimates
Column: Corresponds to goal
Row: Corresponds to time index of one transition
Produce target embeddings and learn the UVFA
Performance determined by amount of experience and amount of computation to build UVFA
Challenge: Data depends on how behavior policy explores environmentI.e., might not see much data relevant to goals of interest
After a certain amount of data, there is a tipping point where UVFA gives reasonable estimates even for goals it wasn’t trained on
Ms Pacman
Trained 150 demons
Used 29 demons to seed data matrix
Tested on 5 goal locations from the remanining 121 demons
Showed that small horde of demons can approximate larger horde of demons
Direct Bootstrapping
Bootstrapping update: $Q(s_t, a_t, g) = \alpha(r_g + \gamma_g max _{a’}Q(s _{t+1}, a’, g)) + (1 - \alpha)(Q(s_t, a_t, g))$
Learning process can be unstable
Use smaller learning rates
Use a better behaved $h$
Use a distance based $h(a,b) = \gamma^{\vert \vert a - b\vert \vert_2}$Does not recover 100% policy quality but UVFA still generalizes well when trained on 25% of possible state-goal pairs
Discussion
UVFAs can be used for transfer learning to new tasks with same dynamics but different goals
Generalized value functions can be used as features to represent state
UVFA can be used to generate optionsOption can act greedily with respect to $V(s, g, \theta)$
UVFA can act as a universal option model: $V(s, g, \theta)$ can approximate discounted probability of reaching $g$ under $s$ under a policy
June 12, 2025 · research