Universal Value Function Approximators

Resources
- Paper
Introduction
- General value function ($V_g(s)$): represent utility of state $s$ in achieving goal $g$
  - Collection of these can learn from single stream of experience
  - Each one can generate a policy (i.e., greedy policy)
  - Can be used as a predictive representation of state
- Usually represented as a neural net or linear combination
  - Usually exploits state space structure for generalization
  - Goal space usually also similar amount of structure
- Universal Value Function Approximator: $V(s, g, \theta)$
  - Extends value function approximation to states and goals
  - Exploits structure across states and goals
  - Genearlizes to the set of all goals (even infinite sets!)
  - Exploits two kinds of structures between goals
    - Structure of induced value function
    - Similarity encoded priors in goal representations
- Learning UVFA is hard because we see small subset of $(s,g)$
  - Challenging regression problem in supervised setting
  - Decompose regression
    - View data as sparse table of values - one row + one col for each state-goal pair $\rightarrow$ Find low rank factorization into goal and state embeddings $\phi(s), \psi(g)$
    - Learn non-linear mappings between states and state embeddings + goals and goal embeddings
  - 2 Approaches to learn UVFA:
    - Maintain finite horde of value functions and seed a table to learn $V(s, g, \theta)$
    - Bootstrap from value of UVFA at successor states
Background
- Assume standard MDP RL setting
- $\gamma_g$: Pseudo-discount function
  - State-dependent discounting
  - Soft termination (equal to 0 if state is terminal based on goal)
- Pseudo-discounted expected pseudo-return: $V _{g, \pi}(s) = \mathbb{E}[\sum _{t=0}^\infty R_g(s _{t+1}, a_t, s_t) \prod _{k=0}^t \gamma_g(s_k) \vert s_0 = s]$
  - Action value function: $Q _{g, \pi}(s,a) = \mathbb{E} _{s’}[R_g(s, a, s’) + \gamma_g(s’) \cdot V _{g, \pi}(s’)]$
Universal Value Function Approximators
- 3 possible value function approximators
  - $\mathcal{F}: \mathcal{S} x \mathcal{G} \rightarrow \mathbb{R}$: Concatenate goal and state
  - $\phi: \mathcal{S} \rightarrow \mathbb{R}^n, \psi: \mathcal{G} \rightarrow \mathbb{R}^n, h: \mathbb{R}^n x \mathbb{R}^n \rightarrow \mathbb{R}$: Two stream architecture
    - $\phi, \psi$ are general function approximators
    - Exploits common structures between states and goals
    - If $\mathcal{G} \subseteq \mathcal{S}$, can use shared representation for $\phi, \psi$
    - UVFA can be symmetric: $V^\ast _s(g) = V^\ast _g(s)$
      - Partially Symmetric: Share some of the same parameters between goal and state but not identical
      - Symmetric: $\phi = \psi$
      - Small distances between representations = indicate similar states
- Supervised Learning of UVFAs
  - Approach 1: End to End Training
    - Backprop on MSE: $\mathbb{E}[(V^\ast_g(s) - V(s, g; \theta))^2]$ and apply SGD
  - Approach 2: Two stage training procedure based on matrix factorization
    - Layout all values of $V^\ast_g(s)$ in table, one row for each state, one column for each goal
    - Factorize the matrix and find low rank approximation
      - $\hat{\phi}_s$: Target embedding vector for row of $s$
      - $\hat{\psi}_g$: Target embedding vector for column of $g$
    - Learn parameters for $\phi,\psi$ via regression toward target embeddings
    - (Optional) fine-tune with end-to-end training
  - Factorization = Finds idealized embeddings
  - Learning = achieve idealized embeddings from states and goals
Supervised Learning Experiments
- Train UVFA on ground truth data
- Evaluate using MSE on unseen state-goal pairs
- Measure policy quality of a value function approximator as true expected discounted reward average over all start states
  - Follow softmax policy of values (with a temperature) and compare it to optimal value function
  - Normalize policy quality such that optimal policy = 1, and uniform random policy = 0
- Test on LavaWorld
  - 4 rooms for states + 4 directions for actions
  - Contains deadly lava blocks when touched
- Tabular Completion
  - States + goals represented as 1 hot vectors
  - $\phi, \psi$ are identity functions
  - We see how unseen state-goal pairs can be reconstructed with low rank approximation
    - Policy quality saturates optimally even if value error continues to improve
    - Low rank embeddings can recover topological structures in LavaWorld
  - Test reliability with respect to missing/unreliable data
    - Reconstruct $V(s, g; \theta) = \hat{\phi}_s \cdot \hat{\psi}_g$
    - Policy quality degrades gracefully as less and less value info is provided
- Interpolation
  - We want to know if training set goals gives reasonable estimates to never seen goals
  - Interpolation does in fact occur and we get good estimates
- Extrapolation
  - We can interpolate between similar goals but can we extrapolate between dissimilar goals
  - Partially symmetry allows us to transfer knowledge between $\phi$ to $\psi$
  - Doing this enables extrapolation
Reinforcement Learning Experiments
- In RL we don’t have no ground truth values
  - Test via horde of value functions for targets
  - Test via Bootstrapping for targets
- Generalizing from Horde
  - Seed the data matrix from the horde
  - Use two stream factorization to build a UVFA
  - Each demon learns a $Q_g(s,a)$ for its goal off-policy
    - Build the data matrix from estimates
    - Column: Corresponds to goal
    - Row: Corresponds to time index of one transition
  - Produce target embeddings and learn the UVFA
  - Performance determined by amount of experience and amount of computation to build UVFA
  - Challenge: Data depends on how behavior policy explores environment
    - I.e., might not see much data relevant to goals of interest
  - After a certain amount of data, there is a tipping point where UVFA gives reasonable estimates even for goals it wasn’t trained on
- Ms Pacman
  - Trained 150 demons
  - Used 29 demons to seed data matrix
  - Tested on 5 goal locations from the remanining 121 demons
  - Showed that small horde of demons can approximate larger horde of demons
- Direct Bootstrapping
  - Bootstrapping update: $Q(s_t, a_t, g) = \alpha(r_g + \gamma_g max _{a’}Q(s _{t+1}, a’, g)) + (1 - \alpha)(Q(s_t, a_t, g))$
  - Learning process can be unstable
    - Use smaller learning rates
    - Use a better behaved $h$
  - Use a distance based $h(a,b) = \gamma^{\vert \vert a - b\vert \vert_2}$
    - Does not recover 100% policy quality but UVFA still generalizes well when trained on 25% of possible state-goal pairs
Discussion
- UVFAs can be used for transfer learning to new tasks with same dynamics but different goals
- Generalized value functions can be used as features to represent state
- UVFA can be used to generate options
  - Option can act greedily with respect to $V(s, g, \theta)$
- UVFA can act as a universal option model: $V(s, g, \theta)$ can approximate discounted probability of reaching $g$ under $s$ under a policy

June 12, 2025 · research