cs234 / lecture 2 - given a model of the world
Resources:
Markov Process and Markov Chains
- Memoryless / random sequence of events which satisfies the markov property
- No rewards, no actions
- Dynamics model specifies the probability of the next state given the previous state. This is expressible as a matrix: $$ \begin{pmatrix} P(s_1 \vert s_1) & P(s_1 \vert s_2) & \dots & P(s_1 \vert s_N)
\\ \vdots & \vdots & \ddots & \vdots \\ P(s_N \vert s_1) & P(s_N \vert s_2) & \dots & P(s_N \vert s_N) \end{pmatrix} $$
Markov Reward Process
- Markov Reward Process: Markov Chains + rewards (no actions)
- $S:$ Finite number of states ($s \in S$)
- $P:$ Dynamics / transition model where $P(s _{t+1} = s’ \vert s_t = s)$
- $R:$ Reward function $R(s_t = s) = E[r_t \vert s_t = s]$
- Tied to immediate state or state and next state
- $\gamma:$ Discount factor [0, 1]
Return and Value Function
-
Horizon: The number of time steps in an episodes
- Can be infinite or finite (finite Markov Process)
-
Return: Discounted sum of rewards from time step t to horizon
- $G_t = r_t + \gamma r _{t+1} + \gamma^2 r _{t+2} \dots$
-
Value Function: Expected return from a state
- $V(s) = E[G_t \vert s_t = s] = E[r_t + \gamma r _{t+1} + \gamma^2 r _{t+2} \dots \vert s_t = s]$
- Same as return if the process is deterministic (single next state from a state)
- Different from return if process is stochastic
Discount Factor
- Used to avoid inifinite returns and values
- $\gamma = 0:$ Only care about immediate rewards
- $\gamma = 1:$ Future reward and immediate reward equally beneficial
- Use $\gamma = 1$ for finite episode lengths (for mathematical convenience)
Computing Value of a Markov Reward Process
- Estimate by simulation by generating a large number of episodes
- Average the returns
- Requires no assumption of Markov structure
- Using markov structure
- MRP Value Function: $V(s) = R(s) + \gamma \sum _{s’ \in S}P(s’ \vert s)V(s’)$
- $R(s):$ Immediate rewards
- $\gamma \sum _{s’ \in S}P(s’ \vert s)V(s’):$ Discounted sum of future rewards
- MRP Value Function: $V(s) = R(s) + \gamma \sum _{s’ \in S}P(s’ \vert s)V(s’)$
- Matrix form of Bellman Equation for MRPs: $V = R + \gamma PV$
- Simplify equation to: $V - \gamma PV = R = (I - \gamma P)V = R \rightarrow V = (I - \gamma P)^{-1}R$
$$ \begin{pmatrix} V(s_1)
\\ \vdots \\ V(s_N) \end{pmatrix} = \begin{pmatrix} R(s_1)
\\ \vdots \\ R(s_N) \end{pmatrix} + \gamma \begin{pmatrix} P(s_1 \vert s_1) & P(s_1 \vert s_2) & \dots & P(s_1 \vert s_N)
\\ \vdots & \vdots & \ddots & \vdots \\ P(s_N \vert s_1) & P(s_N \vert s_2) & \dots & P(s_N \vert s_N) \end{pmatrix} \begin{pmatrix} V(s_1)
\\ \vdots \\ V(s_N) \end{pmatrix} $$
- Iterative Algorithm for Computing Value of an MRP (Dynamic Programming)
- Initialize $V_0(s) = 0$ for all s
- For k = 1 until convergence
- For all s in S
- $V_k = R(s) + \gamma \sum _{s’ \in S}P(s’\vert s)V _{k-1}(s’)$
- For all s in S
Markov Decision Processes
- Markov reward processes with actions ($a \in A$)
- We have a dynamics model that is specified differently for each action $\rightarrow P(s _{t+1} = s’ \vert s_t = s, a_t = a)$
- Multiple matrices (one for each action)
- Our reward function can be a function current state, current state + action, or current state + action + next state
- $R(s_t = s, a_t = a) = E[r_t \vert s_t = s, a_t = a]$
- Can express MDPs as a tuple of $(S, A, R, P, \gamma)$
Policies
- Expressed as $\pi(a_t = a \vert s_t = s) = P(a_t = a \vert s_t = s)$
- Specifies which action to take in each state $\rightarrow$ can be deterministic or stochastic
- Markov Reward Process = MDP + Policy $\rightarrow$ specifying a policy induces MRP because it defines the expected rewards and transition model
- $R^{\pi}(s) = \sum _{a \in A} \pi(a \vert s)R(s,a)$
- $P^{\pi}(s) = \sum _{a \in A} \pi(a \vert s)P(s’\vert s,a)$
Policy Evaluation
- Iterative Algorithm (Bellman Backup)
- Initialize $V_0(s) = 0$ for all s
- For k = 1 until convergence
- For all s in S
- $V^{\pi}_k = r(s, \pi(s)) + \gamma \sum _{s’ \in S}P(s’\vert s, \pi(s))V^{\pi} _{k-1}(s’)$
- For all s in S
MDP Control
- We want to find the optimal policy
- $\pi^*(s) = argmax _{\pi} V^{\pi}(s)$
- There exists a unique optimal value function
- Optimal Policy Characteristics (for infinite horizon MDPs):
- Deterministic
- Stationary (doesn’t depend on time step)
- not necessarily unique (there could be ties between policies that get the optimal value function)
Policy Search
- We know that the number of deterministic policies is $\vert A \vert ^{\vert S \vert}$
- Using enumeration is inefficient (evaluating every policy exhaustively)
- We prefer policy iteration
MDP Policy Iteration
- We take a “guess” of the optimal policy, we evaluate it, and then we try to improve it until we cannot improve it anymore
- Algorithm:
- Set i = 0
- Intialize $\pi_{0}(s)$ randomly for all states $s$
- While i == 0 or $\vert\vert \pi_i - \pi _{i-1}\vert\vert > 0$ (L1 Norm to see if policy changes)
- $V^{\pi i} \leftarrow \text{MDP Value Function}$
- $\pi _{i+1} \leftarrow \text{Policy Improvement}$
- $i = i + 1$
-
State Value: $V^{\pi}(s)$
- “If you start in state s and follow a policy, what is the discounted sum of rewards”
-
State-Action Value: $Q^{\pi}(s, a) = R(s, a) + \gamma \sum _{s’ \in S} P(s’ \vert s, a)V^{\pi}(s’)$
- “If you start in state s, take an action, and then follow a policy, what is the discounted sum of rewards”
-
Policy Improvement:
- Compute the state action value of a policy
- For s in S and a in A
- Compute $Q ^{\pi i}(s,a) = R(s, a) + \gamma \sum _{s’ \in S} P(s’ \vert s, a)V^{\pi}(s’)$
- For s in S and a in A
- Compute new policy $\pi _{i+1}$ for all $s \in S$
- $\pi _{i+1} = argmax _{a} Q^{\pi i}(s, a) \forall s \in S$
- Compute the state action value of a policy
- Monotonic Improvement in Policy: The value of the policy is greater than or equal to the old policy for all states
- Once the policy stops changing, you know you are at the global best policy
- There is a maximum of $\vert A \vert ^{\vert S \vert}$ iterations for policy iteration
Value Iteration
- Maintain the optimal value of starting in a state if we have a finite number of steps left in the episode
- Value function must satisfy the bellman equation
-
Bellman Backup Operator: Allows us to transform an old value function into a new one to improve it
- $BV(s) = max_a R(s, a) + \gamma \sum _{s’ \in S}P(s’ \vert s, a)V(s’)$
- For a particular policy: $B^{\pi}V(s) = R^{\pi}(s) + \gamma \sum _{s’ \in S}P^{\pi}(s’ \vert s)V(s)$
- We can do policy evaluation using this operating as follows: $B^{\pi}B^{\pi}B^{\pi}\cdots V$ until it converges
- Value Iteration Algorithm
- Set k = 1
- Initialize $V_0(s) = 0$ for all s
- Loop until finite horizon ends or convergence
- For each state, s:
- Compute $V _{k+1} = max_a R(s, a) + \gamma \sum _{s’ \in S} P(s’ \vert s, a)V_k(s’)$
- $V _{k+1} = BV_k$
- $\pi _{k+1}(s) = argmax_a R(s,a) + \gamma \sum _{s’ \in S} P(s’ \vert s, a)V_k(s’)$
- For each state, s:
- Contraction Operator
- Let $O$ be an operator $\vert x \vert$ be any norm of $x$
- If $\vert OV - OV’ \vert \leq \vert V - V’\vert$ then O is a contraction operator
- Value iteration converges because the Bellman Backup is a contractor operator when $\gamma < 1$