cs234 / background lecture 0 - markov decision processes value iteration

Note: This was from stanford cs221 and is not part of cs234. I used this to gain some prerequistie knowledge for cs234

Resources:

Introduction

Based on uncertainty / randomness of the real world where a state + action combo can lead to two different states.

Applications:

Markov Decision Processes

Transition State Diagram:

Blocks represent the state nodes/ state-action node. Label edges represent the probability of that transition occuring from that state-action node.

Transition Function: Gives you the probability of going from a given state to another state given an action.

Reward Function: Gives you the reward of the transition from one state to another given an action

Markov: Future states only depend on the current state and action taken.

Policy (π): A mapping from each state sstates to each action aActions(s)

Discount Factor: The value of future rewards relative to your current day rewards

Policy Evaluation

Utility: The discounted sum of rewards over a random path that is yielded by a specific policy.

Value: The expected utility of a specific policy

Value of a Policy: Expected utility received by following policy π from state s - denoted by Vπ(s)

Q-Value of a Policy: Expected utility from a state-action node - denoted by Qπ(s,a)

Recurrence Using Vπ(s) and Qπ(s,a):

Vπ(s)={0 if IsEnd(s)=True, else Qπ(s,a) Qπ(s,a)=sT(s,a,s)[R(s,a,s)+γVπ(s)]

Iterative Algorithm: Start with arbitrary policy values and apply the recurrence until convergence

Implementation Details: We want to wait until convergence but that might take a while so we use this heuristic:

maxsstatesVπt(s)Vπt1(s)ϵ

Value Iteration

Optimal Value: The maximum value obtained by any policy - denoted by Vopt(s)

Qopt(s,a)=sT(s,a,s)[R(s,a,s)+γVopt(s)] Vopt(s)={0 if IsEnd(s)=True, else maxaActions(s)Qπ(s,a) πopt(s)=argmaxaActions(s)  Qopt(s,a)

Iterative Algorithm: