cs234 / background lecture 0 - markov decision processes value iteration
Note: This was from stanford cs221 and is not part of cs234. I used this to gain some prerequistie knowledge for cs234
Resources:
Introduction
Based on uncertainty / randomness of the real world where a state + action combo can lead to two different states.
Applications:
- Robotics (actuators can fail, unseen obstacles)
- Resource Allocation (do not know customer demand for various products)
- Agriculture (uncertain weather impacts crop yield)
Markov Decision Processes
Transition State Diagram:
Blocks represent the state nodes/ state-action node. Label edges represent the probability of that transition occuring from that state-action node.
Transition Function: Gives you the probability of going from a given state to another state given an action.
where is a given state, is an action, and is the next state- Adding up all possible
you can end up at for the same should result in 1
Reward Function: Gives you the reward of the transition from one state to another given an action
where is a given state, is an action, and is the next state
Markov: Future states only depend on the current state and action taken.
Policy (
- This is the solution to an MDP
Discount Factor: The value of future rewards relative to your current day rewards
- Range:
- 1 means save for the future
- 0 means live in the moment
Policy Evaluation
Utility: The discounted sum of rewards over a random path that is yielded by a specific policy.
- Utility is a random variable
- Formula:
Value: The expected utility of a specific policy
- Average of utilities on all random paths
- Not a random variable
Value of a Policy: Expected utility received by following policy
Q-Value of a Policy: Expected utility from a state-action node - denoted by
Recurrence Using
Iterative Algorithm: Start with arbitrary policy values and apply the recurrence until convergence
- Step 1. Initialize
for all states . - Step 2. Iterate until convergence (keep track of a time, t)
- Step 3. For each state s update
- Note:
Implementation Details: We want to wait until convergence but that might take a while so we use this heuristic:
- Only need to store last two iterations of
, not all
Value Iteration
Optimal Value: The maximum value obtained by any policy - denoted by
Iterative Algorithm:
- Step 1. Initialize
for all states . - Step 2. Iterate until convergence (keep track of a time, t)
- Step 3. For each state s update
- Note: