background lecture 1 - markov decision processes reinforcement learning

cs234 / background lecture 1 - markov decision processes reinforcement learning

Note: This was from stanford cs221 and is not part of cs234. I used this to gain some prerequistie knowledge for cs234

Resources:

Lecture Video

Vocabulary

Episode: A sequence of states and actions until you hit the end state
You can look at the utility (discounted sum of rewards) of an episode
On-Policy: Estimate the value of data-generating policy
Off-Policy: Estimate the value of another policy

Markov Decision Processes and Reinforcement Learning

At its core: reinforcement learning are MDPs without transition probabilities and rewards

Markov Decision Processes:

Have a mental model of how the world works
You find the policy to maximize rewards

Reinforcement Learning:

Don’t know how the world works (we have no MDP for the situation)
You perform actions in the world to figure out the reward function

Reinforcement Learning Base Algorithm:

Choose an action $a_t = \pi _{act}(s _{t-1})$
Receive a reward $r_t$ and observe new state $s_t$
Update parameters

Model-Based Monte Carlo Methods

Core Idea: estimate the MDP: $T(s, a, s’)$ and reward $R(s, a, s’)$

Transition Formula:

$$\hat{T}(s, a, s’) = \frac{\text{# of times (s,a,s’) occurs}}{\text{# of times (s,a) occurs}}$$

Reward Formula: $$\hat{R}(s, a, s’) = \text{r in (s, a, r, s’)}$$

Once we have an estimated MDP, we calculate the policy using value iteration.

Exploration: One issue that could arise is that if your policy is non-exhaustive, then it might miss a state. This state may have a large reward and failing to explore it results in a suboptimal policy, To do reinforcement learning, we need to explore the state space. Solution: Have your policy, $\pi$ explore explicitly.

Model-Free Monte Carlo

Core Idea: All that matters for prediction is an estimate of $Q _{opt}(s,a)$ so try to estimate $Q _{opt}(s,a)$ directly.

We know that we can calculate utility as follows: $u = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} \dots$. We can estimate $Q _{\pi}(s,a)$ as follows:

$$\hat{Q} _{\pi}(s,a) = \text{average of } u_t \text{ where } s _{t-1} = s, a_t = a \text{ where s, a doesn’t occur in the episode}$$ Essentially get the average utility across many episodes where s,a occurs once

Model-Free is a on-policy approach because we are estimating the value of a specific policy

Convex Combination Equivalent Formulation for $\hat{Q} _{\pi}(s, a)$:

For each $s, a, u$

$$\eta = \frac{1}{1+\text{ updates to (s, a)}}$$ $$\hat{Q} _{\pi}(s, a) = (1 - \eta)\hat{Q} _{\pi}(s, a) + \eta u$$

Stochastic Gradient Descent Equivalent Formulation for $\hat{Q} _{\pi}(s, a)$:

$$\hat{Q} _{\pi}(s, a) = \hat{Q} _{\pi}(s, a) - \eta(\hat{Q} _{\pi}(s, a) - u)$$

$\hat{Q} _{\pi}(s, a) - u$ is the prediction minus the target $\rightarrow$ the implied objective is least sqaures regression which would be $(\hat{Q} _{\pi}(s, a) - u)^2$

SARSA

Algorithm: On each (s, a, r, s’, a’): $$\hat{Q} _{\pi}(s, a) = (1 - \eta)\hat{Q} _{\pi}(s, a) + \eta(r + \gamma\hat{Q} _{\pi}(s’, a’))$$

where $r$ is the data and $\hat{Q} _{\pi}(s’, a’)$ is the estimate

Bootstrapping: You use $\hat{Q} _{\pi}$ to estimate $\hat{Q} _{\pi}$ instead of $u$

Estimating using $u$:

Based on one path
Unbiased
Large Variance
Need to wait until end to update

Estimating using $r + \hat{Q} _{\pi}$:

Based on estimate
Biased
Small Variance
Can update immediately

Model Free Monte Carlo and SARSA only estimate $Q _{\pi}$ not $Q _{opt}$. Use Q-learning to get $Q _{opt}$

Q Learning

Q-Learning: Calculates $\hat{Q} _{opt}$ using $\hat{Q} _{opt}$ and $r$. This is an off-policy technique.

For each $(s, a, r, s’)$

$$\hat{Q} _{opt}(s, a) = (1 - \eta)\hat{Q} _{opt}(s, a) + \eta(r + \gamma\hat{V} _{opt}(s’))$$ $$\hat{V} _{opt}(s’) = max _{a’ \in Actions(s’)} \hat{Q} _{opt}(s’, a’)$$

where $\hat{Q} _{opt}(s, a)$ is the prediction and $r + \gamma\hat{V} _{opt}(s’)$ is the target.

Exploration vs Exploitation

Extreme 1: No exploration, all exploitation - choose the action that leads to the highest $\hat{Q} _{opt}$ (too greedy)

$$\pi _{act}(s) = argmax _{a \in Actions(s)} \hat{Q} _{opt}(s, a)$$

Extreme 2: No exploitation, all exploration - choose an action randomly. Leads to bad average utility because exploration is not guided

$$ \pi _{act}(s) = \text{random from Actions(s)}$$

Epsilon Greedy Policy

Balances exploitation and exploration using a parameter, $\epsilon$:

With probability $1 - \epsilon$, we choose exploitation
With probability $\epsilon$, we choose exploration

Problem: Large state spaces are hard to explore

Stochastic Gradient Descent Update:

$$\hat{Q} _{\pi}(s, a) = \hat{Q} _{\pi}(s, a) - \eta(\hat{Q} _{\pi}(s, a) - u)$$

Rote Learning: Every $\hat{Q} _{\pi}(s, a)$ has a different value (not really learning) $\rightarrow$ doesn’t generalize to unseen states/actions

Function Approximation

Core Idea: Instead of a lookup table, use a linear regression model with features $\phi(s, a)$ and weights $w$ to generalize to unseen states: $$ \hat{Q} _{opt}(s, a; w) = w \cdot \phi(s, a)$$

Q-Learning with Function Approximation:

$$ w = w - \eta(\hat{Q} _{opt}(s, a; w) - (r + \gamma\hat{V} _{opt}(s’)))\phi(s, a)$$

Where $\hat{Q} _{opt}(s, a; w) - (r + \gamma\hat{V} _{opt}(s’))$ is the prediction minus the target

Miscellaneous

Partial Feedback: You can only learn about actions you take
State: Rewards depend on previous state
Reinforcement Learning: Partial Feedback + State
Deep RL: Use neural networks to estimate $\hat{Q} _{opt}(s, a)$
Policy Gradient: Use neural networks to train a policy directly