## cs234 / background lecture 1 - markov decision processes reinforcement learning

*Note: This was from stanford cs221 and is not part of cs234. I used this to gain some prerequistie knowledge for cs234*

**Resources:**

### Vocabulary

- Episode: A sequence of states and actions until you hit the end state
You can look at the utility (discounted sum of rewards) of an episode

- On-Policy: Estimate the value of data-generating policy
- Off-Policy: Estimate the value of another policy

### Markov Decision Processes and Reinforcement Learning

*At its core: reinforcement learning are MDPs without transition probabilities and rewards*

**Markov Decision Processes:**

- Have a mental model of how the world works
- You find the policy to maximize rewards

**Reinforcement Learning:**

- Don’t know how the world works (we have no MDP for the situation)
- You perform actions in the world to figure out the reward function

**Reinforcement Learning Base Algorithm:**

- Choose an action $a_t = \pi _{act}(s _{t-1})$
- Receive a reward $r_t$ and observe new state $s_t$
- Update parameters

### Model-Based Monte Carlo Methods

**Core Idea:** estimate the MDP: $T(s, a, s’)$ and reward $R(s, a, s’)$

**Transition Formula:**

$$\hat{T}(s, a, s’) = \frac{\text{# of times (s,a,s’) occurs}}{\text{# of times (s,a) occurs}}$$

**Reward Formula:** $$\hat{R}(s, a, s’) = \text{r in (s, a, r, s’)}$$

Once we have an estimated MDP, we calculate the policy using value iteration.

**Exploration:** One issue that could arise is that if your policy is non-exhaustive, then it might miss a state. This state may have a large reward and failing to explore it results in a suboptimal policy, To do reinforcement learning, we need to explore the state space. **Solution:** Have your policy, $\pi$ explore explicitly.

### Model-Free Monte Carlo

**Core Idea:** All that matters for prediction is an estimate of $Q _{opt}(s,a)$ so try to estimate $Q _{opt}(s,a)$ directly.

We know that we can calculate utility as follows: $u = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} \dots$. We can estimate $Q _{\pi}(s,a)$ as follows:

$$\hat{Q} _{\pi}(s,a) = \text{average of } u_t \text{ where } s _{t-1} = s, a_t = a \text{ where s, a doesn’t occur in the episode}$$ Essentially get the average utility across many episodes where s,a occurs once

Model-Free is a on-policy approach because we are estimating the value of a specific policy

Convex Combination Equivalent Formulation for $\hat{Q} _{\pi}(s, a)$:

For each $s, a, u$

$$\eta = \frac{1}{1+\text{ updates to (s, a)}}$$ $$\hat{Q} _{\pi}(s, a) = (1 - \eta)\hat{Q} _{\pi}(s, a) + \eta u$$

Stochastic Gradient Descent Equivalent Formulation for $\hat{Q} _{\pi}(s, a)$:

$$\hat{Q} _{\pi}(s, a) = \hat{Q} _{\pi}(s, a) - \eta(\hat{Q} _{\pi}(s, a) - u)$$

$\hat{Q} _{\pi}(s, a) - u$ is the prediction minus the target $\rightarrow$ the implied objective is least sqaures regression which would be $(\hat{Q} _{\pi}(s, a) - u)^2$

### SARSA

Algorithm: On each (s, a, r, s’, a’): $$\hat{Q} _{\pi}(s, a) = (1 - \eta)\hat{Q} _{\pi}(s, a) + \eta(r + \gamma\hat{Q} _{\pi}(s’, a’))$$

where $r$ is the data and $\hat{Q} _{\pi}(s’, a’)$ is the estimate

**Bootstrapping:** You use $\hat{Q} _{\pi}$ to estimate $\hat{Q} _{\pi}$ instead of $u$

Estimating using $u$:

- Based on one path
- Unbiased
- Large Variance
- Need to wait until end to update

Estimating using $r + \hat{Q} _{\pi}$:

- Based on estimate
- Biased
- Small Variance
- Can update immediately

*Model Free Monte Carlo and SARSA only estimate $Q _{\pi}$ not $Q _{opt}$. Use Q-learning to get $Q _{opt}$*

### Q Learning

**Q-Learning:** Calculates $\hat{Q} _{opt}$ using $\hat{Q} _{opt}$ and $r$. This is an off-policy technique.

For each $(s, a, r, s’)$

$$\hat{Q} _{opt}(s, a) = (1 - \eta)\hat{Q} _{opt}(s, a) + \eta(r + \gamma\hat{V} _{opt}(s’))$$ $$\hat{V} _{opt}(s’) = max _{a’ \in Actions(s’)} \hat{Q} _{opt}(s’, a’)$$

where $\hat{Q} _{opt}(s, a)$ is the prediction and $r + \gamma\hat{V} _{opt}(s’)$ is the target.

### Exploration vs Exploitation

Extreme 1: No exploration, all exploitation - choose the action that leads to the highest $\hat{Q} _{opt}$ (too greedy)

$$\pi _{act}(s) = argmax _{a \in Actions(s)} \hat{Q} _{opt}(s, a)$$

Extreme 2: No exploitation, all exploration - choose an action randomly. Leads to bad average utility because exploration is not guided

$$ \pi _{act}(s) = \text{random from Actions(s)}$$

### Epsilon Greedy Policy

Balances exploitation and exploration using a parameter, $\epsilon$:

- With probability $1 - \epsilon$, we choose exploitation
- With probability $\epsilon$, we choose exploration

Problem: Large state spaces are hard to explore

Stochastic Gradient Descent Update:

$$\hat{Q} _{\pi}(s, a) = \hat{Q} _{\pi}(s, a) - \eta(\hat{Q} _{\pi}(s, a) - u)$$

**Rote Learning:** Every $\hat{Q} _{\pi}(s, a)$ has a different value (not really learning) $\rightarrow$ doesn’t generalize to unseen states/actions

### Function Approximation

**Core Idea:** Instead of a lookup table, use a linear regression model with features $\phi(s, a)$ and weights $w$ to generalize to unseen states: $$ \hat{Q} _{opt}(s, a; w) = w \cdot \phi(s, a)$$

**Q-Learning with Function Approximation:**

$$ w = w - \eta(\hat{Q} _{opt}(s, a; w) - (r + \gamma\hat{V} _{opt}(s’)))\phi(s, a)$$

Where $\hat{Q} _{opt}(s, a; w) - (r + \gamma\hat{V} _{opt}(s’))$ is the prediction minus the target

### Miscellaneous

- Partial Feedback: You can only learn about actions you take
- State: Rewards depend on previous state
- Reinforcement Learning: Partial Feedback + State
- Deep RL: Use neural networks to estimate $\hat{Q} _{opt}(s, a)$
- Policy Gradient: Use neural networks to train a policy directly