cs234 / lecture 6 - cnns and deep q learning
Resources:
Limitations to Linear Value Function Approximation
- Assumes that the value function is a weighted combination of a set of features where each feature is a function of the state
- Requires carefully hand designing a feature set
- Would be much better if you could go from states without needing a specific feature set
- Local representations (like kernel approaches) doesn’t scale well to enormous state spaces and datasets
Deep Neural Networks
- Composition of multiple functions
is the loss function- Backpropgates gradient using the chain rule
- The h functions must be differentiable
- Linear:
- Nonlinear:
(activation functions like sigmoid or ReLU) Benefits:
- Linear:
- The h functions must be differentiable
- Uses distributed representations instead of local ones
- Universal function approximator
- Potentially need exponential fewer nodes/parameters (compared to a shallow net) to represent the same function
- Can learn parameters using stochastic gradient descent
Convolutional Neural Networks
- Fully Connected Network: Requires an enormous amount of data for visual data
- High space-time complexity
- Lack of structure + locality of info
- Convolutional Neural Networks
- Considers the local structure + extraction of features
- Not fully connected
- Locality of processing
- Weight sharing for parameter reduction
- Local parameters that are idential for groups of pixels in the image
- Learns the parameters of multiple convolutional filter banks
- Compress to extract salient features and favors generalization
- Locality of Information
- Receptive Field: An input patch of where the hidden unit is connected to
- Stride: How much you move the patch
- Zero Padding: How many 0s to add to either side of an input layer
- Activation value of the hidden layer neuron:
- Feature Maps:
- All neurons in the first hidden layer capture the same feature just at different locations in hte feature map
- Feature: A pattern that makes a neuron produce a certain response level
- Pooling Layers:
- Used immediately after covolutional layers
- Simplifies / compresses infomration in the output from a convolutional layer
- Takes each feature map output from convolutional layer and prepares a condensed feature map
- Final Layer Fully Connected:
- Prior to final layer we are creating some feature representation
- Final layer is used to make a prediction
Deep Q-Networks (DQN)
- Represent value function, policy, and model using DNNs
DQNs in Atari
- End to end learning of values
from pixels s- Input state
is stack of raw pixels from last 4 frames- Allows you to grab velocity + position of the ball
- Output is
for 18 joystick/button presses - Reward is change in score for that step
- Same network architecture and hyperparameters across all games
- Input state
- Minimize MSE loss with stochastic gradient
- Divergence with Q-learning
- Correlations between samples
- Non-stationary targets
- Address divergence with
- Experience Replay
- Fixed Q Targets
DQN: Experienced Replay
- To help remove correlations, store dataset (a reply buffer) from prior experience
- To perform experience replay, repeat the following
- Sample an experience tuple from the dataset
- Compute the target value for the sampled
- Use stochastic gradient descent to update weights
- Sample an experience tuple from the dataset
Fixed Q Targets
- To improve stability, fix the target weights used in the target calculation for multiple updates
- Fix the w in
for several rounds- Approximation of the oracle of the
- Approximation of the oracle of the
- Use a different set weights to compute target than that is being updated
: Set of weights used for target computation- Gets updated periodically (i.e., every
steps )
- Gets updated periodically (i.e., every
: Set of weights being updated
- Computation of target changes:
- SGD:
- SGD:
- Fix the w in
Double DQN
- Similar idea to Double Q Learning but we have one network to selection actions and one network to evaluate actions
: Weights for network used to select actions : Weights for network used to evaluate actions- Action Selection:
- Action Evaluation:
- Action Selection:
- Swap the
and on each timestep which ensures both sets of weights get updated frequently - Avoids maximization bias
Prioritized Replay
- Prioritizing which replay you sample leads to exponential improvements in covergence (if we had a perfect oracle that could tell us the next replay to choose)
- Heuristic: Prioritize Tuples Based on DQN Error:
- Let i be the index of the tuple of experience
- Sample tuples using priority function
- Priority of tuple is proportional to DQN error:
- Update
every update (initially set to 0) - Probability of selecting that tuple:
- Let i be the index of the tuple of experience
Dueling DQN
- Intuition: Features needed for value are not necessarily the features you need to determine the benefit of an action
-
Advantage Function:
- Dueling DQNs seperate the value function and advantage function and estimate them seperately and then recombine them for the Q function
- Not identifiable
given a we cannot decompose it into a unique and