Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Resources
Paper
Introduction
Two major challenges:
Model free RL methods have expensive sample complexity
New samples needed at each gradient step
Off-policy reuse past experience
Not practical with conventional policy gradient
Practical with Q learning but hard to combine both with continuous action/state spaces and neural nets
Methods are brittle with respect to their hyperparameters (i.e., learning rates, exploration rates, etc.)DDPG does combine Q learning and experience replay but is hyperparameter sensitive
Maximum entropy framework
Augments maximum reward RL with entropy maximization
Alters RL objective but original objective can be recovered through temperature parameter
Finds diverse experiences; helps exploration
Prior maximum entropy algorithms
On-policy: suffer from poor sample complexity
Off-policy: require approximate inference procedures in continuous action spaces
Soft Actor-Critic (SAC)
Sample-efficient and stable learning
Avoids approximate inference (i.e., like in soft Q learning)
Related Work
3 Ingredients to SAC Algorithm:
Actor Critic architecture with seperate policy + critic networks
Off-policy formulation for data reuse and efficiency
Entropy maximization for stability and exploration
Actor critic
Policy Iteration: Rotates between policy evaluation and policy improvement
Built on policy gradient for actor
Some use entropy regularization
DDPG hard to stabalize + brittle to hyperparameters
Interplay between Q function and deterministic actor
SAC uses stochastic actor
Prior Maximum Entropy RL
Soft Q learning uses actor as a sampler for learning Q value functionConvergence depends on how well the actor estimates the posterior
Usually maximum entropy RL methods don’t exceed SOTA off-policy RL methods
Preliminaries
Notation
Follow standard MDP notation
State marginal (how frequently a state is visited under $\pi$): $\rho _\pi(s_t)$
State-action marginal (how frequently a state-action pair occurs under $\pi$): $\rho _\pi(s_t, a_t)$
Maximum Entropy Reinforcement Learning
Augment the expected sum of rewards with expected entropy
$J(\pi) = \sum^T _{t=0} \mathbb{E} _{(s_t, a_t) \sim \rho _\pi}[r(s_t, a_t) + \alpha H(\pi (\cdot \vert s_t))]$
$\alpha$: Temperature - denotes the importance of entropy vs reward (controls stochasticity)Set it equal to 0 for getting original objective
Incentivized to explore more while giving up on unpromising avenues
Captures multiple modes of near optimal behaviorEqually attractive actions = equal probability to take those
From Soft Policy Iteration to Soft Actor-Critic
Derivation of Soft Policy Iteration
We want to compute a policy for the maximum entropy objective
Policy Evaluation:
We can repeatedly apply the bellman backup: $\tau^{\pi} Q(s _t, a _t) = r(s _t, a _t) + \gamma \mathbb{E} _{s _{t+1} \sim p}[V(s _{t+1})]$
Soft state value function: $V(s_t) = \mathbb{E} _{a_t \sim \pi}[Q(s_t, a_t) - \log \pi (a_t \vert s_t)]$
Lemma 1 (Soft Policy Evaluation): Applying the bellman backup repeatedly on $Q^k$ will cause $Q^k$ to converge tot he soft Q-value function of $\pi$
Policy Improvement:
Update policy towards exponential of new Q value function
For tractable policies, we restrict $\pi \in \Pi$ (i.e., where $\Pi$ is a family of parameterized policies)
Update is argmin of the KL divergence between family of policies and normalized projection of a distribution of Q values
$\pi _{new} = argmin _{\pi’ \in \Pi} D _{KL}(\pi’ (\cdot \vert s_t)\vert\vert \frac{exp(Q^{\pi _{old}}(s_t, \cdot))}{Z^{\pi _{old}}}(s_t))$
We can ignore the normalization term (doesn’t contribute to gradient and is intractable)
Lemma 2 (Soft Policy Improvement): Q value of new policy is greater than or equal to than Q value of old policy for all state-action pairs.
In SAC we use neural nets for Q value functions for continuous domains
Soft Policy Iteration: Repeated application of evaluation and iteration converge to an optimal policyRunning evaluation and improvement until convergence = too computationally expensive for continuous case
Soft Actor-Critic
Use a state value function, $V _{\psi}(s_t)$, soft Q function, $Q _\theta(s_t, a_t)$, and tractable policy, $\pi _\phi (a_t \vert s_t)$
Policy outputs a mean and covariance
State-value function can be estimated from sampling a single example from the policy
However, having a seperate estimator stabalizes training + is convenient to train in parallel with other networks
Soft value function objective is mean residual error: $J_V(\psi) = \mathbb{E} _{s_t \sim D}[\frac{1}{2} (V _{\psi}(s_t) - \mathbb{E} _{a_t \sim \pi _\phi}[Q _\theta(s_t, a_t) - \log \pi _\phi(a_t \vert s_t)])^2]$Can estimate gradient: $\nabla _{\psi} J_V(\psi) = \nabla _{\psi} V _{\psi}(s_t) (V _{\psi}(s_t) - Q _\theta(s_t, a_t) + \log \pi _\phi(a_t \vert s_t))$
Soft Q function minimize the bellman residual: $J_Q(\theta) = \mathbb{E} _{(s_t, a_t) \sim D}[\frac{1}{2}(Q _{\theta} (s_t, a_t) - \hat{Q}(s_t, a_t))^2]$
$\hat{Q}(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E} _{s _{t+1} \sim p}[V _{\psi}(s _{t+1})]$
Gradients: $\nabla _{\theta} J_Q(\theta) = \nabla _{\theta} Q _{\theta}(s_t, a_t) ( Q _{\theta}(s_t, a_t) - r(s_t, a_t) - \gamma V _{\psi}(s _{t+1}))$Weights of $V$ can be an exponentially moving average to stabalize training or can periodically do a hard update on weights
Policy network minimizes expected KL divergence: $J(\phi) = D _{KL}(\pi _{\phi} (\cdot \vert s_t)\vert\vert \frac{exp(Q _{\theta}(s_t, \cdot))}{Z _{\theta}(s_t)})$Because our target density is the Q function, we can reparameterize it for a lower variance
$a_t = f _{\phi}(\epsilon_t ; s_t)$$\epsilon_t$ is random noise sampled from a fixed distribution
New policy objective: $J _{\pi}(\phi) = \mathbb{E} _{s_t \sim D, \epsilon_t \sim \mathcal{N}}[\log \pi _{\phi}(f _{\phi}(\epsilon_t ; s_t) \vert s_t) - Q _{\theta}(s_t, f _\phi (\epsilon_t ; s_t))]$Gradient: $\hat{\nabla} _\phi J _\pi (\phi) = \nabla _\phi \log \pi _\phi (a_t \vert s_t) + (\nabla _{a_t} \log \pi _\phi (a_t \vert s_t) - \nabla a_t Q(s_t, a_t))\nabla _\phi f _\phi(\epsilon_t ; s_t)$
Uses 2 Q functions to mitigate maximization bias
Collect experience and update function approximators off-policy with replay buffer
Usually take single step followed up multiple gradient steps
Experiments
Comparative Evaluation
SAC performs comparably to baseline methods (PPO, TD3, DDPG, SQL) on easier tasks and outperforms them significantly on harder ones
SAC learns faster than PPO because PPO needs large batch sizes for complex tasks
SQL can learn all tasks but has worse asymptotic performance compared to SAC
SAC is SOTA for sample efficiency and final performance
Ablation Study
Stochastic vs Deterministic Policy: SAC has more consistent performance compared to the deterministic variant of it (stochasticity stabalizes learning)
Policy Evaluation: For evaluation, its better to make the final policy deterministic
Reward Scale: SAC is sensitive to reward scale; it plays a role in the temperature
Small Rewards: Cannot exploit reward signal; degrades performance
Large Rewards: Learns quickly at first but then policy becomes near deterministic
Only hyperparameter that requires fine-tuning
Target Network Update: Large update rate causes instability and small ones cause slower trainingCan also copy over weights periodically - better to take more gradient steps in between each environment step but also increases computational cost
April 23, 2025 · research