Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Resources
- Paper
Introduction
- Two major challenges:
  - Model free RL methods have expensive sample complexity
    - New samples needed at each gradient step
    - Off-policy reuse past experience
      - Not practical with conventional policy gradient
      - Practical with Q learning but hard to combine both with continuous action/state spaces and neural nets
  - Methods are brittle with respect to their hyperparameters (i.e., learning rates, exploration rates, etc.)
    - DDPG does combine Q learning and experience replay but is hyperparameter sensitive
- Maximum entropy framework
  - Augments maximum reward RL with entropy maximization
  - Alters RL objective but original objective can be recovered through temperature parameter
  - Finds diverse experiences; helps exploration
- Prior maximum entropy algorithms
  - On-policy: suffer from poor sample complexity
  - Off-policy: require approximate inference procedures in continuous action spaces
- Soft Actor-Critic (SAC)
  - Sample-efficient and stable learning
  - Avoids approximate inference (i.e., like in soft Q learning)
Related Work
- 3 Ingredients to SAC Algorithm:
  - Actor Critic architecture with seperate policy + critic networks
  - Off-policy formulation for data reuse and efficiency
  - Entropy maximization for stability and exploration
- Actor critic
  - Policy Iteration: Rotates between policy evaluation and policy improvement
  - Built on policy gradient for actor
  - Some use entropy regularization
- DDPG hard to stabalize + brittle to hyperparameters
  - Interplay between Q function and deterministic actor
  - SAC uses stochastic actor
- Prior Maximum Entropy RL
  - Soft Q learning uses actor as a sampler for learning Q value function
    - Convergence depends on how well the actor estimates the posterior
  - Usually maximum entropy RL methods don’t exceed SOTA off-policy RL methods
Preliminaries
- Notation
  - Follow standard MDP notation
  - State marginal (how frequently a state is visited under $\pi$): $\rho _\pi(s_t)$
  - State-action marginal (how frequently a state-action pair occurs under $\pi$): $\rho _\pi(s_t, a_t)$
- Maximum Entropy Reinforcement Learning
  - Augment the expected sum of rewards with expected entropy
    - $J(\pi) = \sum^T _{t=0} \mathbb{E} _{(s_t, a_t) \sim \rho _\pi}[r(s_t, a_t) + \alpha H(\pi (\cdot \vert s_t))]$
    - $\alpha$: Temperature - denotes the importance of entropy vs reward (controls stochasticity)
      - Set it equal to 0 for getting original objective
  - Incentivized to explore more while giving up on unpromising avenues
  - Captures multiple modes of near optimal behavior
    - Equally attractive actions = equal probability to take those
From Soft Policy Iteration to Soft Actor-Critic
- Derivation of Soft Policy Iteration
  - We want to compute a policy for the maximum entropy objective
    - Policy Evaluation:
      - We can repeatedly apply the bellman backup: $\tau^{\pi} Q(s _t, a _t) = r(s _t, a _t) + \gamma \mathbb{E} _{s _{t+1} \sim p}[V(s _{t+1})]$
      - Soft state value function: $V(s_t) = \mathbb{E} _{a_t \sim \pi}[Q(s_t, a_t) - \log \pi (a_t \vert s_t)]$
      - Lemma 1 (Soft Policy Evaluation): Applying the bellman backup repeatedly on $Q^k$ will cause $Q^k$ to converge tot he soft Q-value function of $\pi$
    - Policy Improvement:
      - Update policy towards exponential of new Q value function
      - For tractable policies, we restrict $\pi \in \Pi$ (i.e., where $\Pi$ is a family of parameterized policies)
      - Update is argmin of the KL divergence between family of policies and normalized projection of a distribution of Q values
        
        $\pi _{new} = argmin _{\pi’ \in \Pi} D _{KL}(\pi’ (\cdot \vert s_t)\vert\vert \frac{exp(Q^{\pi _{old}}(s_t, \cdot))}{Z^{\pi _{old}}}(s_t))$
        
        We can ignore the normalization term (doesn’t contribute to gradient and is intractable)
      - Lemma 2 (Soft Policy Improvement): Q value of new policy is greater than or equal to than Q value of old policy for all state-action pairs.
  - In SAC we use neural nets for Q value functions for continuous domains
  - Soft Policy Iteration: Repeated application of evaluation and iteration converge to an optimal policy
    - Running evaluation and improvement until convergence = too computationally expensive for continuous case
- Soft Actor-Critic
  - Use a state value function, $V _{\psi}(s_t)$, soft Q function, $Q _\theta(s_t, a_t)$, and tractable policy, $\pi _\phi (a_t \vert s_t)$
    - Policy outputs a mean and covariance
    - State-value function can be estimated from sampling a single example from the policy
    - However, having a seperate estimator stabalizes training + is convenient to train in parallel with other networks
  - Soft value function objective is mean residual error: $J_V(\psi) = \mathbb{E} _{s_t \sim D}[\frac{1}{2} (V _{\psi}(s_t) - \mathbb{E} _{a_t \sim \pi _\phi}[Q _\theta(s_t, a_t) - \log \pi _\phi(a_t \vert s_t)])^2]$
    - Can estimate gradient: $\nabla _{\psi} J_V(\psi) = \nabla _{\psi} V _{\psi}(s_t) (V _{\psi}(s_t) - Q _\theta(s_t, a_t) + \log \pi _\phi(a_t \vert s_t))$
  - Soft Q function minimize the bellman residual: $J_Q(\theta) = \mathbb{E} _{(s_t, a_t) \sim D}[\frac{1}{2}(Q _{\theta} (s_t, a_t) - \hat{Q}(s_t, a_t))^2]$
    - $\hat{Q}(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E} _{s _{t+1} \sim p}[V _{\psi}(s _{t+1})]$
    - Gradients: $\nabla _{\theta} J_Q(\theta) = \nabla _{\theta} Q _{\theta}(s_t, a_t) ( Q _{\theta}(s_t, a_t) - r(s_t, a_t) - \gamma V _{\psi}(s _{t+1}))$
      - Weights of $V$ can be an exponentially moving average to stabalize training or can periodically do a hard update on weights
  - Policy network minimizes expected KL divergence: $J(\phi) = D _{KL}(\pi _{\phi} (\cdot \vert s_t)\vert\vert \frac{exp(Q _{\theta}(s_t, \cdot))}{Z _{\theta}(s_t)})$
    - Because our target density is the Q function, we can reparameterize it for a lower variance
      - $a_t = f _{\phi}(\epsilon_t ; s_t)$
        $\epsilon_t$ is random noise sampled from a fixed distribution
      - New policy objective: $J _{\pi}(\phi) = \mathbb{E} _{s_t \sim D, \epsilon_t \sim \mathcal{N}}[\log \pi _{\phi}(f _{\phi}(\epsilon_t ; s_t) \vert s_t) - Q _{\theta}(s_t, f _\phi (\epsilon_t ; s_t))]$
        Gradient: $\hat{\nabla} _\phi J _\pi (\phi) = \nabla _\phi \log \pi _\phi (a_t \vert s_t) + (\nabla _{a_t} \log \pi _\phi (a_t \vert s_t) - \nabla a_t Q(s_t, a_t))\nabla _\phi f _\phi(\epsilon_t ; s_t)$
  - Uses 2 Q functions to mitigate maximization bias
  - Collect experience and update function approximators off-policy with replay buffer
  - Usually take single step followed up multiple gradient steps
Experiments
- Comparative Evaluation
  - SAC performs comparably to baseline methods (PPO, TD3, DDPG, SQL) on easier tasks and outperforms them significantly on harder ones
  - SAC learns faster than PPO because PPO needs large batch sizes for complex tasks
  - SQL can learn all tasks but has worse asymptotic performance compared to SAC
  - SAC is SOTA for sample efficiency and final performance
- Ablation Study
  - Stochastic vs Deterministic Policy: SAC has more consistent performance compared to the deterministic variant of it (stochasticity stabalizes learning)
  - Policy Evaluation: For evaluation, its better to make the final policy deterministic
  - Reward Scale: SAC is sensitive to reward scale; it plays a role in the temperature
    - Small Rewards: Cannot exploit reward signal; degrades performance
    - Large Rewards: Learns quickly at first but then policy becomes near deterministic
    - Only hyperparameter that requires fine-tuning
  - Target Network Update: Large update rate causes instability and small ones cause slower training
    - Can also copy over weights periodically - better to take more gradient steps in between each environment step but also increases computational cost

April 23, 2025 · research