Uses approximation, new reward function: $r’(s_t, a_t, s _{t+1}) = r(s_t, a_t) +\eta D _{KL}(q(\theta; \phi _{t+1}) \vert \vert q(\theta \vert \phi_t))$
To parameterize dynamics model, use bayesian neural nets
Parameterized with a fully factorized gaussian
Compression
Agent curiosity can be equated with compression improvement: $C(\xi_t; \phi _{t-1}) - C(\xi_t; \phi_t)$ where $C(\xi, \phi)$ is description length of $\xi$ using a model, $\phi$
Compression can be expressed as negative variational lower bound
Using formula for variational lower bound, we can express compression as: $(\log p(\xi_t) - D _{KL}[q(\theta; \phi_t) \vert \vert p(\theta \vert \xi_t)]) - (\log p(\xi_t) - D _{KL}[q(\theta; \phi _{t+1}) \vert \vert p(\theta \vert \xi_t)])$
KL becomes 0 when the approximation equals the posterior
Compression improvement comes down to optimizing KL from posterior given $\xi _{t-1}$ to posterior given $\xi_t$
Reverse KL of information gain: $D _{KL}(p(\theta \vert \xi_t) \vert \vert p(\theta \vert \xi_t, a_t, s _{t+1}))$
Implementation
BNN weight distribution based on fully factorized gaussian: $q(\theta; \phi) = \prod _{i=1}^{\vert \Theta \vert} \mathcal{N}(\theta_i \vert \mu_i; \sigma_i^2)$
$\phi = \mu, \sigma$
Standard deviation parameter paramterized as $\sigma = \log(1 + e^\rho)$
To train the BNN, second term of variational lower bound optimizd through sampling and computing $\mathbb{E} _{\theta \sim q(\cdot;\phi)}[\log p(\mathcal{D} \vert \theta)]$
Use stochastic gradient variational bayes or bayes by backprop for optimizing variational lower bound
Use local reparameterization trick
Sample neuron pre-activations instead of weights
Sample from replay buffer to optimize variational lower bound (prevents temporal correlation, destabalizes learning, iid samples, diminishes posterior approximation error)
To compute posterior distribution of the dynamics model can be computed: $\phi’ = argmin _{\phi}[\ell(q(\theta; \phi), s_t)]$
With respect to $\mu$: $\frac{\partial^2 \ell _{KL}}{\partial \mu_i^2} = \frac{1}{\log^2(1 + e^{\rho_i})}$
With respect to $\rho_i$ (used for standard deviation): $\frac{\partial^2 \ell _{KL}}{\partial \rho_i^2} = \frac{2e^{2\rho_i}}{(1 + e^{\rho_i})^2} \frac{1}{\log^2(1 + e^{\rho_i})}$
Can also approximate KL via second-order taylor expansion