Add post on "How to train your differentiable filter"

2022-05-20 22:14:32 +02:00 · 2022-05-20 22:14:32 +02:00 · 3c48e9317d
commit 3c48e9317d
parent a5d75b49cb
2 changed files with 206 additions and 0 deletions
--- a/bib/bibliography.bib
+++ b/bib/bibliography.bib
@ -668,3 +668,35 @@
  isbn =	 9781119061953,
 }

+@article{kloss2021_how,
+  author =	 {Kloss, Alina and Martius, Georg and Bohg, Jeannette},
+  title =	 {How to train your differentiable filter},
+  journal =	 {Autonomous Robots},
+  volume =	 45,
+  number =	 4,
+  pages =	 {561--578},
+  year =	 2021,
+  month =	 may,
+  issn =	 {1573-7527},
+  publisher =	 {Springer US},
+  doi =		 {10.1007/s10514-021-09990-9}
+}
+
+@book{anderson2005_optim_filter,
+  author =	 {Anderson, Brian D. O. and Moore, John B.},
+  title =	 {Optimal Filtering},
+  year =	 2005,
+  publisher =	 {Dover Publications},
+  isbn =	 9780486439389,
+  series =	 {Dover Books on Electrical Engineering},
+}
+
+@Book{thrun2006_probab_robot,
+  author =	 {Thrun, Sebastian},
+  title =	 {Probabilistic Robotics},
+  year =	 2006,
+  publisher =	 {The MIT Press},
+  url =		 {https://mitpress.mit.edu/books/probabilistic-robotics},
+  address =	 {Cambridge, Massachusetts},
+  isbn =	 9780262201629,
+}
--- a/posts/how-to-train-your-differentiable-filter.org
+++ b/posts/how-to-train-your-differentiable-filter.org
@ -0,0 +1,174 @@
+---
+title: "How to train your differentiable filter"
+date: 2022-05-20
+tags: maths, dynamical systems, machine learning, autodiff
+toc: false
+---
+
+This is a short overview of the following paper [cite:@kloss2021_how]:
+
+#+begin_quote
+Kloss, Alina, Georg Martius, and Jeannette Bohg. 2021. “How to Train
+Your Differentiable Filter.” /Autonomous Robots/ 45 (4):
+561–78. https://doi.org/10.1007/s10514-021-09990-9.
+#+end_quote
+
+* Bayesian filtering for state estimation
+
+Bayesian filters[fn:bayesian-filters] are the standard method for
+probabilistic state estimation. Common examples are (extended,
+unscented) [[https://en.wikipedia.org/wiki/Kalman_filter][Kalman filters]] and [[https://en.wikipedia.org/wiki/Particle_filter][particle filters]]. These filters require
+a /process model/ predicting how the state evolves over time, and an
+/observation model/ relating an sensor value to the underlying state.
+
+[fn:bayesian-filters] {-} [cite:@thrun2006_probab_robot] contains a
+great explanation of Bayesian filters (including Kalman and particle
+filters), in the context of robotics, which is relevant for this
+paper. For a more complete overview of Kalman filters, see
+[cite:@anderson2005_optim_filter].
+
+
+The objective of a filter for state estimation is to estimate a latent
+state $\mathbf{x}$ of a dynamical system at any time step $t$ given an
+initial belief $\mathrm{bel}(\mathbf{x}_0) = p(\mathbf{x}_0)$, a
+sequence of observations $\mathbf{z}_{1\ldots t}$, and controls
+$\mathbf{u}_{0\ldots t}$.
+
+We make the Markov assumption (i.e. states and observations are
+conditionally independent from the history of past states).
+
+\[
+\begin{align*}
+\mathrm{bel}(\mathbf{x}_t) &= \eta p(\mathbf{z}_t | \mathbf{x}_t) \int p(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \mathrm{bel}(\mathbf{x}_{t-1}) d\mathbf{x}_{t-1}\\
+&= \eta p(\mathbf{z}_t | \mathbf{x}_t) \overline{\mathrm{bel}}(\mathbf{x}_t),
+\end{align*}
+\]
+
+where $\eta$ is a normalization factor. Computing
+$\overline{\mathrm{bel}}(\mathbf{x}_t)$ is the /prediction step/, and
+applying $p(\mathbf{z}_t | \mathbf{x}_t)$ is the /update step/ (or the
+/observation step/).
+
+We model the dynamics of the system through a process model $f$ and an
+observation model $h$:
+
+\[
+\begin{align*}
+\mathbf{x}_t &= f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}, \mathbf{q}_{t-1})\\
+\mathbf{z}_t &= h(\mathbf{x}_t, \mathbf{r}_t),
+\end{align*}
+\]
+where $\mathbf{q}$ and $\mathbf{r}$ are random variables representing
+process and observation noise, respectively.
+
+* Differentiable Bayesian filters
+
+These models are often difficult to formulate and specify, especially
+when the application has complex dynamics, with complicated noises,
+nonlinearities, high-dimensional state or observations, etc.
+
+To improve this situation, the key idea is to /learn/ these complex
+dynamics and noise models from data. Instead of spending hours in
+front of a blackboard deriving the equations, we could give a simple
+model a lot of data and learn the equations from them!
+
+In the case of Bayesian filters, we have to define the process,
+observation, and noise processes as parameterized functions
+(e.g. neural networks), and learn their parameters end-to-end, through
+the entire apparatus of the filter. To learn these parameters, we will
+use the simplest method: gradient descent. Our filter have to become
+/differentiable/.
+
+The paper shows that such /differentiable filters/ (trained
+end-to-end) outperform unstructured [[https://en.wikipedia.org/wiki/Long_short-term_memory][LSTMs]], and outperform standard
+filters where the process and observation models are fixed in advance
+(i.e. analytically derived or even trained separately in isolation).
+
+In most applications, the process and observation noises are often
+assumed to be uncorrelated Gaussians, with zero mean and constant
+covariance (which is a hyperparameter of the filter). With end-to-end
+training, we can learn these parameters (mean and covariance of the
+noise), but we can even go further, and use [[https://en.wikipedia.org/wiki/Heteroscedasticity][heteroscedastic]] noise
+models. In this model, the noise can depend on the state of the system
+and the applied control.
+
+* Learnable process and observation models
+
+The observation model $f$ can be implemented as a simple feed-forward
+neural network. Importantly, this NN is trained to output the
+/difference/ between the next and the current state ($\mathbf{x}_{t+1} - \mathbf{x}_t$).
+This ensure stable gradients and an easier initialization near the
+identity.
+
+For the observation model, we could do the same and model $g$ as a
+generative neural network predicting the output of the
+sensors. However, the observation space is often high-dimensional, and
+the network is thus difficult to train. Consequently, the authors use
+a /discriminative/ neural network to reduce the dimensionality of the
+raw sensory output.
+
+* Learnable noise models
+
+In the Gaussian case, we use neural networks to predict the covariance
+matrix of the noise processes. To ensure positive-definiteness, the
+network predicts an upper-triangular matrix $\mathbf{L}_t$ and the
+noise covariance matrix is set to $\mathbf{Q}_t = \mathbf{L}_t \mathbf{L}_t^T$.
+
+In the heteroscedastic case, the noise covariance is predicted from
+the state and the control input.
+
+* Loss function
+
+We assume that we have access to the ground-truth trajectory $\mathbf{x}_{1\ldots T}$.
+
+We can then use the mean squared error (MSE) between the ground truth
+and the mean of the belief:
+
+\[ L_\mathrm{MSE} = \frac{1}{T} \sum_{t=0}^T (\mathbf{x}_t - \mathbf{\mu}_t)^T (\mathbf{x}_t - \mathbf{\mu}_t). \]
+
+Alternatively, we can compute the negative log-likelihood of the true
+state under the belief distribution (represented by a Gaussian of mean
+$\mathbf{\mu}_t$ and covariance $\mathbf{\Sigma}_t$):
+
+\[ L_\mathrm{NLL} = \frac{1}{2T} \sum_{t=0}^T \log(|\mathbf{\Sigma}_t|) + (\mathbf{x}_t - \mathbf{\mu}_t)^T \mathbf{\Sigma}^{-1} (\mathbf{x}_t - \mathbf{\mu}_t). \]
+
+* Implementation issues
+
+We need to implement the filters ([[https://en.wikipedia.org/wiki/Extended_Kalman_filter][EKF]], [[https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter][UKF]], [[https://en.wikipedia.org/wiki/Particle_filter][PF]]) in a [[https://en.wikipedia.org/wiki/Differentiable_programming][differentiable
+programming]] framework. The authors use [[https://en.wikipedia.org/wiki/Differentiable_programming][TensorFlow]]. Their code is
+available [[https://github.com/akloss/differentiable_filters][on GitHub]].
+
+Some are easy because they use only differentiable operations (mostly
+simple linear algebra). For the EKF, we also need to compute
+Jacobians. This can be done automatically via automatic
+differentiation, but the authors have encountered technical
+difficulties with this (memory consumption or slow computations), so
+they recommend computing Jacobians manually.[fn::It is not clear
+whether this is a limitation of automatic differentiation, or of their
+specific implementation with TensorFlow. Some other projects have
+successfully computed Jacobians for EKFs with autodiff libraries, like
+[[https://github.com/sisl/GaussianFilters.jl][GaussianFilters.jl]] in Julia.]
+
+The particle filter has a resampling step that is not differentiable:
+the gradient cannot be propagated to particles that are not selected
+by the sampling step. There are apparently specific resampling
+algorithms that help mitigate this issue in practice when training.
+
+* Conclusions
+
+Differentiable filters achieve better results with fewer parameters
+than unstructured models like LSTMs, especially on complex tasks. The
+paper runs extensive experiments on various toy models of various
+complexity, although unfortunately no real-world application is shown.
+
+Noise models with full covariance improve the tracking
+accuracy. Heteroscedastic noise models improve it even more.
+
+The main issue is to keep the training stable. They recommend the
+differentiable extended Kalman filter for getting started, as it is
+the most simple filter, and is less sensitive to hyperparameter
+choices. If the task is strongly non-linear, one should use a
+differentiable unscented Kalman filter or a differentiable particle
+filter.
+
+* References