Add post on "How to train your differentiable filter"

2022-05-20 22:14:32 +02:00 · 2022-05-20 22:14:32 +02:00 · 3c48e9317d
commit 3c48e9317d
parent a5d75b49cb
2 changed files with 206 additions and 0 deletions
--- a/bib/bibliography.bib
+++ b/bib/bibliography.bib
@ -668,3 +668,35 @@
  isbn =	 9781119061953,
 }
@article{kloss2021_how,
  author =	 {Kloss, Alina and Martius, Georg and Bohg, Jeannette},
  title =	 {How to train your differentiable filter},
  journal =	 {Autonomous Robots},
  volume =	 45,
  number =	 4,
  pages =	 {561--578},
  year =	 2021,
  month =	 may,
  issn =	 {1573-7527},
  publisher =	 {Springer US},
  doi =		 {10.1007/s10514-021-09990-9}
 }
@book{anderson2005_optim_filter,
  author =	 {Anderson, Brian D. O. and Moore, John B.},
  title =	 {Optimal Filtering},
  year =	 2005,
  publisher =	 {Dover Publications},
  isbn =	 9780486439389,
  series =	 {Dover Books on Electrical Engineering},
 }
@Book{thrun2006_probab_robot,
  author =	 {Thrun, Sebastian},
  title =	 {Probabilistic Robotics},
  year =	 2006,
  publisher =	 {The MIT Press},
  url =		 {https://mitpress.mit.edu/books/probabilistic-robotics},
  address =	 {Cambridge, Massachusetts},
  isbn =	 9780262201629,
 }
--- a/posts/how-to-train-your-differentiable-filter.org
+++ b/posts/how-to-train-your-differentiable-filter.org
@ -0,0 +1,174 @@
 ---
 title: "How to train your differentiable filter"
 date: 2022-05-20
 tags: maths, dynamical systems, machine learning, autodiff
 toc: false
 ---
 This is a short overview of the following paper [cite:@kloss2021_how]:
 #+begin_quote
 Kloss, Alina, Georg Martius, and Jeannette Bohg. 2021. “How to Train
 Your Differentiable Filter.” /Autonomous Robots/ 45 (4):
 561–78. https://doi.org/10.1007/s10514-021-09990-9.
 #+end_quote
 * Bayesian filtering for state estimation
 Bayesian filters[fn:bayesian-filters] are the standard method for
 probabilistic state estimation. Common examples are (extended,
 unscented) [[https://en.wikipedia.org/wiki/Kalman_filter][Kalman filters]] and [[https://en.wikipedia.org/wiki/Particle_filter][particle filters]]. These filters require
 a /process model/ predicting how the state evolves over time, and an
 /observation model/ relating an sensor value to the underlying state.
 [fn:bayesian-filters] {-} [cite:@thrun2006_probab_robot] contains a
 great explanation of Bayesian filters (including Kalman and particle
 filters), in the context of robotics, which is relevant for this
 paper. For a more complete overview of Kalman filters, see
 [cite:@anderson2005_optim_filter].
 The objective of a filter for state estimation is to estimate a latent
 state $\mathbf{x}$ of a dynamical system at any time step $t$ given an
 initial belief $\mathrm{bel}(\mathbf{x}_0) = p(\mathbf{x}_0)$, a
 sequence of observations $\mathbf{z}_{1\ldots t}$, and controls
 $\mathbf{u}_{0\ldots t}$.
 We make the Markov assumption (i.e. states and observations are
 conditionally independent from the history of past states).
 \[
 \begin{align*}
 \mathrm{bel}(\mathbf{x}_t) &= \eta p(\mathbf{z}_t | \mathbf{x}_t) \int p(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \mathrm{bel}(\mathbf{x}_{t-1}) d\mathbf{x}_{t-1}\\
 &= \eta p(\mathbf{z}_t | \mathbf{x}_t) \overline{\mathrm{bel}}(\mathbf{x}_t),
 \end{align*}
 \]
 where $\eta$ is a normalization factor. Computing
 $\overline{\mathrm{bel}}(\mathbf{x}_t)$ is the /prediction step/, and
 applying $p(\mathbf{z}_t | \mathbf{x}_t)$ is the /update step/ (or the
 /observation step/).
 We model the dynamics of the system through a process model $f$ and an
 observation model $h$:
 \[
 \begin{align*}
 \mathbf{x}_t &= f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}, \mathbf{q}_{t-1})\\
 \mathbf{z}_t &= h(\mathbf{x}_t, \mathbf{r}_t),
 \end{align*}
 \]
 where $\mathbf{q}$ and $\mathbf{r}$ are random variables representing
 process and observation noise, respectively.
 * Differentiable Bayesian filters
 These models are often difficult to formulate and specify, especially
 when the application has complex dynamics, with complicated noises,
 nonlinearities, high-dimensional state or observations, etc.
 To improve this situation, the key idea is to /learn/ these complex
 dynamics and noise models from data. Instead of spending hours in
 front of a blackboard deriving the equations, we could give a simple
 model a lot of data and learn the equations from them!
 In the case of Bayesian filters, we have to define the process,
 observation, and noise processes as parameterized functions
 (e.g. neural networks), and learn their parameters end-to-end, through
 the entire apparatus of the filter. To learn these parameters, we will
 use the simplest method: gradient descent. Our filter have to become
 /differentiable/.
 The paper shows that such /differentiable filters/ (trained
 end-to-end) outperform unstructured [[https://en.wikipedia.org/wiki/Long_short-term_memory][LSTMs]], and outperform standard
 filters where the process and observation models are fixed in advance
 (i.e. analytically derived or even trained separately in isolation).
 In most applications, the process and observation noises are often
 assumed to be uncorrelated Gaussians, with zero mean and constant
 covariance (which is a hyperparameter of the filter). With end-to-end
 training, we can learn these parameters (mean and covariance of the
 noise), but we can even go further, and use [[https://en.wikipedia.org/wiki/Heteroscedasticity][heteroscedastic]] noise
 models. In this model, the noise can depend on the state of the system
 and the applied control.
 * Learnable process and observation models
 The observation model $f$ can be implemented as a simple feed-forward
 neural network. Importantly, this NN is trained to output the
 /difference/ between the next and the current state ($\mathbf{x}_{t+1} - \mathbf{x}_t$).
 This ensure stable gradients and an easier initialization near the
 identity.
 For the observation model, we could do the same and model $g$ as a
 generative neural network predicting the output of the
 sensors. However, the observation space is often high-dimensional, and
 the network is thus difficult to train. Consequently, the authors use
 a /discriminative/ neural network to reduce the dimensionality of the
 raw sensory output.
 * Learnable noise models
 In the Gaussian case, we use neural networks to predict the covariance
 matrix of the noise processes. To ensure positive-definiteness, the
 network predicts an upper-triangular matrix $\mathbf{L}_t$ and the
 noise covariance matrix is set to $\mathbf{Q}_t = \mathbf{L}_t \mathbf{L}_t^T$.
 In the heteroscedastic case, the noise covariance is predicted from
 the state and the control input.
 * Loss function
 We assume that we have access to the ground-truth trajectory $\mathbf{x}_{1\ldots T}$.
 We can then use the mean squared error (MSE) between the ground truth
 and the mean of the belief:
 \[ L_\mathrm{MSE} = \frac{1}{T} \sum_{t=0}^T (\mathbf{x}_t - \mathbf{\mu}_t)^T (\mathbf{x}_t - \mathbf{\mu}_t). \]
 Alternatively, we can compute the negative log-likelihood of the true
 state under the belief distribution (represented by a Gaussian of mean
 $\mathbf{\mu}_t$ and covariance $\mathbf{\Sigma}_t$):
 \[ L_\mathrm{NLL} = \frac{1}{2T} \sum_{t=0}^T \log(|\mathbf{\Sigma}_t|) + (\mathbf{x}_t - \mathbf{\mu}_t)^T \mathbf{\Sigma}^{-1} (\mathbf{x}_t - \mathbf{\mu}_t). \]
 * Implementation issues
 We need to implement the filters ([[https://en.wikipedia.org/wiki/Extended_Kalman_filter][EKF]], [[https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter][UKF]], [[https://en.wikipedia.org/wiki/Particle_filter][PF]]) in a [[https://en.wikipedia.org/wiki/Differentiable_programming][differentiable
 programming]] framework. The authors use [[https://en.wikipedia.org/wiki/Differentiable_programming][TensorFlow]]. Their code is
 available [[https://github.com/akloss/differentiable_filters][on GitHub]].
 Some are easy because they use only differentiable operations (mostly
 simple linear algebra). For the EKF, we also need to compute
 Jacobians. This can be done automatically via automatic
 differentiation, but the authors have encountered technical
 difficulties with this (memory consumption or slow computations), so
 they recommend computing Jacobians manually.[fn::It is not clear
 whether this is a limitation of automatic differentiation, or of their
 specific implementation with TensorFlow. Some other projects have
 successfully computed Jacobians for EKFs with autodiff libraries, like
 [[https://github.com/sisl/GaussianFilters.jl][GaussianFilters.jl]] in Julia.]
 The particle filter has a resampling step that is not differentiable:
 the gradient cannot be propagated to particles that are not selected
 by the sampling step. There are apparently specific resampling
 algorithms that help mitigate this issue in practice when training.
 * Conclusions
 Differentiable filters achieve better results with fewer parameters
 than unstructured models like LSTMs, especially on complex tasks. The
 paper runs extensive experiments on various toy models of various
 complexity, although unfortunately no real-world application is shown.
 Noise models with full covariance improve the tracking
 accuracy. Heteroscedastic noise models improve it even more.
 The main issue is to keep the training stable. They recommend the
 differentiable extended Kalman filter for getting started, as it is
 the most simple filter, and is less sensitive to hyperparameter
 choices. If the task is strongly non-linear, one should use a
 differentiable unscented Kalman filter or a differentiable particle
 filter.
 * References