Add post on "How to train your differentiable filter"
This commit is contained in:
parent
a5d75b49cb
commit
3c48e9317d
2 changed files with 206 additions and 0 deletions
|
@ -668,3 +668,35 @@
|
||||||
isbn = 9781119061953,
|
isbn = 9781119061953,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@article{kloss2021_how,
|
||||||
|
author = {Kloss, Alina and Martius, Georg and Bohg, Jeannette},
|
||||||
|
title = {How to train your differentiable filter},
|
||||||
|
journal = {Autonomous Robots},
|
||||||
|
volume = 45,
|
||||||
|
number = 4,
|
||||||
|
pages = {561--578},
|
||||||
|
year = 2021,
|
||||||
|
month = may,
|
||||||
|
issn = {1573-7527},
|
||||||
|
publisher = {Springer US},
|
||||||
|
doi = {10.1007/s10514-021-09990-9}
|
||||||
|
}
|
||||||
|
|
||||||
|
@book{anderson2005_optim_filter,
|
||||||
|
author = {Anderson, Brian D. O. and Moore, John B.},
|
||||||
|
title = {Optimal Filtering},
|
||||||
|
year = 2005,
|
||||||
|
publisher = {Dover Publications},
|
||||||
|
isbn = 9780486439389,
|
||||||
|
series = {Dover Books on Electrical Engineering},
|
||||||
|
}
|
||||||
|
|
||||||
|
@Book{thrun2006_probab_robot,
|
||||||
|
author = {Thrun, Sebastian},
|
||||||
|
title = {Probabilistic Robotics},
|
||||||
|
year = 2006,
|
||||||
|
publisher = {The MIT Press},
|
||||||
|
url = {https://mitpress.mit.edu/books/probabilistic-robotics},
|
||||||
|
address = {Cambridge, Massachusetts},
|
||||||
|
isbn = 9780262201629,
|
||||||
|
}
|
||||||
|
|
174
posts/how-to-train-your-differentiable-filter.org
Normal file
174
posts/how-to-train-your-differentiable-filter.org
Normal file
|
@ -0,0 +1,174 @@
|
||||||
|
---
|
||||||
|
title: "How to train your differentiable filter"
|
||||||
|
date: 2022-05-20
|
||||||
|
tags: maths, dynamical systems, machine learning, autodiff
|
||||||
|
toc: false
|
||||||
|
---
|
||||||
|
|
||||||
|
This is a short overview of the following paper [cite:@kloss2021_how]:
|
||||||
|
|
||||||
|
#+begin_quote
|
||||||
|
Kloss, Alina, Georg Martius, and Jeannette Bohg. 2021. “How to Train
|
||||||
|
Your Differentiable Filter.” /Autonomous Robots/ 45 (4):
|
||||||
|
561–78. https://doi.org/10.1007/s10514-021-09990-9.
|
||||||
|
#+end_quote
|
||||||
|
|
||||||
|
* Bayesian filtering for state estimation
|
||||||
|
|
||||||
|
Bayesian filters[fn:bayesian-filters] are the standard method for
|
||||||
|
probabilistic state estimation. Common examples are (extended,
|
||||||
|
unscented) [[https://en.wikipedia.org/wiki/Kalman_filter][Kalman filters]] and [[https://en.wikipedia.org/wiki/Particle_filter][particle filters]]. These filters require
|
||||||
|
a /process model/ predicting how the state evolves over time, and an
|
||||||
|
/observation model/ relating an sensor value to the underlying state.
|
||||||
|
|
||||||
|
[fn:bayesian-filters] {-} [cite:@thrun2006_probab_robot] contains a
|
||||||
|
great explanation of Bayesian filters (including Kalman and particle
|
||||||
|
filters), in the context of robotics, which is relevant for this
|
||||||
|
paper. For a more complete overview of Kalman filters, see
|
||||||
|
[cite:@anderson2005_optim_filter].
|
||||||
|
|
||||||
|
|
||||||
|
The objective of a filter for state estimation is to estimate a latent
|
||||||
|
state $\mathbf{x}$ of a dynamical system at any time step $t$ given an
|
||||||
|
initial belief $\mathrm{bel}(\mathbf{x}_0) = p(\mathbf{x}_0)$, a
|
||||||
|
sequence of observations $\mathbf{z}_{1\ldots t}$, and controls
|
||||||
|
$\mathbf{u}_{0\ldots t}$.
|
||||||
|
|
||||||
|
We make the Markov assumption (i.e. states and observations are
|
||||||
|
conditionally independent from the history of past states).
|
||||||
|
|
||||||
|
\[
|
||||||
|
\begin{align*}
|
||||||
|
\mathrm{bel}(\mathbf{x}_t) &= \eta p(\mathbf{z}_t | \mathbf{x}_t) \int p(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \mathrm{bel}(\mathbf{x}_{t-1}) d\mathbf{x}_{t-1}\\
|
||||||
|
&= \eta p(\mathbf{z}_t | \mathbf{x}_t) \overline{\mathrm{bel}}(\mathbf{x}_t),
|
||||||
|
\end{align*}
|
||||||
|
\]
|
||||||
|
|
||||||
|
where $\eta$ is a normalization factor. Computing
|
||||||
|
$\overline{\mathrm{bel}}(\mathbf{x}_t)$ is the /prediction step/, and
|
||||||
|
applying $p(\mathbf{z}_t | \mathbf{x}_t)$ is the /update step/ (or the
|
||||||
|
/observation step/).
|
||||||
|
|
||||||
|
We model the dynamics of the system through a process model $f$ and an
|
||||||
|
observation model $h$:
|
||||||
|
|
||||||
|
\[
|
||||||
|
\begin{align*}
|
||||||
|
\mathbf{x}_t &= f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}, \mathbf{q}_{t-1})\\
|
||||||
|
\mathbf{z}_t &= h(\mathbf{x}_t, \mathbf{r}_t),
|
||||||
|
\end{align*}
|
||||||
|
\]
|
||||||
|
where $\mathbf{q}$ and $\mathbf{r}$ are random variables representing
|
||||||
|
process and observation noise, respectively.
|
||||||
|
|
||||||
|
* Differentiable Bayesian filters
|
||||||
|
|
||||||
|
These models are often difficult to formulate and specify, especially
|
||||||
|
when the application has complex dynamics, with complicated noises,
|
||||||
|
nonlinearities, high-dimensional state or observations, etc.
|
||||||
|
|
||||||
|
To improve this situation, the key idea is to /learn/ these complex
|
||||||
|
dynamics and noise models from data. Instead of spending hours in
|
||||||
|
front of a blackboard deriving the equations, we could give a simple
|
||||||
|
model a lot of data and learn the equations from them!
|
||||||
|
|
||||||
|
In the case of Bayesian filters, we have to define the process,
|
||||||
|
observation, and noise processes as parameterized functions
|
||||||
|
(e.g. neural networks), and learn their parameters end-to-end, through
|
||||||
|
the entire apparatus of the filter. To learn these parameters, we will
|
||||||
|
use the simplest method: gradient descent. Our filter have to become
|
||||||
|
/differentiable/.
|
||||||
|
|
||||||
|
The paper shows that such /differentiable filters/ (trained
|
||||||
|
end-to-end) outperform unstructured [[https://en.wikipedia.org/wiki/Long_short-term_memory][LSTMs]], and outperform standard
|
||||||
|
filters where the process and observation models are fixed in advance
|
||||||
|
(i.e. analytically derived or even trained separately in isolation).
|
||||||
|
|
||||||
|
In most applications, the process and observation noises are often
|
||||||
|
assumed to be uncorrelated Gaussians, with zero mean and constant
|
||||||
|
covariance (which is a hyperparameter of the filter). With end-to-end
|
||||||
|
training, we can learn these parameters (mean and covariance of the
|
||||||
|
noise), but we can even go further, and use [[https://en.wikipedia.org/wiki/Heteroscedasticity][heteroscedastic]] noise
|
||||||
|
models. In this model, the noise can depend on the state of the system
|
||||||
|
and the applied control.
|
||||||
|
|
||||||
|
* Learnable process and observation models
|
||||||
|
|
||||||
|
The observation model $f$ can be implemented as a simple feed-forward
|
||||||
|
neural network. Importantly, this NN is trained to output the
|
||||||
|
/difference/ between the next and the current state ($\mathbf{x}_{t+1} - \mathbf{x}_t$).
|
||||||
|
This ensure stable gradients and an easier initialization near the
|
||||||
|
identity.
|
||||||
|
|
||||||
|
For the observation model, we could do the same and model $g$ as a
|
||||||
|
generative neural network predicting the output of the
|
||||||
|
sensors. However, the observation space is often high-dimensional, and
|
||||||
|
the network is thus difficult to train. Consequently, the authors use
|
||||||
|
a /discriminative/ neural network to reduce the dimensionality of the
|
||||||
|
raw sensory output.
|
||||||
|
|
||||||
|
* Learnable noise models
|
||||||
|
|
||||||
|
In the Gaussian case, we use neural networks to predict the covariance
|
||||||
|
matrix of the noise processes. To ensure positive-definiteness, the
|
||||||
|
network predicts an upper-triangular matrix $\mathbf{L}_t$ and the
|
||||||
|
noise covariance matrix is set to $\mathbf{Q}_t = \mathbf{L}_t \mathbf{L}_t^T$.
|
||||||
|
|
||||||
|
In the heteroscedastic case, the noise covariance is predicted from
|
||||||
|
the state and the control input.
|
||||||
|
|
||||||
|
* Loss function
|
||||||
|
|
||||||
|
We assume that we have access to the ground-truth trajectory $\mathbf{x}_{1\ldots T}$.
|
||||||
|
|
||||||
|
We can then use the mean squared error (MSE) between the ground truth
|
||||||
|
and the mean of the belief:
|
||||||
|
|
||||||
|
\[ L_\mathrm{MSE} = \frac{1}{T} \sum_{t=0}^T (\mathbf{x}_t - \mathbf{\mu}_t)^T (\mathbf{x}_t - \mathbf{\mu}_t). \]
|
||||||
|
|
||||||
|
Alternatively, we can compute the negative log-likelihood of the true
|
||||||
|
state under the belief distribution (represented by a Gaussian of mean
|
||||||
|
$\mathbf{\mu}_t$ and covariance $\mathbf{\Sigma}_t$):
|
||||||
|
|
||||||
|
\[ L_\mathrm{NLL} = \frac{1}{2T} \sum_{t=0}^T \log(|\mathbf{\Sigma}_t|) + (\mathbf{x}_t - \mathbf{\mu}_t)^T \mathbf{\Sigma}^{-1} (\mathbf{x}_t - \mathbf{\mu}_t). \]
|
||||||
|
|
||||||
|
* Implementation issues
|
||||||
|
|
||||||
|
We need to implement the filters ([[https://en.wikipedia.org/wiki/Extended_Kalman_filter][EKF]], [[https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter][UKF]], [[https://en.wikipedia.org/wiki/Particle_filter][PF]]) in a [[https://en.wikipedia.org/wiki/Differentiable_programming][differentiable
|
||||||
|
programming]] framework. The authors use [[https://en.wikipedia.org/wiki/Differentiable_programming][TensorFlow]]. Their code is
|
||||||
|
available [[https://github.com/akloss/differentiable_filters][on GitHub]].
|
||||||
|
|
||||||
|
Some are easy because they use only differentiable operations (mostly
|
||||||
|
simple linear algebra). For the EKF, we also need to compute
|
||||||
|
Jacobians. This can be done automatically via automatic
|
||||||
|
differentiation, but the authors have encountered technical
|
||||||
|
difficulties with this (memory consumption or slow computations), so
|
||||||
|
they recommend computing Jacobians manually.[fn::It is not clear
|
||||||
|
whether this is a limitation of automatic differentiation, or of their
|
||||||
|
specific implementation with TensorFlow. Some other projects have
|
||||||
|
successfully computed Jacobians for EKFs with autodiff libraries, like
|
||||||
|
[[https://github.com/sisl/GaussianFilters.jl][GaussianFilters.jl]] in Julia.]
|
||||||
|
|
||||||
|
The particle filter has a resampling step that is not differentiable:
|
||||||
|
the gradient cannot be propagated to particles that are not selected
|
||||||
|
by the sampling step. There are apparently specific resampling
|
||||||
|
algorithms that help mitigate this issue in practice when training.
|
||||||
|
|
||||||
|
* Conclusions
|
||||||
|
|
||||||
|
Differentiable filters achieve better results with fewer parameters
|
||||||
|
than unstructured models like LSTMs, especially on complex tasks. The
|
||||||
|
paper runs extensive experiments on various toy models of various
|
||||||
|
complexity, although unfortunately no real-world application is shown.
|
||||||
|
|
||||||
|
Noise models with full covariance improve the tracking
|
||||||
|
accuracy. Heteroscedastic noise models improve it even more.
|
||||||
|
|
||||||
|
The main issue is to keep the training stable. They recommend the
|
||||||
|
differentiable extended Kalman filter for getting started, as it is
|
||||||
|
the most simple filter, and is less sensitive to hyperparameter
|
||||||
|
choices. If the task is strongly non-linear, one should use a
|
||||||
|
differentiable unscented Kalman filter or a differentiable particle
|
||||||
|
filter.
|
||||||
|
|
||||||
|
* References
|
Loading…
Add table
Add a link
Reference in a new issue