Add post on Kolmogorov-Arnold Networks

2024-06-08 12:55:24 +02:00 · 2024-06-08 12:55:24 +02:00 · 681974170e
commit 681974170e
parent fe066d13ed
2 changed files with 127 additions and 0 deletions
--- a/bib/bibliography.bib
+++ b/bib/bibliography.bib
@ -740,3 +740,49 @@
  volume =	 21,
  year =	 1999,
 }
+
+@misc{liu2024_kan,
+  author =	 {Liu, Ziming and Wang, Yixuan and Vaidya, Sachin and
+                  Ruehle, Fabian and Halverson, James and Soljačić,
+                  Marin and Hou, Thomas Y. and Tegmark, Max},
+  title =	 {{KAN}: {Kolmogorov}-{Arnold} {Networks}},
+  year =	 2024,
+  month =	 may,
+  publisher =	 {arXiv},
+  doi =		 {10.48550/arXiv.2404.19756},
+  url =		 {http://arxiv.org/abs/2404.19756},
+  note =	 {arXiv:2404.19756 [cond-mat, stat]},
+  keywords =	 {Computer Science - Machine Learning, Condensed
+                  Matter - Disordered Systems and Neural Networks,
+                  Computer Science - Artificial Intelligence,
+                  Statistics - Machine Learning},
+}
+
+@article{chenNeuralOrdinaryDifferential2018,
+  archivePrefix = {arXiv},
+  eprinttype = {arxiv},
+  eprint = {1806.07366},
+  primaryClass = {cs, stat},
+  title = {Neural {{Ordinary Differential Equations}}},
+  url = {http://arxiv.org/abs/1806.07366},
+  abstract = {We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.},
+  urldate = {2019-01-05},
+  date = {2018-06-19},
+  keywords = {Statistics - Machine Learning,Computer Science - Artificial Intelligence,Computer Science - Machine Learning},
+  author = {Chen, Ricky T. Q. and Rubanova, Yulia and Bettencourt, Jesse and Duvenaud, David},
+  file = {/home/dimitri/Nextcloud/Zotero/storage/26D4Y3GG/Chen et al. - 2018 - Neural Ordinary Differential Equations.pdf;/home/dimitri/Nextcloud/Zotero/storage/RNXT4EQV/1806.html}
+}
+
+@article{ruthotto2024_differ_equat,
+  author =	 {Ruthotto, Lars},
+  title =	 {Differential {Equations} for {Continuous}-{Time}
+                  {Deep} {Learning}},
+  journal =	 {Notices of the American Mathematical Society},
+  year =	 2024,
+  month =	 may,
+  volume =	 71,
+  number =	 05,
+  issn =	 {0002-9920, 1088-9477},
+  doi =		 {10.1090/noti2930},
+  url =		 {https://www.ams.org/notices/202405/rnoti-p613.pdf},
+}
--- a/posts/kolmogorov-arnold-networks.org
+++ b/posts/kolmogorov-arnold-networks.org
@ -0,0 +1,81 @@
+---
+title: "Reading notes: Kolmogorov-Arnold Networks"
+date: 2024-06-08
+tags: machine learning, paper
+toc: false
+---
+
+This paper [cite:@liu2024_kan] proposes an alternative to multi-layer
+perceptrons (MLPs) in machine learning.
+
+The basic idea is that MLPs have parameters on the nodes of the
+computation graph (the weights and biases on each cell), and that KANs
+have the parameters on the edges. Each edge has a learnable activation
+function parameterized as a spline.
+
+The network is learned at two levels, which allows for "adjusting
+locally":
+- the overall shape of the computation graph and its connexions
+  (external degrees of freedom, to learn the compositional structure),
+- the parameters of each activation function (internal degrees of
+  freedom).
+
+It is based on the [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_representation_theorem][Kolmogorov-Arnold representation theorem]], which
+says that any continuous multivariate function can be represented as a
+sum of continuous univariate functions. We recover the distinction
+between the compositional structure of the sum and the structure of
+each internal univariate function.
+
+The theorem can be interpreted as two layers, and the paper then
+generalizes it to multiple layer of arbitrary width. In the theorem,
+the univariate functions are arbitrary and can be complex (even
+fractal), so the hope is that allowing for arbitrary depth and width
+will allow to only use splines. They derive an approximation theorem:
+when replacing the arbitrary continuous functions of the
+Kolmogorov-Arnold representation with splines, we can bound the error
+independently of the dimension. (However there is a constant which
+depends on the function and its representation, and therefore on the
+dimension...) Theoretical scaling laws in the number of parameters are
+much better than for MLPs, and moreover, experiments show that KANs
+are much closer to their theoretical bounds than MLPs.
+
+KANs have interesting properties:
+- The splines are interpolated on grid points which can be iteratively
+  refined. The fact that there is a notion of "fine-grainedness" is
+  very interesting, it allows to add parameters without having to
+  retrain everything.
+- Larger is not always better: the quality of the reconstruction
+  depends on finding the optimal shape of the network, which should
+  match the structure of the function we want to approximate. Finding
+  this optimal shape is found via sparsification, pruning, and
+  regularization (non-trivial).
+- We can have a "human in the loop" during training, guiding pruning,
+  and "symbolifying" some activations (i.e. by recognizing that an
+  activation function is actually a cos function, replace it
+  directly). This symbolic discovery can be guided by a symbolic
+  system recognizing some functions. It's therefore a mix of symbolic
+  regression and numerical regression.
+
+They test mostly with scientific applications in mind: reconstructing
+equations from physics and pure maths. Conceptually, it has a lot of
+overlap with Neural Differential Equations
+[cite:@chenNeuralOrdinaryDifferential2018;@ruthotto2024_differ_equat]
+and "scientific ML" in general.
+
+There is an interesting discussion at the end about KANs as the model
+of choice for the "language of science". The idea is that LLMs are
+ important because they are useful for natural language, and KANs
+could fill the same role for the language of functions. The
+interpretability and adaptability (being able to be manipulated and
+guided during training by a domain expert) is thus a core feature that
+traditional deep learning models lack.
+
+There are still challenges, mostly it's unclear how it performs on
+other types of data and other modalities, but it is very
+encouraging. There is also a computational challenges, they are
+obviously much slower to train, but there has been almost no
+engineering work on them to optimize this, so it's expected. The fact
+that the operations are not easily batchable (compared to matrix
+multiplication) is however worrying for scalability to large networks.
+
+* References