@@ -32,7 +31,9 @@
Background: optimal transport
The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from Peyré and Cuturi (2019), (available on ArXiv as well). There are also very nice posts (in French) by Gabriel Peyré on the CNRS maths blog. Many more resources (including slides for presentations) are available at https://optimaltransport.github.io. For a more complete theoretical treatment of the subject, check out Santambrogio (2015), or, if you’re feeling particularly adventurous, Villani (2009).
-For this paper, only a superficial understanding of how the Wasserstein distance works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability distributions over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the earth mover’s distance, which is just an instance of the general Wasserstein metric.
+For this paper, only a superficial understanding of how the Wasserstein distance works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability distributions over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around Optimal transport originated with Monge, and then Kantorovich, both of whom had very clear military applications in mind (either in Revolutionary France, or during WWII). A lot of historical examples move cannon balls, or other military equipment, along a front line.
+
+. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the earth mover’s distance, which is just an instance of the general Wasserstein metric.
More formally, we start with two sets of points \(x = (x_1, x_2, \ldots,
x_n)\), and \(y = (y_1, y_2, \ldots, y_n)\), along with probability distributions \(p \in \Delta^n\), \(q \in \Delta^m\) over \(x\) and \(y\) (\(\Delta^n\) is the probability simplex of dimension \(n\), i.e. the set of vectors of size \(n\) summing to 1). We can then define the Wasserstein distance as \[
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
@@ -65,9 +66,9 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on \(\lvert T \rvert\) topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.
Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.
-
-
Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019).
-
+
Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019).
+
+
Experiments
The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use Latent Dirichlet Allocation to compute topics and GloVe for pretrained word embeddings (Pennington, Socher, and Manning 2014), and Gurobi to solve the optimisation problems. Their code is available on GitHub.
If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple \(k\)-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.
@@ -97,12 +98,6 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
Yurochkin, Mikhail, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. 2019. “Hierarchical Optimal Transport for Document Representation.” In Advances in Neural Information Processing Systems 32, 1599–1609. http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf.
-
]]>