blog/posts/hierarchical-optimal-transport-for-document-classification.org

---
title: "Reading notes: Hierarchical Optimal Transport for Document Representation"
date: 2020-04-05
toc: true
---

Two weeks ago, I did a presentation for my colleagues of the paper
from cite:yurochkin2019_hierar_optim_trans_docum_repres, from [[https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019][NeurIPS
2019]]. It contains an interesting approach to document classification
leading to strong performance, and, most importantly, excellent
interpretability.

This paper seems interesting to me because of it uses two methods with
strong theoretical guarantees: optimal transport and topic
modelling. Optimal transport looks very promising to me in NLP, and
has seen a lot of interest in recent years due to advances in
approximation algorithms, such as entropy regularisation. It is also
quite refreshing to see approaches using solid results in
optimisation, compared to purely experimental deep learning methods.

* Introduction and motivation

The problem of the paper is to measure similarity (i.e. a distance)
between pairs of documents, by incorporating /semantic/ similarities
(and not only syntactic artefacts), without encountering scalability
issues.

They propose a "meta-distance" between documents, called the
hierarchical optimal topic transport (HOTT), providing a scalable
metric incorporating topic information between documents. As such,
they try to combine two different levels of analysis:
- word embeddings data, to embed language knowledge (via pre-trained
  embeddings for instance),
- topic modelling methods (e.g. [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]]), to
  represent semantically-meaningful groups of words.

* Background: optimal transport

The essential backbone of the method is the Wasserstein distance,
derived from optimal transport theory. Optimal transport is a
fascinating and deep subject, so I won't enter into the details
here. For an introduction to the theory and its applications, check
out the excellent book from
cite:peyreComputationalOptimalTransport2019, ([[https://arxiv.org/abs/1803.00567][available on ArXiv]] as
well). There are also [[https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr][very nice posts]] (in French) by Gabriel Peyré on
the [[https://images.math.cnrs.fr/][CNRS maths blog]]. Many more resources (including slides for
presentations) are available at
[[https://optimaltransport.github.io]]. For a more complete theoretical
treatment of the subject, check out
cite:santambrogioOptimalTransportApplied2015, or, if you're feeling
particularly adventurous, cite:villaniOptimalTransportOld2009.

For this paper, only a superficial understanding of how the
[[https://en.wikipedia.org/wiki/Wasserstein_metric][Wasserstein distance]] works is necessary. Optimal transport is an
optimisation technique to lift a distance between points in a given
metric space, to a distance between probability /distributions/ over
this metric space. The historical example is to move piles of dirt
around: you know the distance between any two points, and you have
piles of dirt lying around[fn:historical_ot]. Now, if you want to move these piles to
another configuration (fewer piles, say, or a different repartition of
dirt a few metres away), you need to find the most efficient way to
move them. The total cost you obtain will define a distance between
the two configurations of dirt, and is usually called the [[https://en.wikipedia.org/wiki/Earth_mover%27s_distance][earth
mover's distance]], which is just an instance of the general Wasserstein
metric.

[fn:historical_ot] {-} Optimal transport originated with Monge, and then
Kantorovich, both of whom had very clear military applications in mind
(either in Revolutionary France, or during WWII). A lot of historical
examples move cannon balls, or other military equipment, along a front
line.


More formally, we start with two sets of points $x = (x_1, x_2, \ldots,
      x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1). We can then define the Wasserstein distance as
\[
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
\]
\[
\text{subject to } \sum_j P_{i,j} = p_i \text{  and } \sum_i P_{i,j} = q_j,
\]
where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.

Now, how can this be applied to a natural language setting? Once we
have word embeddings, we can consider that the vocabulary forms a
metric space (we can compute a distance, for instance the euclidean or
the [[https://en.wikipedia.org/wiki/Cosine_similarity][cosine distance]], between two word embeddings). The key is to
define documents as /distributions/ over words.

Given a vocabulary $V \subset \mathbb{R}^n$ and a corpus $D = (d^1, d^2, \ldots, d^{\lvert D \rvert})$, we represent a document as $d^i \in \Delta^{l_i}$ where $l_i$ is the number of unique words in $d^i$, and $d^i_j$ is the proportion of word $v_j$ in the document $d^i$.
The word mover's distance (WMD) is then defined simply as
\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]

If you didn't follow all of this, don't worry! The gist is: if you
have a distance between points, you can solve an optimisation problem
to obtain a distance between /distributions/ over these points! This
is especially useful when you consider that each word embedding is a
point, and a document is just a set of words, along with the number of
times they appear.

* Hierarchical optimal transport

Using optimal transport, we can use the word mover's distance to
define a metric between documents. However, this suffers from two
drawbacks:
- Documents represented as distributions over words are not easily
  interpretable. For long documents, the vocabulary is huge and word
  frequencies are not easily understandable for humans.
- Large vocabularies mean that the space on which we have to find an
  optimal matching is huge. The [[https://en.wikipedia.org/wiki/Hungarian_algorithm][Hungarian algorithm]] used to compute
  the optimal transport distance runs in $O(l^3 \log l)$, where $l$ is
  the maximum number of unique words in each documents. This quickly
  becomes intractable as the size of documents become larger, or if
  you have to compute all pairwise distances between a large number of
  documents (e.g. for clustering purposes).

To escape these issues, we will add an intermediary step using [[https://en.wikipedia.org/wiki/Topic_model][topic
modelling]]. Once we have topics $T = (t_1, t_2, \ldots, t_{\lvert T
\rvert}) \subset \Delta^{\lvert V \rvert}$, we get two kinds of
representations:
- representations of topics as distributions over words,
- representations of documents as distributions over topics $\bar{d^i} \in \Delta^{\lvert T \rvert}$.

Since they are distributions over words, the word mover's distance
defines a metric over topics. As such, the topics with the WMD become
a metric space.

We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents $d^1$, $d^2$,
\[
\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
\]
where $\delta_{t_k}$ is a distribution supported on topic $t_k$.

Note that in this case, we used optimal transport /twice/:
- once to find distances between topics (WMD),
- once to find distances between documents, where the distance between
  topics became the costs in the new optimal transport
  problem.

The first one can be precomputed once for all subsequent distances, so
it is invariable in the number of documents we have to process. The
second one only operates on $\lvert T \rvert$ topics instead of the
full vocabulary: the resulting optimisation problem is much smaller!
This is great for performance, as it should be easy now to compute all
pairwise distances in a large set of documents.

Another interesting insight is that topics are represented as
collections of words (we can keep the top 20 as a visual
representations), and documents as collections of topics with
weights. Both of these representations are highly interpretable for a
human being who wants to understand what's going on. I think this is
one of the strongest aspects of these approaches: both the various
representations and the algorithms are fully interpretable. Compared
to a deep learning approach, we can make sense of every intermediate
step, from the representations of topics to the weights in the
optimisation algorithm to compute higher-level distances.

[[file:/images/hott_fig1.jpg]]
 [fn::{-} Representation of two documents in topic space, along with
how the distance was computed between them. Everything is
interpretable: from the documents as collections of topics, to the
matchings between topics determining the overall distance between the
books citep:yurochkin2019_hierar_optim_trans_docum_repres.]

* Experiments

The paper is very complete regarding experiments, providing a full
evaluation of the method on one particular application: document
clustering. They use [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]] to compute topics and
GloVe for pretrained word embeddings citep:pennington2014_glove, and
[[https://www.gurobi.com/][Gurobi]] to solve the optimisation problems. Their code is available [[https://github.com/IBM/HOTT][on
GitHub]].

If you want the details, I encourage you to read the full paper, they
tested the methods on a wide variety of datasets, with datasets
containing very short documents (like Twitter), and long documents
with a large vocabulary (books). With a simple $k$-NN classification,
they establish that HOTT performs best on average, especially on large
vocabularies (books, the "gutenberg" dataset). It also has a much
better computational performance than alternative methods based on
regularisation of the optimal transport problem directly on words. So
the hierarchical nature of the approach allows to gain considerably in
performance, along with improvements in interpretability.

What's really interesting in the paper is the sensitivity analysis:
they ran experiments with different word embeddings methods (word2vec,
citep:mikolovDistributedRepresentationsWords2013), and with different
parameters for the topic modelling (topic truncation, number of
topics, etc). All of these reveal that changes in hyperparameters do
not impact the performance of HOTT significantly. This is extremely
important in a field like NLP where most of the times small variations
in approach lead to drastically different results.

* Conclusion

All in all, this paper present a very interesting approach to compute
distance between natural-language documents. It is no secret that I
like methods with strong theoretical background (in this case
optimisation and optimal transport), guaranteeing a stability and
benefiting from decades of research in a well-established domain.

Most importantly, this paper allows for future exploration in document
representation with /interpretability/ in mind. This is often added as
an afterthought in academic research but is one of the most important
topics for the industry, as a system must be understood by end users,
often not trained in ML, before being deployed. The notion of topic,
and distances as weights, can be understood easily by anyone without
significant background in ML or in maths.

Finally, I feel like they did not stop at a simple theoretical
argument, but carefully checked on real-world datasets, measuring
sensitivity to all the arbitrary choices they had to take. Again, from
an industry perspective, this allows to implement the new approach
quickly and easily, being confident that it won't break unexpectedly
without extensive testing.

* References