blog/posts/hierarchical-optimal-transport-for-document-classification.org

---
title: "Reading notes: Hierarchical Optimal Transport for Document Representation"
date: 2020-04-05
---

Two weeks ago, I did a presentation for my colleagues of the paper
from cite:yurochkin2019_hierar_optim_trans_docum_repres, from
NeurIPS 2019. It contains an interesting approach to document
classification leading to strong performance, and, most importantly,
excellent interpretability.

This paper seems interesting to me because of it uses two methods with
strong theoretical guarantees: optimal transport and topic
modelling. Optimal transport looks very promising to me in NLP, and
has seen a lot of interest in recent years due to advances in
approximation algorithms, such as entropy regularisation. It is also
quite refreshing to see approaches using solid results in
optimisation, compared to purely experimental deep learning methods.

* Introduction and motivation

The problem of the paper is to measure similarity (i.e. a distance)
between pairs of documents, by incorporating /semantic/ similarities
(and not only syntactic artefacts), without encountering scalability
issues.

They propose a "meta-distance" between documents, called the
hierarchical optimal topic transport (HOTT), providing a scalable
metric incorporating topic information between documents. As such,
they try to combine two different levels of analysis:
- word embeddings data, to embed language knowledge (via pre-trained
  embeddings for instance),
- topic modelling methods (e.g. [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]]), to
  represent semantically-meaningful groups of words.

* Background: optimal transport

The essential backbone of the method is the Wasserstein distance,
derived from optimal transport theory. Optimal transport is a
fascinating and deep subject, so I won't enter into the details
here. For an introduction to the theory and its applications, check
out the excellent book from
cite:peyreComputationalOptimalTransport2019, ([[https://arxiv.org/abs/1803.00567][available on ArXiv]] as
well). There are also [[https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr][very nice posts]] by Gabriel Peyré on the CNRS
maths blog (in French). Many more resources (including slides for
presentations) are available at
[[https://optimaltransport.github.io]]. For a more complete theoretical
treatment of the subject, check out
cite:santambrogioOptimalTransportApplied2015, or, if you're feeling
particularly adventurous, cite:villaniOptimalTransportOld2009.

For this paper, only a superficial understanding of how the
[[https://en.wikipedia.org/wiki/Wasserstein_metric][Wasserstein distance]] works is necessary. Optimal transport is an
optimisation technique to lift a distance between points in a given
metric space, to a distance between probability /distributions/ over
this metric space. The historical example is to move piles of dirt
around: you know the distance between any two points, and you have
piles of dirt lying around[fn:historical_ot]. Now, if you want to move these piles to
another configuration (fewer piles, say, or a different repartition of
dirt a few metres away), you need to find the most efficient way to
move them. The total cost you obtain will define a distance between
the two configurations of dirt, and is usually called the [[https://en.wikipedia.org/wiki/Earth_mover%27s_distance][earth
mover's distance]], which is just an instance of the general Wasserstein
metric.

[fn:historical_ot] Optimal transport originated with Monge, and then
Kantorovich, both of whom had very clear military applications in mind
(either in Revolutionary France, or during WWII). A lot of historical
examples move cannon balls, or other military equipment, along a front
line.


More formally, if we have to sets of points $x = (x_1, x_2, \ldots,
      x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1), we can define the Wasserstein distance as
\[
  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
	      \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
\]
where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.

Now, how can this be applied to a natural language setting? Once we
have word embeddings, we can consider that the vocabulary forms a
metric space (we can compute a distance, for instance the euclidean or
the [[https://en.wikipedia.org/wiki/Cosine_similarity][cosine distance]], between two word embeddings). The key is to
define documents as /distributions/ over words.

Given a vocabulary $V \subset \mathbb{R}^n$ and a corpus $D = (d^1, d^2, \ldots, d^{\lvert D \rvert})$, we represent a document as $d^i \in \Delta^{l_i}$ where $l_i$ is the number of unique words in $d^i$, and $d^i_j$ is the proportion of word $v_j$ in the document $d^i$.
The word mover's distance (WMD) is then defined simply as
\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]

If you didn't follow all of this, don't worry! The gist is: if you
have a distance between points, you can solve an optimisation problem
to obtain a distance between /distributions/ over these points! This
is especially useful when you consider that each word embedding is a
point, and a document is just a set of words, along with the number of
times they appear.

* References