Merge branch 'hott'

2020-04-05 16:22:37 +02:00 · 2020-04-05 16:22:37 +02:00 · cb33249e0c
commit cb33249e0c
parent f439654137 102d4bb689
9 changed files with 659 additions and 2 deletions
--- a/_site/archive.html
+++ b/_site/archive.html
@ -37,6 +37,10 @@
      Here you can find all my previous posts:
 <ul>
    
+        <li>
+            <a href="./posts/hierarchical-optimal-transport-for-document-classification.html">Reading notes: Hierarchical Optimal Transport for Document Representation</a> - April  5, 2020
+        </li>
+    
        <li>
            <a href="./posts/self-learning-chatbots-destygo.html">Mindsay: Towards Self-Learning Chatbots</a> - April  6, 2019
        </li>
--- a/_site/atom.xml
+++ b/_site/atom.xml
@ -8,8 +8,106 @@
        <name>Dimitri Lozeve</name>
        <email>dimitri+web@lozeve.com</email>
    </author>
-    <updated>2019-04-06T00:00:00Z</updated>
+    <updated>2020-04-05T00:00:00Z</updated>
    <entry>
+    <title>Reading notes: Hierarchical Optimal Transport for Document Representation</title>
+    <link href="https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html" />
+    <id>https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html</id>
+    <published>2020-04-05T00:00:00Z</published>
+    <updated>2020-04-05T00:00:00Z</updated>
+    <summary type="html"><![CDATA[<article>
+    <section class="header">
+        Posted on April  5, 2020
+        
+    </section>
+    <section>
+        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from <a href="https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019">NeurIPS 2019</a>. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
+<p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
+<h1 id="introduction-and-motivation">Introduction and motivation</h1>
+<p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
+<p>They propose a “meta-distance” between documents, called the hierarchical optimal topic transport (HOTT), providing a scalable metric incorporating topic information between documents. As such, they try to combine two different levels of analysis:</p>
+<ul>
+<li>word embeddings data, to embed language knowledge (via pre-trained embeddings for instance),</li>
+<li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
+</ul>
+<h1 id="background-optimal-transport">Background: optimal transport</h1>
+<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> (in French) by Gabriel Peyré on the <a href="https://images.math.cnrs.fr/">CNRS maths blog</a>. Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
+<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
+<p>More formally, we start with two sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
+      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1). We can then define the Wasserstein distance as <span class="math display">\[
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]</span> <span class="math display">\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
+<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
+<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
+<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
+<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
+<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
+<ul>
+<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
+<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
+</ul>
+<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
+<ul>
+<li>representations of topics as distributions over words,</li>
+<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
+</ul>
+<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
+<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
+<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
+<ul>
+<li>once to find distances between topics (WMD),</li>
+<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
+</ul>
+<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
+<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
+<figure>
+<img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
+</figure>
+<h1 id="experiments">Experiments</h1>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="pennington2014_glove">(Pennington, Socher, and Manning <a href="#ref-pennington2014_glove">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on GitHub</a>.</p>
+<p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
+<p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
+<h1 id="conclusion">Conclusion</h1>
+<p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
+<p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, being confident that it won’t break unexpectedly without extensive testing.</p>
+<h1 id="references" class="unnumbered">References</h1>
+<div id="refs" class="references">
+<div id="ref-mikolovDistributedRepresentationsWords2013">
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, 3111–9. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+</div>
+<div id="ref-pennington2014_glove">
+<p>Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</em>, 1532–43. Doha, Qatar: Association for Computational Linguistics. <a href="https://doi.org/10.3115/v1/D14-1162" class="uri">https://doi.org/10.3115/v1/D14-1162</a>.</p>
+</div>
+<div id="ref-peyreComputationalOptimalTransport2019">
+<p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
+</div>
+<div id="ref-santambrogioOptimalTransportApplied2015">
+<p>Santambrogio, Filippo. 2015. <em>Optimal Transport for Applied Mathematicians</em>. Vol. 87. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-20828-2" class="uri">https://doi.org/10.1007/978-3-319-20828-2</a>.</p>
+</div>
+<div id="ref-villaniOptimalTransportOld2009">
+<p>Villani, Cédric. 2009. <em>Optimal Transport: Old and New</em>. Grundlehren Der Mathematischen Wissenschaften 338. Berlin: Springer.</p>
+</div>
+<div id="ref-yurochkin2019_hierar_optim_trans_docum_repres">
+<p>Yurochkin, Mikhail, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. 2019. “Hierarchical Optimal Transport for Document Representation.” In <em>Advances in Neural Information Processing Systems 32</em>, 1599–1609. <a href="http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf" class="uri">http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf</a>.</p>
+</div>
+</div>
+<section class="footnotes">
+<hr />
+<ol>
+<li id="fn1"><p>Optimal transport originated with Monge, and then Kantorovich, both of whom had very clear military applications in mind (either in Revolutionary France, or during WWII). A lot of historical examples move cannon balls, or other military equipment, along a front line.<a href="#fnref1" class="footnote-back">↩</a></p></li>
+</ol>
+</section>
+    </section>
+</article>
+]]></summary>
+</entry>
+<entry>
    <title>Mindsay: Towards Self-Learning Chatbots</title>
    <link href="https://www.lozeve.com/posts/self-learning-chatbots-destygo.html" />
    <id>https://www.lozeve.com/posts/self-learning-chatbots-destygo.html</id>
--- a/_site/images/hott_fig1.png
+++ b/_site/images/hott_fig1.png
--- a/_site/index.html
+++ b/_site/index.html
@ -90,6 +90,10 @@ public key: RWQ6uexORp8f7USHA7nX9lFfltaCA9x6aBV06MvgiGjUt6BVf6McyD26

 <ul>
    
+        <li>
+            <a href="./posts/hierarchical-optimal-transport-for-document-classification.html">Reading notes: Hierarchical Optimal Transport for Document Representation</a> - April  5, 2020
+        </li>
+    
        <li>
            <a href="./posts/self-learning-chatbots-destygo.html">Mindsay: Towards Self-Learning Chatbots</a> - April  6, 2019
        </li>
--- a/_site/posts/hierarchical-optimal-transport-for-document-classification.html
+++ b/_site/posts/hierarchical-optimal-transport-for-document-classification.html
@ -0,0 +1,135 @@
+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <meta http-equiv="x-ua-compatible" content="ie=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="description" content="Dimitri Lozeve's blog: Reading notes: Hierarchical Optimal Transport for Document Representation">
+    <title>Dimitri Lozeve - Reading notes: Hierarchical Optimal Transport for Document Representation</title>
+    <link rel="stylesheet" href="../css/default.css" />
+    <link rel="stylesheet" href="../css/syntax.css" />
+
+    <!-- KaTeX CSS styles -->
+    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.css" integrity="sha384-BdGj8xC2eZkQaxoQ8nSLefg4AV4/AwB3Fj+8SUSo7pnKP6Eoy18liIKTPn9oBYNG" crossorigin="anonymous">
+
+    <!-- The loading of KaTeX is deferred to speed up page rendering -->
+    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.js" integrity="sha384-JiKN5O8x9Hhs/UE5cT5AAJqieYlOZbGT3CHws/y97o3ty4R7/O5poG9F3JoiOYw1" crossorigin="anonymous"></script>
+
+    <!-- To automatically render math in text elements, include the auto-render extension: -->
+    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
+ 
+  </head>
+  <body>
+    <header>
+      <div class="logo">
+        <a href="../">Dimitri Lozeve</a>
+      </div>
+      <nav>
+        <a href="../">Home</a>
+	<a href="../projects.html">Projects</a>
+        <a href="../archive.html">Archive</a>
+	<a href="../contact.html">Contact</a>
+      </nav>
+    </header>
+
+    <main role="main">
+      <h1>Reading notes: Hierarchical Optimal Transport for Document Representation</h1>
+      <article>
+    <section class="header">
+        Posted on April  5, 2020
+        
+    </section>
+    <section>
+        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from <a href="https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019">NeurIPS 2019</a>. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
+<p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
+<h1 id="introduction-and-motivation">Introduction and motivation</h1>
+<p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
+<p>They propose a “meta-distance” between documents, called the hierarchical optimal topic transport (HOTT), providing a scalable metric incorporating topic information between documents. As such, they try to combine two different levels of analysis:</p>
+<ul>
+<li>word embeddings data, to embed language knowledge (via pre-trained embeddings for instance),</li>
+<li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
+</ul>
+<h1 id="background-optimal-transport">Background: optimal transport</h1>
+<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> (in French) by Gabriel Peyré on the <a href="https://images.math.cnrs.fr/">CNRS maths blog</a>. Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
+<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
+<p>More formally, we start with two sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
+      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1). We can then define the Wasserstein distance as <span class="math display">\[
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]</span> <span class="math display">\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
+<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
+<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
+<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
+<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
+<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
+<ul>
+<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
+<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
+</ul>
+<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
+<ul>
+<li>representations of topics as distributions over words,</li>
+<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
+</ul>
+<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
+<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
+<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
+<ul>
+<li>once to find distances between topics (WMD),</li>
+<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
+</ul>
+<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
+<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
+<figure>
+<img src="../images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
+</figure>
+<h1 id="experiments">Experiments</h1>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="pennington2014_glove">(Pennington, Socher, and Manning <a href="#ref-pennington2014_glove">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on GitHub</a>.</p>
+<p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
+<p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
+<h1 id="conclusion">Conclusion</h1>
+<p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
+<p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, being confident that it won’t break unexpectedly without extensive testing.</p>
+<h1 id="references" class="unnumbered">References</h1>
+<div id="refs" class="references">
+<div id="ref-mikolovDistributedRepresentationsWords2013">
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, 3111–9. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+</div>
+<div id="ref-pennington2014_glove">
+<p>Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</em>, 1532–43. Doha, Qatar: Association for Computational Linguistics. <a href="https://doi.org/10.3115/v1/D14-1162" class="uri">https://doi.org/10.3115/v1/D14-1162</a>.</p>
+</div>
+<div id="ref-peyreComputationalOptimalTransport2019">
+<p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
+</div>
+<div id="ref-santambrogioOptimalTransportApplied2015">
+<p>Santambrogio, Filippo. 2015. <em>Optimal Transport for Applied Mathematicians</em>. Vol. 87. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-20828-2" class="uri">https://doi.org/10.1007/978-3-319-20828-2</a>.</p>
+</div>
+<div id="ref-villaniOptimalTransportOld2009">
+<p>Villani, Cédric. 2009. <em>Optimal Transport: Old and New</em>. Grundlehren Der Mathematischen Wissenschaften 338. Berlin: Springer.</p>
+</div>
+<div id="ref-yurochkin2019_hierar_optim_trans_docum_repres">
+<p>Yurochkin, Mikhail, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. 2019. “Hierarchical Optimal Transport for Document Representation.” In <em>Advances in Neural Information Processing Systems 32</em>, 1599–1609. <a href="http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf" class="uri">http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf</a>.</p>
+</div>
+</div>
+<section class="footnotes">
+<hr />
+<ol>
+<li id="fn1"><p>Optimal transport originated with Monge, and then Kantorovich, both of whom had very clear military applications in mind (either in Revolutionary France, or during WWII). A lot of historical examples move cannon balls, or other military equipment, along a front line.<a href="#fnref1" class="footnote-back">↩</a></p></li>
+</ol>
+</section>
+    </section>
+</article>
+
+    </main>
+
+    <footer>
+      Site proudly generated by
+      <a href="http://jaspervdj.be/hakyll">Hakyll</a>
+    </footer>
+  </body>
+</html>
--- a/_site/rss.xml
+++ b/_site/rss.xml
@ -7,8 +7,106 @@
        <description><![CDATA[Recent posts]]></description>
        <atom:link href="https://www.lozeve.com/rss.xml" rel="self"
                   type="application/rss+xml" />
-        <lastBuildDate>Sat, 06 Apr 2019 00:00:00 UT</lastBuildDate>
+        <lastBuildDate>Sun, 05 Apr 2020 00:00:00 UT</lastBuildDate>
        <item>
+    <title>Reading notes: Hierarchical Optimal Transport for Document Representation</title>
+    <link>https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html</link>
+    <description><![CDATA[<article>
+    <section class="header">
+        Posted on April  5, 2020
+        
+    </section>
+    <section>
+        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from <a href="https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019">NeurIPS 2019</a>. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
+<p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
+<h1 id="introduction-and-motivation">Introduction and motivation</h1>
+<p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
+<p>They propose a “meta-distance” between documents, called the hierarchical optimal topic transport (HOTT), providing a scalable metric incorporating topic information between documents. As such, they try to combine two different levels of analysis:</p>
+<ul>
+<li>word embeddings data, to embed language knowledge (via pre-trained embeddings for instance),</li>
+<li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
+</ul>
+<h1 id="background-optimal-transport">Background: optimal transport</h1>
+<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> (in French) by Gabriel Peyré on the <a href="https://images.math.cnrs.fr/">CNRS maths blog</a>. Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
+<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
+<p>More formally, we start with two sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
+      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1). We can then define the Wasserstein distance as <span class="math display">\[
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]</span> <span class="math display">\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
+<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
+<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
+<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
+<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
+<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
+<ul>
+<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
+<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
+</ul>
+<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
+<ul>
+<li>representations of topics as distributions over words,</li>
+<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
+</ul>
+<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
+<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
+<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
+<ul>
+<li>once to find distances between topics (WMD),</li>
+<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
+</ul>
+<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
+<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
+<figure>
+<img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
+</figure>
+<h1 id="experiments">Experiments</h1>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="pennington2014_glove">(Pennington, Socher, and Manning <a href="#ref-pennington2014_glove">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on GitHub</a>.</p>
+<p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
+<p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
+<h1 id="conclusion">Conclusion</h1>
+<p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
+<p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, being confident that it won’t break unexpectedly without extensive testing.</p>
+<h1 id="references" class="unnumbered">References</h1>
+<div id="refs" class="references">
+<div id="ref-mikolovDistributedRepresentationsWords2013">
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, 3111–9. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+</div>
+<div id="ref-pennington2014_glove">
+<p>Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</em>, 1532–43. Doha, Qatar: Association for Computational Linguistics. <a href="https://doi.org/10.3115/v1/D14-1162" class="uri">https://doi.org/10.3115/v1/D14-1162</a>.</p>
+</div>
+<div id="ref-peyreComputationalOptimalTransport2019">
+<p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
+</div>
+<div id="ref-santambrogioOptimalTransportApplied2015">
+<p>Santambrogio, Filippo. 2015. <em>Optimal Transport for Applied Mathematicians</em>. Vol. 87. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-20828-2" class="uri">https://doi.org/10.1007/978-3-319-20828-2</a>.</p>
+</div>
+<div id="ref-villaniOptimalTransportOld2009">
+<p>Villani, Cédric. 2009. <em>Optimal Transport: Old and New</em>. Grundlehren Der Mathematischen Wissenschaften 338. Berlin: Springer.</p>
+</div>
+<div id="ref-yurochkin2019_hierar_optim_trans_docum_repres">
+<p>Yurochkin, Mikhail, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. 2019. “Hierarchical Optimal Transport for Document Representation.” In <em>Advances in Neural Information Processing Systems 32</em>, 1599–1609. <a href="http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf" class="uri">http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf</a>.</p>
+</div>
+</div>
+<section class="footnotes">
+<hr />
+<ol>
+<li id="fn1"><p>Optimal transport originated with Monge, and then Kantorovich, both of whom had very clear military applications in mind (either in Revolutionary France, or during WWII). A lot of historical examples move cannon balls, or other military equipment, along a front line.<a href="#fnref1" class="footnote-back">↩</a></p></li>
+</ol>
+</section>
+    </section>
+</article>
+]]></description>
+    <pubDate>Sun, 05 Apr 2020 00:00:00 UT</pubDate>
+    <guid>https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html</guid>
+    <dc:creator>Dimitri Lozeve</dc:creator>
+</item>
+<item>
    <title>Mindsay: Towards Self-Learning Chatbots</title>
    <link>https://www.lozeve.com/posts/self-learning-chatbots-destygo.html</link>
    <description><![CDATA[<article>
--- a/bib/bibliography.bib
+++ b/bib/bibliography.bib
@ -79,3 +79,109 @@
  isbn =	 9780374275631,
  lccn =	 2012533187,
 }
+
+@incollection{yurochkin2019_hierar_optim_trans_docum_repres,
+  author =	 {Yurochkin, Mikhail and Claici, Sebastian and Chien,
+                  Edward and Mirzazadeh, Farzaneh and Solomon, Justin
+                  M},
+  booktitle =	 {Advances in Neural Information Processing Systems
+                  32},
+  pages =	 {1599--1609},
+  title =	 {Hierarchical Optimal Transport for Document
+                  Representation},
+  url =
+                  {http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf},
+  year =	 2019,
+}
+
+@article{peyreComputationalOptimalTransport2019,
+  langid = {english},
+  title = {Computational {{Optimal Transport}}},
+  volume = {11},
+  issn = {1935-8237, 1935-8245},
+  url = {http://www.nowpublishers.com/article/Details/MAL-073},
+  doi = {10.1561/2200000073},
+  number = {5-6},
+  journaltitle = {Foundations and Trends in Machine Learning},
+  urldate = {2019-02-20},
+  date = {2019},
+  pages = {355-206},
+  author = {Peyré, Gabriel and Cuturi, Marco},
+  file = {/home/dimitri/Nextcloud/Zotero/storage/GLNYIRM9/Peyré and Cuturi - 2019 - Computational Optimal Transport.pdf}
+}
+
+@book{santambrogioOptimalTransportApplied2015,
+  location = {{Cham}},
+  title = {Optimal {{Transport}} for {{Applied Mathematicians}}},
+  volume = {87},
+  isbn = {978-3-319-20827-5 978-3-319-20828-2},
+  url = {http://link.springer.com/10.1007/978-3-319-20828-2},
+  series = {Progress in {{Nonlinear Differential Equations}} and {{Their Applications}}},
+  publisher = {{Springer International Publishing}},
+  urldate = {2019-02-01},
+  date = {2015},
+  author = {Santambrogio, Filippo},
+  file = {/home/dimitri/Nextcloud/Zotero/storage/8NHLGF5U/Santambrogio - 2015 - Optimal Transport for Applied Mathematicians.pdf},
+  doi = {10.1007/978-3-319-20828-2}
+}
+
+@book{villaniOptimalTransportOld2009,
+  location = {{Berlin}},
+  title = {Optimal Transport: Old and New},
+  isbn = {978-3-540-71049-3},
+  shorttitle = {Optimal Transport},
+  pagetotal = {973},
+  number = {338},
+  series = {Grundlehren Der Mathematischen {{Wissenschaften}}},
+  publisher = {{Springer}},
+  date = {2009},
+  keywords = {Probabilities,Dynamics,Dynamique,Géométrie différentielle,Geometry; Differential,Mathematical optimization,Optimisation mathématique,Probabilités,Problèmes de transport (Programmation),Transportation problems (Programming)},
+  author = {Villani, Cédric},
+  file = {/home/dimitri/Nextcloud/Zotero/storage/XMWCC335/Villani - 2009 - Optimal transport old and new.pdf},
+  note = {OCLC: ocn244421231}
+}
+
+@InProceedings{DBLP:conf/emnlp/PenningtonSM14,
+  author       = {Jeffrey Pennington and Richard Socher and
+                  Christopher D. Manning},
+  title	       = {Glove: Global Vectors for Word Representation},
+  year	       = 2014,
+  booktitle    = {Proceedings of the 2014 Conference on Empirical
+                  Methods in Natural Language Processing, {EMNLP}
+                  2014, October 25-29, 2014, Doha, Qatar, {A} meeting
+                  of SIGDAT, a Special Interest Group of the {ACL}},
+  pages	       = {1532-1543},
+  doi	       = {10.3115/v1/d14-1162},
+  url	       = {https://doi.org/10.3115/v1/d14-1162},
+  crossref     = {DBLP:conf/emnlp/2014},
+  timestamp    = {Tue, 28 Jan 2020 10:28:11 +0100},
+  biburl       = {https://dblp.org/rec/conf/emnlp/PenningtonSM14.bib},
+  bibsource    = {dblp computer science bibliography,
+                  https://dblp.org}
+}
+
+@inproceedings{pennington2014_glove,
+  author =	 "Pennington, Jeffrey and Socher, Richard and Manning,
+                  Christopher",
+  title =	 "{G}love: Global Vectors for Word Representation",
+  booktitle =	 "Proceedings of the 2014 Conference on Empirical
+                  Methods in Natural Language Processing ({EMNLP})",
+  year =	 2014,
+  pages =	 "1532--1543",
+  doi =		 "10.3115/v1/D14-1162",
+  url =		 {https://doi.org/10.3115/v1/D14-1162},
+  address =	 "Doha, Qatar",
+  month =	 oct,
+  publisher =	 "Association for Computational Linguistics",
+}
+
+@incollection{mikolovDistributedRepresentationsWords2013,
+  title = {Distributed {{Representations}} of {{Words}} and {{Phrases}} and Their {{Compositionality}}},
+  url = {http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf},
+  booktitle = {Advances in {{Neural Information Processing Systems}} 26},
+  urldate = {2019-08-13},
+  date = {2013},
+  pages = {3111--3119},
+  author = {Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff},
+}
+
--- a/images/hott_fig1.png
+++ b/images/hott_fig1.png
--- a/posts/hierarchical-optimal-transport-for-document-classification.org
+++ b/posts/hierarchical-optimal-transport-for-document-classification.org
@ -0,0 +1,212 @@
+---
+title: "Reading notes: Hierarchical Optimal Transport for Document Representation"
+date: 2020-04-05
+---
+
+Two weeks ago, I did a presentation for my colleagues of the paper
+from cite:yurochkin2019_hierar_optim_trans_docum_repres, from [[https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019][NeurIPS
+2019]]. It contains an interesting approach to document classification
+leading to strong performance, and, most importantly, excellent
+interpretability.
+
+This paper seems interesting to me because of it uses two methods with
+strong theoretical guarantees: optimal transport and topic
+modelling. Optimal transport looks very promising to me in NLP, and
+has seen a lot of interest in recent years due to advances in
+approximation algorithms, such as entropy regularisation. It is also
+quite refreshing to see approaches using solid results in
+optimisation, compared to purely experimental deep learning methods.
+
+* Introduction and motivation
+
+The problem of the paper is to measure similarity (i.e. a distance)
+between pairs of documents, by incorporating /semantic/ similarities
+(and not only syntactic artefacts), without encountering scalability
+issues.
+
+They propose a "meta-distance" between documents, called the
+hierarchical optimal topic transport (HOTT), providing a scalable
+metric incorporating topic information between documents. As such,
+they try to combine two different levels of analysis:
+- word embeddings data, to embed language knowledge (via pre-trained
+  embeddings for instance),
+- topic modelling methods (e.g. [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]]), to
+  represent semantically-meaningful groups of words.
+
+* Background: optimal transport
+
+The essential backbone of the method is the Wasserstein distance,
+derived from optimal transport theory. Optimal transport is a
+fascinating and deep subject, so I won't enter into the details
+here. For an introduction to the theory and its applications, check
+out the excellent book from
+cite:peyreComputationalOptimalTransport2019, ([[https://arxiv.org/abs/1803.00567][available on ArXiv]] as
+well). There are also [[https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr][very nice posts]] (in French) by Gabriel Peyré on
+the [[https://images.math.cnrs.fr/][CNRS maths blog]]. Many more resources (including slides for
+presentations) are available at
+[[https://optimaltransport.github.io]]. For a more complete theoretical
+treatment of the subject, check out
+cite:santambrogioOptimalTransportApplied2015, or, if you're feeling
+particularly adventurous, cite:villaniOptimalTransportOld2009.
+
+For this paper, only a superficial understanding of how the
+[[https://en.wikipedia.org/wiki/Wasserstein_metric][Wasserstein distance]] works is necessary. Optimal transport is an
+optimisation technique to lift a distance between points in a given
+metric space, to a distance between probability /distributions/ over
+this metric space. The historical example is to move piles of dirt
+around: you know the distance between any two points, and you have
+piles of dirt lying around[fn:historical_ot]. Now, if you want to move these piles to
+another configuration (fewer piles, say, or a different repartition of
+dirt a few metres away), you need to find the most efficient way to
+move them. The total cost you obtain will define a distance between
+the two configurations of dirt, and is usually called the [[https://en.wikipedia.org/wiki/Earth_mover%27s_distance][earth
+mover's distance]], which is just an instance of the general Wasserstein
+metric.
+
+[fn:historical_ot] Optimal transport originated with Monge, and then
+Kantorovich, both of whom had very clear military applications in mind
+(either in Revolutionary France, or during WWII). A lot of historical
+examples move cannon balls, or other military equipment, along a front
+line.
+
+
+More formally, we start with two sets of points $x = (x_1, x_2, \ldots,
+      x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1). We can then define the Wasserstein distance as
+\[
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]
+\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+\]
+where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.
+
+Now, how can this be applied to a natural language setting? Once we
+have word embeddings, we can consider that the vocabulary forms a
+metric space (we can compute a distance, for instance the euclidean or
+the [[https://en.wikipedia.org/wiki/Cosine_similarity][cosine distance]], between two word embeddings). The key is to
+define documents as /distributions/ over words.
+
+Given a vocabulary $V \subset \mathbb{R}^n$ and a corpus $D = (d^1, d^2, \ldots, d^{\lvert D \rvert})$, we represent a document as $d^i \in \Delta^{l_i}$ where $l_i$ is the number of unique words in $d^i$, and $d^i_j$ is the proportion of word $v_j$ in the document $d^i$.
+The word mover's distance (WMD) is then defined simply as
+\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]
+
+If you didn't follow all of this, don't worry! The gist is: if you
+have a distance between points, you can solve an optimisation problem
+to obtain a distance between /distributions/ over these points! This
+is especially useful when you consider that each word embedding is a
+point, and a document is just a set of words, along with the number of
+times they appear.
+
+* Hierarchical optimal transport
+
+Using optimal transport, we can use the word mover's distance to
+define a metric between documents. However, this suffers from two
+drawbacks:
+- Documents represented as distributions over words are not easily
+  interpretable. For long documents, the vocabulary is huge and word
+  frequencies are not easily understandable for humans.
+- Large vocabularies mean that the space on which we have to find an
+  optimal matching is huge. The [[https://en.wikipedia.org/wiki/Hungarian_algorithm][Hungarian algorithm]] used to compute
+  the optimal transport distance runs in $O(l^3 \log l)$, where $l$ is
+  the maximum number of unique words in each documents. This quickly
+  becomes intractable as the size of documents become larger, or if
+  you have to compute all pairwise distances between a large number of
+  documents (e.g. for clustering purposes).
+
+To escape these issues, we will add an intermediary step using [[https://en.wikipedia.org/wiki/Topic_model][topic
+modelling]]. Once we have topics $T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}$, we get two kinds of
+representations:
+- representations of topics as distributions over words,
+- representations of documents as distributions over topics $\bar{d^i} \in \Delta^{\lvert T \rvert}$.
+
+Since they are distributions over words, the word mover's distance
+defines a metric over topics. As such, the topics with the WMD become
+a metric space.
+
+We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents $d^1$, $d^2$,
+\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]
+where $\delta_{t_k}$ is a distribution supported on topic $t_k$.
+
+Note that in this case, we used optimal transport /twice/:
+- once to find distances between topics (WMD),
+- once to find distances between documents, where the distance between
+  topics became the costs in the new optimal transport
+  problem.
+
+The first one can be precomputed once for all subsequent distances, so
+it is invariable in the number of documents we have to process. The
+second one only operates on $\lvert T \rvert$ topics instead of the
+full vocabulary: the resulting optimisation problem is much smaller!
+This is great for performance, as it should be easy now to compute all
+pairwise distances in a large set of documents.
+
+Another interesting insight is that topics are represented as
+collections of words (we can keep the top 20 as a visual
+representations), and documents as collections of topics with
+weights. Both of these representations are highly interpretable for a
+human being who wants to understand what's going on. I think this is
+one of the strongest aspects of these approaches: both the various
+representations and the algorithms are fully interpretable. Compared
+to a deep learning approach, we can make sense of every intermediate
+step, from the representations of topics to the weights in the
+optimisation algorithm to compute higher-level distances.
+
+#+caption: Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books citep:yurochkin2019_hierar_optim_trans_docum_repres.
+[[file:/images/hott_fig1.png]]
+
+* Experiments
+
+The paper is very complete regarding experiments, providing a full
+evaluation of the method on one particular application: document
+clustering. They use [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]] to compute topics and
+GloVe for pretrained word embeddings citep:pennington2014_glove, and
+[[https://www.gurobi.com/][Gurobi]] to solve the optimisation problems. Their code is available [[https://github.com/IBM/HOTT][on
+GitHub]].
+
+If you want the details, I encourage you to read the full paper, they
+tested the methods on a wide variety of datasets, with datasets
+containing very short documents (like Twitter), and long documents
+with a large vocabulary (books). With a simple $k$-NN classification,
+they establish that HOTT performs best on average, especially on large
+vocabularies (books, the "gutenberg" dataset). It also has a much
+better computational performance than alternative methods based on
+regularisation of the optimal transport problem directly on words. So
+the hierarchical nature of the approach allows to gain considerably in
+performance, along with improvements in interpretability.
+
+What's really interesting in the paper is the sensitivity analysis:
+they ran experiments with different word embeddings methods (word2vec,
+citep:mikolovDistributedRepresentationsWords2013), and with different
+parameters for the topic modelling (topic truncation, number of
+topics, etc). All of these reveal that changes in hyperparameters do
+not impact the performance of HOTT significantly. This is extremely
+important in a field like NLP where most of the times small variations
+in approach lead to drastically different results.
+
+* Conclusion
+
+All in all, this paper present a very interesting approach to compute
+distance between natural-language documents. It is no secret that I
+like methods with strong theoretical background (in this case
+optimisation and optimal transport), guaranteeing a stability and
+benefiting from decades of research in a well-established domain.
+
+Most importantly, this paper allows for future exploration in document
+representation with /interpretability/ in mind. This is often added as
+an afterthought in academic research but is one of the most important
+topics for the industry, as a system must be understood by end users,
+often not trained in ML, before being deployed. The notion of topic,
+and distances as weights, can be understood easily by anyone without
+significant background in ML or in maths.
+
+Finally, I feel like they did not stop at a simple theoretical
+argument, but carefully checked on real-world datasets, measuring
+sensitivity to all the arbitrary choices they had to take. Again, from
+an industry perspective, this allows to implement the new approach
+quickly and easily, being confident that it won't break unexpectedly
+without extensive testing.
+
+* References