Add post on HOTT

This commit is contained in:
Dimitri Lozeve 2020-04-05 13:32:45 +02:00
parent f439654137
commit b033a5c26b
7 changed files with 378 additions and 2 deletions

View file

@ -37,6 +37,10 @@
Here you can find all my previous posts:
<ul>
<li>
<a href="./posts/hierarchical-optimal-transport-for-document-classification.html">Reading notes: Hierarchical Optimal Transport for Document Representation</a> - April 5, 2020
</li>
<li>
<a href="./posts/self-learning-chatbots-destygo.html">Mindsay: Towards Self-Learning Chatbots</a> - April 6, 2019
</li>

View file

@ -8,8 +8,65 @@
<name>Dimitri Lozeve</name>
<email>dimitri+web@lozeve.com</email>
</author>
<updated>2019-04-06T00:00:00Z</updated>
<updated>2020-04-05T00:00:00Z</updated>
<entry>
<title>Reading notes: Hierarchical Optimal Transport for Document Representation</title>
<link href="https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html" />
<id>https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html</id>
<published>2020-04-05T00:00:00Z</published>
<updated>2020-04-05T00:00:00Z</updated>
<summary type="html"><![CDATA[<article>
<section class="header">
Posted on April 5, 2020
</section>
<section>
<p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from NeurIPS 2019. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
<p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
<h1 id="introduction-and-motivation">Introduction and motivation</h1>
<p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
<p>They propose a “meta-distance” between documents, called the hierarchical optimal topic transport (HOTT), providing a scalable metric incorporating topic information between documents. As such, they try to combine two different levels of analysis:</p>
<ul>
<li>word embeddings data, to embed language knowledge (via pre-trained embeddings for instance),</li>
<li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
</ul>
<h1 id="background-optimal-transport">Background: optimal transport</h1>
<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I wont enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> by Gabriel Peyré on the CNRS maths blog (in French). Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if youre feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth movers distance</a>, which is just an instance of the general Wasserstein metric.</p>
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word movers distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
<p>If you didnt follow all of this, dont worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-peyreComputationalOptimalTransport2019">
<p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
</div>
<div id="ref-santambrogioOptimalTransportApplied2015">
<p>Santambrogio, Filippo. 2015. <em>Optimal Transport for Applied Mathematicians</em>. Vol. 87. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-20828-2" class="uri">https://doi.org/10.1007/978-3-319-20828-2</a>.</p>
</div>
<div id="ref-villaniOptimalTransportOld2009">
<p>Villani, Cédric. 2009. <em>Optimal Transport: Old and New</em>. Grundlehren Der Mathematischen Wissenschaften 338. Berlin: Springer.</p>
</div>
<div id="ref-yurochkin2019_hierar_optim_trans_docum_repres">
<p>Yurochkin, Mikhail, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. 2019. “Hierarchical Optimal Transport for Document Representation.” In <em>Advances in Neural Information Processing Systems 32</em>, 15991609. <a href="http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf" class="uri">http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf</a>.</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Optimal transport originated with Monge, and then Kantorovich, both of whom had very clear military applications in mind (either in Revolutionary France, or during WWII). A lot of historical examples move cannon balls, or other military equipment, along a front line.<a href="#fnref1" class="footnote-back"></a></p></li>
</ol>
</section>
</section>
</article>
]]></summary>
</entry>
<entry>
<title>Mindsay: Towards Self-Learning Chatbots</title>
<link href="https://www.lozeve.com/posts/self-learning-chatbots-destygo.html" />
<id>https://www.lozeve.com/posts/self-learning-chatbots-destygo.html</id>

View file

@ -90,6 +90,10 @@ public key: RWQ6uexORp8f7USHA7nX9lFfltaCA9x6aBV06MvgiGjUt6BVf6McyD26
<ul>
<li>
<a href="./posts/hierarchical-optimal-transport-for-document-classification.html">Reading notes: Hierarchical Optimal Transport for Document Representation</a> - April 5, 2020
</li>
<li>
<a href="./posts/self-learning-chatbots-destygo.html">Mindsay: Towards Self-Learning Chatbots</a> - April 6, 2019
</li>

View file

@ -0,0 +1,94 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Dimitri Lozeve's blog: Reading notes: Hierarchical Optimal Transport for Document Representation">
<title>Dimitri Lozeve - Reading notes: Hierarchical Optimal Transport for Document Representation</title>
<link rel="stylesheet" href="../css/default.css" />
<link rel="stylesheet" href="../css/syntax.css" />
<!-- KaTeX CSS styles -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.css" integrity="sha384-BdGj8xC2eZkQaxoQ8nSLefg4AV4/AwB3Fj+8SUSo7pnKP6Eoy18liIKTPn9oBYNG" crossorigin="anonymous">
<!-- The loading of KaTeX is deferred to speed up page rendering -->
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.js" integrity="sha384-JiKN5O8x9Hhs/UE5cT5AAJqieYlOZbGT3CHws/y97o3ty4R7/O5poG9F3JoiOYw1" crossorigin="anonymous"></script>
<!-- To automatically render math in text elements, include the auto-render extension: -->
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
</head>
<body>
<header>
<div class="logo">
<a href="../">Dimitri Lozeve</a>
</div>
<nav>
<a href="../">Home</a>
<a href="../projects.html">Projects</a>
<a href="../archive.html">Archive</a>
<a href="../contact.html">Contact</a>
</nav>
</header>
<main role="main">
<h1>Reading notes: Hierarchical Optimal Transport for Document Representation</h1>
<article>
<section class="header">
Posted on April 5, 2020
</section>
<section>
<p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from NeurIPS 2019. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
<p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
<h1 id="introduction-and-motivation">Introduction and motivation</h1>
<p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
<p>They propose a “meta-distance” between documents, called the hierarchical optimal topic transport (HOTT), providing a scalable metric incorporating topic information between documents. As such, they try to combine two different levels of analysis:</p>
<ul>
<li>word embeddings data, to embed language knowledge (via pre-trained embeddings for instance),</li>
<li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
</ul>
<h1 id="background-optimal-transport">Background: optimal transport</h1>
<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I wont enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> by Gabriel Peyré on the CNRS maths blog (in French). Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if youre feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth movers distance</a>, which is just an instance of the general Wasserstein metric.</p>
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word movers distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
<p>If you didnt follow all of this, dont worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-peyreComputationalOptimalTransport2019">
<p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
</div>
<div id="ref-santambrogioOptimalTransportApplied2015">
<p>Santambrogio, Filippo. 2015. <em>Optimal Transport for Applied Mathematicians</em>. Vol. 87. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-20828-2" class="uri">https://doi.org/10.1007/978-3-319-20828-2</a>.</p>
</div>
<div id="ref-villaniOptimalTransportOld2009">
<p>Villani, Cédric. 2009. <em>Optimal Transport: Old and New</em>. Grundlehren Der Mathematischen Wissenschaften 338. Berlin: Springer.</p>
</div>
<div id="ref-yurochkin2019_hierar_optim_trans_docum_repres">
<p>Yurochkin, Mikhail, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. 2019. “Hierarchical Optimal Transport for Document Representation.” In <em>Advances in Neural Information Processing Systems 32</em>, 15991609. <a href="http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf" class="uri">http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf</a>.</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Optimal transport originated with Monge, and then Kantorovich, both of whom had very clear military applications in mind (either in Revolutionary France, or during WWII). A lot of historical examples move cannon balls, or other military equipment, along a front line.<a href="#fnref1" class="footnote-back"></a></p></li>
</ol>
</section>
</section>
</article>
</main>
<footer>
Site proudly generated by
<a href="http://jaspervdj.be/hakyll">Hakyll</a>
</footer>
</body>
</html>

View file

@ -7,8 +7,65 @@
<description><![CDATA[Recent posts]]></description>
<atom:link href="https://www.lozeve.com/rss.xml" rel="self"
type="application/rss+xml" />
<lastBuildDate>Sat, 06 Apr 2019 00:00:00 UT</lastBuildDate>
<lastBuildDate>Sun, 05 Apr 2020 00:00:00 UT</lastBuildDate>
<item>
<title>Reading notes: Hierarchical Optimal Transport for Document Representation</title>
<link>https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html</link>
<description><![CDATA[<article>
<section class="header">
Posted on April 5, 2020
</section>
<section>
<p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from NeurIPS 2019. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
<p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
<h1 id="introduction-and-motivation">Introduction and motivation</h1>
<p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
<p>They propose a “meta-distance” between documents, called the hierarchical optimal topic transport (HOTT), providing a scalable metric incorporating topic information between documents. As such, they try to combine two different levels of analysis:</p>
<ul>
<li>word embeddings data, to embed language knowledge (via pre-trained embeddings for instance),</li>
<li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
</ul>
<h1 id="background-optimal-transport">Background: optimal transport</h1>
<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I wont enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> by Gabriel Peyré on the CNRS maths blog (in French). Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if youre feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth movers distance</a>, which is just an instance of the general Wasserstein metric.</p>
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word movers distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
<p>If you didnt follow all of this, dont worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-peyreComputationalOptimalTransport2019">
<p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
</div>
<div id="ref-santambrogioOptimalTransportApplied2015">
<p>Santambrogio, Filippo. 2015. <em>Optimal Transport for Applied Mathematicians</em>. Vol. 87. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing. <a href="https://doi.org/10.1007/978-3-319-20828-2" class="uri">https://doi.org/10.1007/978-3-319-20828-2</a>.</p>
</div>
<div id="ref-villaniOptimalTransportOld2009">
<p>Villani, Cédric. 2009. <em>Optimal Transport: Old and New</em>. Grundlehren Der Mathematischen Wissenschaften 338. Berlin: Springer.</p>
</div>
<div id="ref-yurochkin2019_hierar_optim_trans_docum_repres">
<p>Yurochkin, Mikhail, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. 2019. “Hierarchical Optimal Transport for Document Representation.” In <em>Advances in Neural Information Processing Systems 32</em>, 15991609. <a href="http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf" class="uri">http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf</a>.</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Optimal transport originated with Monge, and then Kantorovich, both of whom had very clear military applications in mind (either in Revolutionary France, or during WWII). A lot of historical examples move cannon balls, or other military equipment, along a front line.<a href="#fnref1" class="footnote-back"></a></p></li>
</ol>
</section>
</section>
</article>
]]></description>
<pubDate>Sun, 05 Apr 2020 00:00:00 UT</pubDate>
<guid>https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html</guid>
<dc:creator>Dimitri Lozeve</dc:creator>
</item>
<item>
<title>Mindsay: Towards Self-Learning Chatbots</title>
<link>https://www.lozeve.com/posts/self-learning-chatbots-destygo.html</link>
<description><![CDATA[<article>

View file

@ -79,3 +79,65 @@
isbn = 9780374275631,
lccn = 2012533187,
}
@incollection{yurochkin2019_hierar_optim_trans_docum_repres,
author = {Yurochkin, Mikhail and Claici, Sebastian and Chien,
Edward and Mirzazadeh, Farzaneh and Solomon, Justin
M},
booktitle = {Advances in Neural Information Processing Systems
32},
pages = {1599--1609},
title = {Hierarchical Optimal Transport for Document
Representation},
url =
{http://papers.nips.cc/paper/8438-hierarchical-optimal-transport-for-document-representation.pdf},
year = 2019,
}
@article{peyreComputationalOptimalTransport2019,
langid = {english},
title = {Computational {{Optimal Transport}}},
volume = {11},
issn = {1935-8237, 1935-8245},
url = {http://www.nowpublishers.com/article/Details/MAL-073},
doi = {10.1561/2200000073},
number = {5-6},
journaltitle = {Foundations and Trends in Machine Learning},
urldate = {2019-02-20},
date = {2019},
pages = {355-206},
author = {Peyré, Gabriel and Cuturi, Marco},
file = {/home/dimitri/Nextcloud/Zotero/storage/GLNYIRM9/Peyré and Cuturi - 2019 - Computational Optimal Transport.pdf}
}
@book{santambrogioOptimalTransportApplied2015,
location = {{Cham}},
title = {Optimal {{Transport}} for {{Applied Mathematicians}}},
volume = {87},
isbn = {978-3-319-20827-5 978-3-319-20828-2},
url = {http://link.springer.com/10.1007/978-3-319-20828-2},
series = {Progress in {{Nonlinear Differential Equations}} and {{Their Applications}}},
publisher = {{Springer International Publishing}},
urldate = {2019-02-01},
date = {2015},
author = {Santambrogio, Filippo},
file = {/home/dimitri/Nextcloud/Zotero/storage/8NHLGF5U/Santambrogio - 2015 - Optimal Transport for Applied Mathematicians.pdf},
doi = {10.1007/978-3-319-20828-2}
}
@book{villaniOptimalTransportOld2009,
location = {{Berlin}},
title = {Optimal Transport: Old and New},
isbn = {978-3-540-71049-3},
shorttitle = {Optimal Transport},
pagetotal = {973},
number = {338},
series = {Grundlehren Der Mathematischen {{Wissenschaften}}},
publisher = {{Springer}},
date = {2009},
keywords = {Probabilities,Dynamics,Dynamique,Géométrie différentielle,Geometry; Differential,Mathematical optimization,Optimisation mathématique,Probabilités,Problèmes de transport (Programmation),Transportation problems (Programming)},
author = {Villani, Cédric},
file = {/home/dimitri/Nextcloud/Zotero/storage/XMWCC335/Villani - 2009 - Optimal transport old and new.pdf},
note = {OCLC: ocn244421231}
}

View file

@ -0,0 +1,98 @@
---
title: "Reading notes: Hierarchical Optimal Transport for Document Representation"
date: 2020-04-05
---
Two weeks ago, I did a presentation for my colleagues of the paper
from cite:yurochkin2019_hierar_optim_trans_docum_repres, from
NeurIPS 2019. It contains an interesting approach to document
classification leading to strong performance, and, most importantly,
excellent interpretability.
This paper seems interesting to me because of it uses two methods with
strong theoretical guarantees: optimal transport and topic
modelling. Optimal transport looks very promising to me in NLP, and
has seen a lot of interest in recent years due to advances in
approximation algorithms, such as entropy regularisation. It is also
quite refreshing to see approaches using solid results in
optimisation, compared to purely experimental deep learning methods.
* Introduction and motivation
The problem of the paper is to measure similarity (i.e. a distance)
between pairs of documents, by incorporating /semantic/ similarities
(and not only syntactic artefacts), without encountering scalability
issues.
They propose a "meta-distance" between documents, called the
hierarchical optimal topic transport (HOTT), providing a scalable
metric incorporating topic information between documents. As such,
they try to combine two different levels of analysis:
- word embeddings data, to embed language knowledge (via pre-trained
embeddings for instance),
- topic modelling methods (e.g. [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]]), to
represent semantically-meaningful groups of words.
* Background: optimal transport
The essential backbone of the method is the Wasserstein distance,
derived from optimal transport theory. Optimal transport is a
fascinating and deep subject, so I won't enter into the details
here. For an introduction to the theory and its applications, check
out the excellent book from
cite:peyreComputationalOptimalTransport2019, ([[https://arxiv.org/abs/1803.00567][available on ArXiv]] as
well). There are also [[https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr][very nice posts]] by Gabriel Peyré on the CNRS
maths blog (in French). Many more resources (including slides for
presentations) are available at
[[https://optimaltransport.github.io]]. For a more complete theoretical
treatment of the subject, check out
cite:santambrogioOptimalTransportApplied2015, or, if you're feeling
particularly adventurous, cite:villaniOptimalTransportOld2009.
For this paper, only a superficial understanding of how the
[[https://en.wikipedia.org/wiki/Wasserstein_metric][Wasserstein distance]] works is necessary. Optimal transport is an
optimisation technique to lift a distance between points in a given
metric space, to a distance between probability /distributions/ over
this metric space. The historical example is to move piles of dirt
around: you know the distance between any two points, and you have
piles of dirt lying around[fn:historical_ot]. Now, if you want to move these piles to
another configuration (fewer piles, say, or a different repartition of
dirt a few metres away), you need to find the most efficient way to
move them. The total cost you obtain will define a distance between
the two configurations of dirt, and is usually called the [[https://en.wikipedia.org/wiki/Earth_mover%27s_distance][earth
mover's distance]], which is just an instance of the general Wasserstein
metric.
[fn:historical_ot] Optimal transport originated with Monge, and then
Kantorovich, both of whom had very clear military applications in mind
(either in Revolutionary France, or during WWII). A lot of historical
examples move cannon balls, or other military equipment, along a front
line.
More formally, if we have to sets of points $x = (x_1, x_2, \ldots,
x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1), we can define the Wasserstein distance as
\[
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
\]
where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.
Now, how can this be applied to a natural language setting? Once we
have word embeddings, we can consider that the vocabulary forms a
metric space (we can compute a distance, for instance the euclidean or
the [[https://en.wikipedia.org/wiki/Cosine_similarity][cosine distance]], between two word embeddings). The key is to
define documents as /distributions/ over words.
Given a vocabulary $V \subset \mathbb{R}^n$ and a corpus $D = (d^1, d^2, \ldots, d^{\lvert D \rvert})$, we represent a document as $d^i \in \Delta^{l_i}$ where $l_i$ is the number of unique words in $d^i$, and $d^i_j$ is the proportion of word $v_j$ in the document $d^i$.
The word mover's distance (WMD) is then defined simply as
\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]
If you didn't follow all of this, don't worry! The gist is: if you
have a distance between points, you can solve an optimisation problem
to obtain a distance between /distributions/ over these points! This
is especially useful when you consider that each word embedding is a
point, and a document is just a set of words, along with the number of
times they appear.
* References