Add description of the method

2020-04-05 15:36:42 +02:00 · 2020-04-05 15:36:42 +02:00 · 3524466d4c
commit 3524466d4c
parent b033a5c26b
6 changed files with 151 additions and 8 deletions
--- a/_site/atom.xml
+++ b/_site/atom.xml
@ -35,12 +35,39 @@
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
 <p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]</span> <span class="math display">\[
 \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
 <p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
 <p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
 <p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
 <h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
 <p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
 <ul>
 <li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
 <li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
 </ul>
 <p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
 \rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
 <ul>
 <li>representations of topics as distributions over words,</li>
 <li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
 </ul>
 <p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
 <p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
 \operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
 \]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
 <p>Note that in this case, we used optimal transport <em>twice</em>:</p>
 <ul>
 <li>once to find distances between topics (WMD),</li>
 <li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
 </ul>
 <p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
 <p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
 <figure>
 <img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-peyreComputationalOptimalTransport2019">
--- a/_site/images/hott_fig1.png
+++ b/_site/images/hott_fig1.png
--- a/_site/posts/hierarchical-optimal-transport-for-document-classification.html
+++ b/_site/posts/hierarchical-optimal-transport-for-document-classification.html
@ -54,12 +54,39 @@
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
 <p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]</span> <span class="math display">\[
 \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
 <p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
 <p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
 <p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
 <h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
 <p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
 <ul>
 <li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
 <li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
 </ul>
 <p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
 \rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
 <ul>
 <li>representations of topics as distributions over words,</li>
 <li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
 </ul>
 <p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
 <p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
 \operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
 \]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
 <p>Note that in this case, we used optimal transport <em>twice</em>:</p>
 <ul>
 <li>once to find distances between topics (WMD),</li>
 <li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
 </ul>
 <p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
 <p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
 <figure>
 <img src="../images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-peyreComputationalOptimalTransport2019">
--- a/_site/rss.xml
+++ b/_site/rss.xml
@ -31,12 +31,39 @@
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
 <p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]</span> <span class="math display">\[
 \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
 <p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
 <p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
 <p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
 <h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
 <p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
 <ul>
 <li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
 <li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
 </ul>
 <p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
 \rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
 <ul>
 <li>representations of topics as distributions over words,</li>
 <li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
 </ul>
 <p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
 <p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
 \operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
 \]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
 <p>Note that in this case, we used optimal transport <em>twice</em>:</p>
 <ul>
 <li>once to find distances between topics (WMD),</li>
 <li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
 </ul>
 <p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
 <p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
 <figure>
 <img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-peyreComputationalOptimalTransport2019">
--- a/images/hott_fig1.png
+++ b/images/hott_fig1.png
--- a/posts/hierarchical-optimal-transport-for-document-classification.org
+++ b/posts/hierarchical-optimal-transport-for-document-classification.org
@ -73,7 +73,9 @@ line.
 More formally, if we have to sets of points $x = (x_1, x_2, \ldots,
      x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1), we can define the Wasserstein distance as
 \[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]
 \[
 \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]
 where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.
@ -95,4 +97,64 @@ is especially useful when you consider that each word embedding is a
 point, and a document is just a set of words, along with the number of
 times they appear.
 * Hierarchical optimal transport
 Using optimal transport, we can use the word mover's distance to
 define a metric between documents. However, this suffers from two
 drawbacks:
 - Documents represented as distributions over words are not easily
  interpretable. For long documents, the vocabulary is huge and word
  frequencies are not easily understandable for humans.
 - Large vocabularies mean that the space on which we have to find an
  optimal matching is huge. The [[https://en.wikipedia.org/wiki/Hungarian_algorithm][Hungarian algorithm]] used to compute
  the optimal transport distance runs in $O(l^3 \log l)$, where $l$ is
  the maximum number of unique words in each documents. This quickly
  becomes intractable as the size of documents become larger, or if
  you have to compute all pairwise distances between a large number of
  documents (e.g. for clustering purposes).
 To escape these issues, we will add an intermediary step using [[https://en.wikipedia.org/wiki/Topic_model][topic
 modelling]]. Once we have topics $T = (t_1, t_2, \ldots, t_{\lvert T
 \rvert}) \subset \Delta^{\lvert V \rvert}$, we get two kinds of
 representations:
 - representations of topics as distributions over words,
 - representations of documents as distributions over topics $\bar{d^i} \in \Delta^{\lvert T \rvert}$.
 Since they are distributions over words, the word mover's distance
 defines a metric over topics. As such, the topics with the WMD become
 a metric space.
 We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents $d^1$, $d^2$,
 \[
 \operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
 \]
 where $\delta_{t_k}$ is a distribution supported on topic $t_k$.
 Note that in this case, we used optimal transport /twice/:
 - once to find distances between topics (WMD),
 - once to find distances between documents, where the distance between
  topics became the costs in the new optimal transport
  problem.
 The first one can be precomputed once for all subsequent distances, so
 it is invariable in the number of documents we have to process. The
 second one only operates on $\lvert T \rvert$ topics instead of the
 full vocabulary: the resulting optimisation problem is much smaller!
 This is great for performance, as it should be easy now to compute all
 pairwise distances in a large set of documents.
 Another interesting insight is that topics are represented as
 collections of words (we can keep the top 20 as a visual
 representations), and documents as collections of topics with
 weights. Both of these representations are highly interpretable for a
 human being who wants to understand what's going on. I think this is
 one of the strongest aspects of these approaches: both the various
 representations and the algorithms are fully interpretable. Compared
 to a deep learning approach, we can make sense of every intermediate
 step, from the representations of topics to the weights in the
 optimisation algorithm to compute higher-level distances.
 #+caption: Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books citep:yurochkin2019_hierar_optim_trans_docum_repres.
 [[file:/images/hott_fig1.png]]
 * References