Add description of the method
This commit is contained in:
parent
b033a5c26b
commit
3524466d4c
6 changed files with 151 additions and 8 deletions
|
@ -35,12 +35,39 @@
|
||||||
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
|
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
|
||||||
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
|
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
|
||||||
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
|
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
|
||||||
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
|
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
|
||||||
|
\]</span> <span class="math display">\[
|
||||||
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
||||||
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
|
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
|
||||||
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
|
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
|
||||||
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
|
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
|
||||||
<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
|
<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
|
||||||
|
<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
|
||||||
|
<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
|
||||||
|
<ul>
|
||||||
|
<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
|
||||||
|
<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
|
||||||
|
</ul>
|
||||||
|
<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
|
||||||
|
\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
|
||||||
|
<ul>
|
||||||
|
<li>representations of topics as distributions over words,</li>
|
||||||
|
<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
|
||||||
|
</ul>
|
||||||
|
<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
|
||||||
|
<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
|
||||||
|
\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
|
||||||
|
\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
|
||||||
|
<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
|
||||||
|
<ul>
|
||||||
|
<li>once to find distances between topics (WMD),</li>
|
||||||
|
<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
|
||||||
|
</ul>
|
||||||
|
<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
|
||||||
|
<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
|
||||||
|
<figure>
|
||||||
|
<img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
|
||||||
|
</figure>
|
||||||
<h1 id="references" class="unnumbered">References</h1>
|
<h1 id="references" class="unnumbered">References</h1>
|
||||||
<div id="refs" class="references">
|
<div id="refs" class="references">
|
||||||
<div id="ref-peyreComputationalOptimalTransport2019">
|
<div id="ref-peyreComputationalOptimalTransport2019">
|
||||||
|
|
BIN
_site/images/hott_fig1.png
Normal file
BIN
_site/images/hott_fig1.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 667 KiB |
|
@ -54,12 +54,39 @@
|
||||||
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
|
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
|
||||||
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
|
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
|
||||||
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
|
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
|
||||||
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
|
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
|
||||||
|
\]</span> <span class="math display">\[
|
||||||
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
||||||
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
|
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
|
||||||
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
|
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
|
||||||
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
|
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
|
||||||
<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
|
<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
|
||||||
|
<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
|
||||||
|
<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
|
||||||
|
<ul>
|
||||||
|
<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
|
||||||
|
<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
|
||||||
|
</ul>
|
||||||
|
<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
|
||||||
|
\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
|
||||||
|
<ul>
|
||||||
|
<li>representations of topics as distributions over words,</li>
|
||||||
|
<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
|
||||||
|
</ul>
|
||||||
|
<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
|
||||||
|
<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
|
||||||
|
\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
|
||||||
|
\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
|
||||||
|
<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
|
||||||
|
<ul>
|
||||||
|
<li>once to find distances between topics (WMD),</li>
|
||||||
|
<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
|
||||||
|
</ul>
|
||||||
|
<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
|
||||||
|
<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
|
||||||
|
<figure>
|
||||||
|
<img src="../images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
|
||||||
|
</figure>
|
||||||
<h1 id="references" class="unnumbered">References</h1>
|
<h1 id="references" class="unnumbered">References</h1>
|
||||||
<div id="refs" class="references">
|
<div id="refs" class="references">
|
||||||
<div id="ref-peyreComputationalOptimalTransport2019">
|
<div id="ref-peyreComputationalOptimalTransport2019">
|
||||||
|
|
|
@ -31,12 +31,39 @@
|
||||||
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
|
<p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
|
||||||
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
|
<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
|
||||||
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
|
x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
|
||||||
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
|
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
|
||||||
|
\]</span> <span class="math display">\[
|
||||||
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
||||||
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
|
\]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
|
||||||
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
|
<p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
|
||||||
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
|
<p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
|
||||||
<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
|
<p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
|
||||||
|
<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
|
||||||
|
<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
|
||||||
|
<ul>
|
||||||
|
<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
|
||||||
|
<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
|
||||||
|
</ul>
|
||||||
|
<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
|
||||||
|
\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
|
||||||
|
<ul>
|
||||||
|
<li>representations of topics as distributions over words,</li>
|
||||||
|
<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
|
||||||
|
</ul>
|
||||||
|
<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
|
||||||
|
<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
|
||||||
|
\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
|
||||||
|
\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
|
||||||
|
<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
|
||||||
|
<ul>
|
||||||
|
<li>once to find distances between topics (WMD),</li>
|
||||||
|
<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
|
||||||
|
</ul>
|
||||||
|
<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
|
||||||
|
<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
|
||||||
|
<figure>
|
||||||
|
<img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
|
||||||
|
</figure>
|
||||||
<h1 id="references" class="unnumbered">References</h1>
|
<h1 id="references" class="unnumbered">References</h1>
|
||||||
<div id="refs" class="references">
|
<div id="refs" class="references">
|
||||||
<div id="ref-peyreComputationalOptimalTransport2019">
|
<div id="ref-peyreComputationalOptimalTransport2019">
|
||||||
|
|
BIN
images/hott_fig1.png
Normal file
BIN
images/hott_fig1.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 667 KiB |
|
@ -73,7 +73,9 @@ line.
|
||||||
More formally, if we have to sets of points $x = (x_1, x_2, \ldots,
|
More formally, if we have to sets of points $x = (x_1, x_2, \ldots,
|
||||||
x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1), we can define the Wasserstein distance as
|
x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1), we can define the Wasserstein distance as
|
||||||
\[
|
\[
|
||||||
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
|
W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
|
||||||
|
\]
|
||||||
|
\[
|
||||||
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
|
||||||
\]
|
\]
|
||||||
where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.
|
where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.
|
||||||
|
@ -95,4 +97,64 @@ is especially useful when you consider that each word embedding is a
|
||||||
point, and a document is just a set of words, along with the number of
|
point, and a document is just a set of words, along with the number of
|
||||||
times they appear.
|
times they appear.
|
||||||
|
|
||||||
|
* Hierarchical optimal transport
|
||||||
|
|
||||||
|
Using optimal transport, we can use the word mover's distance to
|
||||||
|
define a metric between documents. However, this suffers from two
|
||||||
|
drawbacks:
|
||||||
|
- Documents represented as distributions over words are not easily
|
||||||
|
interpretable. For long documents, the vocabulary is huge and word
|
||||||
|
frequencies are not easily understandable for humans.
|
||||||
|
- Large vocabularies mean that the space on which we have to find an
|
||||||
|
optimal matching is huge. The [[https://en.wikipedia.org/wiki/Hungarian_algorithm][Hungarian algorithm]] used to compute
|
||||||
|
the optimal transport distance runs in $O(l^3 \log l)$, where $l$ is
|
||||||
|
the maximum number of unique words in each documents. This quickly
|
||||||
|
becomes intractable as the size of documents become larger, or if
|
||||||
|
you have to compute all pairwise distances between a large number of
|
||||||
|
documents (e.g. for clustering purposes).
|
||||||
|
|
||||||
|
To escape these issues, we will add an intermediary step using [[https://en.wikipedia.org/wiki/Topic_model][topic
|
||||||
|
modelling]]. Once we have topics $T = (t_1, t_2, \ldots, t_{\lvert T
|
||||||
|
\rvert}) \subset \Delta^{\lvert V \rvert}$, we get two kinds of
|
||||||
|
representations:
|
||||||
|
- representations of topics as distributions over words,
|
||||||
|
- representations of documents as distributions over topics $\bar{d^i} \in \Delta^{\lvert T \rvert}$.
|
||||||
|
|
||||||
|
Since they are distributions over words, the word mover's distance
|
||||||
|
defines a metric over topics. As such, the topics with the WMD become
|
||||||
|
a metric space.
|
||||||
|
|
||||||
|
We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents $d^1$, $d^2$,
|
||||||
|
\[
|
||||||
|
\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
|
||||||
|
\]
|
||||||
|
where $\delta_{t_k}$ is a distribution supported on topic $t_k$.
|
||||||
|
|
||||||
|
Note that in this case, we used optimal transport /twice/:
|
||||||
|
- once to find distances between topics (WMD),
|
||||||
|
- once to find distances between documents, where the distance between
|
||||||
|
topics became the costs in the new optimal transport
|
||||||
|
problem.
|
||||||
|
|
||||||
|
The first one can be precomputed once for all subsequent distances, so
|
||||||
|
it is invariable in the number of documents we have to process. The
|
||||||
|
second one only operates on $\lvert T \rvert$ topics instead of the
|
||||||
|
full vocabulary: the resulting optimisation problem is much smaller!
|
||||||
|
This is great for performance, as it should be easy now to compute all
|
||||||
|
pairwise distances in a large set of documents.
|
||||||
|
|
||||||
|
Another interesting insight is that topics are represented as
|
||||||
|
collections of words (we can keep the top 20 as a visual
|
||||||
|
representations), and documents as collections of topics with
|
||||||
|
weights. Both of these representations are highly interpretable for a
|
||||||
|
human being who wants to understand what's going on. I think this is
|
||||||
|
one of the strongest aspects of these approaches: both the various
|
||||||
|
representations and the algorithms are fully interpretable. Compared
|
||||||
|
to a deep learning approach, we can make sense of every intermediate
|
||||||
|
step, from the representations of topics to the weights in the
|
||||||
|
optimisation algorithm to compute higher-level distances.
|
||||||
|
|
||||||
|
#+caption: Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books citep:yurochkin2019_hierar_optim_trans_docum_repres.
|
||||||
|
[[file:/images/hott_fig1.png]]
|
||||||
|
|
||||||
* References
|
* References
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue