diff --git a/_site/atom.xml b/_site/atom.xml
index 825d2af..1392ed7 100644
--- a/_site/atom.xml
+++ b/_site/atom.xml
@@ -35,12 +35,39 @@
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
 <p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
       x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
-	      \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]</span> <span class="math display">\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
 <p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
 <p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
 <p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
+<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
+<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
+<ul>
+<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
+<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
+</ul>
+<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
+<ul>
+<li>representations of topics as distributions over words,</li>
+<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
+</ul>
+<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
+<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
+<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
+<ul>
+<li>once to find distances between topics (WMD),</li>
+<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
+</ul>
+<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
+<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
+<figure>
+<img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
+</figure>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-peyreComputationalOptimalTransport2019">
diff --git a/_site/images/hott_fig1.png b/_site/images/hott_fig1.png
new file mode 100644
index 0000000..3b7799f
Binary files /dev/null and b/_site/images/hott_fig1.png differ
diff --git a/_site/posts/hierarchical-optimal-transport-for-document-classification.html b/_site/posts/hierarchical-optimal-transport-for-document-classification.html
index c1f0cf0..6353b48 100644
--- a/_site/posts/hierarchical-optimal-transport-for-document-classification.html
+++ b/_site/posts/hierarchical-optimal-transport-for-document-classification.html
@@ -54,12 +54,39 @@
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
 <p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
       x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
-	      \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]</span> <span class="math display">\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
 <p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
 <p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
 <p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
+<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
+<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
+<ul>
+<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
+<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
+</ul>
+<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
+<ul>
+<li>representations of topics as distributions over words,</li>
+<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
+</ul>
+<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
+<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
+<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
+<ul>
+<li>once to find distances between topics (WMD),</li>
+<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
+</ul>
+<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
+<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
+<figure>
+<img src="../images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
+</figure>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-peyreComputationalOptimalTransport2019">
diff --git a/_site/rss.xml b/_site/rss.xml
index a3573d8..43ae4e8 100644
--- a/_site/rss.xml
+++ b/_site/rss.xml
@@ -31,12 +31,39 @@
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
 <p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
       x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
-	      \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]</span> <span class="math display">\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]</span> where <span class="math inline">\(C_{i,j} = d(x_i, x_j)\)</span> are the costs computed from the original distance between points, and <span class="math inline">\(P_{i,j}\)</span> represent the amount we are moving from pile <span class="math inline">\(i\)</span> to pile <span class="math inline">\(j\)</span>.</p>
 <p>Now, how can this be applied to a natural language setting? Once we have word embeddings, we can consider that the vocabulary forms a metric space (we can compute a distance, for instance the euclidean or the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, between two word embeddings). The key is to define documents as <em>distributions</em> over words.</p>
 <p>Given a vocabulary <span class="math inline">\(V \subset \mathbb{R}^n\)</span> and a corpus <span class="math inline">\(D = (d^1, d^2, \ldots, d^{\lvert D \rvert})\)</span>, we represent a document as <span class="math inline">\(d^i \in \Delta^{l_i}\)</span> where <span class="math inline">\(l_i\)</span> is the number of unique words in <span class="math inline">\(d^i\)</span>, and <span class="math inline">\(d^i_j\)</span> is the proportion of word <span class="math inline">\(v_j\)</span> in the document <span class="math inline">\(d^i\)</span>. The word mover’s distance (WMD) is then defined simply as <span class="math display">\[ \operatorname{WMD}(d^1, d^2) = W_1(d^1, d^2). \]</span></p>
 <p>If you didn’t follow all of this, don’t worry! The gist is: if you have a distance between points, you can solve an optimisation problem to obtain a distance between <em>distributions</em> over these points! This is especially useful when you consider that each word embedding is a point, and a document is just a set of words, along with the number of times they appear.</p>
+<h1 id="hierarchical-optimal-transport">Hierarchical optimal transport</h1>
+<p>Using optimal transport, we can use the word mover’s distance to define a metric between documents. However, this suffers from two drawbacks:</p>
+<ul>
+<li>Documents represented as distributions over words are not easily interpretable. For long documents, the vocabulary is huge and word frequencies are not easily understandable for humans.</li>
+<li>Large vocabularies mean that the space on which we have to find an optimal matching is huge. The <a href="https://en.wikipedia.org/wiki/Hungarian_algorithm">Hungarian algorithm</a> used to compute the optimal transport distance runs in <span class="math inline">\(O(l^3 \log l)\)</span>, where <span class="math inline">\(l\)</span> is the maximum number of unique words in each documents. This quickly becomes intractable as the size of documents become larger, or if you have to compute all pairwise distances between a large number of documents (e.g. for clustering purposes).</li>
+</ul>
+<p>To escape these issues, we will add an intermediary step using <a href="https://en.wikipedia.org/wiki/Topic_model">topic modelling</a>. Once we have topics <span class="math inline">\(T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}\)</span>, we get two kinds of representations:</p>
+<ul>
+<li>representations of topics as distributions over words,</li>
+<li>representations of documents as distributions over topics <span class="math inline">\(\bar{d^i} \in \Delta^{\lvert T \rvert}\)</span>.</li>
+</ul>
+<p>Since they are distributions over words, the word mover’s distance defines a metric over topics. As such, the topics with the WMD become a metric space.</p>
+<p>We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents <span class="math inline">\(d^1\)</span>, <span class="math inline">\(d^2\)</span>, <span class="math display">\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]</span> where <span class="math inline">\(\delta_{t_k}\)</span> is a distribution supported on topic <span class="math inline">\(t_k\)</span>.</p>
+<p>Note that in this case, we used optimal transport <em>twice</em>:</p>
+<ul>
+<li>once to find distances between topics (WMD),</li>
+<li>once to find distances between documents, where the distance between topics became the costs in the new optimal transport problem.</li>
+</ul>
+<p>The first one can be precomputed once for all subsequent distances, so it is invariable in the number of documents we have to process. The second one only operates on <span class="math inline">\(\lvert T \rvert\)</span> topics instead of the full vocabulary: the resulting optimisation problem is much smaller! This is great for performance, as it should be easy now to compute all pairwise distances in a large set of documents.</p>
+<p>Another interesting insight is that topics are represented as collections of words (we can keep the top 20 as a visual representations), and documents as collections of topics with weights. Both of these representations are highly interpretable for a human being who wants to understand what’s going on. I think this is one of the strongest aspects of these approaches: both the various representations and the algorithms are fully interpretable. Compared to a deep learning approach, we can make sense of every intermediate step, from the representations of topics to the weights in the optimisation algorithm to compute higher-level distances.</p>
+<figure>
+<img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
+</figure>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-peyreComputationalOptimalTransport2019">
diff --git a/images/hott_fig1.png b/images/hott_fig1.png
new file mode 100644
index 0000000..3b7799f
Binary files /dev/null and b/images/hott_fig1.png differ
diff --git a/posts/hierarchical-optimal-transport-for-document-classification.org b/posts/hierarchical-optimal-transport-for-document-classification.org
index f7f5bbe..cf78dd6 100644
--- a/posts/hierarchical-optimal-transport-for-document-classification.org
+++ b/posts/hierarchical-optimal-transport-for-document-classification.org
@@ -73,8 +73,10 @@ line.
 More formally, if we have to sets of points $x = (x_1, x_2, \ldots,
       x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1), we can define the Wasserstein distance as
 \[
-  W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}\\
-	      \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
+W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
+\]
+\[
+\text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
 \]
 where $C_{i,j} = d(x_i, x_j)$ are the costs computed from the original distance between points, and $P_{i,j}$ represent the amount we are moving from pile $i$ to pile $j$.
 
@@ -95,4 +97,64 @@ is especially useful when you consider that each word embedding is a
 point, and a document is just a set of words, along with the number of
 times they appear.
 
+* Hierarchical optimal transport
+
+Using optimal transport, we can use the word mover's distance to
+define a metric between documents. However, this suffers from two
+drawbacks:
+- Documents represented as distributions over words are not easily
+  interpretable. For long documents, the vocabulary is huge and word
+  frequencies are not easily understandable for humans.
+- Large vocabularies mean that the space on which we have to find an
+  optimal matching is huge. The [[https://en.wikipedia.org/wiki/Hungarian_algorithm][Hungarian algorithm]] used to compute
+  the optimal transport distance runs in $O(l^3 \log l)$, where $l$ is
+  the maximum number of unique words in each documents. This quickly
+  becomes intractable as the size of documents become larger, or if
+  you have to compute all pairwise distances between a large number of
+  documents (e.g. for clustering purposes).
+
+To escape these issues, we will add an intermediary step using [[https://en.wikipedia.org/wiki/Topic_model][topic
+modelling]]. Once we have topics $T = (t_1, t_2, \ldots, t_{\lvert T
+\rvert}) \subset \Delta^{\lvert V \rvert}$, we get two kinds of
+representations:
+- representations of topics as distributions over words,
+- representations of documents as distributions over topics $\bar{d^i} \in \Delta^{\lvert T \rvert}$.
+
+Since they are distributions over words, the word mover's distance
+defines a metric over topics. As such, the topics with the WMD become
+a metric space.
+
+We can now define the hierarchical optimal topic transport (HOTT), as the optimal transport distance between documents, represented as distributions over topics. For two documents $d^1$, $d^2$,
+\[
+\operatorname{HOTT}(d^1, d^2) = W_1\left( \sum_{k=1}^{\lvert T \rvert} \bar{d^1_k} \delta_{t_k}, \sum_{k=1}^{\lvert T \rvert} \bar{d^2_k} \delta_{t_k} \right).
+\]
+where $\delta_{t_k}$ is a distribution supported on topic $t_k$.
+
+Note that in this case, we used optimal transport /twice/:
+- once to find distances between topics (WMD),
+- once to find distances between documents, where the distance between
+  topics became the costs in the new optimal transport
+  problem.
+
+The first one can be precomputed once for all subsequent distances, so
+it is invariable in the number of documents we have to process. The
+second one only operates on $\lvert T \rvert$ topics instead of the
+full vocabulary: the resulting optimisation problem is much smaller!
+This is great for performance, as it should be easy now to compute all
+pairwise distances in a large set of documents.
+
+Another interesting insight is that topics are represented as
+collections of words (we can keep the top 20 as a visual
+representations), and documents as collections of topics with
+weights. Both of these representations are highly interpretable for a
+human being who wants to understand what's going on. I think this is
+one of the strongest aspects of these approaches: both the various
+representations and the algorithms are fully interpretable. Compared
+to a deep learning approach, we can make sense of every intermediate
+step, from the representations of topics to the weights in the
+optimisation algorithm to compute higher-level distances.
+
+#+caption: Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books citep:yurochkin2019_hierar_optim_trans_docum_repres.
+[[file:/images/hott_fig1.png]]
+
 * References