Small updates

2020-04-05 16:07:24 +02:00 · 2020-04-05 16:07:24 +02:00 · 102d4bb689
commit 102d4bb689
parent 044a011a4e
5 changed files with 52 additions and 57 deletions
--- a/_site/atom.xml
+++ b/_site/atom.xml
@ -21,7 +21,7 @@
    </section>
    <section>
-        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from NeurIPS 2019. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
+        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from <a href="https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019">NeurIPS 2019</a>. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
 <p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
 <h1 id="introduction-and-motivation">Introduction and motivation</h1>
 <p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
@ -31,10 +31,10 @@
 <li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
 </ul>
 <h1 id="background-optimal-transport">Background: optimal transport</h1>
-<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> by Gabriel Peyré on the CNRS maths blog (in French). Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
+<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> (in French) by Gabriel Peyré on the <a href="https://images.math.cnrs.fr/">CNRS maths blog</a>. Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
-<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
+<p>More formally, we start with two sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
-      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
+      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1). We can then define the Wasserstein distance as <span class="math display">\[
 W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]</span> <span class="math display">\[
 \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
@ -69,20 +69,20 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 <img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
 <h1 id="experiments">Experiments</h1>
-<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="moschitti2014_proceed_confer_empir_method_natur">(Moschitti, Pang, and Daelemans <a href="#ref-moschitti2014_proceed_confer_empir_method_natur">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on Github</a>.</p>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="pennington2014_glove">(Pennington, Socher, and Manning <a href="#ref-pennington2014_glove">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on GitHub</a>.</p>
 <p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
 <p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
 <h1 id="conclusion">Conclusion</h1>
 <p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
 <p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
-<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, confident that it won’t break unexpectedly without extensive testing.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, being confident that it won’t break unexpectedly without extensive testing.</p>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-mikolovDistributedRepresentationsWords2013">
-<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–9. Curran Associates, Inc. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, 3111–9. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
 </div>
-<div id="ref-moschitti2014_proceed_confer_empir_method_natur">
+<div id="ref-pennington2014_glove">
-<p>Moschitti, Alessandro, Bo Pang, and Walter Daelemans, eds. 2014. <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A Meeting of Sigdat, a Special Interest Group of the ACL</em>. ACL. <a href="https://www.aclweb.org/anthology/volumes/D14-1/" class="uri">https://www.aclweb.org/anthology/volumes/D14-1/</a>.</p>
+<p>Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</em>, 1532–43. Doha, Qatar: Association for Computational Linguistics. <a href="https://doi.org/10.3115/v1/D14-1162" class="uri">https://doi.org/10.3115/v1/D14-1162</a>.</p>
 </div>
 <div id="ref-peyreComputationalOptimalTransport2019">
 <p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
--- a/_site/posts/hierarchical-optimal-transport-for-document-classification.html
+++ b/_site/posts/hierarchical-optimal-transport-for-document-classification.html
@ -40,7 +40,7 @@
    </section>
    <section>
-        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from NeurIPS 2019. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
+        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from <a href="https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019">NeurIPS 2019</a>. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
 <p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
 <h1 id="introduction-and-motivation">Introduction and motivation</h1>
 <p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
@ -50,10 +50,10 @@
 <li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
 </ul>
 <h1 id="background-optimal-transport">Background: optimal transport</h1>
-<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> by Gabriel Peyré on the CNRS maths blog (in French). Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
+<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> (in French) by Gabriel Peyré on the <a href="https://images.math.cnrs.fr/">CNRS maths blog</a>. Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
-<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
+<p>More formally, we start with two sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
-      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
+      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1). We can then define the Wasserstein distance as <span class="math display">\[
 W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]</span> <span class="math display">\[
 \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
@ -88,20 +88,20 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 <img src="../images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
 <h1 id="experiments">Experiments</h1>
-<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="moschitti2014_proceed_confer_empir_method_natur">(Moschitti, Pang, and Daelemans <a href="#ref-moschitti2014_proceed_confer_empir_method_natur">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on Github</a>.</p>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="pennington2014_glove">(Pennington, Socher, and Manning <a href="#ref-pennington2014_glove">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on GitHub</a>.</p>
 <p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
 <p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
 <h1 id="conclusion">Conclusion</h1>
 <p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
 <p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
-<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, confident that it won’t break unexpectedly without extensive testing.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, being confident that it won’t break unexpectedly without extensive testing.</p>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-mikolovDistributedRepresentationsWords2013">
-<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–9. Curran Associates, Inc. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, 3111–9. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
 </div>
-<div id="ref-moschitti2014_proceed_confer_empir_method_natur">
+<div id="ref-pennington2014_glove">
-<p>Moschitti, Alessandro, Bo Pang, and Walter Daelemans, eds. 2014. <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A Meeting of Sigdat, a Special Interest Group of the ACL</em>. ACL. <a href="https://www.aclweb.org/anthology/volumes/D14-1/" class="uri">https://www.aclweb.org/anthology/volumes/D14-1/</a>.</p>
+<p>Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</em>, 1532–43. Doha, Qatar: Association for Computational Linguistics. <a href="https://doi.org/10.3115/v1/D14-1162" class="uri">https://doi.org/10.3115/v1/D14-1162</a>.</p>
 </div>
 <div id="ref-peyreComputationalOptimalTransport2019">
 <p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
--- a/_site/rss.xml
+++ b/_site/rss.xml
@ -17,7 +17,7 @@
    </section>
    <section>
-        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from NeurIPS 2019. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
+        <p>Two weeks ago, I did a presentation for my colleagues of the paper from <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">Yurochkin et al. (<a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>, from <a href="https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019">NeurIPS 2019</a>. It contains an interesting approach to document classification leading to strong performance, and, most importantly, excellent interpretability.</p>
 <p>This paper seems interesting to me because of it uses two methods with strong theoretical guarantees: optimal transport and topic modelling. Optimal transport looks very promising to me in NLP, and has seen a lot of interest in recent years due to advances in approximation algorithms, such as entropy regularisation. It is also quite refreshing to see approaches using solid results in optimisation, compared to purely experimental deep learning methods.</p>
 <h1 id="introduction-and-motivation">Introduction and motivation</h1>
 <p>The problem of the paper is to measure similarity (i.e. a distance) between pairs of documents, by incorporating <em>semantic</em> similarities (and not only syntactic artefacts), without encountering scalability issues.</p>
@ -27,10 +27,10 @@
 <li>topic modelling methods (e.g. <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a>), to represent semantically-meaningful groups of words.</li>
 </ul>
 <h1 id="background-optimal-transport">Background: optimal transport</h1>
-<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> by Gabriel Peyré on the CNRS maths blog (in French). Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
+<p>The essential backbone of the method is the Wasserstein distance, derived from optimal transport theory. Optimal transport is a fascinating and deep subject, so I won’t enter into the details here. For an introduction to the theory and its applications, check out the excellent book from <span class="citation" data-cites="peyreComputationalOptimalTransport2019">Peyré and Cuturi (<a href="#ref-peyreComputationalOptimalTransport2019">2019</a>)</span>, (<a href="https://arxiv.org/abs/1803.00567">available on ArXiv</a> as well). There are also <a href="https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr">very nice posts</a> (in French) by Gabriel Peyré on the <a href="https://images.math.cnrs.fr/">CNRS maths blog</a>. Many more resources (including slides for presentations) are available at <a href="https://optimaltransport.github.io" class="uri">https://optimaltransport.github.io</a>. For a more complete theoretical treatment of the subject, check out <span class="citation" data-cites="santambrogioOptimalTransportApplied2015">Santambrogio (<a href="#ref-santambrogioOptimalTransportApplied2015">2015</a>)</span>, or, if you’re feeling particularly adventurous, <span class="citation" data-cites="villaniOptimalTransportOld2009">Villani (<a href="#ref-villaniOptimalTransportOld2009">2009</a>)</span>.</p>
 <p>For this paper, only a superficial understanding of how the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> works is necessary. Optimal transport is an optimisation technique to lift a distance between points in a given metric space, to a distance between probability <em>distributions</em> over this metric space. The historical example is to move piles of dirt around: you know the distance between any two points, and you have piles of dirt lying around<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Now, if you want to move these piles to another configuration (fewer piles, say, or a different repartition of dirt a few metres away), you need to find the most efficient way to move them. The total cost you obtain will define a distance between the two configurations of dirt, and is usually called the <a href="https://en.wikipedia.org/wiki/Earth_mover%27s_distance">earth mover’s distance</a>, which is just an instance of the general Wasserstein metric.</p>
-<p>More formally, if we have to sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
+<p>More formally, we start with two sets of points <span class="math inline">\(x = (x_1, x_2, \ldots,
-      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1), we can define the Wasserstein distance as <span class="math display">\[
+      x_n)\)</span>, and <span class="math inline">\(y = (y_1, y_2, \ldots, y_n)\)</span>, along with probability distributions <span class="math inline">\(p \in \Delta^n\)</span>, <span class="math inline">\(q \in \Delta^m\)</span> over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> (<span class="math inline">\(\Delta^n\)</span> is the probability simplex of dimension <span class="math inline">\(n\)</span>, i.e. the set of vectors of size <span class="math inline">\(n\)</span> summing to 1). We can then define the Wasserstein distance as <span class="math display">\[
 W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]</span> <span class="math display">\[
 \text{\small subject to } \sum_j P_{i,j} = p_i \text{ \small and } \sum_i P_{i,j} = q_j,
@ -65,20 +65,20 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 <img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
 <h1 id="experiments">Experiments</h1>
-<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="moschitti2014_proceed_confer_empir_method_natur">(Moschitti, Pang, and Daelemans <a href="#ref-moschitti2014_proceed_confer_empir_method_natur">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on Github</a>.</p>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="pennington2014_glove">(Pennington, Socher, and Manning <a href="#ref-pennington2014_glove">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on GitHub</a>.</p>
 <p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
 <p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
 <h1 id="conclusion">Conclusion</h1>
 <p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
 <p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
-<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, confident that it won’t break unexpectedly without extensive testing.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, being confident that it won’t break unexpectedly without extensive testing.</p>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
 <div id="ref-mikolovDistributedRepresentationsWords2013">
-<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–9. Curran Associates, Inc. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, 3111–9. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
 </div>
-<div id="ref-moschitti2014_proceed_confer_empir_method_natur">
+<div id="ref-pennington2014_glove">
-<p>Moschitti, Alessandro, Bo Pang, and Walter Daelemans, eds. 2014. <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A Meeting of Sigdat, a Special Interest Group of the ACL</em>. ACL. <a href="https://www.aclweb.org/anthology/volumes/D14-1/" class="uri">https://www.aclweb.org/anthology/volumes/D14-1/</a>.</p>
+<p>Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</em>, 1532–43. Doha, Qatar: Association for Computational Linguistics. <a href="https://doi.org/10.3115/v1/D14-1162" class="uri">https://doi.org/10.3115/v1/D14-1162</a>.</p>
 </div>
 <div id="ref-peyreComputationalOptimalTransport2019">
 <p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
--- a/bib/bibliography.bib
+++ b/bib/bibliography.bib
@ -160,33 +160,28 @@
                  https://dblp.org}
 }
-@proceedings{moschitti2014_proceed_confer_empir_method_natur,
+@inproceedings{pennington2014_glove,
-  bibsource =	 {dblp computer science bibliography,
+  author =	 "Pennington, Jeffrey and Socher, Richard and Manning,
-                  https://dblp.org},
+                  Christopher",
-  biburl =	 {https://dblp.org/rec/conf/emnlp/2014.bib},
+  title =	 "{G}love: Global Vectors for Word Representation",
-  editor =	 {Alessandro Moschitti and Bo Pang and Walter
+  booktitle =	 "Proceedings of the 2014 Conference on Empirical
-                  Daelemans},
+                  Methods in Natural Language Processing ({EMNLP})",
  isbn =	 {978-1-937284-96-1},
  publisher =	 {{ACL}},
  timestamp =	 {Fri, 13 Sep 2019 13:08:45 +0200},
  title =	 {Proceedings of the 2014 Conference on Empirical
                  Methods in Natural Language Processing, {EMNLP}
                  2014, October 25-29, 2014, Doha, Qatar, {A} meeting
                  of SIGDAT, a Special Interest Group of the {ACL}},
  url =		 {https://www.aclweb.org/anthology/volumes/D14-1/},
  year =	 2014,
  pages =	 "1532--1543",
  doi =		 "10.3115/v1/D14-1162",
  url =		 {https://doi.org/10.3115/v1/D14-1162},
  address =	 "Doha, Qatar",
  month =	 oct,
  publisher =	 "Association for Computational Linguistics",
 }
@incollection{mikolovDistributedRepresentationsWords2013,
  title = {Distributed {{Representations}} of {{Words}} and {{Phrases}} and Their {{Compositionality}}},
  url = {http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf},
  booktitle = {Advances in {{Neural Information Processing Systems}} 26},
  publisher = {{Curran Associates, Inc.}},
  urldate = {2019-08-13},
  date = {2013},
  pages = {3111--3119},
  author = {Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff},
  editor = {Burges, C. J. C. and Bottou, L. and Welling, M. and Ghahramani, Z. and Weinberger, K. Q.},
  file = {/home/dimitri/Nextcloud/Zotero/storage/Q4GDL59G/5021-distributed-representations-of-words-andphrases.html}
 }
--- a/posts/hierarchical-optimal-transport-for-document-classification.org
+++ b/posts/hierarchical-optimal-transport-for-document-classification.org
@ -4,10 +4,10 @@ date: 2020-04-05
 ---
 Two weeks ago, I did a presentation for my colleagues of the paper
-from cite:yurochkin2019_hierar_optim_trans_docum_repres, from
+from cite:yurochkin2019_hierar_optim_trans_docum_repres, from [[https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019][NeurIPS
-NeurIPS 2019. It contains an interesting approach to document
+2019]]. It contains an interesting approach to document classification
-classification leading to strong performance, and, most importantly,
+leading to strong performance, and, most importantly, excellent
-excellent interpretability.
+interpretability.
 This paper seems interesting to me because of it uses two methods with
 strong theoretical guarantees: optimal transport and topic
@ -41,8 +41,8 @@ fascinating and deep subject, so I won't enter into the details
 here. For an introduction to the theory and its applications, check
 out the excellent book from
 cite:peyreComputationalOptimalTransport2019, ([[https://arxiv.org/abs/1803.00567][available on ArXiv]] as
-well). There are also [[https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr][very nice posts]] by Gabriel Peyré on the CNRS
+well). There are also [[https://images.math.cnrs.fr/Le-transport-optimal-numerique-et-ses-applications-Partie-1.html?lang=fr][very nice posts]] (in French) by Gabriel Peyré on
-maths blog (in French). Many more resources (including slides for
+the [[https://images.math.cnrs.fr/][CNRS maths blog]]. Many more resources (including slides for
 presentations) are available at
 [[https://optimaltransport.github.io]]. For a more complete theoretical
 treatment of the subject, check out
@ -70,8 +70,8 @@ examples move cannon balls, or other military equipment, along a front
 line.
-More formally, if we have to sets of points $x = (x_1, x_2, \ldots,
+More formally, we start with two sets of points $x = (x_1, x_2, \ldots,
-      x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1), we can define the Wasserstein distance as
+      x_n)$, and $y = (y_1, y_2, \ldots, y_n)$, along with probability distributions $p \in \Delta^n$, $q \in \Delta^m$ over $x$ and $y$ ($\Delta^n$ is the probability simplex of dimension $n$, i.e. the set of vectors of size $n$ summing to 1). We can then define the Wasserstein distance as
 \[
 W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 \]
@ -162,9 +162,9 @@ optimisation algorithm to compute higher-level distances.
 The paper is very complete regarding experiments, providing a full
 evaluation of the method on one particular application: document
 clustering. They use [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]] to compute topics and
-GloVe for pretrained word embeddings
+GloVe for pretrained word embeddings citep:pennington2014_glove, and
-citep:moschitti2014_proceed_confer_empir_method_natur, and [[https://www.gurobi.com/][Gurobi]] to
+[[https://www.gurobi.com/][Gurobi]] to solve the optimisation problems. Their code is available [[https://github.com/IBM/HOTT][on
-solve the optimisation problems. Their code is available [[https://github.com/IBM/HOTT][on Github]].
+GitHub]].
 If you want the details, I encourage you to read the full paper, they
 tested the methods on a wide variety of datasets, with datasets
@ -206,7 +206,7 @@ Finally, I feel like they did not stop at a simple theoretical
 argument, but carefully checked on real-world datasets, measuring
 sensitivity to all the arbitrary choices they had to take. Again, from
 an industry perspective, this allows to implement the new approach
-quickly and easily, confident that it won't break unexpectedly without
+quickly and easily, being confident that it won't break unexpectedly
-extensive testing.
+without extensive testing.
 * References