Add experiments and conclusion

2020-04-05 15:55:21 +02:00 · 2020-04-05 15:55:21 +02:00 · 044a011a4e
commit 044a011a4e
parent 3524466d4c
5 changed files with 143 additions and 0 deletions
--- a/_site/atom.xml
+++ b/_site/atom.xml
@ -68,8 +68,22 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 <figure>
 <img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
+<h1 id="experiments">Experiments</h1>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="moschitti2014_proceed_confer_empir_method_natur">(Moschitti, Pang, and Daelemans <a href="#ref-moschitti2014_proceed_confer_empir_method_natur">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on Github</a>.</p>
+<p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
+<p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
+<h1 id="conclusion">Conclusion</h1>
+<p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
+<p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, confident that it won’t break unexpectedly without extensive testing.</p>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
+<div id="ref-mikolovDistributedRepresentationsWords2013">
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–9. Curran Associates, Inc. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+</div>
+<div id="ref-moschitti2014_proceed_confer_empir_method_natur">
+<p>Moschitti, Alessandro, Bo Pang, and Walter Daelemans, eds. 2014. <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A Meeting of Sigdat, a Special Interest Group of the ACL</em>. ACL. <a href="https://www.aclweb.org/anthology/volumes/D14-1/" class="uri">https://www.aclweb.org/anthology/volumes/D14-1/</a>.</p>
+</div>
 <div id="ref-peyreComputationalOptimalTransport2019">
 <p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
 </div>
--- a/_site/posts/hierarchical-optimal-transport-for-document-classification.html
+++ b/_site/posts/hierarchical-optimal-transport-for-document-classification.html
@ -87,8 +87,22 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 <figure>
 <img src="../images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
+<h1 id="experiments">Experiments</h1>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="moschitti2014_proceed_confer_empir_method_natur">(Moschitti, Pang, and Daelemans <a href="#ref-moschitti2014_proceed_confer_empir_method_natur">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on Github</a>.</p>
+<p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
+<p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
+<h1 id="conclusion">Conclusion</h1>
+<p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
+<p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, confident that it won’t break unexpectedly without extensive testing.</p>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
+<div id="ref-mikolovDistributedRepresentationsWords2013">
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–9. Curran Associates, Inc. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+</div>
+<div id="ref-moschitti2014_proceed_confer_empir_method_natur">
+<p>Moschitti, Alessandro, Bo Pang, and Walter Daelemans, eds. 2014. <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A Meeting of Sigdat, a Special Interest Group of the ACL</em>. ACL. <a href="https://www.aclweb.org/anthology/volumes/D14-1/" class="uri">https://www.aclweb.org/anthology/volumes/D14-1/</a>.</p>
+</div>
 <div id="ref-peyreComputationalOptimalTransport2019">
 <p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
 </div>
--- a/_site/rss.xml
+++ b/_site/rss.xml
@ -64,8 +64,22 @@ W_1(p, q) = \min_{P \in \mathbb{R}_+^{n\times m}} \sum_{i,j} C_{i,j} P_{i,j}
 <figure>
 <img src="/images/hott_fig1.png" alt="Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books (Yurochkin et al. 2019)." /><figcaption>Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books <span class="citation" data-cites="yurochkin2019_hierar_optim_trans_docum_repres">(Yurochkin et al. <a href="#ref-yurochkin2019_hierar_optim_trans_docum_repres">2019</a>)</span>.</figcaption>
 </figure>
+<h1 id="experiments">Experiments</h1>
+<p>The paper is very complete regarding experiments, providing a full evaluation of the method on one particular application: document clustering. They use <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">Latent Dirichlet Allocation</a> to compute topics and GloVe for pretrained word embeddings <span class="citation" data-cites="moschitti2014_proceed_confer_empir_method_natur">(Moschitti, Pang, and Daelemans <a href="#ref-moschitti2014_proceed_confer_empir_method_natur">2014</a>)</span>, and <a href="https://www.gurobi.com/">Gurobi</a> to solve the optimisation problems. Their code is available <a href="https://github.com/IBM/HOTT">on Github</a>.</p>
+<p>If you want the details, I encourage you to read the full paper, they tested the methods on a wide variety of datasets, with datasets containing very short documents (like Twitter), and long documents with a large vocabulary (books). With a simple <span class="math inline">\(k\)</span>-NN classification, they establish that HOTT performs best on average, especially on large vocabularies (books, the “gutenberg” dataset). It also has a much better computational performance than alternative methods based on regularisation of the optimal transport problem directly on words. So the hierarchical nature of the approach allows to gain considerably in performance, along with improvements in interpretability.</p>
+<p>What’s really interesting in the paper is the sensitivity analysis: they ran experiments with different word embeddings methods (word2vec, <span class="citation" data-cites="mikolovDistributedRepresentationsWords2013">(Mikolov et al. <a href="#ref-mikolovDistributedRepresentationsWords2013">2013</a>)</span>), and with different parameters for the topic modelling (topic truncation, number of topics, etc). All of these reveal that changes in hyperparameters do not impact the performance of HOTT significantly. This is extremely important in a field like NLP where most of the times small variations in approach lead to drastically different results.</p>
+<h1 id="conclusion">Conclusion</h1>
+<p>All in all, this paper present a very interesting approach to compute distance between natural-language documents. It is no secret that I like methods with strong theoretical background (in this case optimisation and optimal transport), guaranteeing a stability and benefiting from decades of research in a well-established domain.</p>
+<p>Most importantly, this paper allows for future exploration in document representation with <em>interpretability</em> in mind. This is often added as an afterthought in academic research but is one of the most important topics for the industry, as a system must be understood by end users, often not trained in ML, before being deployed. The notion of topic, and distances as weights, can be understood easily by anyone without significant background in ML or in maths.</p>
+<p>Finally, I feel like they did not stop at a simple theoretical argument, but carefully checked on real-world datasets, measuring sensitivity to all the arbitrary choices they had to take. Again, from an industry perspective, this allows to implement the new approach quickly and easily, confident that it won’t break unexpectedly without extensive testing.</p>
 <h1 id="references" class="unnumbered">References</h1>
 <div id="refs" class="references">
+<div id="ref-mikolovDistributedRepresentationsWords2013">
+<p>Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In <em>Advances in Neural Information Processing Systems 26</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3111–9. Curran Associates, Inc. <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf" class="uri">http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf</a>.</p>
+</div>
+<div id="ref-moschitti2014_proceed_confer_empir_method_natur">
+<p>Moschitti, Alessandro, Bo Pang, and Walter Daelemans, eds. 2014. <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A Meeting of Sigdat, a Special Interest Group of the ACL</em>. ACL. <a href="https://www.aclweb.org/anthology/volumes/D14-1/" class="uri">https://www.aclweb.org/anthology/volumes/D14-1/</a>.</p>
+</div>
 <div id="ref-peyreComputationalOptimalTransport2019">
 <p>Peyré, Gabriel, and Marco Cuturi. 2019. “Computational Optimal Transport.” <em>Foundations and Trends in Machine Learning</em> 11 (5-6): 355–206. <a href="https://doi.org/10.1561/2200000073" class="uri">https://doi.org/10.1561/2200000073</a>.</p>
 </div>
--- a/bib/bibliography.bib
+++ b/bib/bibliography.bib
@ -141,3 +141,52 @@
  note = {OCLC: ocn244421231}
 }

+@InProceedings{DBLP:conf/emnlp/PenningtonSM14,
+  author       = {Jeffrey Pennington and Richard Socher and
+                  Christopher D. Manning},
+  title	       = {Glove: Global Vectors for Word Representation},
+  year	       = 2014,
+  booktitle    = {Proceedings of the 2014 Conference on Empirical
+                  Methods in Natural Language Processing, {EMNLP}
+                  2014, October 25-29, 2014, Doha, Qatar, {A} meeting
+                  of SIGDAT, a Special Interest Group of the {ACL}},
+  pages	       = {1532-1543},
+  doi	       = {10.3115/v1/d14-1162},
+  url	       = {https://doi.org/10.3115/v1/d14-1162},
+  crossref     = {DBLP:conf/emnlp/2014},
+  timestamp    = {Tue, 28 Jan 2020 10:28:11 +0100},
+  biburl       = {https://dblp.org/rec/conf/emnlp/PenningtonSM14.bib},
+  bibsource    = {dblp computer science bibliography,
+                  https://dblp.org}
+}
+
+@proceedings{moschitti2014_proceed_confer_empir_method_natur,
+  bibsource =	 {dblp computer science bibliography,
+                  https://dblp.org},
+  biburl =	 {https://dblp.org/rec/conf/emnlp/2014.bib},
+  editor =	 {Alessandro Moschitti and Bo Pang and Walter
+                  Daelemans},
+  isbn =	 {978-1-937284-96-1},
+  publisher =	 {{ACL}},
+  timestamp =	 {Fri, 13 Sep 2019 13:08:45 +0200},
+  title =	 {Proceedings of the 2014 Conference on Empirical
+                  Methods in Natural Language Processing, {EMNLP}
+                  2014, October 25-29, 2014, Doha, Qatar, {A} meeting
+                  of SIGDAT, a Special Interest Group of the {ACL}},
+  url =		 {https://www.aclweb.org/anthology/volumes/D14-1/},
+  year =	 2014,
+}
+
+@incollection{mikolovDistributedRepresentationsWords2013,
+  title = {Distributed {{Representations}} of {{Words}} and {{Phrases}} and Their {{Compositionality}}},
+  url = {http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf},
+  booktitle = {Advances in {{Neural Information Processing Systems}} 26},
+  publisher = {{Curran Associates, Inc.}},
+  urldate = {2019-08-13},
+  date = {2013},
+  pages = {3111--3119},
+  author = {Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff},
+  editor = {Burges, C. J. C. and Bottou, L. and Welling, M. and Ghahramani, Z. and Weinberger, K. Q.},
+  file = {/home/dimitri/Nextcloud/Zotero/storage/Q4GDL59G/5021-distributed-representations-of-words-andphrases.html}
+}
+
--- a/posts/hierarchical-optimal-transport-for-document-classification.org
+++ b/posts/hierarchical-optimal-transport-for-document-classification.org
@ -157,4 +157,56 @@ optimisation algorithm to compute higher-level distances.
 #+caption: Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books citep:yurochkin2019_hierar_optim_trans_docum_repres.
 [[file:/images/hott_fig1.png]]

+* Experiments
+
+The paper is very complete regarding experiments, providing a full
+evaluation of the method on one particular application: document
+clustering. They use [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]] to compute topics and
+GloVe for pretrained word embeddings
+citep:moschitti2014_proceed_confer_empir_method_natur, and [[https://www.gurobi.com/][Gurobi]] to
+solve the optimisation problems. Their code is available [[https://github.com/IBM/HOTT][on Github]].
+
+If you want the details, I encourage you to read the full paper, they
+tested the methods on a wide variety of datasets, with datasets
+containing very short documents (like Twitter), and long documents
+with a large vocabulary (books). With a simple $k$-NN classification,
+they establish that HOTT performs best on average, especially on large
+vocabularies (books, the "gutenberg" dataset). It also has a much
+better computational performance than alternative methods based on
+regularisation of the optimal transport problem directly on words. So
+the hierarchical nature of the approach allows to gain considerably in
+performance, along with improvements in interpretability.
+
+What's really interesting in the paper is the sensitivity analysis:
+they ran experiments with different word embeddings methods (word2vec,
+citep:mikolovDistributedRepresentationsWords2013), and with different
+parameters for the topic modelling (topic truncation, number of
+topics, etc). All of these reveal that changes in hyperparameters do
+not impact the performance of HOTT significantly. This is extremely
+important in a field like NLP where most of the times small variations
+in approach lead to drastically different results.
+
+* Conclusion
+
+All in all, this paper present a very interesting approach to compute
+distance between natural-language documents. It is no secret that I
+like methods with strong theoretical background (in this case
+optimisation and optimal transport), guaranteeing a stability and
+benefiting from decades of research in a well-established domain.
+
+Most importantly, this paper allows for future exploration in document
+representation with /interpretability/ in mind. This is often added as
+an afterthought in academic research but is one of the most important
+topics for the industry, as a system must be understood by end users,
+often not trained in ML, before being deployed. The notion of topic,
+and distances as weights, can be understood easily by anyone without
+significant background in ML or in maths.
+
+Finally, I feel like they did not stop at a simple theoretical
+argument, but carefully checked on real-world datasets, measuring
+sensitivity to all the arbitrary choices they had to take. Again, from
+an industry perspective, this allows to implement the new approach
+quickly and easily, confident that it won't break unexpectedly without
+extensive testing.
+
 * References