Add experiments and conclusion

2020-04-05 15:55:21 +02:00 · 2020-04-05 15:55:21 +02:00 · 044a011a4e
commit 044a011a4e
parent 3524466d4c
5 changed files with 143 additions and 0 deletions
--- a/posts/hierarchical-optimal-transport-for-document-classification.org
+++ b/posts/hierarchical-optimal-transport-for-document-classification.org
@ -157,4 +157,56 @@ optimisation algorithm to compute higher-level distances.
 #+caption: Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books citep:yurochkin2019_hierar_optim_trans_docum_repres.
 [[file:/images/hott_fig1.png]]

+* Experiments
+
+The paper is very complete regarding experiments, providing a full
+evaluation of the method on one particular application: document
+clustering. They use [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]] to compute topics and
+GloVe for pretrained word embeddings
+citep:moschitti2014_proceed_confer_empir_method_natur, and [[https://www.gurobi.com/][Gurobi]] to
+solve the optimisation problems. Their code is available [[https://github.com/IBM/HOTT][on Github]].
+
+If you want the details, I encourage you to read the full paper, they
+tested the methods on a wide variety of datasets, with datasets
+containing very short documents (like Twitter), and long documents
+with a large vocabulary (books). With a simple $k$-NN classification,
+they establish that HOTT performs best on average, especially on large
+vocabularies (books, the "gutenberg" dataset). It also has a much
+better computational performance than alternative methods based on
+regularisation of the optimal transport problem directly on words. So
+the hierarchical nature of the approach allows to gain considerably in
+performance, along with improvements in interpretability.
+
+What's really interesting in the paper is the sensitivity analysis:
+they ran experiments with different word embeddings methods (word2vec,
+citep:mikolovDistributedRepresentationsWords2013), and with different
+parameters for the topic modelling (topic truncation, number of
+topics, etc). All of these reveal that changes in hyperparameters do
+not impact the performance of HOTT significantly. This is extremely
+important in a field like NLP where most of the times small variations
+in approach lead to drastically different results.
+
+* Conclusion
+
+All in all, this paper present a very interesting approach to compute
+distance between natural-language documents. It is no secret that I
+like methods with strong theoretical background (in this case
+optimisation and optimal transport), guaranteeing a stability and
+benefiting from decades of research in a well-established domain.
+
+Most importantly, this paper allows for future exploration in document
+representation with /interpretability/ in mind. This is often added as
+an afterthought in academic research but is one of the most important
+topics for the industry, as a system must be understood by end users,
+often not trained in ML, before being deployed. The notion of topic,
+and distances as weights, can be understood easily by anyone without
+significant background in ML or in maths.
+
+Finally, I feel like they did not stop at a simple theoretical
+argument, but carefully checked on real-world datasets, measuring
+sensitivity to all the arbitrary choices they had to take. Again, from
+an industry perspective, this allows to implement the new approach
+quickly and easily, confident that it won't break unexpectedly without
+extensive testing.
+
 * References