Add experiments and conclusion

This commit is contained in:
Dimitri Lozeve 2020-04-05 15:55:21 +02:00
parent 3524466d4c
commit 044a011a4e
5 changed files with 143 additions and 0 deletions

View file

@ -157,4 +157,56 @@ optimisation algorithm to compute higher-level distances.
#+caption: Representation of two documents in topic space, along with how the distance was computed between them. Everything is interpretable: from the documents as collections of topics, to the matchings between topics determining the overall distance between the books citep:yurochkin2019_hierar_optim_trans_docum_repres.
[[file:/images/hott_fig1.png]]
* Experiments
The paper is very complete regarding experiments, providing a full
evaluation of the method on one particular application: document
clustering. They use [[https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation][Latent Dirichlet Allocation]] to compute topics and
GloVe for pretrained word embeddings
citep:moschitti2014_proceed_confer_empir_method_natur, and [[https://www.gurobi.com/][Gurobi]] to
solve the optimisation problems. Their code is available [[https://github.com/IBM/HOTT][on Github]].
If you want the details, I encourage you to read the full paper, they
tested the methods on a wide variety of datasets, with datasets
containing very short documents (like Twitter), and long documents
with a large vocabulary (books). With a simple $k$-NN classification,
they establish that HOTT performs best on average, especially on large
vocabularies (books, the "gutenberg" dataset). It also has a much
better computational performance than alternative methods based on
regularisation of the optimal transport problem directly on words. So
the hierarchical nature of the approach allows to gain considerably in
performance, along with improvements in interpretability.
What's really interesting in the paper is the sensitivity analysis:
they ran experiments with different word embeddings methods (word2vec,
citep:mikolovDistributedRepresentationsWords2013), and with different
parameters for the topic modelling (topic truncation, number of
topics, etc). All of these reveal that changes in hyperparameters do
not impact the performance of HOTT significantly. This is extremely
important in a field like NLP where most of the times small variations
in approach lead to drastically different results.
* Conclusion
All in all, this paper present a very interesting approach to compute
distance between natural-language documents. It is no secret that I
like methods with strong theoretical background (in this case
optimisation and optimal transport), guaranteeing a stability and
benefiting from decades of research in a well-established domain.
Most importantly, this paper allows for future exploration in document
representation with /interpretability/ in mind. This is often added as
an afterthought in academic research but is one of the most important
topics for the industry, as a system must be understood by end users,
often not trained in ML, before being deployed. The notion of topic,
and distances as weights, can be understood easily by anyone without
significant background in ML or in maths.
Finally, I feel like they did not stop at a simple theoretical
argument, but carefully checked on real-world datasets, measuring
sensitivity to all the arbitrary choices they had to take. Again, from
an industry perspective, this allows to implement the new approach
quickly and easily, confident that it won't break unexpectedly without
extensive testing.
* References