162 lines
7.1 KiB
Org Mode
162 lines
7.1 KiB
Org Mode
---
|
|
title: "ICLR 2020 Notes"
|
|
date: 2020-05-05
|
|
---
|
|
|
|
ICLR is one of the most important conferences in machine learning, and
|
|
as such, I was very excited to have the opportunity to volunteer and
|
|
attend the first fully-virtual edition of the event. The whole content
|
|
of the conference has been made [[https://iclr.cc/virtual_2020/index.html][publicly available]], only a few days
|
|
after the end of the event!
|
|
|
|
I would like to thank the [[https://iclr.cc/Conferences/2020/Committees][organizing committee]] for this incredible
|
|
event, and the possibility to volunteer to help other
|
|
participants[fn:volunteer].
|
|
|
|
The many volunteers, the online-only nature of the event, and the low
|
|
registration fees also allowed for what felt like a very diverse,
|
|
inclusive event. Many graduate students and researchers from industry
|
|
(like me), who do not generally have the time or the resources to
|
|
travel to conferences like this, were able to attend, and make the
|
|
exchanges richer.
|
|
|
|
In this post, I will try to give my impressions on the event, and
|
|
share the most interesting events and papers I saw.
|
|
|
|
[fn:volunteer] To better organize the event, and help people navigate
|
|
the various online tools, they brought in 500(!) volunteers, waved our
|
|
registration fees, and asked us to do simple load-testing and tech
|
|
support. This was a very generous offer, and felt very rewarding for
|
|
us, as we could attend the conference, and give back to the
|
|
organization a little bit.
|
|
|
|
* The Format of the Virtual Conference
|
|
|
|
As a result of global travel restrictions, the conference was made
|
|
fully-virtual. It was supposed to take place in Addis Ababa, Ethiopia,
|
|
which is great for people who are often the target of restrictive visa
|
|
policies in Northern American countries.
|
|
|
|
The thing I appreciated most about the conference format was its
|
|
emphasis on /asynchronous/ communication. Given how little time they
|
|
had to plan the conference, they could have made all poster
|
|
presentations via video-conference and call it a day. Instead, each
|
|
poster had to record a 5-minute video summarising their
|
|
research. Alongside each presentation, there was a dedicated
|
|
Rocket.Chat channel[fn:rocketchat] where anyone could ask a question
|
|
to the authors, or just show their appreciation for the work. This was
|
|
a fantastic idea as it allowed any participant to interact with papers
|
|
and authors at any time they please, which is especially important in
|
|
a setting where people were spread all over the globe.
|
|
|
|
There were also Zoom session where authors were available for direct,
|
|
face-to-face discussions, allowing for more traditional
|
|
conversations. But asking questions on the channel had also the
|
|
advantage of keeping a track of all questions that were asked by other
|
|
people. As such, I quickly acquired the habit of watching the video,
|
|
looking at the chat to see the previous discussions (even if they
|
|
happened in the middle of the night in my timezone!), and then
|
|
skimming the paper or asking questions myself.
|
|
|
|
All of these excellent ideas were implemented by an [[https://iclr.cc/virtual_2020/papers.html?filter=keywords][amazing website]],
|
|
collecting all papers in a searchable, easy-to-use interface, and even
|
|
a nice [[https://iclr.cc/virtual_2020/paper_vis.html][visualisation]] of papers as a point cloud!
|
|
|
|
[fn:rocketchat] [[https://rocket.chat/][Rocket.Chat]] seems to be an [[https://github.com/RocketChat/Rocket.Chat][open-source]] alternative to
|
|
Slack. Overall, the experience was great, and I appreciate the efforts
|
|
of the organizers to use open source software instead of proprietary
|
|
applications. I hope other conferences will do the same, and perhaps
|
|
even avoid Zoom, because of recent privacy concerns (maybe try
|
|
[[https://jitsi.org/][Jitsi]]?).
|
|
|
|
* Speakers
|
|
|
|
Overall, there were 8 speakers (two for each day of the main
|
|
conference). They made a 40-minute presentation, and then there was a
|
|
Q&A both via the chat and via Zoom. I only saw 4 of them, but I expect
|
|
I will be watching the others in the near future.
|
|
|
|
** Prof. Leslie Kaelbling, [[https://iclr.cc/virtual_2020/speaker_2.html][Doing for Our Robots What Nature Did For Us]]
|
|
|
|
This talk was fascinating. It is about robotics, and especially how to
|
|
design the "software" of our robots. We want to program a robot in a
|
|
way that it could work the best it can over all possible domains it
|
|
can encounter.
|
|
|
|
|
|
|
|
* Workshops
|
|
|
|
* Some Interesting Papers
|
|
|
|
** Natural Language Processing
|
|
|
|
*** Harwath et al., [[https://openreview.net/forum?id=B1elCp4KwH][Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech]]
|
|
|
|
Humans can easily deconstruct all the information available in speech
|
|
(meaning, language, emotion, speaker, etc.). However, this is very
|
|
hard for machines. This paper explores the capacity of algorithms to
|
|
reason about all the aspects of the signal, including visual cues.
|
|
|
|
Their goal is to use spoken captions of images to train a predictive
|
|
model.
|
|
|
|
- speech signal: contains a lot of information (meaning, language,
|
|
emotion, speaker, environment)
|
|
- difficult to identify these different parts for an algo (although
|
|
easy for a human)
|
|
- dominated by supervised ML
|
|
- automated speech recognition (ASR) = P(text | waveform)
|
|
- text-to-speech (TTS) = P(waveform | text, speaker)
|
|
- high sample complexity
|
|
- bad out-of-domain performance
|
|
- limited by annotation capability
|
|
- human-like learning
|
|
- ability to jointly reason about all the aspects of the signal
|
|
- rapidly adapt to new speaker or noise conditions
|
|
- learn new words for a single example
|
|
- utilize unlabelled multimodal data
|
|
- using visual grounding for self-supervision
|
|
- humans can leverage cross-modal correspondences to learn what
|
|
spoken words represent without requiring any text or symbolic
|
|
input whatsoever
|
|
- hypothesis: similar for computer algorithms?
|
|
- goal: use spoken captions of images to train a predictive model
|
|
- learn a hierarchical structure of units
|
|
- learn the corresponding text, but also the transcription of the
|
|
spoken sounds, at a sub-word level
|
|
- prefer models that learn a discrete tokenisation of the speech
|
|
- language has an intrinsically symbolic structure
|
|
- convey meaning with discrete words
|
|
- words are in turn composed of a finite set of speech sounds
|
|
(phones)
|
|
- model that can discover discrete representations for word and
|
|
phone-like units
|
|
- more interpretable
|
|
- able to do few-shot learning (learn a new word-like unit in
|
|
terms of known phone-like units)
|
|
- path towards learning compositional structure from continuous
|
|
signals
|
|
- model for audio-visual grounding:
|
|
- NN for image
|
|
- NN for raw speech
|
|
- shared embedding space
|
|
- semantic supervision
|
|
- preliminary studies
|
|
- lower layers features are correlated with phones
|
|
- higher lauer features are correlated with words
|
|
- add a vector quantizing layer in the speech NN
|
|
- hierarchy of quantization layers
|
|
- capture phones and words
|
|
- do an ABX test to compare performance to speech-only models
|
|
- conclusion
|
|
- novel linguistic unit learning paradigm using multimodal data
|
|
without text
|
|
- SOTA performance on learning phonetic and word-level units
|
|
- discovery of discreteness as a good inductive bias for semantic
|
|
task from speech
|
|
|
|
** Reinforcement Learning
|
|
|
|
** ML and Neural Network Theory
|
|
|