ICLR 2020 Notes

May 5, 2020

ICLR is one of the most important conferences in machine learning, and as such, I was very excited to have the opportunity to volunteer and attend the first fully-virtual edition of the event. The whole content of the conference has been made publicly available, only a few days after the end of the event!

I would like to thank the organizing committee for this incredible event, and the possibility to volunteer to help other participantsTo better organize the event, and help people navigate the various online tools, they brought in 500(!) volunteers, waved our registration fees, and asked us to do simple load-testing and tech support. This was a very generous offer, and felt very rewarding for us, as we could attend the conference, and give back to the organization a little bit.

.

The many volunteers, the online-only nature of the event, and the low registration fees also allowed for what felt like a very diverse, inclusive event. Many graduate students and researchers from industry (like me), who do not generally have the time or the resources to travel to conferences like this, were able to attend, and make the exchanges richer.

In this post, I will try to give my impressions on the event, and share the most interesting events and papers I saw.

The Format of the Virtual Conference

As a result of global travel restrictions, the conference was made fully-virtual. It was supposed to take place in Addis Ababa, Ethiopia, which is great for people who are often the target of restrictive visa policies in Northern American countries.

The thing I appreciated most about the conference format was its emphasis on asynchronous communication. Given how little time they had to plan the conference, they could have made all poster presentations via video-conference and call it a day. Instead, each poster had to record a 5-minute video summarising their research. Alongside each presentation, there was a dedicated Rocket.Chat channelRocket.Chat seems to be an open-source alternative to Slack. Overall, the experience was great, and I appreciate the efforts of the organizers to use open source software instead of proprietary applications. I hope other conferences will do the same, and perhaps even avoid Zoom, because of recent privacy concerns (maybe try Jitsi?).

where anyone could ask a question to the authors, or just show their appreciation for the work. This was a fantastic idea as it allowed any participant to interact with papers and authors at any time they please, which is especially important in a setting where people were spread all over the globe.

There were also Zoom session where authors were available for direct, face-to-face discussions, allowing for more traditional conversations. But asking questions on the channel had also the advantage of keeping a track of all questions that were asked by other people. As such, I quickly acquired the habit of watching the video, looking at the chat to see the previous discussions (even if they happened in the middle of the night in my timezone!), and then skimming the paper or asking questions myself.

All of these excellent ideas were implemented by an amazing website, collecting all papers in a searchable, easy-to-use interface, and even a nice visualisation of papers as a point cloud!

Speakers

Overall, there were 8 speakers (two for each day of the main conference). They made a 40-minute presentation, and then there was a Q&A both via the chat and via Zoom. I only saw 4 of them, but I expect I will be watching the others in the near future.

Prof. Leslie Kaelbling, Doing for Our Robots What Nature Did For Us

This talk was fascinating. It is about robotics, and especially how to design the “software” of our robots. We want to program a robot in a way that it could work the best it can over all possible domains it can encounter.

Workshops

Some Interesting Papers

Natural Language Processing

Harwath et al., Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

Humans can easily deconstruct all the information available in speech (meaning, language, emotion, speaker, etc.). However, this is very hard for machines. This paper explores the capacity of algorithms to reason about all the aspects of the signal, including visual cues.

Their goal is to use spoken captions of images to train a predictive model.

speech signal: contains a lot of information (meaning, language, emotion, speaker, environment)
- difficult to identify these different parts for an algo (although easy for a human)
dominated by supervised ML
- automated speech recognition (ASR) = P(text | waveform)
- text-to-speech (TTS) = P(waveform | text, speaker)
- high sample complexity
- bad out-of-domain performance
- limited by annotation capability
human-like learning
- ability to jointly reason about all the aspects of the signal
- rapidly adapt to new speaker or noise conditions
- learn new words for a single example
- utilize unlabelled multimodal data
using visual grounding for self-supervision
- humans can leverage cross-modal correspondences to learn what spoken words represent without requiring any text or symbolic input whatsoever
- hypothesis: similar for computer algorithms?
goal: use spoken captions of images to train a predictive model
- learn a hierarchical structure of units
- learn the corresponding text, but also the transcription of the spoken sounds, at a sub-word level
prefer models that learn a discrete tokenisation of the speech
- language has an intrinsically symbolic structure
  - convey meaning with discrete words
  - words are in turn composed of a finite set of speech sounds (phones)
- model that can discover discrete representations for word and phone-like units
  - more interpretable
  - able to do few-shot learning (learn a new word-like unit in terms of known phone-like units)
  - path towards learning compositional structure from continuous signals
model for audio-visual grounding:
- NN for image
- NN for raw speech
- shared embedding space
- semantic supervision
preliminary studies
- lower layers features are correlated with phones
- higher lauer features are correlated with words
add a vector quantizing layer in the speech NN
hierarchy of quantization layers
- capture phones and words
do an ABX test to compare performance to speech-only models
conclusion
- novel linguistic unit learning paradigm using multimodal data without text
- SOTA performance on learning phonetic and word-level units
- discovery of discreteness as a good inductive bias for semantic task from speech