Add ICLR 2020 Notes

2020-05-05 09:55:33 +02:00 · 2020-05-05 09:55:33 +02:00 · dddc5f1c39
commit dddc5f1c39
parent d084ba876d
6 changed files with 545 additions and 2 deletions
--- a/_site/archive.html
+++ b/_site/archive.html
@ -47,6 +47,10 @@
    Here you can find all my previous posts:
 <ul>
        <li>
            <a href="./posts/iclr-2020-notes.html">ICLR 2020 Notes</a> - May  5, 2020
        </li>
        <li>
            <a href="./posts/hierarchical-optimal-transport-for-document-classification.html">Reading notes: Hierarchical Optimal Transport for Document Representation</a> - April  5, 2020
        </li>
--- a/_site/atom.xml
+++ b/_site/atom.xml
@ -8,7 +8,116 @@
        <name>Dimitri Lozeve</name>
        <email>dimitri+web@lozeve.com</email>
    </author>
-    <updated>2020-04-05T00:00:00Z</updated>
+    <updated>2020-05-05T00:00:00Z</updated>
    <entry>
    <title>ICLR 2020 Notes</title>
    <link href="https://www.lozeve.com/posts/iclr-2020-notes.html" />
    <id>https://www.lozeve.com/posts/iclr-2020-notes.html</id>
    <published>2020-05-05T00:00:00Z</published>
    <updated>2020-05-05T00:00:00Z</updated>
    <summary type="html"><![CDATA[<article>
    <section class="header">
    </section>
    <section>
        <p>ICLR is one of the most important conferences in machine learning, and as such, I was very excited to have the opportunity to volunteer and attend the first fully-virtual edition of the event. The whole content of the conference has been made <a href="https://iclr.cc/virtual_2020/index.html">publicly available</a>, only a few days after the end of the event!</p>
 <p>I would like to thank the <a href="https://iclr.cc/Conferences/2020/Committees">organizing committee</a> for this incredible event, and the possibility to volunteer to help other participants<span><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle"/><span class="sidenote">To better organize the event, and help people navigate the various online tools, they brought in 500(!) volunteers, waved our registration fees, and asked us to do simple load-testing and tech support. This was a very generous offer, and felt very rewarding for us, as we could attend the conference, and give back to the organization a little bit.<br />
 <br />
 </span></span>.</p>
 <p>The many volunteers, the online-only nature of the event, and the low registration fees also allowed for what felt like a very diverse, inclusive event. Many graduate students and researchers from industry (like me), who do not generally have the time or the resources to travel to conferences like this, were able to attend, and make the exchanges richer.</p>
 <p>In this post, I will try to give my impressions on the event, and share the most interesting events and papers I saw.</p>
 <h1 id="the-format-of-the-virtual-conference">The Format of the Virtual Conference</h1>
 <p>As a result of global travel restrictions, the conference was made fully-virtual. It was supposed to take place in Addis Ababa, Ethiopia, which is great for people who are often the target of restrictive visa policies in Northern American countries.</p>
 <p>The thing I appreciated most about the conference format was its emphasis on <em>asynchronous</em> communication. Given how little time they had to plan the conference, they could have made all poster presentations via video-conference and call it a day. Instead, each poster had to record a 5-minute video summarising their research. Alongside each presentation, there was a dedicated Rocket.Chat channel<span><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle"/><span class="sidenote"><a href="https://rocket.chat/">Rocket.Chat</a> seems to be an <a href="https://github.com/RocketChat/Rocket.Chat">open-source</a> alternative to Slack. Overall, the experience was great, and I appreciate the efforts of the organizers to use open source software instead of proprietary applications. I hope other conferences will do the same, and perhaps even avoid Zoom, because of recent privacy concerns (maybe try <a href="https://jitsi.org/">Jitsi</a>?).<br />
 <br />
 </span></span> where anyone could ask a question to the authors, or just show their appreciation for the work. This was a fantastic idea as it allowed any participant to interact with papers and authors at any time they please, which is especially important in a setting where people were spread all over the globe.</p>
 <p>There were also Zoom session where authors were available for direct, face-to-face discussions, allowing for more traditional conversations. But asking questions on the channel had also the advantage of keeping a track of all questions that were asked by other people. As such, I quickly acquired the habit of watching the video, looking at the chat to see the previous discussions (even if they happened in the middle of the night in my timezone!), and then skimming the paper or asking questions myself.</p>
 <p>All of these excellent ideas were implemented by an <a href="https://iclr.cc/virtual_2020/papers.html?filter=keywords">amazing website</a>, collecting all papers in a searchable, easy-to-use interface, and even a nice <a href="https://iclr.cc/virtual_2020/paper_vis.html">visualisation</a> of papers as a point cloud!</p>
 <h1 id="speakers">Speakers</h1>
 <p>Overall, there were 8 speakers (two for each day of the main conference). They made a 40-minute presentation, and then there was a Q&amp;A both via the chat and via Zoom. I only saw 4 of them, but I expect I will be watching the others in the near future.</p>
 <h2 id="prof.-leslie-kaelbling-doing-for-our-robots-what-nature-did-for-us">Prof. Leslie Kaelbling, <a href="https://iclr.cc/virtual_2020/speaker_2.html">Doing for Our Robots What Nature Did For Us</a></h2>
 <p>This talk was fascinating. It is about robotics, and especially how to design the “software” of our robots. We want to program a robot in a way that it could work the best it can over all possible domains it can encounter.</p>
 <h1 id="workshops">Workshops</h1>
 <h1 id="some-interesting-papers">Some Interesting Papers</h1>
 <h2 id="natural-language-processing">Natural Language Processing</h2>
 <h3 id="harwath-et-al.-learning-hierarchical-discrete-linguistic-units-from-visually-grounded-speech">Harwath et al., <a href="https://openreview.net/forum?id=B1elCp4KwH">Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech</a></h3>
 <p>Humans can easily deconstruct all the information available in speech (meaning, language, emotion, speaker, etc.). However, this is very hard for machines. This paper explores the capacity of algorithms to reason about all the aspects of the signal, including visual cues.</p>
 <p>Their goal is to use spoken captions of images to train a predictive model.</p>
 <ul>
 <li>speech signal: contains a lot of information (meaning, language, emotion, speaker, environment)
 <ul>
 <li>difficult to identify these different parts for an algo (although easy for a human)</li>
 </ul></li>
 <li>dominated by supervised ML
 <ul>
 <li>automated speech recognition (ASR) = P(text | waveform)</li>
 <li>text-to-speech (TTS) = P(waveform | text, speaker)</li>
 <li>high sample complexity</li>
 <li>bad out-of-domain performance</li>
 <li>limited by annotation capability</li>
 </ul></li>
 <li>human-like learning
 <ul>
 <li>ability to jointly reason about all the aspects of the signal</li>
 <li>rapidly adapt to new speaker or noise conditions</li>
 <li>learn new words for a single example</li>
 <li>utilize unlabelled multimodal data</li>
 </ul></li>
 <li>using visual grounding for self-supervision
 <ul>
 <li>humans can leverage cross-modal correspondences to learn what spoken words represent without requiring any text or symbolic input whatsoever</li>
 <li>hypothesis: similar for computer algorithms?</li>
 </ul></li>
 <li>goal: use spoken captions of images to train a predictive model
 <ul>
 <li>learn a hierarchical structure of units</li>
 <li>learn the corresponding text, but also the transcription of the spoken sounds, at a sub-word level</li>
 </ul></li>
 <li>prefer models that learn a discrete tokenisation of the speech
 <ul>
 <li>language has an intrinsically symbolic structure
 <ul>
 <li>convey meaning with discrete words</li>
 <li>words are in turn composed of a finite set of speech sounds (phones)</li>
 </ul></li>
 <li>model that can discover discrete representations for word and phone-like units
 <ul>
 <li>more interpretable</li>
 <li>able to do few-shot learning (learn a new word-like unit in terms of known phone-like units)</li>
 <li>path towards learning compositional structure from continuous signals</li>
 </ul></li>
 </ul></li>
 <li>model for audio-visual grounding:
 <ul>
 <li>NN for image</li>
 <li>NN for raw speech</li>
 <li>shared embedding space</li>
 <li>semantic supervision</li>
 </ul></li>
 <li>preliminary studies
 <ul>
 <li>lower layers features are correlated with phones</li>
 <li>higher lauer features are correlated with words</li>
 </ul></li>
 <li>add a vector quantizing layer in the speech NN</li>
 <li>hierarchy of quantization layers
 <ul>
 <li>capture phones and words</li>
 </ul></li>
 <li>do an ABX test to compare performance to speech-only models</li>
 <li>conclusion
 <ul>
 <li>novel linguistic unit learning paradigm using multimodal data without text</li>
 <li>SOTA performance on learning phonetic and word-level units</li>
 <li>discovery of discreteness as a good inductive bias for semantic task from speech</li>
 </ul></li>
 </ul>
 <h2 id="reinforcement-learning">Reinforcement Learning</h2>
 <h2 id="ml-and-neural-network-theory">ML and Neural Network Theory</h2>
    </section>
 </article>
 ]]></summary>
 </entry>
 <entry>
    <title>Reading notes: Hierarchical Optimal Transport for Document Representation</title>
    <link href="https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html" />
--- a/_site/index.html
+++ b/_site/index.html
@ -73,6 +73,10 @@ public key: RWQ6uexORp8f7USHA7nX9lFfltaCA9x6aBV06MvgiGjUt6BVf6McyD26
 <ul>
        <li>
            <a href="./posts/iclr-2020-notes.html">ICLR 2020 Notes</a> - May  5, 2020
        </li>
        <li>
            <a href="./posts/hierarchical-optimal-transport-for-document-classification.html">Reading notes: Hierarchical Optimal Transport for Document Representation</a> - April  5, 2020
        </li>
--- a/_site/posts/iclr-2020-notes.html
+++ b/_site/posts/iclr-2020-notes.html
@ -0,0 +1,155 @@
 <!doctype html>
 <html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=yes">
    <meta name="description" content="Dimitri Lozeve's blog: ICLR 2020 Notes">
    <title>Dimitri Lozeve - ICLR 2020 Notes</title>
    <link rel="stylesheet" href="../css/tufte.css" />
    <link rel="stylesheet" href="../css/pandoc.css" />
    <link rel="stylesheet" href="../css/default.css" />
    <link rel="stylesheet" href="../css/syntax.css" />
    <!-- KaTeX CSS styles -->
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.css" integrity="sha384-BdGj8xC2eZkQaxoQ8nSLefg4AV4/AwB3Fj+8SUSo7pnKP6Eoy18liIKTPn9oBYNG" crossorigin="anonymous">
    <!-- The loading of KaTeX is deferred to speed up page rendering -->
    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.js" integrity="sha384-JiKN5O8x9Hhs/UE5cT5AAJqieYlOZbGT3CHws/y97o3ty4R7/O5poG9F3JoiOYw1" crossorigin="anonymous"></script>
    <!-- To automatically render math in text elements, include the auto-render extension: -->
    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
  </head>
  <body>
    <article>
      <header>
 	<nav>
          <a href="../">Home</a>
 	  <a href="../projects.html">Projects</a>
          <a href="../archive.html">Archive</a>
 	  <a href="../contact.html">Contact</a>
 	</nav>
 	<h1 class="title">ICLR 2020 Notes</h1>
 	<p class="byline">May  5, 2020</p>
      </header>
    </article>
    <article>
    <section class="header">
    </section>
    <section>
        <p>ICLR is one of the most important conferences in machine learning, and as such, I was very excited to have the opportunity to volunteer and attend the first fully-virtual edition of the event. The whole content of the conference has been made <a href="https://iclr.cc/virtual_2020/index.html">publicly available</a>, only a few days after the end of the event!</p>
 <p>I would like to thank the <a href="https://iclr.cc/Conferences/2020/Committees">organizing committee</a> for this incredible event, and the possibility to volunteer to help other participants<span><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle" /><span class="sidenote">To better organize the event, and help people navigate the various online tools, they brought in 500(!) volunteers, waved our registration fees, and asked us to do simple load-testing and tech support. This was a very generous offer, and felt very rewarding for us, as we could attend the conference, and give back to the organization a little bit.<br />
 <br />
 </span></span>.</p>
 <p>The many volunteers, the online-only nature of the event, and the low registration fees also allowed for what felt like a very diverse, inclusive event. Many graduate students and researchers from industry (like me), who do not generally have the time or the resources to travel to conferences like this, were able to attend, and make the exchanges richer.</p>
 <p>In this post, I will try to give my impressions on the event, and share the most interesting events and papers I saw.</p>
 <h1 id="the-format-of-the-virtual-conference">The Format of the Virtual Conference</h1>
 <p>As a result of global travel restrictions, the conference was made fully-virtual. It was supposed to take place in Addis Ababa, Ethiopia, which is great for people who are often the target of restrictive visa policies in Northern American countries.</p>
 <p>The thing I appreciated most about the conference format was its emphasis on <em>asynchronous</em> communication. Given how little time they had to plan the conference, they could have made all poster presentations via video-conference and call it a day. Instead, each poster had to record a 5-minute video summarising their research. Alongside each presentation, there was a dedicated Rocket.Chat channel<span><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle" /><span class="sidenote"><a href="https://rocket.chat/">Rocket.Chat</a> seems to be an <a href="https://github.com/RocketChat/Rocket.Chat">open-source</a> alternative to Slack. Overall, the experience was great, and I appreciate the efforts of the organizers to use open source software instead of proprietary applications. I hope other conferences will do the same, and perhaps even avoid Zoom, because of recent privacy concerns (maybe try <a href="https://jitsi.org/">Jitsi</a>?).<br />
 <br />
 </span></span> where anyone could ask a question to the authors, or just show their appreciation for the work. This was a fantastic idea as it allowed any participant to interact with papers and authors at any time they please, which is especially important in a setting where people were spread all over the globe.</p>
 <p>There were also Zoom session where authors were available for direct, face-to-face discussions, allowing for more traditional conversations. But asking questions on the channel had also the advantage of keeping a track of all questions that were asked by other people. As such, I quickly acquired the habit of watching the video, looking at the chat to see the previous discussions (even if they happened in the middle of the night in my timezone!), and then skimming the paper or asking questions myself.</p>
 <p>All of these excellent ideas were implemented by an <a href="https://iclr.cc/virtual_2020/papers.html?filter=keywords">amazing website</a>, collecting all papers in a searchable, easy-to-use interface, and even a nice <a href="https://iclr.cc/virtual_2020/paper_vis.html">visualisation</a> of papers as a point cloud!</p>
 <h1 id="speakers">Speakers</h1>
 <p>Overall, there were 8 speakers (two for each day of the main conference). They made a 40-minute presentation, and then there was a Q&amp;A both via the chat and via Zoom. I only saw 4 of them, but I expect I will be watching the others in the near future.</p>
 <h2 id="prof.-leslie-kaelbling-doing-for-our-robots-what-nature-did-for-us">Prof. Leslie Kaelbling, <a href="https://iclr.cc/virtual_2020/speaker_2.html">Doing for Our Robots What Nature Did For Us</a></h2>
 <p>This talk was fascinating. It is about robotics, and especially how to design the “software” of our robots. We want to program a robot in a way that it could work the best it can over all possible domains it can encounter.</p>
 <h1 id="workshops">Workshops</h1>
 <h1 id="some-interesting-papers">Some Interesting Papers</h1>
 <h2 id="natural-language-processing">Natural Language Processing</h2>
 <h3 id="harwath-et-al.-learning-hierarchical-discrete-linguistic-units-from-visually-grounded-speech">Harwath et al., <a href="https://openreview.net/forum?id=B1elCp4KwH">Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech</a></h3>
 <p>Humans can easily deconstruct all the information available in speech (meaning, language, emotion, speaker, etc.). However, this is very hard for machines. This paper explores the capacity of algorithms to reason about all the aspects of the signal, including visual cues.</p>
 <p>Their goal is to use spoken captions of images to train a predictive model.</p>
 <ul>
 <li>speech signal: contains a lot of information (meaning, language, emotion, speaker, environment)
 <ul>
 <li>difficult to identify these different parts for an algo (although easy for a human)</li>
 </ul></li>
 <li>dominated by supervised ML
 <ul>
 <li>automated speech recognition (ASR) = P(text | waveform)</li>
 <li>text-to-speech (TTS) = P(waveform | text, speaker)</li>
 <li>high sample complexity</li>
 <li>bad out-of-domain performance</li>
 <li>limited by annotation capability</li>
 </ul></li>
 <li>human-like learning
 <ul>
 <li>ability to jointly reason about all the aspects of the signal</li>
 <li>rapidly adapt to new speaker or noise conditions</li>
 <li>learn new words for a single example</li>
 <li>utilize unlabelled multimodal data</li>
 </ul></li>
 <li>using visual grounding for self-supervision
 <ul>
 <li>humans can leverage cross-modal correspondences to learn what spoken words represent without requiring any text or symbolic input whatsoever</li>
 <li>hypothesis: similar for computer algorithms?</li>
 </ul></li>
 <li>goal: use spoken captions of images to train a predictive model
 <ul>
 <li>learn a hierarchical structure of units</li>
 <li>learn the corresponding text, but also the transcription of the spoken sounds, at a sub-word level</li>
 </ul></li>
 <li>prefer models that learn a discrete tokenisation of the speech
 <ul>
 <li>language has an intrinsically symbolic structure
 <ul>
 <li>convey meaning with discrete words</li>
 <li>words are in turn composed of a finite set of speech sounds (phones)</li>
 </ul></li>
 <li>model that can discover discrete representations for word and phone-like units
 <ul>
 <li>more interpretable</li>
 <li>able to do few-shot learning (learn a new word-like unit in terms of known phone-like units)</li>
 <li>path towards learning compositional structure from continuous signals</li>
 </ul></li>
 </ul></li>
 <li>model for audio-visual grounding:
 <ul>
 <li>NN for image</li>
 <li>NN for raw speech</li>
 <li>shared embedding space</li>
 <li>semantic supervision</li>
 </ul></li>
 <li>preliminary studies
 <ul>
 <li>lower layers features are correlated with phones</li>
 <li>higher lauer features are correlated with words</li>
 </ul></li>
 <li>add a vector quantizing layer in the speech NN</li>
 <li>hierarchy of quantization layers
 <ul>
 <li>capture phones and words</li>
 </ul></li>
 <li>do an ABX test to compare performance to speech-only models</li>
 <li>conclusion
 <ul>
 <li>novel linguistic unit learning paradigm using multimodal data without text</li>
 <li>SOTA performance on learning phonetic and word-level units</li>
 <li>discovery of discreteness as a good inductive bias for semantic task from speech</li>
 </ul></li>
 </ul>
 <h2 id="reinforcement-learning">Reinforcement Learning</h2>
 <h2 id="ml-and-neural-network-theory">ML and Neural Network Theory</h2>
    </section>
 </article>
    <footer>
      Site proudly generated by
      <a href="http://jaspervdj.be/hakyll">Hakyll</a>
    </footer>
  </body>
 </html>
--- a/_site/rss.xml
+++ b/_site/rss.xml
@ -7,7 +7,116 @@
        <description><![CDATA[Recent posts]]></description>
        <atom:link href="https://www.lozeve.com/rss.xml" rel="self"
                   type="application/rss+xml" />
-        <lastBuildDate>Sun, 05 Apr 2020 00:00:00 UT</lastBuildDate>
+        <lastBuildDate>Tue, 05 May 2020 00:00:00 UT</lastBuildDate>
        <item>
    <title>ICLR 2020 Notes</title>
    <link>https://www.lozeve.com/posts/iclr-2020-notes.html</link>
    <description><![CDATA[<article>
    <section class="header">
    </section>
    <section>
        <p>ICLR is one of the most important conferences in machine learning, and as such, I was very excited to have the opportunity to volunteer and attend the first fully-virtual edition of the event. The whole content of the conference has been made <a href="https://iclr.cc/virtual_2020/index.html">publicly available</a>, only a few days after the end of the event!</p>
 <p>I would like to thank the <a href="https://iclr.cc/Conferences/2020/Committees">organizing committee</a> for this incredible event, and the possibility to volunteer to help other participants<span><label for="sn-1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-1" class="margin-toggle"/><span class="sidenote">To better organize the event, and help people navigate the various online tools, they brought in 500(!) volunteers, waved our registration fees, and asked us to do simple load-testing and tech support. This was a very generous offer, and felt very rewarding for us, as we could attend the conference, and give back to the organization a little bit.<br />
 <br />
 </span></span>.</p>
 <p>The many volunteers, the online-only nature of the event, and the low registration fees also allowed for what felt like a very diverse, inclusive event. Many graduate students and researchers from industry (like me), who do not generally have the time or the resources to travel to conferences like this, were able to attend, and make the exchanges richer.</p>
 <p>In this post, I will try to give my impressions on the event, and share the most interesting events and papers I saw.</p>
 <h1 id="the-format-of-the-virtual-conference">The Format of the Virtual Conference</h1>
 <p>As a result of global travel restrictions, the conference was made fully-virtual. It was supposed to take place in Addis Ababa, Ethiopia, which is great for people who are often the target of restrictive visa policies in Northern American countries.</p>
 <p>The thing I appreciated most about the conference format was its emphasis on <em>asynchronous</em> communication. Given how little time they had to plan the conference, they could have made all poster presentations via video-conference and call it a day. Instead, each poster had to record a 5-minute video summarising their research. Alongside each presentation, there was a dedicated Rocket.Chat channel<span><label for="sn-2" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sn-2" class="margin-toggle"/><span class="sidenote"><a href="https://rocket.chat/">Rocket.Chat</a> seems to be an <a href="https://github.com/RocketChat/Rocket.Chat">open-source</a> alternative to Slack. Overall, the experience was great, and I appreciate the efforts of the organizers to use open source software instead of proprietary applications. I hope other conferences will do the same, and perhaps even avoid Zoom, because of recent privacy concerns (maybe try <a href="https://jitsi.org/">Jitsi</a>?).<br />
 <br />
 </span></span> where anyone could ask a question to the authors, or just show their appreciation for the work. This was a fantastic idea as it allowed any participant to interact with papers and authors at any time they please, which is especially important in a setting where people were spread all over the globe.</p>
 <p>There were also Zoom session where authors were available for direct, face-to-face discussions, allowing for more traditional conversations. But asking questions on the channel had also the advantage of keeping a track of all questions that were asked by other people. As such, I quickly acquired the habit of watching the video, looking at the chat to see the previous discussions (even if they happened in the middle of the night in my timezone!), and then skimming the paper or asking questions myself.</p>
 <p>All of these excellent ideas were implemented by an <a href="https://iclr.cc/virtual_2020/papers.html?filter=keywords">amazing website</a>, collecting all papers in a searchable, easy-to-use interface, and even a nice <a href="https://iclr.cc/virtual_2020/paper_vis.html">visualisation</a> of papers as a point cloud!</p>
 <h1 id="speakers">Speakers</h1>
 <p>Overall, there were 8 speakers (two for each day of the main conference). They made a 40-minute presentation, and then there was a Q&amp;A both via the chat and via Zoom. I only saw 4 of them, but I expect I will be watching the others in the near future.</p>
 <h2 id="prof.-leslie-kaelbling-doing-for-our-robots-what-nature-did-for-us">Prof. Leslie Kaelbling, <a href="https://iclr.cc/virtual_2020/speaker_2.html">Doing for Our Robots What Nature Did For Us</a></h2>
 <p>This talk was fascinating. It is about robotics, and especially how to design the “software” of our robots. We want to program a robot in a way that it could work the best it can over all possible domains it can encounter.</p>
 <h1 id="workshops">Workshops</h1>
 <h1 id="some-interesting-papers">Some Interesting Papers</h1>
 <h2 id="natural-language-processing">Natural Language Processing</h2>
 <h3 id="harwath-et-al.-learning-hierarchical-discrete-linguistic-units-from-visually-grounded-speech">Harwath et al., <a href="https://openreview.net/forum?id=B1elCp4KwH">Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech</a></h3>
 <p>Humans can easily deconstruct all the information available in speech (meaning, language, emotion, speaker, etc.). However, this is very hard for machines. This paper explores the capacity of algorithms to reason about all the aspects of the signal, including visual cues.</p>
 <p>Their goal is to use spoken captions of images to train a predictive model.</p>
 <ul>
 <li>speech signal: contains a lot of information (meaning, language, emotion, speaker, environment)
 <ul>
 <li>difficult to identify these different parts for an algo (although easy for a human)</li>
 </ul></li>
 <li>dominated by supervised ML
 <ul>
 <li>automated speech recognition (ASR) = P(text | waveform)</li>
 <li>text-to-speech (TTS) = P(waveform | text, speaker)</li>
 <li>high sample complexity</li>
 <li>bad out-of-domain performance</li>
 <li>limited by annotation capability</li>
 </ul></li>
 <li>human-like learning
 <ul>
 <li>ability to jointly reason about all the aspects of the signal</li>
 <li>rapidly adapt to new speaker or noise conditions</li>
 <li>learn new words for a single example</li>
 <li>utilize unlabelled multimodal data</li>
 </ul></li>
 <li>using visual grounding for self-supervision
 <ul>
 <li>humans can leverage cross-modal correspondences to learn what spoken words represent without requiring any text or symbolic input whatsoever</li>
 <li>hypothesis: similar for computer algorithms?</li>
 </ul></li>
 <li>goal: use spoken captions of images to train a predictive model
 <ul>
 <li>learn a hierarchical structure of units</li>
 <li>learn the corresponding text, but also the transcription of the spoken sounds, at a sub-word level</li>
 </ul></li>
 <li>prefer models that learn a discrete tokenisation of the speech
 <ul>
 <li>language has an intrinsically symbolic structure
 <ul>
 <li>convey meaning with discrete words</li>
 <li>words are in turn composed of a finite set of speech sounds (phones)</li>
 </ul></li>
 <li>model that can discover discrete representations for word and phone-like units
 <ul>
 <li>more interpretable</li>
 <li>able to do few-shot learning (learn a new word-like unit in terms of known phone-like units)</li>
 <li>path towards learning compositional structure from continuous signals</li>
 </ul></li>
 </ul></li>
 <li>model for audio-visual grounding:
 <ul>
 <li>NN for image</li>
 <li>NN for raw speech</li>
 <li>shared embedding space</li>
 <li>semantic supervision</li>
 </ul></li>
 <li>preliminary studies
 <ul>
 <li>lower layers features are correlated with phones</li>
 <li>higher lauer features are correlated with words</li>
 </ul></li>
 <li>add a vector quantizing layer in the speech NN</li>
 <li>hierarchy of quantization layers
 <ul>
 <li>capture phones and words</li>
 </ul></li>
 <li>do an ABX test to compare performance to speech-only models</li>
 <li>conclusion
 <ul>
 <li>novel linguistic unit learning paradigm using multimodal data without text</li>
 <li>SOTA performance on learning phonetic and word-level units</li>
 <li>discovery of discreteness as a good inductive bias for semantic task from speech</li>
 </ul></li>
 </ul>
 <h2 id="reinforcement-learning">Reinforcement Learning</h2>
 <h2 id="ml-and-neural-network-theory">ML and Neural Network Theory</h2>
    </section>
 </article>
 ]]></description>
    <pubDate>Tue, 05 May 2020 00:00:00 UT</pubDate>
    <guid>https://www.lozeve.com/posts/iclr-2020-notes.html</guid>
    <dc:creator>Dimitri Lozeve</dc:creator>
 </item>
 <item>
    <title>Reading notes: Hierarchical Optimal Transport for Document Representation</title>
    <link>https://www.lozeve.com/posts/hierarchical-optimal-transport-for-document-classification.html</link>
--- a/posts/iclr-2020-notes.org
+++ b/posts/iclr-2020-notes.org
@ -0,0 +1,162 @@
 ---
 title: "ICLR 2020 Notes"
 date: 2020-05-05
 ---
 ICLR is one of the most important conferences in machine learning, and
 as such, I was very excited to have the opportunity to volunteer and
 attend the first fully-virtual edition of the event. The whole content
 of the conference has been made [[https://iclr.cc/virtual_2020/index.html][publicly available]], only a few days
 after the end of the event!
 I would like to thank the [[https://iclr.cc/Conferences/2020/Committees][organizing committee]] for this incredible
 event, and the possibility to volunteer to help other
 participants[fn:volunteer].
 The many volunteers, the online-only nature of the event, and the low
 registration fees also allowed for what felt like a very diverse,
 inclusive event. Many graduate students and researchers from industry
 (like me), who do not generally have the time or the resources to
 travel to conferences like this, were able to attend, and make the
 exchanges richer.
 In this post, I will try to give my impressions on the event, and
 share the most interesting events and papers I saw.
 [fn:volunteer] To better organize the event, and help people navigate
 the various online tools, they brought in 500(!) volunteers, waved our
 registration fees, and asked us to do simple load-testing and tech
 support. This was a very generous offer, and felt very rewarding for
 us, as we could attend the conference, and give back to the
 organization a little bit.
 * The Format of the Virtual Conference
 As a result of global travel restrictions, the conference was made
 fully-virtual. It was supposed to take place in Addis Ababa, Ethiopia,
 which is great for people who are often the target of restrictive visa
 policies in Northern American countries.
 The thing I appreciated most about the conference format was its
 emphasis on /asynchronous/ communication. Given how little time they
 had to plan the conference, they could have made all poster
 presentations via video-conference and call it a day. Instead, each
 poster had to record a 5-minute video summarising their
 research. Alongside each presentation, there was a dedicated
 Rocket.Chat channel[fn:rocketchat] where anyone could ask a question
 to the authors, or just show their appreciation for the work. This was
 a fantastic idea as it allowed any participant to interact with papers
 and authors at any time they please, which is especially important in
 a setting where people were spread all over the globe.
 There were also Zoom session where authors were available for direct,
 face-to-face discussions, allowing for more traditional
 conversations. But asking questions on the channel had also the
 advantage of keeping a track of all questions that were asked by other
 people. As such, I quickly acquired the habit of watching the video,
 looking at the chat to see the previous discussions (even if they
 happened in the middle of the night in my timezone!), and then
 skimming the paper or asking questions myself.
 All of these excellent ideas were implemented by an [[https://iclr.cc/virtual_2020/papers.html?filter=keywords][amazing website]],
 collecting all papers in a searchable, easy-to-use interface, and even
 a nice [[https://iclr.cc/virtual_2020/paper_vis.html][visualisation]] of papers as a point cloud!
 [fn:rocketchat] [[https://rocket.chat/][Rocket.Chat]] seems to be an [[https://github.com/RocketChat/Rocket.Chat][open-source]] alternative to
 Slack. Overall, the experience was great, and I appreciate the efforts
 of the organizers to use open source software instead of proprietary
 applications. I hope other conferences will do the same, and perhaps
 even avoid Zoom, because of recent privacy concerns (maybe try
 [[https://jitsi.org/][Jitsi]]?).
 * Speakers
 Overall, there were 8 speakers (two for each day of the main
 conference). They made a 40-minute presentation, and then there was a
 Q&A both via the chat and via Zoom. I only saw 4 of them, but I expect
 I will be watching the others in the near future.
 ** Prof. Leslie Kaelbling, [[https://iclr.cc/virtual_2020/speaker_2.html][Doing for Our Robots What Nature Did For Us]]
 This talk was fascinating. It is about robotics, and especially how to
 design the "software" of our robots. We want to program a robot in a
 way that it could work the best it can over all possible domains it
 can encounter.
 * Workshops
 * Some Interesting Papers
 ** Natural Language Processing
 *** Harwath et al., [[https://openreview.net/forum?id=B1elCp4KwH][Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech]]
 Humans can easily deconstruct all the information available in speech
 (meaning, language, emotion, speaker, etc.). However, this is very
 hard for machines. This paper explores the capacity of algorithms to
 reason about all the aspects of the signal, including visual cues.
 Their goal is to use spoken captions of images to train a predictive
 model.
 - speech signal: contains a lot of information (meaning, language,
  emotion, speaker, environment)
  - difficult to identify these different parts for an algo (although
    easy for a human)
 - dominated by supervised ML
  - automated speech recognition (ASR) = P(text | waveform)
  - text-to-speech (TTS) = P(waveform | text, speaker)
  - high sample complexity
  - bad out-of-domain performance
  - limited by annotation capability
 - human-like learning
  - ability to jointly reason about all the aspects of the signal
  - rapidly adapt to new speaker or noise conditions
  - learn new words for a single example
  - utilize unlabelled multimodal data
 - using visual grounding for self-supervision
  - humans can leverage cross-modal correspondences to learn what
    spoken words represent without requiring any text or symbolic
    input whatsoever
  - hypothesis: similar for computer algorithms?
 - goal: use spoken captions of images to train a predictive model
  - learn a hierarchical structure of units
  - learn the corresponding text, but also the transcription of the
    spoken sounds, at a sub-word level
 - prefer models that learn a discrete tokenisation of the speech
  - language has an intrinsically symbolic structure
    - convey meaning with discrete words
    - words are in turn composed of a finite set of speech sounds
      (phones)
  - model that can discover discrete representations for word and
    phone-like units
    - more interpretable
    - able to do few-shot learning (learn a new word-like unit in
      terms of known phone-like units)
    - path towards learning compositional structure from continuous
      signals
 - model for audio-visual grounding:
  - NN for image
  - NN for raw speech
  - shared embedding space
  - semantic supervision
 - preliminary studies
  - lower layers features are correlated with phones
  - higher lauer features are correlated with words
 - add a vector quantizing layer in the speech NN
 - hierarchy of quantization layers
  - capture phones and words
 - do an ABX test to compare performance to speech-only models
 - conclusion
  - novel linguistic unit learning paradigm using multimodal data
    without text
  - SOTA performance on learning phonetic and word-level units
  - discovery of discreteness as a good inductive bias for semantic
    task from speech
 ** Reinforcement Learning
 ** ML and Neural Network Theory