Upgrade toolchain

2020-08-27 15:01:49 +02:00 · 2020-08-27 15:01:49 +02:00 · 5719104fd1
commit 5719104fd1
parent 0b8247cf0d
33 changed files with 1326 additions and 1061 deletions
--- a/_site/posts/reinforcement-learning-1.html
+++ b/_site/posts/reinforcement-learning-1.html
@ -16,14 +16,18 @@
    <link rel="alternate" type="application/rss+xml" title="Dimitri Lozeve's blog" href="../rss.xml" /> 

    <!-- KaTeX CSS styles -->
-    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.css" integrity="sha384-BdGj8xC2eZkQaxoQ8nSLefg4AV4/AwB3Fj+8SUSo7pnKP6Eoy18liIKTPn9oBYNG" crossorigin="anonymous">
+    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.12.0/dist/katex.min.css" integrity="sha384-AfEj0r4/OFrOo5t7NnNe46zW/tFgW6x/bCJG8FqQCEo3+Aro6EYUG4+cU+KJWu/X" crossorigin="anonymous">

    <!-- The loading of KaTeX is deferred to speed up page rendering -->
-    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.js" integrity="sha384-JiKN5O8x9Hhs/UE5cT5AAJqieYlOZbGT3CHws/y97o3ty4R7/O5poG9F3JoiOYw1" crossorigin="anonymous"></script>
+    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.12.0/dist/katex.min.js" integrity="sha384-g7c+Jr9ZivxKLnZTDUhnkOnsh30B4H0rpLUpJ4jAIKs4fnJI+sEnkvrMWph2EDg4" crossorigin="anonymous"></script>

    <!-- To automatically render math in text elements, include the auto-render extension: -->
-    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
- 
+    <script defer src="https://cdn.jsdelivr.net/npm/katex@0.12.0/dist/contrib/auto-render.min.js" integrity="sha384-mll67QQFJfxn0IYznZYonOWZ644AWYC+Pt2cHqMaRhXVrursRwvLnLaebdGIlYNa" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
+
+
+<!-- <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script> -->
+<!-- <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> -->
+
  </head>
  <body>
    <article>
@ -44,7 +48,6 @@
      </header>
      

-      
    </article>

    <article>
@ -68,20 +71,36 @@
 <p>A <em>Markov Decision Process</em> is a tuple <span class="math inline">\((\mathcal{S}, \mathcal{A},
 \mathcal{R}, p)\)</span> where:</p>
 <ul>
-<li><span class="math inline">\(\mathcal{S}\)</span> is a set of <em>states</em>,</li>
-<li><span class="math inline">\(\mathcal{A}\)</span> is an application mapping each state <span class="math inline">\(s \in
- \mathcal{S}\)</span> to a set <span class="math inline">\(\mathcal{A}(s)\)</span> of possible <em>actions</em> for this state. In this post, we will often simplify by using <span class="math inline">\(\mathcal{A}\)</span> as a set, assuming that all actions are possible for each state,</li>
-<li><span class="math inline">\(\mathcal{R} \subset \mathbb{R}\)</span> is a set of <em>rewards</em>,</li>
+<li><p><span class="math inline">\(\mathcal{S}\)</span> is a set of <em>states</em>,</p></li>
+<li><p><span class="math inline">\(\mathcal{A}\)</span> is an application mapping each state <span class="math inline">\(s \in
+ \mathcal{S}\)</span> to a set <span class="math inline">\(\mathcal{A}(s)\)</span> of possible <em>actions</em> for this state. In this post, we will often simplify by using <span class="math inline">\(\mathcal{A}\)</span> as a set, assuming that all actions are possible for each state,</p></li>
+<li><p><span class="math inline">\(\mathcal{R} \subset \mathbb{R}\)</span> is a set of <em>rewards</em>,</p></li>
 <li><p>and <span class="math inline">\(p\)</span> is a function representing the <em>dynamics</em> of the MDP:</p>
+<span class="math display">\[\begin{align}
+ p &amp;: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
+ p(s', r \;|\; s, a) &amp;:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
+\end{align}
+\]</span>
 <p>such that <span class="math display">\[ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. \]</span></p></li>
 </ul>
 </div>
 <p>The function <span class="math inline">\(p\)</span> represents the probability of transitioning to the state <span class="math inline">\(s'\)</span> and getting a reward <span class="math inline">\(r\)</span> when the agent is at state <span class="math inline">\(s\)</span> and chooses action <span class="math inline">\(a\)</span>.</p>
 <p>We will also use occasionally the <em>state-transition probabilities</em>:</p>
-
+<span class="math display">\[\begin{align}
+ p &amp;: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
+p(s' \;|\; s, a) &amp;:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
+&amp;= \sum_r p(s', r \;|\; s, a).
+\end{align}
+\]</span>
 <h3 id="rewarding-the-agent">Rewarding the agent</h3>
 <div class="definition">
 <p>The <em>expected reward</em> of a state-action pair is the function</p>
+<span class="math display">\[\begin{align}
+r &amp;: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
+r(s,a) &amp;:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
+&amp;= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
+\end{align}
+\]</span>
 </div>
 <div class="definition">
 <p>The <em>discounted return</em> is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: <span class="math display">\[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \]</span> where <span class="math inline">\(T\)</span> can be infinite or <span class="math inline">\(\gamma\)</span> can be 1, but not both.</p>
@ -91,14 +110,33 @@
 <p>A <em>policy</em> is a way for the agent to choose the next action to perform.</p>
 <div class="definition">
 <p>A <em>policy</em> is a function <span class="math inline">\(\pi\)</span> defined as</p>
+<span class="math display">\[\begin{align}
+\pi &amp;: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\
+\pi(a \;|\; s) &amp;:= \mathbb{P}(A_t=a \;|\; S_t=s).
+\end{align}
+\]</span>
 </div>
 <p>In order to compare policies, we need to associate values to them.</p>
 <div class="definition">
 <p>The <em>state-value function</em> of a policy <span class="math inline">\(\pi\)</span> is</p>
+<span class="math display">\[\begin{align}
+v_{\pi} &amp;: \mathcal{S} \mapsto \mathbb{R} \\
+v_{\pi}(s) &amp;:= \text{expected return when starting in $s$ and following $\pi$} \\
+v_{\pi}(s) &amp;:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\
+v_{\pi}(s) &amp;= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right]
+\end{align}
+\]</span>
 </div>
 <p>We can also compute the value starting from a state <span class="math inline">\(s\)</span> by also taking into account the action taken <span class="math inline">\(a\)</span>.</p>
 <div class="definition">
 <p>The <em>action-value function</em> of a policy <span class="math inline">\(\pi\)</span> is</p>
+<span class="math display">\[\begin{align}
+q_{\pi} &amp;: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
+q_{\pi}(s,a) &amp;:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\
+q_{\pi}(s,a) &amp;:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\
+q_{\pi}(s,a) &amp;= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right]
+\end{align}
+\]</span>
 </div>
 <h3 id="the-quest-for-the-optimal-policy">The quest for the optimal policy</h3>
 <h2 id="references">References</h2>