Upgrade toolchain

This commit is contained in:
Dimitri Lozeve 2020-08-27 15:01:49 +02:00
parent 0b8247cf0d
commit 5719104fd1
33 changed files with 1326 additions and 1061 deletions

View file

@ -16,14 +16,18 @@
<link rel="alternate" type="application/rss+xml" title="Dimitri Lozeve's blog" href="../rss.xml" />
<!-- KaTeX CSS styles -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.css" integrity="sha384-BdGj8xC2eZkQaxoQ8nSLefg4AV4/AwB3Fj+8SUSo7pnKP6Eoy18liIKTPn9oBYNG" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.12.0/dist/katex.min.css" integrity="sha384-AfEj0r4/OFrOo5t7NnNe46zW/tFgW6x/bCJG8FqQCEo3+Aro6EYUG4+cU+KJWu/X" crossorigin="anonymous">
<!-- The loading of KaTeX is deferred to speed up page rendering -->
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/katex.min.js" integrity="sha384-JiKN5O8x9Hhs/UE5cT5AAJqieYlOZbGT3CHws/y97o3ty4R7/O5poG9F3JoiOYw1" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.12.0/dist/katex.min.js" integrity="sha384-g7c+Jr9ZivxKLnZTDUhnkOnsh30B4H0rpLUpJ4jAIKs4fnJI+sEnkvrMWph2EDg4" crossorigin="anonymous"></script>
<!-- To automatically render math in text elements, include the auto-render extension: -->
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.0/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.12.0/dist/contrib/auto-render.min.js" integrity="sha384-mll67QQFJfxn0IYznZYonOWZ644AWYC+Pt2cHqMaRhXVrursRwvLnLaebdGIlYNa" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
<!-- <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script> -->
<!-- <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> -->
</head>
<body>
<article>
@ -44,7 +48,6 @@
</header>
</article>
<article>
@ -68,20 +71,36 @@
<p>A <em>Markov Decision Process</em> is a tuple <span class="math inline">\((\mathcal{S}, \mathcal{A},
\mathcal{R}, p)\)</span> where:</p>
<ul>
<li><span class="math inline">\(\mathcal{S}\)</span> is a set of <em>states</em>,</li>
<li><span class="math inline">\(\mathcal{A}\)</span> is an application mapping each state <span class="math inline">\(s \in
\mathcal{S}\)</span> to a set <span class="math inline">\(\mathcal{A}(s)\)</span> of possible <em>actions</em> for this state. In this post, we will often simplify by using <span class="math inline">\(\mathcal{A}\)</span> as a set, assuming that all actions are possible for each state,</li>
<li><span class="math inline">\(\mathcal{R} \subset \mathbb{R}\)</span> is a set of <em>rewards</em>,</li>
<li><p><span class="math inline">\(\mathcal{S}\)</span> is a set of <em>states</em>,</p></li>
<li><p><span class="math inline">\(\mathcal{A}\)</span> is an application mapping each state <span class="math inline">\(s \in
\mathcal{S}\)</span> to a set <span class="math inline">\(\mathcal{A}(s)\)</span> of possible <em>actions</em> for this state. In this post, we will often simplify by using <span class="math inline">\(\mathcal{A}\)</span> as a set, assuming that all actions are possible for each state,</p></li>
<li><p><span class="math inline">\(\mathcal{R} \subset \mathbb{R}\)</span> is a set of <em>rewards</em>,</p></li>
<li><p>and <span class="math inline">\(p\)</span> is a function representing the <em>dynamics</em> of the MDP:</p>
<span class="math display">\[\begin{align}
p &amp;: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
p(s', r \;|\; s, a) &amp;:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
\end{align}
\]</span>
<p>such that <span class="math display">\[ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. \]</span></p></li>
</ul>
</div>
<p>The function <span class="math inline">\(p\)</span> represents the probability of transitioning to the state <span class="math inline">\(s'\)</span> and getting a reward <span class="math inline">\(r\)</span> when the agent is at state <span class="math inline">\(s\)</span> and chooses action <span class="math inline">\(a\)</span>.</p>
<p>We will also use occasionally the <em>state-transition probabilities</em>:</p>
<span class="math display">\[\begin{align}
p &amp;: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
p(s' \;|\; s, a) &amp;:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
&amp;= \sum_r p(s', r \;|\; s, a).
\end{align}
\]</span>
<h3 id="rewarding-the-agent">Rewarding the agent</h3>
<div class="definition">
<p>The <em>expected reward</em> of a state-action pair is the function</p>
<span class="math display">\[\begin{align}
r &amp;: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
r(s,a) &amp;:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
&amp;= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
\end{align}
\]</span>
</div>
<div class="definition">
<p>The <em>discounted return</em> is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: <span class="math display">\[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \]</span> where <span class="math inline">\(T\)</span> can be infinite or <span class="math inline">\(\gamma\)</span> can be 1, but not both.</p>
@ -91,14 +110,33 @@
<p>A <em>policy</em> is a way for the agent to choose the next action to perform.</p>
<div class="definition">
<p>A <em>policy</em> is a function <span class="math inline">\(\pi\)</span> defined as</p>
<span class="math display">\[\begin{align}
\pi &amp;: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\
\pi(a \;|\; s) &amp;:= \mathbb{P}(A_t=a \;|\; S_t=s).
\end{align}
\]</span>
</div>
<p>In order to compare policies, we need to associate values to them.</p>
<div class="definition">
<p>The <em>state-value function</em> of a policy <span class="math inline">\(\pi\)</span> is</p>
<span class="math display">\[\begin{align}
v_{\pi} &amp;: \mathcal{S} \mapsto \mathbb{R} \\
v_{\pi}(s) &amp;:= \text{expected return when starting in $s$ and following $\pi$} \\
v_{\pi}(s) &amp;:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\
v_{\pi}(s) &amp;= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right]
\end{align}
\]</span>
</div>
<p>We can also compute the value starting from a state <span class="math inline">\(s\)</span> by also taking into account the action taken <span class="math inline">\(a\)</span>.</p>
<div class="definition">
<p>The <em>action-value function</em> of a policy <span class="math inline">\(\pi\)</span> is</p>
<span class="math display">\[\begin{align}
q_{\pi} &amp;: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
q_{\pi}(s,a) &amp;:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\
q_{\pi}(s,a) &amp;:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\
q_{\pi}(s,a) &amp;= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right]
\end{align}
\]</span>
</div>
<h3 id="the-quest-for-the-optimal-policy">The quest for the optimal policy</h3>
<h2 id="references">References</h2>