Demote headers to avoid first-level as <h1>
This commit is contained in:
parent
aa841f4ba2
commit
02f4a537bd
13 changed files with 222 additions and 220 deletions
|
@ -49,11 +49,11 @@
|
|||
|
||||
</section>
|
||||
<section>
|
||||
<h1 id="introduction">Introduction</h1>
|
||||
<h2 id="introduction">Introduction</h2>
|
||||
<p>In this series of blog posts, I intend to write my notes as I go through Richard S. Sutton’s excellent <em>Reinforcement Learning: An Introduction</em> <a href="#ref-1">(1)</a>.</p>
|
||||
<p>I will try to formalise the maths behind it a little bit, mainly because I would like to use it as a useful personal reference to the main concepts in RL. I will probably add a few remarks about a possible implementation as I go on.</p>
|
||||
<h1 id="relationship-between-agent-and-environment">Relationship between agent and environment</h1>
|
||||
<h2 id="context-and-assumptions">Context and assumptions</h2>
|
||||
<h2 id="relationship-between-agent-and-environment">Relationship between agent and environment</h2>
|
||||
<h3 id="context-and-assumptions">Context and assumptions</h3>
|
||||
<p>The goal of reinforcement learning is to select the best actions availables to an agent as it goes through a series of states in an environment. In this post, we will only consider <em>discrete</em> time steps.</p>
|
||||
<p>The most important hypothesis we make is the <em>Markov property:</em></p>
|
||||
<blockquote>
|
||||
|
@ -76,15 +76,15 @@
|
|||
<p>The function <span class="math inline">\(p\)</span> represents the probability of transitioning to the state <span class="math inline">\(s'\)</span> and getting a reward <span class="math inline">\(r\)</span> when the agent is at state <span class="math inline">\(s\)</span> and chooses action <span class="math inline">\(a\)</span>.</p>
|
||||
<p>We will also use occasionally the <em>state-transition probabilities</em>:</p>
|
||||
|
||||
<h2 id="rewarding-the-agent">Rewarding the agent</h2>
|
||||
<h3 id="rewarding-the-agent">Rewarding the agent</h3>
|
||||
<div class="definition">
|
||||
<p>The <em>expected reward</em> of a state-action pair is the function</p>
|
||||
</div>
|
||||
<div class="definition">
|
||||
<p>The <em>discounted return</em> is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: <span class="math display">\[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \]</span> where <span class="math inline">\(T\)</span> can be infinite or <span class="math inline">\(\gamma\)</span> can be 1, but not both.</p>
|
||||
</div>
|
||||
<h1 id="deciding-what-to-do-policies">Deciding what to do: policies</h1>
|
||||
<h2 id="defining-our-policy-and-its-value">Defining our policy and its value</h2>
|
||||
<h2 id="deciding-what-to-do-policies">Deciding what to do: policies</h2>
|
||||
<h3 id="defining-our-policy-and-its-value">Defining our policy and its value</h3>
|
||||
<p>A <em>policy</em> is a way for the agent to choose the next action to perform.</p>
|
||||
<div class="definition">
|
||||
<p>A <em>policy</em> is a function <span class="math inline">\(\pi\)</span> defined as</p>
|
||||
|
@ -97,8 +97,8 @@
|
|||
<div class="definition">
|
||||
<p>The <em>action-value function</em> of a policy <span class="math inline">\(\pi\)</span> is</p>
|
||||
</div>
|
||||
<h2 id="the-quest-for-the-optimal-policy">The quest for the optimal policy</h2>
|
||||
<h1 id="references">References</h1>
|
||||
<h3 id="the-quest-for-the-optimal-policy">The quest for the optimal policy</h3>
|
||||
<h2 id="references">References</h2>
|
||||
<ol>
|
||||
<li><span id="ref-1"></span>R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction, Second edition. Cambridge, MA: The MIT Press, 2018.</li>
|
||||
</ol>
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue