Reinforcement Learning 1
This commit is contained in:
parent
863de69b3e
commit
ad4cea5710
4 changed files with 211 additions and 0 deletions
|
@ -27,6 +27,10 @@
|
||||||
Here you can find all my previous posts:
|
Here you can find all my previous posts:
|
||||||
<ul>
|
<ul>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<a href="./posts/reinforcement-learning-1.html">Quick Notes on Reinforcement Learning (Part 1)</a> - November 21, 2018
|
||||||
|
</li>
|
||||||
|
|
||||||
<li>
|
<li>
|
||||||
<a href="./posts/ising-apl.html">Ising model simulation in APL</a> - March 5, 2018
|
<a href="./posts/ising-apl.html">Ising model simulation in APL</a> - March 5, 2018
|
||||||
</li>
|
</li>
|
||||||
|
|
|
@ -51,6 +51,10 @@
|
||||||
<h2>Recent Posts</h2>
|
<h2>Recent Posts</h2>
|
||||||
<ul>
|
<ul>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<a href="./posts/reinforcement-learning-1.html">Quick Notes on Reinforcement Learning (Part 1)</a> - November 21, 2018
|
||||||
|
</li>
|
||||||
|
|
||||||
<li>
|
<li>
|
||||||
<a href="./posts/ising-apl.html">Ising model simulation in APL</a> - March 5, 2018
|
<a href="./posts/ising-apl.html">Ising model simulation in APL</a> - March 5, 2018
|
||||||
</li>
|
</li>
|
||||||
|
|
101
_site/posts/reinforcement-learning-1.html
Normal file
101
_site/posts/reinforcement-learning-1.html
Normal file
|
@ -0,0 +1,101 @@
|
||||||
|
<!doctype html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="utf-8">
|
||||||
|
<meta http-equiv="x-ua-compatible" content="ie=edge">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||||
|
<title>Dimitri Lozeve - Quick Notes on Reinforcement Learning (Part 1)</title>
|
||||||
|
<link rel="stylesheet" href="../css/default.css" />
|
||||||
|
<link rel="stylesheet" href="../css/syntax.css" />
|
||||||
|
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML" async></script>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<header>
|
||||||
|
<div class="logo">
|
||||||
|
<a href="../">Dimitri Lozeve</a>
|
||||||
|
</div>
|
||||||
|
<nav>
|
||||||
|
<a href="../">Home</a>
|
||||||
|
<a href="../projects.html">Projects</a>
|
||||||
|
<a href="../archive.html">Archive</a>
|
||||||
|
<a href="../contact.html">Contact</a>
|
||||||
|
</nav>
|
||||||
|
</header>
|
||||||
|
|
||||||
|
<main role="main">
|
||||||
|
<h1>Quick Notes on Reinforcement Learning (Part 1)</h1>
|
||||||
|
<article>
|
||||||
|
<section class="header">
|
||||||
|
Posted on November 21, 2018
|
||||||
|
|
||||||
|
</section>
|
||||||
|
<section>
|
||||||
|
<h1 id="introduction">Introduction</h1>
|
||||||
|
<p>In this series of blog posts, I intend to write my notes as I go through Richard S. Sutton’s excellent <em>Reinforcement Learning: An Introduction</em> <a href="#ref-1">(1)</a>.</p>
|
||||||
|
<p>I will try to formalise the maths behind it a little bit, mainly because I would like to use it as a useful personal reference to the main concepts in RL. I will probably add a few remarks about a possible implementation as I go on.</p>
|
||||||
|
<h1 id="relationship-between-agent-and-environment">Relationship between agent and environment</h1>
|
||||||
|
<h2 id="context-and-assumptions">Context and assumptions</h2>
|
||||||
|
<p>The goal of reinforcement learning is to select the best actions availables to an agent as it goes through a series of states in an environment. In this post, we will only consider <em>discrete</em> time steps.</p>
|
||||||
|
<p>The most important hypothesis we make is the <em>Markov property:</em></p>
|
||||||
|
<blockquote>
|
||||||
|
<p>At each time step, the next state of the agent depends only on the current state and the current action taken. It cannot depend on the history of the states visited by the agent.</p>
|
||||||
|
</blockquote>
|
||||||
|
<p>This property is essential to make our problems tractable, and often holds true in practice (to a reasonable approximation).</p>
|
||||||
|
<p>With this assumption, we can define the relationship between agent and environment as a <em>Markov Decision Process</em> (MDP).</p>
|
||||||
|
<div class="definition">
|
||||||
|
<p>A <em>Markov Decision Process</em> is a tuple <span class="math inline">\((\mathcal{S}, \mathcal{A},
|
||||||
|
\mathcal{R}, p)\)</span> where:</p>
|
||||||
|
<ul>
|
||||||
|
<li><span class="math inline">\(\mathcal{S}\)</span> is a set of <em>states</em>,</li>
|
||||||
|
<li><span class="math inline">\(\mathcal{A}\)</span> is an application mapping each state <span class="math inline">\(s \in
|
||||||
|
\mathcal{S}\)</span> to a set <span class="math inline">\(\mathcal{A}(s)\)</span> of possible <em>actions</em> for this state. In this post, we will often simplify by using <span class="math inline">\(\mathcal{A}\)</span> as a set, assuming that all actions are possible for each state,</li>
|
||||||
|
<li><span class="math inline">\(\mathcal{R} \subset \mathbb{R}\)</span> is a set of <em>rewards</em>,</li>
|
||||||
|
<li><p>and <span class="math inline">\(p\)</span> is a function representing the <em>dynamics</em> of the MDP:</p>
|
||||||
|
<span class="math display">\[\begin{align}
|
||||||
|
p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
||||||
|
p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
|
||||||
|
\end{align}
|
||||||
|
\]</span>
|
||||||
|
<p>such that <span class="math display">\[ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. \]</span></p></li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
<p>The function <span class="math inline">\(p\)</span> represents the probability of transitioning to the state <span class="math inline">\(s'\)</span> and getting a reward <span class="math inline">\(r\)</span> when the agent is at state <span class="math inline">\(s\)</span> and chooses action <span class="math inline">\(a\)</span>.</p>
|
||||||
|
<p>We will also use occasionally the <em>state-transition probabilities</em>:</p>
|
||||||
|
<span class="math display">\[\begin{align}
|
||||||
|
p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
||||||
|
p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
|
||||||
|
&= \sum_r p(s', r \;|\; s, a).
|
||||||
|
\end{align}
|
||||||
|
\]</span>
|
||||||
|
<h2 id="rewarding-the-agent">Rewarding the agent</h2>
|
||||||
|
<div class="definition">
|
||||||
|
<p>The <em>expected reward</em> of a state-action pair is the function</p>
|
||||||
|
<span class="math display">\[\begin{align}
|
||||||
|
r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
|
||||||
|
r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
|
||||||
|
&= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
|
||||||
|
\end{align}
|
||||||
|
\]</span>
|
||||||
|
</div>
|
||||||
|
<div class="definition">
|
||||||
|
<p>The <em>discounted return</em> is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: <span class="math display">\[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \]</span> where <span class="math inline">\(T\)</span> can be infinite or <span class="math inline">\(\gamma\)</span> can be 1, but not both.</p>
|
||||||
|
</div>
|
||||||
|
<h1 id="deciding-what-to-do-policies">Deciding what to do: policies</h1>
|
||||||
|
<p>Coming soon…</p>
|
||||||
|
<h2 id="defining-our-policy-and-its-value">Defining our policy and its value</h2>
|
||||||
|
<h2 id="the-quest-for-the-optimal-policy">The quest for the optimal policy</h2>
|
||||||
|
<h1 id="references">References</h1>
|
||||||
|
<ol>
|
||||||
|
<li><span id="ref-1"></span>R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction, Second edition. Cambridge, MA: The MIT Press, 2018.</li>
|
||||||
|
</ol>
|
||||||
|
</section>
|
||||||
|
</article>
|
||||||
|
|
||||||
|
</main>
|
||||||
|
|
||||||
|
<footer>
|
||||||
|
Site proudly generated by
|
||||||
|
<a href="http://jaspervdj.be/hakyll">Hakyll</a>
|
||||||
|
</footer>
|
||||||
|
</body>
|
||||||
|
</html>
|
102
posts/reinforcement-learning-1.org
Normal file
102
posts/reinforcement-learning-1.org
Normal file
|
@ -0,0 +1,102 @@
|
||||||
|
---
|
||||||
|
title: "Quick Notes on Reinforcement Learning (Part 1)"
|
||||||
|
date: 2018-11-21
|
||||||
|
---
|
||||||
|
|
||||||
|
* Introduction
|
||||||
|
|
||||||
|
In this series of blog posts, I intend to write my notes as I go
|
||||||
|
through Richard S. Sutton's excellent /Reinforcement Learning: An
|
||||||
|
Introduction/ [[ref-1][(1)]].
|
||||||
|
|
||||||
|
I will try to formalise the maths behind it a little bit, mainly
|
||||||
|
because I would like to use it as a useful personal reference to the
|
||||||
|
main concepts in RL. I will probably add a few remarks about a
|
||||||
|
possible implementation as I go on.
|
||||||
|
|
||||||
|
* Relationship between agent and environment
|
||||||
|
|
||||||
|
** Context and assumptions
|
||||||
|
|
||||||
|
The goal of reinforcement learning is to select the best actions
|
||||||
|
availables to an agent as it goes through a series of states in an
|
||||||
|
environment. In this post, we will only consider /discrete/ time
|
||||||
|
steps.
|
||||||
|
|
||||||
|
The most important hypothesis we make is the /Markov property:/
|
||||||
|
|
||||||
|
#+BEGIN_QUOTE
|
||||||
|
At each time step, the next state of the agent depends only on the
|
||||||
|
current state and the current action taken. It cannot depend on the
|
||||||
|
history of the states visited by the agent.
|
||||||
|
#+END_QUOTE
|
||||||
|
|
||||||
|
This property is essential to make our problems tractable, and often
|
||||||
|
holds true in practice (to a reasonable approximation).
|
||||||
|
|
||||||
|
With this assumption, we can define the relationship between agent and
|
||||||
|
environment as a /Markov Decision Process/ (MDP).
|
||||||
|
|
||||||
|
#+begin_definition
|
||||||
|
A /Markov Decision Process/ is a tuple $(\mathcal{S}, \mathcal{A},
|
||||||
|
\mathcal{R}, p)$ where:
|
||||||
|
- $\mathcal{S}$ is a set of /states/,
|
||||||
|
- $\mathcal{A}$ is an application mapping each state $s \in
|
||||||
|
\mathcal{S}$ to a set $\mathcal{A}(s)$ of possible /actions/ for
|
||||||
|
this state. In this post, we will often simplify by using
|
||||||
|
$\mathcal{A}$ as a set, assuming that all actions are possible for
|
||||||
|
each state,
|
||||||
|
- $\mathcal{R} \subset \mathbb{R}$ is a set of /rewards/,
|
||||||
|
- and $p$ is a function representing the /dynamics/ of the MDP:
|
||||||
|
\begin{align}
|
||||||
|
p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
||||||
|
p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
|
||||||
|
\end{align}
|
||||||
|
such that
|
||||||
|
$$ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. $$
|
||||||
|
#+end_definition
|
||||||
|
|
||||||
|
The function $p$ represents the probability of transitioning to the
|
||||||
|
state $s'$ and getting a reward $r$ when the agent is at state $s$ and
|
||||||
|
chooses action $a$.
|
||||||
|
|
||||||
|
We will also use occasionally the /state-transition probabilities/:
|
||||||
|
\begin{align}
|
||||||
|
p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
||||||
|
p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
|
||||||
|
&= \sum_r p(s', r \;|\; s, a).
|
||||||
|
\end{align}
|
||||||
|
|
||||||
|
** Rewarding the agent
|
||||||
|
|
||||||
|
#+begin_definition
|
||||||
|
The /expected reward/ of a state-action pair is the function
|
||||||
|
\begin{align}
|
||||||
|
r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
|
||||||
|
r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
|
||||||
|
&= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
|
||||||
|
\end{align}
|
||||||
|
#+end_definition
|
||||||
|
|
||||||
|
#+begin_definition
|
||||||
|
The /discounted return/ is the sum of all future rewards, with a
|
||||||
|
multiplicative factor to give more weights to more immediate rewards:
|
||||||
|
$$ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, $$
|
||||||
|
where $T$ can be infinite or $\gamma$ can be 1, but not both.
|
||||||
|
#+end_definition
|
||||||
|
|
||||||
|
* Deciding what to do: policies
|
||||||
|
|
||||||
|
# TODO
|
||||||
|
|
||||||
|
Coming soon...
|
||||||
|
|
||||||
|
** Defining our policy and its value
|
||||||
|
|
||||||
|
** The quest for the optimal policy
|
||||||
|
|
||||||
|
* References
|
||||||
|
|
||||||
|
1. <<ref-1>>R. S. Sutton and A. G. Barto, Reinforcement learning: an
|
||||||
|
introduction, Second edition. Cambridge, MA: The MIT Press, 2018.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue