Reinforcement Learning 1
This commit is contained in:
parent
863de69b3e
commit
ad4cea5710
4 changed files with 211 additions and 0 deletions
102
posts/reinforcement-learning-1.org
Normal file
102
posts/reinforcement-learning-1.org
Normal file
|
@ -0,0 +1,102 @@
|
|||
---
|
||||
title: "Quick Notes on Reinforcement Learning (Part 1)"
|
||||
date: 2018-11-21
|
||||
---
|
||||
|
||||
* Introduction
|
||||
|
||||
In this series of blog posts, I intend to write my notes as I go
|
||||
through Richard S. Sutton's excellent /Reinforcement Learning: An
|
||||
Introduction/ [[ref-1][(1)]].
|
||||
|
||||
I will try to formalise the maths behind it a little bit, mainly
|
||||
because I would like to use it as a useful personal reference to the
|
||||
main concepts in RL. I will probably add a few remarks about a
|
||||
possible implementation as I go on.
|
||||
|
||||
* Relationship between agent and environment
|
||||
|
||||
** Context and assumptions
|
||||
|
||||
The goal of reinforcement learning is to select the best actions
|
||||
availables to an agent as it goes through a series of states in an
|
||||
environment. In this post, we will only consider /discrete/ time
|
||||
steps.
|
||||
|
||||
The most important hypothesis we make is the /Markov property:/
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
At each time step, the next state of the agent depends only on the
|
||||
current state and the current action taken. It cannot depend on the
|
||||
history of the states visited by the agent.
|
||||
#+END_QUOTE
|
||||
|
||||
This property is essential to make our problems tractable, and often
|
||||
holds true in practice (to a reasonable approximation).
|
||||
|
||||
With this assumption, we can define the relationship between agent and
|
||||
environment as a /Markov Decision Process/ (MDP).
|
||||
|
||||
#+begin_definition
|
||||
A /Markov Decision Process/ is a tuple $(\mathcal{S}, \mathcal{A},
|
||||
\mathcal{R}, p)$ where:
|
||||
- $\mathcal{S}$ is a set of /states/,
|
||||
- $\mathcal{A}$ is an application mapping each state $s \in
|
||||
\mathcal{S}$ to a set $\mathcal{A}(s)$ of possible /actions/ for
|
||||
this state. In this post, we will often simplify by using
|
||||
$\mathcal{A}$ as a set, assuming that all actions are possible for
|
||||
each state,
|
||||
- $\mathcal{R} \subset \mathbb{R}$ is a set of /rewards/,
|
||||
- and $p$ is a function representing the /dynamics/ of the MDP:
|
||||
\begin{align}
|
||||
p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
||||
p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
|
||||
\end{align}
|
||||
such that
|
||||
$$ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. $$
|
||||
#+end_definition
|
||||
|
||||
The function $p$ represents the probability of transitioning to the
|
||||
state $s'$ and getting a reward $r$ when the agent is at state $s$ and
|
||||
chooses action $a$.
|
||||
|
||||
We will also use occasionally the /state-transition probabilities/:
|
||||
\begin{align}
|
||||
p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
||||
p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
|
||||
&= \sum_r p(s', r \;|\; s, a).
|
||||
\end{align}
|
||||
|
||||
** Rewarding the agent
|
||||
|
||||
#+begin_definition
|
||||
The /expected reward/ of a state-action pair is the function
|
||||
\begin{align}
|
||||
r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
|
||||
r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
|
||||
&= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
|
||||
\end{align}
|
||||
#+end_definition
|
||||
|
||||
#+begin_definition
|
||||
The /discounted return/ is the sum of all future rewards, with a
|
||||
multiplicative factor to give more weights to more immediate rewards:
|
||||
$$ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, $$
|
||||
where $T$ can be infinite or $\gamma$ can be 1, but not both.
|
||||
#+end_definition
|
||||
|
||||
* Deciding what to do: policies
|
||||
|
||||
# TODO
|
||||
|
||||
Coming soon...
|
||||
|
||||
** Defining our policy and its value
|
||||
|
||||
** The quest for the optimal policy
|
||||
|
||||
* References
|
||||
|
||||
1. <<ref-1>>R. S. Sutton and A. G. Barto, Reinforcement learning: an
|
||||
introduction, Second edition. Cambridge, MA: The MIT Press, 2018.
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue