134 lines
4.6 KiB
Org Mode
134 lines
4.6 KiB
Org Mode
---
|
|
title: "Quick Notes on Reinforcement Learning"
|
|
date: 2018-11-21
|
|
---
|
|
|
|
* Introduction
|
|
|
|
In this series of blog posts, I intend to write my notes as I go
|
|
through Richard S. Sutton's excellent /Reinforcement Learning: An
|
|
Introduction/ [[ref-1][(1)]].
|
|
|
|
I will try to formalise the maths behind it a little bit, mainly
|
|
because I would like to use it as a useful personal reference to the
|
|
main concepts in RL. I will probably add a few remarks about a
|
|
possible implementation as I go on.
|
|
|
|
* Relationship between agent and environment
|
|
|
|
** Context and assumptions
|
|
|
|
The goal of reinforcement learning is to select the best actions
|
|
availables to an agent as it goes through a series of states in an
|
|
environment. In this post, we will only consider /discrete/ time
|
|
steps.
|
|
|
|
The most important hypothesis we make is the /Markov property:/
|
|
|
|
#+BEGIN_QUOTE
|
|
At each time step, the next state of the agent depends only on the
|
|
current state and the current action taken. It cannot depend on the
|
|
history of the states visited by the agent.
|
|
#+END_QUOTE
|
|
|
|
This property is essential to make our problems tractable, and often
|
|
holds true in practice (to a reasonable approximation).
|
|
|
|
With this assumption, we can define the relationship between agent and
|
|
environment as a /Markov Decision Process/ (MDP).
|
|
|
|
#+begin_definition
|
|
A /Markov Decision Process/ is a tuple $(\mathcal{S}, \mathcal{A},
|
|
\mathcal{R}, p)$ where:
|
|
- $\mathcal{S}$ is a set of /states/,
|
|
- $\mathcal{A}$ is an application mapping each state $s \in
|
|
\mathcal{S}$ to a set $\mathcal{A}(s)$ of possible /actions/ for
|
|
this state. In this post, we will often simplify by using
|
|
$\mathcal{A}$ as a set, assuming that all actions are possible for
|
|
each state,
|
|
- $\mathcal{R} \subset \mathbb{R}$ is a set of /rewards/,
|
|
- and $p$ is a function representing the /dynamics/ of the MDP:
|
|
\begin{align}
|
|
p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
|
p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
|
|
\end{align}
|
|
such that
|
|
$$ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. $$
|
|
#+end_definition
|
|
|
|
The function $p$ represents the probability of transitioning to the
|
|
state $s'$ and getting a reward $r$ when the agent is at state $s$ and
|
|
chooses action $a$.
|
|
|
|
We will also use occasionally the /state-transition probabilities/:
|
|
\begin{align}
|
|
p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
|
|
p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
|
|
&= \sum_r p(s', r \;|\; s, a).
|
|
\end{align}
|
|
|
|
** Rewarding the agent
|
|
|
|
#+begin_definition
|
|
The /expected reward/ of a state-action pair is the function
|
|
\begin{align}
|
|
r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
|
|
r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
|
|
&= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
|
|
\end{align}
|
|
#+end_definition
|
|
|
|
#+begin_definition
|
|
The /discounted return/ is the sum of all future rewards, with a
|
|
multiplicative factor to give more weights to more immediate rewards:
|
|
$$ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, $$
|
|
where $T$ can be infinite or $\gamma$ can be 1, but not both.
|
|
#+end_definition
|
|
|
|
* Deciding what to do: policies
|
|
|
|
** Defining our policy and its value
|
|
|
|
A /policy/ is a way for the agent to choose the next action to
|
|
perform.
|
|
|
|
#+begin_definition
|
|
A /policy/ is a function $\pi$ defined as
|
|
\begin{align}
|
|
\pi &: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\
|
|
\pi(a \;|\; s) &:= \mathbb{P}(A_t=a \;|\; S_t=s).
|
|
\end{align}
|
|
#+end_definition
|
|
|
|
In order to compare policies, we need to associate values to them.
|
|
|
|
#+begin_definition
|
|
The /state-value function/ of a policy $\pi$ is
|
|
\begin{align}
|
|
v_{\pi} &: \mathcal{S} \mapsto \mathbb{R} \\
|
|
v_{\pi}(s) &:= \text{expected return when starting in $s$ and following $\pi$} \\
|
|
v_{\pi}(s) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\
|
|
v_{\pi}(s) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right]
|
|
\end{align}
|
|
#+end_definition
|
|
|
|
We can also compute the value starting from a state $s$ by also taking
|
|
into account the action taken $a$.
|
|
|
|
#+begin_definition
|
|
The /action-value function/ of a policy $\pi$ is
|
|
\begin{align}
|
|
q_{\pi} &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
|
|
q_{\pi}(s,a) &:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\
|
|
q_{\pi}(s,a) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\
|
|
q_{\pi}(s,a) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right]
|
|
\end{align}
|
|
#+end_definition
|
|
|
|
** The quest for the optimal policy
|
|
|
|
* References
|
|
|
|
1. <<ref-1>>R. S. Sutton and A. G. Barto, Reinforcement learning: an
|
|
introduction, Second edition. Cambridge, MA: The MIT Press, 2018.
|
|
|