Reinforcement Learning 1

2018-11-21 21:16:06 +01:00 · 2018-11-21 21:16:06 +01:00 · ad4cea5710
commit ad4cea5710
parent 863de69b3e
4 changed files with 211 additions and 0 deletions
--- a/posts/reinforcement-learning-1.org
+++ b/posts/reinforcement-learning-1.org
@ -0,0 +1,102 @@
+---
+title: "Quick Notes on Reinforcement Learning (Part 1)"
+date: 2018-11-21
+---
+
+* Introduction
+
+In this series of blog posts, I intend to write my notes as I go
+through Richard S. Sutton's excellent /Reinforcement Learning: An
+Introduction/ [[ref-1][(1)]].
+
+I will try to formalise the maths behind it a little bit, mainly
+because I would like to use it as a useful personal reference to the
+main concepts in RL. I will probably add a few remarks about a
+possible implementation as I go on.
+
+* Relationship between agent and environment
+
+** Context and assumptions
+
+The goal of reinforcement learning is to select the best actions
+availables to an agent as it goes through a series of states in an
+environment. In this post, we will only consider /discrete/ time
+steps.
+
+The most important hypothesis we make is the /Markov property:/
+
+#+BEGIN_QUOTE
+At each time step, the next state of the agent depends only on the
+current state and the current action taken. It cannot depend on the
+history of the states visited by the agent.
+#+END_QUOTE
+
+This property is essential to make our problems tractable, and often
+holds true in practice (to a reasonable approximation).
+
+With this assumption, we can define the relationship between agent and
+environment as a /Markov Decision Process/ (MDP).
+
+#+begin_definition
+A /Markov Decision Process/ is a tuple $(\mathcal{S}, \mathcal{A},
+\mathcal{R}, p)$ where:
+- $\mathcal{S}$ is a set of /states/,
+- $\mathcal{A}$ is an application mapping each state $s \in
+  \mathcal{S}$ to a set $\mathcal{A}(s)$ of possible /actions/ for
+  this state. In this post, we will often simplify by using
+  $\mathcal{A}$ as a set, assuming that all actions are possible for
+  each state,
+- $\mathcal{R} \subset \mathbb{R}$ is a set of /rewards/,
+- and $p$ is a function representing the /dynamics/ of the MDP: 
+  \begin{align}
+  p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
+  p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
+  \end{align}
+  such that
+  $$ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. $$
+#+end_definition
+
+The function $p$ represents the probability of transitioning to the
+state $s'$ and getting a reward $r$ when the agent is at state $s$ and
+chooses action $a$.
+
+We will also use occasionally the /state-transition probabilities/:
+\begin{align}
+ p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
+p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
+&= \sum_r p(s', r \;|\; s, a).
+\end{align}
+  
+** Rewarding the agent
+
+#+begin_definition
+The /expected reward/ of a state-action pair is the function
+\begin{align}
+r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
+r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
+&= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
+\end{align}
+#+end_definition
+
+#+begin_definition
+The /discounted return/ is the sum of all future rewards, with a
+multiplicative factor to give more weights to more immediate rewards:
+$$ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, $$
+where $T$ can be infinite or $\gamma$ can be 1, but not both.
+#+end_definition
+
+* Deciding what to do: policies
+
+# TODO
+
+Coming soon...
+
+** Defining our policy and its value
+
+** The quest for the optimal policy
+
+* References
+
+1. <<ref-1>>R. S. Sutton and A. G. Barto, Reinforcement learning: an
+   introduction, Second edition. Cambridge, MA: The MIT Press, 2018.
+