blog/posts/reinforcement-learning-1.org

---
title: "Quick Notes on Reinforcement Learning"
date: 2018-11-21
tags: machine learning, reinforcement learning
---

* Introduction

In this series of blog posts, I intend to write my notes as I go
through Richard S. Sutton's excellent /Reinforcement Learning: An
Introduction/ [[ref-1][(1)]].

I will try to formalise the maths behind it a little bit, mainly
because I would like to use it as a useful personal reference to the
main concepts in RL. I will probably add a few remarks about a
possible implementation as I go on.

* Relationship between agent and environment

** Context and assumptions

The goal of reinforcement learning is to select the best actions
availables to an agent as it goes through a series of states in an
environment. In this post, we will only consider /discrete/ time
steps.

The most important hypothesis we make is the /Markov property:/

#+BEGIN_QUOTE
At each time step, the next state of the agent depends only on the
current state and the current action taken. It cannot depend on the
history of the states visited by the agent.
#+END_QUOTE

This property is essential to make our problems tractable, and often
holds true in practice (to a reasonable approximation).

With this assumption, we can define the relationship between agent and
environment as a /Markov Decision Process/ (MDP).

#+begin_definition
A /Markov Decision Process/ is a tuple $(\mathcal{S}, \mathcal{A},
\mathcal{R}, p)$ where:
- $\mathcal{S}$ is a set of /states/,
- $\mathcal{A}$ is an application mapping each state $s \in
  \mathcal{S}$ to a set $\mathcal{A}(s)$ of possible /actions/ for
  this state. In this post, we will often simplify by using
  $\mathcal{A}$ as a set, assuming that all actions are possible for
  each state,
- $\mathcal{R} \subset \mathbb{R}$ is a set of /rewards/,
- and $p$ is a function representing the /dynamics/ of the MDP:
  \begin{align}
  p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
  p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a),
  \end{align}
  such that
  $$ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. $$
#+end_definition

The function $p$ represents the probability of transitioning to the
state $s'$ and getting a reward $r$ when the agent is at state $s$ and
chooses action $a$.

We will also use occasionally the /state-transition probabilities/:
\begin{align}
 p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\
p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\
&= \sum_r p(s', r \;|\; s, a).
\end{align}

** Rewarding the agent

#+begin_definition
The /expected reward/ of a state-action pair is the function
\begin{align}
r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\
&= \sum_r r \sum_{s'} p(s', r \;|\; s, a).
\end{align}
#+end_definition

#+begin_definition
The /discounted return/ is the sum of all future rewards, with a
multiplicative factor to give more weights to more immediate rewards:
$$ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, $$
where $T$ can be infinite or $\gamma$ can be 1, but not both.
#+end_definition

* Deciding what to do: policies

** Defining our policy and its value

A /policy/ is a way for the agent to choose the next action to
perform.

#+begin_definition
A /policy/ is a function $\pi$ defined as
\begin{align}
\pi &: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\
\pi(a \;|\; s) &:= \mathbb{P}(A_t=a \;|\; S_t=s).
\end{align}
#+end_definition

In order to compare policies, we need to associate values to them.

#+begin_definition
The /state-value function/ of a policy $\pi$ is
\begin{align}
v_{\pi} &: \mathcal{S} \mapsto \mathbb{R} \\
v_{\pi}(s) &:= \text{expected return when starting in $s$ and following $\pi$} \\
v_{\pi}(s) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\
v_{\pi}(s) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right]
\end{align}
#+end_definition

We can also compute the value starting from a state $s$ by also taking
into account the action taken $a$.

#+begin_definition
The /action-value function/ of a policy $\pi$ is
\begin{align}
q_{\pi} &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\
q_{\pi}(s,a) &:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\
q_{\pi}(s,a) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\
q_{\pi}(s,a) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right]
\end{align}
#+end_definition

** The quest for the optimal policy

* References

1. <<ref-1>>R. S. Sutton and A. G. Barto, Reinforcement learning: an
   introduction, Second edition. Cambridge, MA: The MIT Press, 2018.