diff --git a/_site/archive.html b/_site/archive.html index 3401c4c..00f5cb6 100644 --- a/_site/archive.html +++ b/_site/archive.html @@ -7,7 +7,16 @@
First, we prove that every natural number commutes with \(0\).
For every natural number \(a\) such that \(0+a = a+0\), we have:
-\[\begin{align} - 0 + s(a) &= s(0+a)\\ - &= s(a+0)\\ - &= s(a)\\ - &= s(a) + 0. -\end{align} -\]For every natural number \(a\) such that \(0+a = a+0\), we have:
By Axiom 5, every natural number commutes with \(0\).
We can now prove the main proposition:
For all \(a\) and \(b\) such that \(a+b=b+a\),
-\[\begin{align} - a + s(b) &= s(a+b)\\ - &= s(b+a)\\ - &= s(b) + a. -\end{align} -\]For all \(a\) and \(b\) such that \(a+b=b+a\),
We used the opposite of the second rule for \(+\), namely \(\forall a, \forall b,\quad s(a) + b = s(a+b)\). This can easily be proved by another induction.
@@ -230,31 +217,15 @@ then \(\varphi(n)\) is true for every natural n \mathcal{S}\) to a set \(\mathcal{A}(s)\) of possible actions for this state. In this post, we will often simplify by using \(\mathcal{A}\) as a set, assuming that all actions are possible for each state,and \(p\) is a function representing the dynamics of the MDP:
-\[\begin{align} - p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ - p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a), -\end{align} -\]such that \[ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. \]
The function \(p\) represents the probability of transitioning to the state \(s'\) and getting a reward \(r\) when the agent is at state \(s\) and chooses action \(a\).
We will also use occasionally the state-transition probabilities:
-\[\begin{align} - p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ -p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\ -&= \sum_r p(s', r \;|\; s, a). -\end{align} -\] +The expected reward of a state-action pair is the function
-\[\begin{align} -r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ -r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\ -&= \sum_r r \sum_{s'} p(s', r \;|\; s, a). -\end{align} -\]The discounted return is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: \[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \] where \(T\) can be infinite or \(\gamma\) can be 1, but not both.
@@ -264,33 +235,14 @@ r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\A policy is a way for the agent to choose the next action to perform.
A policy is a function \(\pi\) defined as
-\[\begin{align} -\pi &: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\ -\pi(a \;|\; s) &:= \mathbb{P}(A_t=a \;|\; S_t=s). -\end{align} -\]In order to compare policies, we need to associate values to them.
The state-value function of a policy \(\pi\) is
-\[\begin{align} -v_{\pi} &: \mathcal{S} \mapsto \mathbb{R} \\ -v_{\pi}(s) &:= \text{expected return when starting in $s$ and following $\pi$} \\ -v_{\pi}(s) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\ -v_{\pi}(s) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right] -\end{align} -\]We can also compute the value starting from a state \(s\) by also taking into account the action taken \(a\).
The action-value function of a policy \(\pi\) is
-\[\begin{align} -q_{\pi} &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ -q_{\pi}(s,a) &:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\ -q_{\pi}(s,a) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\ -q_{\pi}(s,a) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right] -\end{align} -\]First, we prove that every natural number commutes with \(0\).
For every natural number \(a\) such that \(0+a = a+0\), we have:
-\[\begin{align} - 0 + s(a) &= s(0+a)\\ - &= s(a+0)\\ - &= s(a)\\ - &= s(a) + 0. -\end{align} -\]For every natural number \(a\) such that \(0+a = a+0\), we have:
By Axiom 5, every natural number commutes with \(0\).
We can now prove the main proposition:
For all \(a\) and \(b\) such that \(a+b=b+a\),
-\[\begin{align} - a + s(b) &= s(a+b)\\ - &= s(b+a)\\ - &= s(b) + a. -\end{align} -\]For all \(a\) and \(b\) such that \(a+b=b+a\),
We used the opposite of the second rule for \(+\), namely \(\forall a, \forall b,\quad s(a) + b = s(a+b)\). This can easily be proved by another induction.
diff --git a/_site/posts/reinforcement-learning-1.html b/_site/posts/reinforcement-learning-1.html index de4414e..1472269 100644 --- a/_site/posts/reinforcement-learning-1.html +++ b/_site/posts/reinforcement-learning-1.html @@ -7,7 +7,16 @@and \(p\) is a function representing the dynamics of the MDP:
-\[\begin{align} - p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ - p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a), -\end{align} -\]such that \[ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. \]
The function \(p\) represents the probability of transitioning to the state \(s'\) and getting a reward \(r\) when the agent is at state \(s\) and chooses action \(a\).
We will also use occasionally the state-transition probabilities:
-\[\begin{align} - p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ -p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\ -&= \sum_r p(s', r \;|\; s, a). -\end{align} -\] +The expected reward of a state-action pair is the function
-\[\begin{align} -r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ -r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\ -&= \sum_r r \sum_{s'} p(s', r \;|\; s, a). -\end{align} -\]The discounted return is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: \[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \] where \(T\) can be infinite or \(\gamma\) can be 1, but not both.
@@ -85,33 +78,14 @@ r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\A policy is a way for the agent to choose the next action to perform.
A policy is a function \(\pi\) defined as
-\[\begin{align} -\pi &: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\ -\pi(a \;|\; s) &:= \mathbb{P}(A_t=a \;|\; S_t=s). -\end{align} -\]In order to compare policies, we need to associate values to them.
The state-value function of a policy \(\pi\) is
-\[\begin{align} -v_{\pi} &: \mathcal{S} \mapsto \mathbb{R} \\ -v_{\pi}(s) &:= \text{expected return when starting in $s$ and following $\pi$} \\ -v_{\pi}(s) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\ -v_{\pi}(s) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right] -\end{align} -\]We can also compute the value starting from a state \(s\) by also taking into account the action taken \(a\).
The action-value function of a policy \(\pi\) is
-\[\begin{align} -q_{\pi} &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ -q_{\pi}(s,a) &:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\ -q_{\pi}(s,a) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\ -q_{\pi}(s,a) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right] -\end{align} -\]First, we prove that every natural number commutes with \(0\).
For every natural number \(a\) such that \(0+a = a+0\), we have:
-\[\begin{align} - 0 + s(a) &= s(0+a)\\ - &= s(a+0)\\ - &= s(a)\\ - &= s(a) + 0. -\end{align} -\]For every natural number \(a\) such that \(0+a = a+0\), we have:
By Axiom 5, every natural number commutes with \(0\).
We can now prove the main proposition:
For all \(a\) and \(b\) such that \(a+b=b+a\),
-\[\begin{align} - a + s(b) &= s(a+b)\\ - &= s(b+a)\\ - &= s(b) + a. -\end{align} -\]For all \(a\) and \(b\) such that \(a+b=b+a\),
We used the opposite of the second rule for \(+\), namely \(\forall a, \forall b,\quad s(a) + b = s(a+b)\). This can easily be proved by another induction.
@@ -226,31 +213,15 @@ then \(\varphi(n)\) is true for every natural n \mathcal{S}\) to a set \(\mathcal{A}(s)\) of possible actions for this state. In this post, we will often simplify by using \(\mathcal{A}\) as a set, assuming that all actions are possible for each state,and \(p\) is a function representing the dynamics of the MDP:
-\[\begin{align} - p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ - p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a), -\end{align} -\]such that \[ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. \]
The function \(p\) represents the probability of transitioning to the state \(s'\) and getting a reward \(r\) when the agent is at state \(s\) and chooses action \(a\).
We will also use occasionally the state-transition probabilities:
-\[\begin{align} - p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ -p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\ -&= \sum_r p(s', r \;|\; s, a). -\end{align} -\] +The expected reward of a state-action pair is the function
-\[\begin{align} -r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ -r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\ -&= \sum_r r \sum_{s'} p(s', r \;|\; s, a). -\end{align} -\]The discounted return is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: \[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \] where \(T\) can be infinite or \(\gamma\) can be 1, but not both.
@@ -260,33 +231,14 @@ r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\A policy is a way for the agent to choose the next action to perform.
A policy is a function \(\pi\) defined as
-\[\begin{align} -\pi &: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\ -\pi(a \;|\; s) &:= \mathbb{P}(A_t=a \;|\; S_t=s). -\end{align} -\]In order to compare policies, we need to associate values to them.
The state-value function of a policy \(\pi\) is
-\[\begin{align} -v_{\pi} &: \mathcal{S} \mapsto \mathbb{R} \\ -v_{\pi}(s) &:= \text{expected return when starting in $s$ and following $\pi$} \\ -v_{\pi}(s) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\ -v_{\pi}(s) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right] -\end{align} -\]We can also compute the value starting from a state \(s\) by also taking into account the action taken \(a\).
The action-value function of a policy \(\pi\) is
-\[\begin{align} -q_{\pi} &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ -q_{\pi}(s,a) &:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\ -q_{\pi}(s,a) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\ -q_{\pi}(s,a) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right] -\end{align} -\]