{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Temporal Difference Learning\n",
    "\n",
    "In this notebook, we study in practice Temporal Difference (TD) methods. We follow closely the discussion in Sutton & Barto, Chapter 6.\n",
    "\n",
    "Like Monte Carlo methods, TD methods do not require knowledge of the transition probabilities $p(s',r|s,a)$, i.e. they are model-free; TD methods learn directly from experience. Like Dynamic Programming methods, TD methods update incrementally the value function, and can also be used for inifinite-horizeon tasks. \n",
    "\n",
    "Below, you will implement:\n",
    "\n",
    "1. Policy Evaluation using TD(0),  \n",
    "2. On-Policy Control: SARSA\n",
    "3. Off-Policy Control: Q-Learning\n",
    "\n",
    "As an example, we use the WindyGridWorld environment defined in Notebook 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import import_ipynb\n",
    "from Notebook_2_RL_environments import WindyGridWorldEnv # import environment, notebooks must be in same directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set seed of rng (for reproducibility of the results)\n",
    "seed=0 \n",
    "np.random.seed(seed)\n",
    "\n",
    "# create environment class\n",
    "env=WindyGridWorldEnv(seed=seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Policy Evaluation\n",
    "\n",
    "First, we implement Policy Evaluation using TD learning. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_evaluation(pi, N_episodes, alpha, gamma, verbose=True):\n",
    "    \"\"\"\n",
    "    pi: np.ndarray\n",
    "        policy to be evaluated.\n",
    "    N_episodes: int\n",
    "        number of training episodes.\n",
    "    alpha: double\n",
    "        learning rate or step-size parameter. Should be in the interval [0,1].\n",
    "    gamma: double\n",
    "        discount factor. Should be in the interval [0,1].\n",
    "    verbose: bool\n",
    "        whether or not to input progress.\n",
    "    \n",
    "    \"\"\"\n",
    "    \n",
    "    # initialize value function V\n",
    "    V = np.zeros((10,7),) \n",
    "    \n",
    "    # policy evaluation\n",
    "    for episode in range(N_episodes):\n",
    "        \n",
    "        # reset environment\n",
    "        env.reset()\n",
    "        \n",
    "        # loop over the steps in an episode\n",
    "        done=False\n",
    "        while (not done):\n",
    "            \n",
    "            # pick a random action\n",
    "            S=env.state.copy()\n",
    "            A=np.random.choice([0,1,2,3], p=pi[S[0],S[1],:] )\n",
    "\n",
    "            # take an environment step\n",
    "            S_p, R, done = env.step(A)\n",
    "            \n",
    "            # update value function\n",
    "            V[S[0],S[1]] += alpha*(R + gamma*V[S_p[0],S_p[1]] - V[S[0],S[1]])\n",
    "      \n",
    "        if episode%10==0 and verbose:\n",
    "            print('finished episode {0:d}.'.format(episode))\n",
    "    \n",
    "    return V"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define an equiprobably policy and evaluate it using TD learning. Use $\\alpha=0.5$, $\\gamma=0.9$, and `N_episodes=100`. \n",
    "\n",
    "1. What happens if you change the value for the discount factor $\\gamma$? Can you explain your observations?\n",
    "2. Does the result depend on the number of episodes used? Why?\n",
    "3. Try to evaluate the policy which deterministically takes the action `right`. What can go wrong here? Can you come up with a way to fix the issue you observe?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "finished episode 0.\n",
      "finished episode 10.\n",
      "finished episode 20.\n",
      "finished episode 30.\n",
      "finished episode 40.\n",
      "finished episode 50.\n",
      "finished episode 60.\n",
      "finished episode 70.\n",
      "finished episode 80.\n",
      "finished episode 90.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[-10., -10., -10., -10., -10., -10., -10., -10., -10., -10.],\n",
       "       [-10., -10., -10., -10., -10., -10., -10., -10., -10., -10.],\n",
       "       [-10., -10., -10., -10., -10., -10., -10., -10., -10.,  -9.],\n",
       "       [-10., -10., -10., -10., -10., -10., -10.,   0., -10.,  -9.],\n",
       "       [-10., -10., -10., -10., -10., -10.,   0.,  -8.,  -5.,  -9.],\n",
       "       [-10., -10., -10., -10., -10.,   0.,   0., -10., -10.,  -9.],\n",
       "       [-10., -10., -10., -10.,   0.,   0.,   0.,   0.,  -7.,  -9.]])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# learning rate\n",
    "alpha = 0.5\n",
    "\n",
    "# discount factor\n",
    "gamma = 0.9\n",
    "\n",
    "# number of episodes to collect data from\n",
    "N_episodes=100\n",
    "\n",
    "# define equiprobable policy\n",
    "pi_equiprob=0.25*np.ones((10,7,4))\n",
    "        \n",
    "# evaluate the valur function for the equiproably policy\n",
    "V_equiprob = policy_evaluation(pi_equiprob, N_episodes, alpha, gamma)\n",
    "\n",
    "# print value function\n",
    "np.round(np.flipud(V_equiprob.T),0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SARSA\n",
    "\n",
    "We now turn to policy improvement, and implement the SARSA algorithm. SARSA is an on-policy algorithm, i.e. the policy being improved is the same policy which generates the data. The algorithm makes use of generalized policy iteration to find an approximation to the optimal $Q$-function using experience (i.e. from data). \n",
    "\n",
    "SARSA requires us to be able to take actions according to an $\\varepsilon$-greedy policy. Therefore, your first task is to implement a function `take_eps_greedy_action(Qs,eps,avail_actions)` which takes an action according to the $\\varepsilon$-greedy policy w.r.t. the $Q$-function `Q(s,:)` for some fixed state $s$. For simiplicity, we assume that all actions are available from every state (see definition of WindyGridWorld environment). \n",
    "\n",
    "Once we have `take_eps_greedy_action`, we can move on to implement the `SARSA(N_episodes, alpha, gamma, eps)` routine. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def take_eps_greedy_action(Qs,eps,avail_actions=np.array([0,1,2,3])):\n",
    "    \"\"\"\n",
    "    Qs: np.ndarray\n",
    "        the Q(s,:) values for a fixed state, over all available actions from that state.\n",
    "    eps: double\n",
    "        small number to define the eps-greedy policy.\n",
    "    avail_actions: np.ndarray\n",
    "        an array containing all allowed actions from the state s. \n",
    "        For the WindyGridWorld environment, these are always all four actions. \n",
    "        \n",
    "    \"\"\"\n",
    "    \n",
    "    # compute greedy action\n",
    "    a = np.argmax(Qs) \n",
    "    \n",
    "    # draw a random number in [0,1]\n",
    "    delta=np.random.uniform()\n",
    "    \n",
    "    # take a non=greedy action with probability eps/|A|\n",
    "    if delta < eps/avail_actions.shape[0]:\n",
    "        a = np.random.choice( avail_actions[np.arange(len(avail_actions)) != a] )\n",
    "        \n",
    "    return a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def SARSA(N_episodes, alpha, gamma, eps):\n",
    "    \"\"\"\n",
    "    N_episodes: int\n",
    "        number of training episodes\n",
    "    alpha: double\n",
    "        learning rate or step-size parameter. Should be in the interval [0,1].\n",
    "    gamma: double\n",
    "        discount factor. Should be in the interval [0,1].\n",
    "    eps: double\n",
    "        the eps-greedy policy paramter. Control exploration. Should be in the interval [0,1].\n",
    "    \"\"\"\n",
    "\n",
    "    # initialize Q function\n",
    "    Q = np.zeros((10,7,4),) # (10,7)-grid of 4 actions\n",
    "    \n",
    "\n",
    "    # policy evaluation\n",
    "    for episode in range(N_episodes):\n",
    "        \n",
    "        # reset environment and compute initial state\n",
    "        S=env.reset().copy()\n",
    "        # take first action using eps-greedy policy\n",
    "        A=take_eps_greedy_action(Q[S[0],S[1],:],eps)\n",
    "        \n",
    "        # loop over the timesteps in the episode\n",
    "        done=False\n",
    "        while not done:\n",
    "\n",
    "            # take an environment step\n",
    "            S_p, R, done = env.step(A)\n",
    "            \n",
    "            # choose A_p\n",
    "            A_p=take_eps_greedy_action(Q[S_p[0],S_p[1],:],eps)\n",
    "                \n",
    "            # update value function\n",
    "            Q[S[0],S[1],A] += alpha*(R + gamma*Q[S_p[0],S_p[1],A_p] - Q[S[0],S[1],A])\n",
    "                \n",
    "            # update states\n",
    "            S=S_p.copy()\n",
    "            A=A_p.copy()\n",
    "      \n",
    "    return Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we implemented SARSA, we can evaluate the optimal $q_\\ast(s,a)$ and $v_\\ast(s)$ functions for a fixed value of $\\varepsilon=0.1$. \n",
    "\n",
    "* What is the role of the parameters `N_episodes` when it comes to the convergence speed?\n",
    "* Print the value function corresponding to `Q_SARSA`. \n",
    "* Extract the optimal policy from `Q_SARSA`. \n",
    "\n",
    "Use your code now to gain intuition about how the algorithm behahaves:\n",
    "\n",
    "* How do the results change if you decrease the value of `eps`?\n",
    "* How do the results change if you increase/decrease the step size paramter `alpha`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-16., -16., -14., -13., -11., -10.,  -9.,  -8.,  -7.,  -6.],\n",
       "       [-16., -16., -14., -12., -13., -11., -10., -10.,  -8.,  -5.],\n",
       "       [-17., -16., -15., -14., -12., -12., -11.,  -7.,  -8.,  -4.],\n",
       "       [-15., -15., -14., -13., -13., -11., -10.,   0.,  -7.,  -3.],\n",
       "       [-17., -16., -15., -14., -12., -11.,   0.,  -1.,  -1.,  -2.],\n",
       "       [-16., -15., -14., -13., -12.,   0.,   0.,  -2.,  -2.,  -4.],\n",
       "       [-15., -14., -13., -13.,   0.,   0.,   0.,   0.,  -2.,  -4.]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# learning rate\n",
    "alpha = 0.5\n",
    "\n",
    "# discount factor\n",
    "gamma = 1.0\n",
    "\n",
    "# epsilon: small positive number used in the definition of the epsilon-greedy policy\n",
    "eps = 0.1\n",
    "\n",
    "# number of episodes to collect data from\n",
    "N_episodes=1000\n",
    "\n",
    "# use SARSA to compute the optimal Q function\n",
    "Q_SARSA=SARSA(N_episodes, alpha, gamma, eps)\n",
    "\n",
    "# extract and plot the corresponding value function\n",
    "V_SARSA=np.max(Q_SARSA,axis=2)\n",
    "np.round(np.flipud(V_SARSA.T),0)\n",
    "\n",
    "# compute and print the optimal policy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q-Learning\n",
    "\n",
    "Q-Learning is an off-policy algorithm, i.e. the policy being improved can be different from the behavior policy which generates the data. Like SARSA, Q-Learning makes use of generalized policy iteration to find an approximation to the optimal $Q$-function using experience (i.e. from data).\n",
    "\n",
    "The Q-Learning algorithm forms the basics of modern Deep RL studies; the off-policy character allows to learn from old data which makes it particularly suitable for data-driven deep learning approaches.\n",
    "\n",
    "As with SARSA, we will make use of the routine `take_eps_greedy_action` defined above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def Q_Learning(N_episodes, alpha, gamma, eps):\n",
    "    \"\"\"\n",
    "    N_episodes: int\n",
    "        number of training episodes\n",
    "    alpha: double\n",
    "        learning rate or step-size parameter. Should be in the interval [0,1].\n",
    "    gamma: double\n",
    "        discount factor. Should be in the interval [0,1].\n",
    "    eps: double\n",
    "        the eps-greedy policy paramter. Control exploration. Should be in the interval [0,1].\n",
    "    \"\"\"\n",
    "\n",
    "    # initialize Q function\n",
    "    Q = np.zeros((10,7,4),) # (10,7)-grid of 4 actions\n",
    "    \n",
    "\n",
    "    # policy evaluation\n",
    "    for episode in range(N_episodes):\n",
    "        \n",
    "        # reset environment and compute initial state\n",
    "        S=env.reset().copy()\n",
    "        \n",
    "        # loop over the timesteps in the episode\n",
    "        done=False\n",
    "        while not done:\n",
    "            \n",
    "            # choose action\n",
    "            A=take_eps_greedy_action(Q[S[0],S[1],:],eps)\n",
    "\n",
    "            # take an environment step\n",
    "            S_p, R, done = env.step(A)\n",
    "                \n",
    "            # update value function\n",
    "            Q[S[0],S[1],A] += alpha*(R + gamma*np.max(Q[S_p[0],S_p[1],:]) - Q[S[0],S[1],A])\n",
    "                \n",
    "            # update states\n",
    "            S=S_p.copy()\n",
    "      \n",
    "    return Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like with SARSA, we'd like to first evaluate the optimal $q_\\ast(s,a)$ and $v_\\ast(s)$ functions for a fixed value of $\\varepsilon=0.1$. \n",
    "\n",
    "* What is the role of the parameters `N_episodes` when it comes to the convergence speed?\n",
    "* Print the value function `V_QL` corresponding to `Q_QL`. \n",
    "* Extract the optimal policy from `Q_QL`. \n",
    "* Now use the `policy_evaluation` routine to evaluate the optimal policy and visualize its value function `V`; What is the relation between the value function `V`, and the value function `V_QL` obtained from `Q_QL`?\n",
    "\n",
    "Use your code now to gain intuition about how the algorithm behahaves:\n",
    "\n",
    "* How do the results change if you decrease the value of `eps`?\n",
    "* How do the results change if you increase/decrease the step size paramter `alpha`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  0.,   0.,   0.,   0.,   0.,   0.,  -9.,  -8.,  -7.,  -6.],\n",
       "       [  0.,   0.,   0.,   0.,   0., -10.,   0.,   0.,   0.,  -5.],\n",
       "       [  0.,   0.,   0.,   0., -11.,   0.,   0.,   0.,   0.,  -4.],\n",
       "       [-15., -14., -13., -12.,   0.,   0.,   0.,   0.,   0.,  -3.],\n",
       "       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,  -1.,  -2.],\n",
       "       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],\n",
       "       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# learning rate\n",
    "alpha = 0.5\n",
    "\n",
    "# discount factor\n",
    "gamma = 1.0\n",
    "\n",
    "# epsilon: small positive number used in the definition of the epsilon-greedy policy\n",
    "eps = 0.1\n",
    "\n",
    "# number of episodes to collect data from\n",
    "N_episodes=1000\n",
    "\n",
    "# use Q-Learning to compute the optimal Q function\n",
    "Q_QL=Q_Learning(N_episodes, alpha, gamma, eps)\n",
    "\n",
    "# extract and plot the corresponding value function\n",
    "V_QL=np.max(Q_QL,axis=2)\n",
    "np.round(np.flipud(V_QL.T),0)\n",
    "\n",
    "# compute and print the optimal policy\n",
    "pi_QL=np.zeros((10,7,4))\n",
    "\n",
    "# greedy policy associated with Q function\n",
    "amax=np.argmax(Q_QL,axis=2)\n",
    "for m in range(10):\n",
    "    for n in range(7):\n",
    "        pi_QL[m,n,amax[m,n]]=1\n",
    "        \n",
    "# now use the policy_evaluation routine to evaluate the optimal policy and visualize its value function. \n",
    "V=policy_evaluation(pi_QL, N_episodes, alpha, gamma, verbose=False)\n",
    "np.round(np.flipud(V.T),0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "RL_class",
   "language": "python",
   "name": "rl_class"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "latex_metadata": {
   "affiliation": "Faculty of Physics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria",
   "author": "Marin Bukov",
   "title": "Reinforcement Learning Course: WiSe 2020/21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}