{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dynamic Programming (DP)\n",
    "\n",
    "In this notebook, we will code the following DP algorithms:\n",
    "\n",
    "1. Policy Evaluation  \n",
    "2. Policy Iteration\n",
    "3. Value Iteration\n",
    "\n",
    "As an example, we shall use the GridWorld environment defined in Notebook 2.\n",
    "\n",
    "We follow closely the discussion in Sutton & Barto, Chapter 4. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "# load Notebook_2\n",
    "import import_ipynb\n",
    "from Notebook_2_RL_environments import GridWorldEnv # import environment, both notebooks must be in same directory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we fix the hyperparameters, and create the environment object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "### define rng seed\n",
    "seed=0\n",
    "\n",
    "### construct the Gridworld environment\n",
    "env=GridWorldEnv(seed=seed)\n",
    "\n",
    "### set algorithm parameters\n",
    "# define convergence threshold theta\n",
    "theta = 1E-9\n",
    "# discount factor\n",
    "gamma = 0.9"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Policy Evaluation\n",
    "\n",
    "We now consider the GridWorld environment from Notebook_2. Suppose an RL agent acts with the equiprobable (random) policy\n",
    "\n",
    "$$\\pi(a|s) = \\frac{1}{|\\mathcal{A}|}$$\n",
    "\n",
    "Our first task is to use Iterative Policy Evaluation to determine the value function $V(s)\\approx v_\\pi(s)$. We shall do this in two steps:\n",
    "1. write a routine `policy_evaluation()` which accepts a policy `pi`, the convergence threshold parameter `theta`, and the discount factor `gamma`.\n",
    "2. define the equiprobable policy, and use it as input for `policy_evaluation()` to compute the corresponding value function.\n",
    "3. plot the value function as a table to compare it with the result from Sutton & Barto. \n",
    "4. define a new policy, which is non-uniform policy over the actions but still uniform over the state space: $\\pi'(a|s) = \\pi'(a) = [0.1,0.3,0.5,0.1]$, and compute $v_{\\pi'}$.\n",
    "5. determine which of the following is true: $\\pi'>\\pi$, $\\pi'\\geq\\pi$, $\\pi'=\\pi$, $\\pi'\\leq\\pi$, $\\pi'<\\pi$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_evaluation(pi, theta, gamma):\n",
    "\n",
    "    # initialize approximant value function V\n",
    "    V = np.zeros((5,5),) # 5x5 grid => 25 states\n",
    "\n",
    "    # define differece parameter Delta\n",
    "    Delta=1.0\n",
    "\n",
    "    # policy evaluation\n",
    "    while Delta>=theta:\n",
    "        Delta=0.0\n",
    "\n",
    "        # loop over all states in the state space\n",
    "        for m in range(5):\n",
    "            for n in range(5):\n",
    "                # back up old value for state\n",
    "                V_old = V[m,n]\n",
    "\n",
    "                # evaluate update rule\n",
    "                V_aux=0.0\n",
    "                for action in range(len(env.action_space)):\n",
    "\n",
    "                    # set environment state\n",
    "                    env.state=np.array([m,n])\n",
    "                    # take step\n",
    "                    s_prime, reward, _ = env.step(action)\n",
    "\n",
    "                    # increment V(s)\n",
    "                    V_aux += pi[m,n,action]*(reward + gamma*V[s_prime[0],s_prime[1]])\n",
    "\n",
    "                V[m,n]=V_aux\n",
    "                # compute Delta\n",
    "                Delta = max(Delta, np.abs(V[m,n]-V_old))\n",
    "      \n",
    "    return V\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 3.3,  8.8,  4.4,  5.3,  1.5],\n",
       "       [ 1.5,  3. ,  2.3,  1.9,  0.5],\n",
       "       [ 0.1,  0.7,  0.7,  0.4, -0.4],\n",
       "       [-1. , -0.4, -0.4, -0.6, -1.2],\n",
       "       [-1.9, -1.3, -1.2, -1.4, -2. ]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# define policy over the action space\n",
    "pi = 1.0/4.0*np.ones((5,5,4),) # equiprobable/random policy\n",
    "#pi = np.tile(np.array([0.1,0.3,0.5,0.1]), (5,5,1) ) # another, non-uniform policy\n",
    "\n",
    "\n",
    "# run policy evaluation\n",
    "V=policy_evaluation(pi,theta,gamma)\n",
    "\n",
    "# print V, rounded to the first decimal\n",
    "np.round(np.flipud(V.T),1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Policy Iteration\n",
    "\n",
    "Let us now consider the Policy Iteration algorithm. Recall that Policy Iteration consists of two alternating sub-routines:\n",
    "1. Policy Evaluation (E)\n",
    "2. Policy Improvement (I)\n",
    "\n",
    "We already wrote the `policy_evaluation()` routine above. \n",
    "\n",
    "First, we code up the `policy_improvement(pi,V)` rouine: it accepts two arguments: the current policy `pi` which is to be improved, and the corresponding value function approximant `V`$\\approx v_\\pi$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_improvement(pi,V,gamma):\n",
    "    \n",
    "    stable = True # aux variable for policy iteration\n",
    "        \n",
    "    # loop over all states in the state space\n",
    "    for m in range(5):\n",
    "        for n in range(5):\n",
    "\n",
    "            # back-up old value\n",
    "            action_old = np.argmax(pi[m,n,:])\n",
    "\n",
    "            # define a trial policy for all actions\n",
    "            q_aux = np.zeros(len(env.action_space))\n",
    "            for action in range(len(env.action_space)):\n",
    "\n",
    "                # set environment state\n",
    "                env.state=np.array([m,n])\n",
    "                # take step\n",
    "                s_prime, reward, _ = env.step(action)\n",
    "\n",
    "                # increment V(s)\n",
    "                q_aux[action] = (reward + gamma*V[s_prime[0],s_prime[1]])\n",
    "\n",
    "            # policy improvement step\n",
    "            pi[m,n,:] = 0.0\n",
    "            action_max=np.argmax(q_aux)\n",
    "            pi[m,n,action_max] = 1.0\n",
    "\n",
    "            # aux varible for policy iteration\n",
    "            if action_old != action_max:\n",
    "                stable = False\n",
    "            \n",
    "                \n",
    "    return pi, stable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the two building block routines: `policy_evaluation()` and `policy_improvement()`, use them to write down the `policy_iteration()` routine. Follow the pseudocode in Sutton & Barto. \n",
    "\n",
    "We use as an initial policy the equiprobable policy $\\pi(a|s) = \\frac{1}{|\\mathcal{A}|}$.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_iteration(theta,gamma):\n",
    "    \n",
    "    # define initial deterministic policy at random\n",
    "    pi = 1.0/4.0*np.ones((5,5,4),) # equiprobable/random policy\n",
    "\n",
    "    j=0 \n",
    "    stable = False\n",
    "    while not stable:\n",
    "        \n",
    "        # run policy evaluation\n",
    "        V = policy_evaluation(pi,theta,gamma)\n",
    "        \n",
    "        # policy improvement\n",
    "        pi, stable = policy_improvement(pi,V,gamma)\n",
    "        \n",
    "    \n",
    "        # print policy\n",
    "        print(\"iteration {}: stable = {}\".format(j,stable))\n",
    "        print( np.round(np.flipud(V.T),1) )\n",
    "        print()\n",
    "        \n",
    "        j+=1\n",
    "        \n",
    "    return V, pi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now compute optimal policy for GridWorld using Policy Iteration. In doing so, print the instantaneous value function to monitor its convergence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "iteration 0: stable = False\n",
      "[[ 3.3  8.8  4.4  5.3  1.5]\n",
      " [ 1.5  3.   2.3  1.9  0.5]\n",
      " [ 0.1  0.7  0.7  0.4 -0.4]\n",
      " [-1.  -0.4 -0.4 -0.6 -1.2]\n",
      " [-1.9 -1.3 -1.2 -1.4 -2. ]]\n",
      "\n",
      "iteration 1: stable = False\n",
      "[[22.  24.4 22.  18.5 16.6]\n",
      " [19.8 22.  19.8 16.6 14.9]\n",
      " [17.8 19.8 17.8 14.9 13.5]\n",
      " [16.  17.8 16.  13.5 12.1]\n",
      " [14.4 16.  14.4 12.1 10.9]]\n",
      "\n",
      "iteration 2: stable = False\n",
      "[[22.  24.4 22.  19.4 17.5]\n",
      " [19.8 22.  19.8 17.8 15.7]\n",
      " [17.8 19.8 17.8 16.  14.2]\n",
      " [16.  17.8 16.  14.4 12.7]\n",
      " [14.4 16.  14.4 13.  11.5]]\n",
      "\n",
      "iteration 3: stable = False\n",
      "[[22.  24.4 22.  19.4 17.5]\n",
      " [19.8 22.  19.8 17.8 16. ]\n",
      " [17.8 19.8 17.8 16.  14.4]\n",
      " [16.  17.8 16.  14.4 13. ]\n",
      " [14.4 16.  14.4 13.  11.7]]\n",
      "\n",
      "iteration 4: stable = True\n",
      "[[22.  24.4 22.  19.4 17.5]\n",
      " [19.8 22.  19.8 17.8 16. ]\n",
      " [17.8 19.8 17.8 16.  14.4]\n",
      " [16.  17.8 16.  14.4 13. ]\n",
      " [14.4 16.  14.4 13.  11.7]]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "V, pi = policy_iteration(theta,gamma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, visualize the optimal policy $\\pi_\\ast$. To do this, you may want to plot slices over the action space. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pi(s,a=0):\n",
      "[[0. 1. 0. 1. 0.]\n",
      " [1. 1. 1. 0. 0.]\n",
      " [1. 1. 1. 1. 1.]\n",
      " [1. 1. 1. 1. 1.]\n",
      " [1. 1. 1. 1. 1.]]\n",
      "\n",
      "pi(s,a=1):\n",
      "[[0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]]\n",
      "\n",
      "pi(s,a=2):\n",
      "[[1. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]]\n",
      "\n",
      "pi(s,a=3):\n",
      "[[0. 0. 1. 0. 1.]\n",
      " [0. 0. 0. 1. 1.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0.]]\n",
      "\n",
      "\n",
      "optimal policy $pi_\u0007st$:\n",
      "[[2 0 3 0 3]\n",
      " [0 0 0 3 3]\n",
      " [0 0 0 0 0]\n",
      " [0 0 0 0 0]\n",
      " [0 0 0 0 0]]\n"
     ]
    }
   ],
   "source": [
    "print(\"pi(s,a=0):\")\n",
    "print(np.flipud(pi[...,0].T))\n",
    "\n",
    "print(\"\\npi(s,a=1):\")\n",
    "print(np.flipud(pi[...,1].T))\n",
    "\n",
    "print(\"\\npi(s,a=2):\")\n",
    "print(np.flipud(pi[...,2].T))\n",
    "\n",
    "print(\"\\npi(s,a=3):\")\n",
    "print(np.flipud(pi[...,3].T))\n",
    "\n",
    "\n",
    "print(\"\\n\\noptimal policy $pi_\\ast$:\")\n",
    "print(np.flipud(np.argmax(pi,axis=2).T))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Value Iteration\n",
    "\n",
    "Last, we also code up the Value Iteration algorithm:\n",
    "\n",
    "$$ \n",
    "v_{k+1}(s) = \\max_{a\\in\\mathcal{A}(s)} \\sum_{s',r} p(s',r|s,a)\\left[ r + \\gamma v_k(s)\\right],\\qquad \\lim_{k\\to\\infty}v_k\\to v_\\ast \\\\\n",
    "$$\n",
    "\n",
    "1. Determine the transition probabilities $p(s',r|s,a)$ for the GridWolrd environment. \n",
    "2. Write the routine `value_iteration(theta, gamma)`, starting from the initial value function `V(s)=1` for all states `s`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def value_iteration(theta, gamma):\n",
    "\n",
    "    # initialize value function V\n",
    "    V = np.ones((5,5),) # 5x5 grid => 25 states\n",
    "\n",
    "    # define differece Delta\n",
    "    Delta=1.0\n",
    "\n",
    "    # value iteration\n",
    "    while Delta>=theta:\n",
    "        Delta=0.0\n",
    "\n",
    "        # loop over all states in the state space\n",
    "        for m in range(5):\n",
    "            for n in range(5):\n",
    "                # back up old value for state\n",
    "                v_old = V[m,n]\n",
    "\n",
    "                # evaluate update rule\n",
    "                v_aux = np.zeros(len(env.action_space))\n",
    "                for action in range(len(env.action_space)):\n",
    "\n",
    "                    # set environment state\n",
    "                    env.state=np.array([m,n])\n",
    "                    # take step\n",
    "                    s_prime, reward, _ = env.step(action)\n",
    "                    \n",
    "                    # increment V(s)\n",
    "                    v_aux[action] = reward + gamma*V[s_prime[0],s_prime[1]]\n",
    "\n",
    "                # compute updated value function\n",
    "                V[m,n]=np.max(v_aux)\n",
    "                # compute Delta\n",
    "                Delta = max(Delta, np.abs(V[m,n]-v_old))\n",
    "      \n",
    "    return V"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use Value Iteration to compute the value function for GridWorld, and visualize it. Compare it to your previous results from Policy Iteration. Do they agree? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[22.  24.4 22.  19.4 17.5]\n",
      " [19.8 22.  19.8 17.8 16. ]\n",
      " [17.8 19.8 17.8 16.  14.4]\n",
      " [16.  17.8 16.  14.4 13. ]\n",
      " [14.4 16.  14.4 13.  11.7]]\n"
     ]
    }
   ],
   "source": [
    "### run policy evaluation\n",
    "V=value_iteration(theta,gamma)\n",
    "\n",
    "# print V\n",
    "print( np.round(np.flipud(V.T),1) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, extract the optimal greedy policy $\\pi_\\ast(a|s) = \\pi_\\ast(s)$ which corresponds to the optimal value function $v_\\ast(s)$, that you computed using Value Iteration. \n",
    "\n",
    "$$\n",
    "\\pi_\\ast(s) = \\mathrm{argmax}_{a\\in\\mathcal{A}(s)} \\sum_{s',r} p(s',r|s,a)\\left[ r + \\gamma v_\\ast(s)\\right] \n",
    "$$\n",
    "\n",
    "Does the policy make sense?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "RL_class",
   "language": "python",
   "name": "rl_class"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "latex_metadata": {
   "affiliation": "Faculty of Physics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria",
   "author": "Marin Bukov",
   "title": "Reinforcement Learning Course: WiSe 2020/21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}