{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Monte Carlo (MC) Methods in Reinforcement Learning\n",
    "\n",
    "In this notebook, we study tabular model-free RL using MC methods. Recall that we introduced MC sampling in Notebook 1, and here we shall make use of some of the sampling techniques we discussed there,\n",
    "\n",
    "In particular, in this Notebook, we shall code up simple routines for:\n",
    "1. first-visit MC prediction for value function estimation;\n",
    "2. MC with exploring starts (ES) for policy optimization.\n",
    "\n",
    "We follow closely the discussion and pseudocodes from Sutton & Barto, Chapter 5. \n",
    "\n",
    "Because the tabular MC methods in RL we discussed in class are defined only for episodic tasks, below we shall use the `Episodic_GridWorldEnv` we created in Notebook 2. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import import_ipynb\n",
    "from Notebook_2_RL_environments import Episodic_GridWorldEnv # import environment, notebooks must be in same directory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us begin by initializing the environment. \n",
    "\n",
    "Apart from fixing the seed for reproducibility, episodic environments also require the argument `n_time_steps` which sets an upper bound on the number of time steps per episode. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_time_steps=100 # number of steps in a single episode\n",
    "seed=0 # set seed of rng (for reproducibility of the results)\n",
    "\n",
    "# discount factor\n",
    "gamma = 1.0\n",
    "\n",
    "# create environment class\n",
    "env=Episodic_GridWorldEnv(n_time_steps=n_time_steps,seed=seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our first taks is to write a routine called `policy_rollout` which uses a policy `pi` to generate RL tranjectories \n",
    "\n",
    "$$\\tau = (S_0, A_0, R_1, S_1 A_1, R_2, \\dots, S_\\mathrm{terminal})$$\n",
    "\n",
    "Generating trajectories is often called a rollout in RL.\n",
    "\n",
    "Note that if we encounter a terminal state $S_\\mathrm{terminal}$, then we should stop the rollout. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_rollout(pi, States, Actions, Rewards):\n",
    "    \"\"\"\n",
    "    pi: np.ndarray of dimension (m,n,a). \n",
    "        policy to generate trajectory from\n",
    "    States: list \n",
    "        append here the encountered states in\n",
    "    Actions: list \n",
    "        append here the encountered actions\n",
    "    Rewards: list \n",
    "        append here the encountered rewards \n",
    "    \n",
    "    \"\"\"\n",
    "    \n",
    "    \n",
    "    while env.current_step < env.n_time_steps:\n",
    "    \n",
    "        # read off state\n",
    "        state=env.state.copy()\n",
    "        \n",
    "        # pick a random action\n",
    "        action=np.random.choice([0,1,2,3], p=pi[state[0],state[1],:] )        \n",
    "        \n",
    "        # take an environment step\n",
    "        state_prime, reward, done = env.step(action)\n",
    "\n",
    "        # store trajectory\n",
    "        States.append(state)\n",
    "        Actions.append(action)\n",
    "        Rewards.append(reward)\n",
    "\n",
    "        # check if state is terminal\n",
    "        if done:\n",
    "            break\n",
    "                \n",
    "    return States, Actions, Rewards\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First-visit MC prediction for estimating $V\\approx V_\\pi$\n",
    "\n",
    "We are now ready to code up the first-visit MC prediction algorithm to approximating the vaue function $V\\approx V_\\pi$.\n",
    "\n",
    "\n",
    "1. initialize the table for the approximate value function `V`, and the empty list `Returns` (see pseudicode in textbook)\n",
    "\n",
    "2. For each episode: \n",
    "    * re-set the environment to a random initial state, cf. `env.reset(random=True)`. \n",
    "    * roll out the policy to sample a trajectory\n",
    "    * loop backwards over the trajectory to evaluate the return G (you need to do extra work to guarantee the first-visit constraint, see pseudocode)\n",
    "    \n",
    "As usual, it is considered a better practice to write the routine itself in a separate function `first_visit_MC`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def first_visit_MC(N_trajectories, pi):\n",
    "    \"\"\"\n",
    "    N_trajectories: int\n",
    "        Number of trajectories to be sampled.\n",
    "    pi: np.ndarray of dimension (m,n,a). \n",
    "        policy to generate trajectories from.\n",
    "    \n",
    "    \"\"\"\n",
    "\n",
    "    # initialize variables\n",
    "    V = np.zeros((4,4),) # (m,n)\n",
    "    Returns = [[[] for j in range(4)] for i in range(4)]\n",
    "\n",
    "    # loop over N_trajectories to sample trajectories\n",
    "    for episode in range(N_trajectories):\n",
    "\n",
    "        # set env to a random initial state\n",
    "        env.reset(random=True)\n",
    "\n",
    "        # generate episode using pi\n",
    "        States  = []\n",
    "        Actions = []\n",
    "        Rewards = []\n",
    "\n",
    "        States,Actions,Rewards = policy_rollout(pi, States,Actions,Rewards)\n",
    "\n",
    "        # compute integer representation\n",
    "        States_ints = [state[1]*4+state[0] for state in States]\n",
    "\n",
    "        #print('\\nfinished sampling trajectory {} with {} steps.'.format(n,env.current_step))\n",
    "\n",
    "        # evaluate trajectory\n",
    "        G = 0.0\n",
    "        for t_rev in range(env.current_step-1,-1,-1):\n",
    "            # read off state\n",
    "            state     = States[t_rev]\n",
    "            m,n = state\n",
    "\n",
    "            # update G\n",
    "            G*=gamma\n",
    "            G+=Rewards[t_rev]\n",
    "\n",
    "            if np.sum([np.all(np.linalg.norm(state-S)<1E-13) for S in States[:t_rev]]) < 1:\n",
    "                # append returns\n",
    "                Returns[m][n].append(G)\n",
    "                # compute MC average\n",
    "                V[m,n]=np.mean(Returns[m][n])\n",
    "\n",
    "    return V\n",
    "                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us test the routine \n",
    "\n",
    "1. using the equiprobable policy\n",
    "\n",
    "$$ \\pi(a|s) = \\frac{1}{|\\mathcal{A}(s)|} $$\n",
    "\n",
    "Note that $\\mathcal{A}(s)$ depends on the states $s$; in parciular, at the boundary the action space is smaller than in the bulk. \n",
    "\n",
    "2. define your own policy -- be creative!\n",
    "\n",
    "\n",
    "To do this, we call  we print the resulting approximate value function `V`. \n",
    "\n",
    "3. In both cases, we can set the number of sampled trajectories to `N_trajectories=10000`. How does the result change if we use fewer / more samples? \n",
    "\n",
    "4. The result we got for the equiprobable random policy differs from the $k\\to\\infty$ iteration shown in Fig 4.1 (Sutton & Barto). What is this difference due to? *Hint:* did we implement the same boundary conditions as in the book? Why does the definition of the action space play a role here? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  0., -11., -15., -16.],\n",
       "       [-11., -14., -16., -15.],\n",
       "       [-15., -16., -14., -11.],\n",
       "       [-16., -15., -11.,   0.]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# re-set the seed (reproducibility)\n",
    "np.random.seed(seed) \n",
    "\n",
    "# define equiprobable policy to be evaluated\n",
    "# loop over the states to define equiprobable policy: \\\n",
    "# the loop is needed because the action space is smaller at the boundary than in the bulk\n",
    "pi = np.zeros((4,4,4),) # (m,n,a)\n",
    "for m in range(4):\n",
    "    for n in range(4):\n",
    "        available_actions=env.actions[m,n]\n",
    "        pi[m,n,available_actions]=1.0/len(available_actions) # equiprobable; try out something else!\n",
    "\n",
    "# enable this policy to implement the uniform boundary condition and reproduce result from boo  \n",
    "# pi[...,:] = 0.25 \n",
    "    \n",
    "# number of trajectories to sample for the value function estimate\n",
    "N_trajectories = 10000 \n",
    "\n",
    "\n",
    "# use first-visit MC to learn an estimate for the value function $V\\approx v_\\pi$\n",
    "V = first_visit_MC(N_trajectories, pi)\n",
    "\n",
    "# print value function estimates\n",
    "np.round(V.T, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Monte Carlo  Exploring Starts (ES), for estimating $\\pi\\approx\\pi_\\ast$ \n",
    "   \n",
    "Our second task is to code an algorithm which allows us to improve the policy using MC. In particular, here we focus in Monte Carlo with Exploring Starts (ES) to learn an approximation to the optimal policy $\\pi\\approx\\pi_\\ast$ .\n",
    "\n",
    "Use the pseudocode from Chapter 5.3 in Sutton & Barto to write the `MC_Exploring_Starts()` routine. This routine is similar (but somewhat different) to the `first_visit_MC()` function we wrote above. If you encouner difficulties, we recommend that you first extend `first_visit_MC()` to action-value, or `Q`-functions, and then return to the problem below. \n",
    "\n",
    "In the function below, we also allow to input an initial policy `pi` externally, so we can test the behavior starting from different policy afterwards. Of course, the routing should produce an optimal policy irrespective of the initial policy `pi`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def MC_Exploring_Starts(N_trajectories, pi):\n",
    "\n",
    "    # initialize variables\n",
    "    Q = np.zeros((4,4,4),) # (m,n,a)\n",
    "    Returns = [[[ [] for k in range(4)] for j in range(4)] for i in range(4)] # (m,n,a)\n",
    "\n",
    "    # loop over N_trajectories to sample trajectories\n",
    "    for episode in range(N_trajectories):\n",
    "\n",
    "        ### implement Exploring Starts\n",
    "        # choose random (state,action)-pair with equal probability\n",
    "        env.reset(random=True)\n",
    "        state=env.state.copy()\n",
    "\n",
    "        action = np.random.choice(env.actions[state[0],state[1]])\n",
    "\n",
    "\n",
    "        ### take initial state-action pair\n",
    "        _, reward, done = env.step(action)\n",
    "\n",
    "        ### generate episode using pi\n",
    "\n",
    "        # store initial data\n",
    "        States  = [state]\n",
    "        Actions = [action]\n",
    "        Rewards = [reward]\n",
    "\n",
    "        # roll out policy\n",
    "        if not done:\n",
    "            States,Actions,Rewards = policy_rollout(pi, States,Actions,Rewards)\n",
    "\n",
    "\n",
    "        # evaluate trajectory\n",
    "        G = 0.0\n",
    "        for t_rev in range(env.current_step-1,-1,-1):\n",
    "            # read off state\n",
    "            state = States[t_rev]\n",
    "            m,n = state\n",
    "\n",
    "            action = Actions[t_rev]\n",
    "\n",
    "            # update G\n",
    "            G*=gamma\n",
    "            G+=Rewards[t_rev]\n",
    "\n",
    "            # check if state action pair is not present in te trajectory (there likely is a better way to do this)\n",
    "            pair_absent_bool = np.sum([np.linalg.norm(state-S)<1E-13 * (action==A) for S,A in zip(States[:t_rev],Actions[:t_rev]) ]  ) < 1\n",
    "\n",
    "            if pair_absent_bool: # (S,A) pair not present in trajectory\n",
    "                # append returns\n",
    "                Returns[m][n][action].append(G)\n",
    "                # compute MC average\n",
    "                Q[m,n,action]=np.mean(Returns[m][n][action])\n",
    "                # improve policy\n",
    "                pi[m,n,:] = 0.0\n",
    "                pi[m,n,np.argmax(Q[m,n,:])] = 1.0\n",
    "\n",
    "\n",
    "    return Q, pi\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us test the `MC_Exploring_Starts` routine. \n",
    "\n",
    "1. You may start from any initial policy, including the equiprobable policy, or a random policy. Do you observa any difference? How does this change if we use fewer/more trajectories?\n",
    "\n",
    "2. Visualize the optimal $q_\\ast$ function `Q_star`. Wxtract the correspondig greedy policy from it; does it agree with the optimal policy `pi_star`?  \n",
    "\n",
    "3. Does the optimal policy depend on the boundary condition? *Hint:* to answer this, you'd have to modify the available action space in the environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0. -1. -2. -3.]\n",
      " [-1. -2. -3. -2.]\n",
      " [-2. -3. -2. -1.]\n",
      " [-4. -2. -1.  0.]]\n"
     ]
    }
   ],
   "source": [
    "# re-set the seed (reproducibility)\n",
    "np.random.seed(seed)\n",
    "\n",
    "\n",
    "# define equiprobable policy to be evaluated\n",
    "# loop over the states to define equiprobable policy: \\\n",
    "# the loop is needed because the action space is smaller at the boundary than in the bulk\n",
    "pi = np.zeros((4,4,4),) # (m,n,a)\n",
    "for m in range(4):\n",
    "    for n in range(4):\n",
    "        available_actions=env.actions[m,n]\n",
    "        pi[m,n,available_actions]=1.0/len(available_actions) # equiprobable; try out something else!\n",
    "\n",
    "        \n",
    "## define a random initial policy (has to be normalized to a probability distr)\n",
    "# pi=np.random.uniform(low=0.0,high=1.0, size=(4,4,4)) # random initial policy\n",
    "# norm=np.sum(pi,axis=2)\n",
    "# pi=np.einsum('ijk,ij->ijk', pi,1.0/norm)\n",
    "\n",
    "\n",
    "# MC step\n",
    "N_trajectories = 10000\n",
    "\n",
    "Q_star, pi_star = MC_Exploring_Starts(N_trajectories, pi)\n",
    "    \n",
    "# greedy policy from Q\n",
    "print( np.round( np.max(Q_star,axis=2).T,0) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## On-policy first-visit MC control (for $\\varepsilon$-soft policies), for estimating $\\pi\\approx\\pi_\\ast$\n",
    "\n",
    "We can remove the exploring starts requirement, by using an $\\varepsilon$-soft policy. Modify the function `MC_Exploring_Starts()` to a function `MC_control()` which implements this. The pseudocode is available in Sutton & Barto, Chapter 5.4. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "RL_class",
   "language": "python",
   "name": "rl_class"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "latex_metadata": {
   "affiliation": "Faculty of Physics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria",
   "author": "Marin Bukov",
   "title": "Reinforcement Learning Course: WiSe 2020/21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}