{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Temporal Difference Learning\n", "\n", "In this notebook, we study in practice Temporal Difference (TD) methods. We follow closely the discussion in Sutton & Barto, Chapter 6.\n", "\n", "Like Monte Carlo methods, TD methods do not require knowledge of the transition probabilities $p(s',r|s,a)$, i.e. they are model-free; TD methods learn directly from experience. Like Dynamic Programming methods, TD methods update incrementally the value function, and can also be used for inifinite-horizeon tasks. \n", "\n", "Below, you will implement:\n", "\n", "1. Policy Evaluation using TD(0), \n", "2. On-Policy Control: SARSA\n", "3. Off-Policy Control: Q-Learning\n", "\n", "As an example, we use the WindyGridWorld environment defined in Notebook 2." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import import_ipynb\n", "from Notebook_2_RL_environments import WindyGridWorldEnv # import environment, notebooks must be in same directory" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# set seed of rng (for reproducibility of the results)\n", "seed=0 \n", "np.random.seed(seed)\n", "\n", "# create environment class\n", "env=WindyGridWorldEnv(seed=seed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Policy Evaluation\n", "\n", "First, we implement Policy Evaluation using TD learning. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def policy_evaluation(pi, N_episodes, alpha, gamma, verbose=True):\n", " \"\"\"\n", " pi: np.ndarray\n", " policy to be evaluated.\n", " N_episodes: int\n", " number of training episodes.\n", " alpha: double\n", " learning rate or step-size parameter. Should be in the interval [0,1].\n", " gamma: double\n", " discount factor. Should be in the interval [0,1].\n", " verbose: bool\n", " whether or not to input progress.\n", " \n", " \"\"\"\n", " \n", " # initialize value function V\n", " V = np.zeros((10,7),) \n", " \n", " # policy evaluation\n", " for episode in range(N_episodes):\n", " \n", " # reset environment\n", " env.reset()\n", " \n", " # loop over the steps in an episode\n", " done=False\n", " while (not done):\n", " \n", " # pick a random action\n", " S=env.state.copy()\n", " A=np.random.choice([0,1,2,3], p=pi[S[0],S[1],:] )\n", "\n", " # take an environment step\n", " S_p, R, done = env.step(A)\n", " \n", " # update value function\n", " V[S[0],S[1]] += alpha*(R + gamma*V[S_p[0],S_p[1]] - V[S[0],S[1]])\n", " \n", " if episode%10==0 and verbose:\n", " print('finished episode {0:d}.'.format(episode))\n", " \n", " return V" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define an equiprobably policy and evaluate it using TD learning. Use $\\alpha=0.5$, $\\gamma=0.9$, and `N_episodes=100`. \n", "\n", "1. What happens if you change the value for the discount factor $\\gamma$? Can you explain your observations?\n", "2. Does the result depend on the number of episodes used? Why?\n", "3. Try to evaluate the policy which deterministically takes the action `right`. What can go wrong here? Can you come up with a way to fix the issue you observe?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finished episode 0.\n", "finished episode 10.\n", "finished episode 20.\n", "finished episode 30.\n", "finished episode 40.\n", "finished episode 50.\n", "finished episode 60.\n", "finished episode 70.\n", "finished episode 80.\n", "finished episode 90.\n" ] }, { "data": { "text/plain": [ "array([[-10., -10., -10., -10., -10., -10., -10., -10., -10., -10.],\n", " [-10., -10., -10., -10., -10., -10., -10., -10., -10., -10.],\n", " [-10., -10., -10., -10., -10., -10., -10., -10., -10., -9.],\n", " [-10., -10., -10., -10., -10., -10., -10., 0., -10., -9.],\n", " [-10., -10., -10., -10., -10., -10., 0., -8., -5., -9.],\n", " [-10., -10., -10., -10., -10., 0., 0., -10., -10., -9.],\n", " [-10., -10., -10., -10., 0., 0., 0., 0., -7., -9.]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# learning rate\n", "alpha = 0.5\n", "\n", "# discount factor\n", "gamma = 0.9\n", "\n", "# number of episodes to collect data from\n", "N_episodes=100\n", "\n", "# define equiprobable policy\n", "pi_equiprob=0.25*np.ones((10,7,4))\n", " \n", "# evaluate the valur function for the equiproably policy\n", "V_equiprob = policy_evaluation(pi_equiprob, N_episodes, alpha, gamma)\n", "\n", "# print value function\n", "np.round(np.flipud(V_equiprob.T),0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SARSA\n", "\n", "We now turn to policy improvement, and implement the SARSA algorithm. SARSA is an on-policy algorithm, i.e. the policy being improved is the same policy which generates the data. The algorithm makes use of generalized policy iteration to find an approximation to the optimal $Q$-function using experience (i.e. from data). \n", "\n", "SARSA requires us to be able to take actions according to an $\\varepsilon$-greedy policy. Therefore, your first task is to implement a function `take_eps_greedy_action(Qs,eps,avail_actions)` which takes an action according to the $\\varepsilon$-greedy policy w.r.t. the $Q$-function `Q(s,:)` for some fixed state $s$. For simiplicity, we assume that all actions are available from every state (see definition of WindyGridWorld environment). \n", "\n", "Once we have `take_eps_greedy_action`, we can move on to implement the `SARSA(N_episodes, alpha, gamma, eps)` routine. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def take_eps_greedy_action(Qs,eps,avail_actions=np.array([0,1,2,3])):\n", " \"\"\"\n", " Qs: np.ndarray\n", " the Q(s,:) values for a fixed state, over all available actions from that state.\n", " eps: double\n", " small number to define the eps-greedy policy.\n", " avail_actions: np.ndarray\n", " an array containing all allowed actions from the state s. \n", " For the WindyGridWorld environment, these are always all four actions. \n", " \n", " \"\"\"\n", " \n", " # compute greedy action\n", " a = np.argmax(Qs) \n", " \n", " # draw a random number in [0,1]\n", " delta=np.random.uniform()\n", " \n", " # take a non=greedy action with probability eps/|A|\n", " if delta < eps/avail_actions.shape[0]:\n", " a = np.random.choice( avail_actions[np.arange(len(avail_actions)) != a] )\n", " \n", " return a" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def SARSA(N_episodes, alpha, gamma, eps):\n", " \"\"\"\n", " N_episodes: int\n", " number of training episodes\n", " alpha: double\n", " learning rate or step-size parameter. Should be in the interval [0,1].\n", " gamma: double\n", " discount factor. Should be in the interval [0,1].\n", " eps: double\n", " the eps-greedy policy paramter. Control exploration. Should be in the interval [0,1].\n", " \"\"\"\n", "\n", " # initialize Q function\n", " Q = np.zeros((10,7,4),) # (10,7)-grid of 4 actions\n", " \n", "\n", " # policy evaluation\n", " for episode in range(N_episodes):\n", " \n", " # reset environment and compute initial state\n", " S=env.reset().copy()\n", " # take first action using eps-greedy policy\n", " A=take_eps_greedy_action(Q[S[0],S[1],:],eps)\n", " \n", " # loop over the timesteps in the episode\n", " done=False\n", " while not done:\n", "\n", " # take an environment step\n", " S_p, R, done = env.step(A)\n", " \n", " # choose A_p\n", " A_p=take_eps_greedy_action(Q[S_p[0],S_p[1],:],eps)\n", " \n", " # update value function\n", " Q[S[0],S[1],A] += alpha*(R + gamma*Q[S_p[0],S_p[1],A_p] - Q[S[0],S[1],A])\n", " \n", " # update states\n", " S=S_p.copy()\n", " A=A_p.copy()\n", " \n", " return Q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we implemented SARSA, we can evaluate the optimal $q_\\ast(s,a)$ and $v_\\ast(s)$ functions for a fixed value of $\\varepsilon=0.1$. \n", "\n", "* What is the role of the parameters `N_episodes` when it comes to the convergence speed?\n", "* Print the value function corresponding to `Q_SARSA`. \n", "* Extract the optimal policy from `Q_SARSA`. \n", "\n", "Use your code now to gain intuition about how the algorithm behahaves:\n", "\n", "* How do the results change if you decrease the value of `eps`?\n", "* How do the results change if you increase/decrease the step size paramter `alpha`?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-16., -16., -14., -13., -11., -10., -9., -8., -7., -6.],\n", " [-16., -16., -14., -12., -13., -11., -10., -10., -8., -5.],\n", " [-17., -16., -15., -14., -12., -12., -11., -7., -8., -4.],\n", " [-15., -15., -14., -13., -13., -11., -10., 0., -7., -3.],\n", " [-17., -16., -15., -14., -12., -11., 0., -1., -1., -2.],\n", " [-16., -15., -14., -13., -12., 0., 0., -2., -2., -4.],\n", " [-15., -14., -13., -13., 0., 0., 0., 0., -2., -4.]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# learning rate\n", "alpha = 0.5\n", "\n", "# discount factor\n", "gamma = 1.0\n", "\n", "# epsilon: small positive number used in the definition of the epsilon-greedy policy\n", "eps = 0.1\n", "\n", "# number of episodes to collect data from\n", "N_episodes=1000\n", "\n", "# use SARSA to compute the optimal Q function\n", "Q_SARSA=SARSA(N_episodes, alpha, gamma, eps)\n", "\n", "# extract and plot the corresponding value function\n", "V_SARSA=np.max(Q_SARSA,axis=2)\n", "np.round(np.flipud(V_SARSA.T),0)\n", "\n", "# compute and print the optimal policy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q-Learning\n", "\n", "Q-Learning is an off-policy algorithm, i.e. the policy being improved can be different from the behavior policy which generates the data. Like SARSA, Q-Learning makes use of generalized policy iteration to find an approximation to the optimal $Q$-function using experience (i.e. from data).\n", "\n", "The Q-Learning algorithm forms the basics of modern Deep RL studies; the off-policy character allows to learn from old data which makes it particularly suitable for data-driven deep learning approaches.\n", "\n", "As with SARSA, we will make use of the routine `take_eps_greedy_action` defined above. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def Q_Learning(N_episodes, alpha, gamma, eps):\n", " \"\"\"\n", " N_episodes: int\n", " number of training episodes\n", " alpha: double\n", " learning rate or step-size parameter. Should be in the interval [0,1].\n", " gamma: double\n", " discount factor. Should be in the interval [0,1].\n", " eps: double\n", " the eps-greedy policy paramter. Control exploration. Should be in the interval [0,1].\n", " \"\"\"\n", "\n", " # initialize Q function\n", " Q = np.zeros((10,7,4),) # (10,7)-grid of 4 actions\n", " \n", "\n", " # policy evaluation\n", " for episode in range(N_episodes):\n", " \n", " # reset environment and compute initial state\n", " S=env.reset().copy()\n", " \n", " # loop over the timesteps in the episode\n", " done=False\n", " while not done:\n", " \n", " # choose action\n", " A=take_eps_greedy_action(Q[S[0],S[1],:],eps)\n", "\n", " # take an environment step\n", " S_p, R, done = env.step(A)\n", " \n", " # update value function\n", " Q[S[0],S[1],A] += alpha*(R + gamma*np.max(Q[S_p[0],S_p[1],:]) - Q[S[0],S[1],A])\n", " \n", " # update states\n", " S=S_p.copy()\n", " \n", " return Q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like with SARSA, we'd like to first evaluate the optimal $q_\\ast(s,a)$ and $v_\\ast(s)$ functions for a fixed value of $\\varepsilon=0.1$. \n", "\n", "* What is the role of the parameters `N_episodes` when it comes to the convergence speed?\n", "* Print the value function `V_QL` corresponding to `Q_QL`. \n", "* Extract the optimal policy from `Q_QL`. \n", "* Now use the `policy_evaluation` routine to evaluate the optimal policy and visualize its value function `V`; What is the relation between the value function `V`, and the value function `V_QL` obtained from `Q_QL`?\n", "\n", "Use your code now to gain intuition about how the algorithm behahaves:\n", "\n", "* How do the results change if you decrease the value of `eps`?\n", "* How do the results change if you increase/decrease the step size paramter `alpha`?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 0., 0., 0., 0., -9., -8., -7., -6.],\n", " [ 0., 0., 0., 0., 0., -10., 0., 0., 0., -5.],\n", " [ 0., 0., 0., 0., -11., 0., 0., 0., 0., -4.],\n", " [-15., -14., -13., -12., 0., 0., 0., 0., 0., -3.],\n", " [ 0., 0., 0., 0., 0., 0., 0., 0., -1., -2.],\n", " [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# learning rate\n", "alpha = 0.5\n", "\n", "# discount factor\n", "gamma = 1.0\n", "\n", "# epsilon: small positive number used in the definition of the epsilon-greedy policy\n", "eps = 0.1\n", "\n", "# number of episodes to collect data from\n", "N_episodes=1000\n", "\n", "# use Q-Learning to compute the optimal Q function\n", "Q_QL=Q_Learning(N_episodes, alpha, gamma, eps)\n", "\n", "# extract and plot the corresponding value function\n", "V_QL=np.max(Q_QL,axis=2)\n", "np.round(np.flipud(V_QL.T),0)\n", "\n", "# compute and print the optimal policy\n", "pi_QL=np.zeros((10,7,4))\n", "\n", "# greedy policy associated with Q function\n", "amax=np.argmax(Q_QL,axis=2)\n", "for m in range(10):\n", " for n in range(7):\n", " pi_QL[m,n,amax[m,n]]=1\n", " \n", "# now use the policy_evaluation routine to evaluate the optimal policy and visualize its value function. \n", "V=policy_evaluation(pi_QL, N_episodes, alpha, gamma, verbose=False)\n", "np.round(np.flipud(V.T),0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "RL_class", "language": "python", "name": "rl_class" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "latex_metadata": { "affiliation": "Faculty of Physics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria", "author": "Marin Bukov", "title": "Reinforcement Learning Course: WiSe 2020/21" } }, "nbformat": 4, "nbformat_minor": 4 }