{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reinforcement Learning (RL) Environments\n", "\n", "\n", "In this notebook, we define the backbone code for RL environments, following [OpenAI Gym](https://gym.openai.com/). \n", "\n", "Then, we create some example environments that we shall use in subsequent coding sessions throught the course: we will create three gridworld environments: GridWorld, GridWorld2, and Windy GridWorld. We also create a Qubit environment, and discuss some OpenAI Gym environments. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy.linalg import expm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "class MyEnv():\n", " \"\"\"\n", " Gym style environment for RL. You may also inherit the class structure from OpenAI Gym. \n", " Parameters:\n", " n_time_steps: int\n", " Total number of time steps within each episode\n", " seed: int\n", " seed of the RNG (for reproducibility)\n", " \"\"\"\n", " \n", " def __init__(self, n_time_steps, seed):\n", " \"\"\"\n", " Initialize the environment.\n", " \n", " \"\"\"\n", " \n", " ### define action space variables\n", " \n", " ### define state space variables\n", " \n", " \n", " \n", " pass\n", "\n", " \n", " \n", " def step(self, action):\n", " \"\"\"\n", " Interface between environment and agent. Performs one step in the environemnt.\n", " Parameters:\n", " action: int\n", " the index of the respective action in the action array\n", " Returns:\n", " output: ( object, float, bool)\n", " information provided by the environment about its current state:\n", " (state, reward, done)\n", " \"\"\"\n", "\n", " pass\n", "\n", " return self.state, reward, done\n", "\n", " \n", " \n", " def set_seed(self,seed=0):\n", " \"\"\"\n", " Sets the seed of the RNG.\n", " \n", " \"\"\"\n", " pass\n", " \n", " \n", " \n", " def reset(self):\n", " \"\"\"\n", " Resets the environment to its initial values.\n", " Returns:\n", " state: object\n", " the initial state of the environment\n", " \"\"\"\n", " pass\n", "\n", " return self.state\n", "\n", " \n", " \n", " def render(self):\n", " \"\"\"\n", " Plots the state of the environment. For visulization purposes only. \n", "\n", " \"\"\"\n", " pass\n", " \n", " \n", " # ... add extra private and public functions as necessary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GridWorld\n", "\n", "Consider the GridWorld problem Example 3.5 from Sutton & Barto's \"Reinforcement Learning: an Introduction\", (MIT Press, 2018):\n", "\n", "A $5\\times 5$ grid with open boundary conditions has two pairs of special states: $(A,A')$ and $(B,B')$, such that from state $A$ ($B$) the environment always goes into stte $A'$ ($B'$). The state transitions receive the rewards $r(s,s')$: \n", "\n", "1. $r(A\\to A')=+10$\n", "2. $r(B\\to B')=+5$\n", "3. $r(s',s)=0$ for all other states (except when a move from a boundary state $s$ tries to leave the grid, in which case $r=-1$).\n", "\n", "From each state $s$, the RL agent can take four possible actions $a$: $north$, $south$, $east$, and $west$.\n", "\n", "The **action space** is discrete four-element set $\\mathcal{A}=(north, south, east, west)$\n", "\n", "The **state space** is the two-dimensional grid $\\mathcal{S}=\\mathbb{Z}_5^2$: each state $s=(m,n)$ is labeled by two integers $m,n\\in\\{0,1,2,3,4\\}$. The special states have the coordinates $A=(1,4)$, $A'=(1,0)$, $B=(3,4)$, and $B'=(3,2)$.\n", "\n", "Finally, the **reward space** is given by the discrete set $\\mathcal{R}=\\{-1,0,5,10\\}$.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "class GridWorldEnv():\n", " \"\"\"\n", " Gym style environment for GridWorld\n", " Parameters:\n", " n_time_steps: int\n", " Total number of time steps within each episode\n", " seed: int\n", " seed of the RNG (for reproducibility)\n", " \"\"\"\n", " \n", " def __init__(self, n_time_steps=10, seed=0):\n", " \"\"\"\n", " Initialize the environment.\n", " \n", " \"\"\"\n", " \n", " self.n_time_steps = n_time_steps\n", " \n", " ### define action space variables\n", " self.actions=np.array([0,1,2,3])\n", " #['north', 'south', 'east', 'west'] in coordinate form\n", " self.action_space = [np.array([0,1]), np.array([0,-1]), np.array([1,0]), np.array([-1,0])] \n", " \n", " ### define state space variables\n", " self.state_A = np.array([1,4])\n", " self.state_Ap = np.array([1,0])\n", " self.state_B = np.array([3,4])\n", " self.state_Bp = np.array([3,2])\n", " \n", " # set seed\n", " self.set_seed(seed)\n", " self.reset()\n", "\n", " \n", " \n", " def step(self, action):\n", " \"\"\"\n", " Interface between environment and agent. Performs one step in the environemnt.\n", " Parameters:\n", " action: int\n", " the index of the respective action in the action array\n", " Returns:\n", " output: ( np.array, float, bool)\n", " information provided by the environment about its current state:\n", " (state, reward, done)\n", " \"\"\"\n", " \n", " # check if action tries to take state across the grid boundary \n", " bdry_bool= (self.state[0]==0 and action==3) or (self.state[0]==4 and action==2) \\\n", " or (self.state[1]==0 and action==1) or (self.state[1]==4 and action==0)\n", " \n", " \n", " # environment dynamics (deterministic)\n", " if np.linalg.norm(self.state - self.state_A) < 1E-14:\n", " self.state=self.state_Ap.copy()\n", " reward=10\n", " elif np.linalg.norm(self.state - self.state_B) < 1E-14:\n", " self.state=self.state_Bp.copy()\n", " reward=5\n", " elif bdry_bool:\n", " reward=-1\n", " else:\n", " self.state+=self.action_space[action]\n", " reward=0\n", " \n", " done=False # infinite-horizon task\n", " \n", " self.current_step += 1\n", " \n", " return self.state, reward, done\n", "\n", " \n", " \n", " def set_seed(self,seed=0):\n", " \"\"\"\n", " Sets the seed of the RNG.\n", " \n", " \"\"\"\n", " np.random.seed(seed)\n", " \n", " \n", " \n", " def reset(self):\n", " \"\"\"\n", " Resets the environment to its initial values.\n", " Returns:\n", " state: np.array\n", " the initial state of the environment\n", " \"\"\"\n", " self.current_step = 0\n", " \n", " self.state = np.array([2,2]) #initialize to some state on the grid\n", " return self.state\n", " \n", " \n", " \n", " def sample(self):\n", " \"\"\"\n", " Returns a randomly sampled action.\n", " \"\"\"\n", " return np.random.choice(self.actions) # equiprobable policy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now test the GridWorld environment. We do so by fixing the number of time steps, `n_time_steps`, and the `seed`. We then create the environment and reset it. Finally, we want to loop over the " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0. s=[2 2], a=[0 1], r=0, s'=[2 3]\n", "1. s=[2 3], a=[-1 0], r=0, s'=[1 3]\n", "2. s=[1 3], a=[ 0 -1], r=0, s'=[1 2]\n", "3. s=[1 2], a=[0 1], r=0, s'=[1 3]\n", "4. s=[1 3], a=[-1 0], r=0, s'=[0 3]\n", "5. s=[0 3], a=[-1 0], r=-1, s'=[0 3]\n", "6. s=[0 3], a=[-1 0], r=-1, s'=[0 3]\n", "7. s=[0 3], a=[-1 0], r=-1, s'=[0 3]\n", "8. s=[0 3], a=[ 0 -1], r=0, s'=[0 2]\n", "9. s=[0 2], a=[-1 0], r=-1, s'=[0 2]\n", "10. s=[0 2], a=[ 0 -1], r=0, s'=[0 1]\n", "11. s=[0 1], a=[1 0], r=0, s'=[1 1]\n", "12. s=[1 1], a=[0 1], r=0, s'=[1 2]\n", "13. s=[1 2], a=[-1 0], r=0, s'=[0 2]\n", "14. s=[0 2], a=[1 0], r=0, s'=[1 2]\n", "15. s=[1 2], a=[0 1], r=0, s'=[1 3]\n", "16. s=[1 3], a=[0 1], r=0, s'=[1 4]\n", "17. s=[1 4], a=[0 1], r=10, s'=[1 0]\n", "18. s=[1 0], a=[1 0], r=0, s'=[2 0]\n", "19. s=[2 0], a=[ 0 -1], r=-1, s'=[2 0]\n" ] } ], "source": [ "n_time_steps=20\n", "seed=0\n", "\n", "env=GridWorldEnv(n_time_steps=n_time_steps,seed=seed)\n", "env.reset()\n", "\n", "for _ in range(n_time_steps):\n", " \n", " # pick a random action\n", " action=env.sample() # equiprobable policy\n", " \n", " # take an environment step\n", " state=env.state.copy()\n", " state_p, reward, done = env.step(action)\n", " \n", " print(\"{}. s={}, a={}, r={}, s'={}\".format(_, state, env.action_space[action], reward, state_p))\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GridWorld 2\n", "\n", "This is a finite-horizon, i.e. episodic, GridWorld environment. We consider the $4\\times 4$ grid from Example 4.1 in Sutton & Barto. \n", "\n", "**state space:** $\\mathcal{S} = \\{0,1,2,\\dots,15\\}$, where $0=s=15$ is the terminal state. \n", "\n", "**action space:** $\\mathcal{A} = \\{north,south, east,west\\}$. Actions trying to take the agent off the grid leave the state unchanged: to implement this behavior, we will define smaller actions spaces $\\mathcal{A}(s_\\mathrm{boundary})$ for all states $s_\\mathrm{boundary}$ at the bounary of the grid.\n", "\n", "**reward space:** $\\mathcal{R}=\\{-1\\}$; $r(s,s',a)=-1$ for all states $s,s'\\in\\mathcal{S}$ and all allowed actions $a\\in\\mathcal{A}(s)$. \n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "class Episodic_GridWorldEnv():\n", " \"\"\"\n", " Gym style environment for GridWorld\n", " Parameters:\n", " n_time_steps: int\n", " Total number of time steps within each episode\n", " seed: int\n", " seed of the RNG (for reproducibility)\n", " \"\"\"\n", " \n", " def __init__(self, n_time_steps=10, seed=0):\n", " \"\"\"\n", " Initialize the environment.\n", " \n", " \"\"\"\n", " \n", " self.n_time_steps = n_time_steps\n", " \n", " ### define action space variables\n", " #['north', 'south', 'east', 'west']\n", " self.action_space = [np.array([0,1]), np.array([0,-1]), np.array([1,0]), np.array([-1,0])]\n", " # define the allowed actions from every state s, taking into account the boundary\n", " self.actions={}\n", " for m in range(4):\n", " for n in range(4):\n", " \n", " if m==0: \n", " if n==0:\n", " self.actions[m,n]=np.array([0,2])\n", " elif n==3:\n", " self.actions[m,n]=np.array([1,2])\n", " else:\n", " self.actions[m,n]=np.array([0,1,2])\n", " \n", " elif m==3:\n", " if n==0:\n", " self.actions[m,n]=np.array([0,3])\n", " elif n==3:\n", " self.actions[m,n]=np.array([1,3])\n", " else:\n", " self.actions[m,n]=np.array([0,1,3])\n", " \n", " elif 0\n", " \"\"\"\n", " theta, phi = s\n", " psi = np.array([np.cos(0.5*theta), np.exp(1j*phi)*np.sin(0.5*theta)] )\n", " return psi\n", " \n", " \n", " def qubit_to_RL_state(self,psi):\n", " \"\"\"\n", " Take as input the RL state s, and return the quantum state |psi>\n", " \"\"\"\n", " # take away unphysical global phase\n", " alpha = np.angle(psi[0])\n", " psi_new = np.exp(-1j*alpha) * psi \n", " \n", " # find Bloch sphere angles\n", " theta = 2.0*np.arccos(psi_new[0]).real\n", " phi = np.angle(psi_new[1])\n", " \n", " return np.array([theta, phi])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class QubitEnv2():\n", " \"\"\"\n", " Gym style environment for RL. You may also inherit the class structure from OpenAI Gym. \n", " Parameters:\n", " n_time_steps: int\n", " Total number of time steps within each episode\n", " seed: int\n", " seed of the RNG (for reproducibility)\n", " \"\"\"\n", " \n", " def __init__(self, n_time_steps, seed):\n", " \"\"\"\n", " Initialize the environment.\n", " \n", " \"\"\"\n", " \n", " self.n_time_steps = n_time_steps\n", " \n", " \n", " ### define action space variables\n", " self.n_actions = 4 # action space size\n", " delta_t = 2*np.pi/n_time_steps # set a value for the time step\n", " # define Pauli matrices\n", " Id =np.array([[1.0,0.0 ], [0.0 ,+1.0]])\n", " sigma_x=np.array([[0.0,1.0 ], [1.0 , 0.0]])\n", " sigma_y=np.array([[0.0,-1.0j], [1.0j, 0.0]])\n", " sigma_z=np.array([[1.0,0.0 ], [0.0 ,-1.0]])\n", " \n", " self.action_space=[]\n", " for generator in [Id, sigma_x, sigma_y, sigma_z]:\n", " self.action_space.append( expm(-1j*delta_t*generator) )\n", " \n", " self.actions = np.array([0,1,2,3])\n", " \n", " \n", " ### define state space variables\n", " self.S_target = np.array([0.0,0.0])\n", " self.psi_target = self.RL_to_qubit_state(self.S_target)\n", " \n", " \n", " # set seed\n", " self.set_seed(seed)\n", " self.reset()\n", " \n", " \n", " def step(self, action):\n", " \"\"\"\n", " Interface between environment and agent. Performs one step in the environemnt.\n", " Parameters:\n", " action: int\n", " the index of the respective action in the action array\n", " Returns:\n", " output: ( object, float, bool)\n", " information provided by the environment about its current state:\n", " (state, reward, done)\n", " \"\"\"\n", "\n", " # apply gate to quantum state\n", " self.psi = self.action_space[action].dot(self.psi)\n", " \n", " # compute RL state\n", " self.state = self.qubit_to_RL_state(self.psi)\n", " \n", " # compute reward\n", " reward = np.abs( self.psi_target.conj().dot(self.psi) )**2\n", " \n", " \n", " # check if state is terminal\n", " done=False\n", " \n", "\n", " return self.state, reward, done\n", "\n", " \n", " \n", " def set_seed(self,seed=0):\n", " \"\"\"\n", " Sets the seed of the RNG.\n", " \n", " \"\"\"\n", " np.random.seed(seed)\n", " \n", " \n", " \n", " def reset(self, random=True):\n", " \"\"\"\n", " Resets the environment to its initial values.\n", " Returns:\n", " state: object\n", " the initial state of the environment\n", " random: bool\n", " controls whether the initial state is a random state on the sphere or a fixed initial state.\n", " \"\"\"\n", " \n", " if random:\n", " theta = np.pi*np.random.uniform(0.0,1.0)\n", " phi = 2*np.pi*np.random.uniform(0.0,1.0)\n", " else:\n", " # start from south pole of Bloch sphere\n", " theta=np.pi\n", " phi=0.0\n", " \n", " self.state=np.array([theta,phi])\n", " self.psi=self.RL_to_qubit_state(self.state)\n", "\n", " return self.state\n", "\n", " \n", " \n", " def render(self):\n", " \"\"\"\n", " Plots the state of the environment. For visulization purposes only. \n", "\n", " \"\"\"\n", " pass\n", " \n", " \n", " def RL_to_qubit_state(self,s):\n", " \"\"\"\n", " Take as input the RL state s, and return the quantum state |psi>\n", " \"\"\"\n", " theta, phi = s\n", " psi = np.array([np.cos(0.5*theta), np.exp(1j*phi)*np.sin(0.5*theta)] )\n", " return psi\n", " \n", " \n", " def qubit_to_RL_state(self,psi):\n", " \"\"\"\n", " Take as input the RL state s, and return the quantum state |psi>\n", " \"\"\"\n", " # take away unphysical global phase\n", " alpha = np.angle(psi[0])\n", " psi_new = np.exp(-1j*alpha) * psi \n", " \n", " # find Bloch sphere angles\n", " theta = 2.0*np.arccos(psi_new[0]).real\n", " phi = np.angle(psi_new[1])\n", " \n", " return np.array([theta, phi])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0. s=[2.58 0.26], a=0, r=0.076805, s'=[2.58 0.26]\n", "\n", "1. s=[2.58 0.26], a=2, r=0.047907, s'=[2.7 0.33]\n", "\n", "2. s=[2.7 0.33], a=1, r=0.060115, s'=[2.65 0.55]\n", "\n", "3. s=[2.65 0.55], a=1, r=0.079261, s'=[2.57 0.73]\n", "\n", "4. s=[2.57 0.73], a=3, r=0.079261, s'=[2.57 0.85]\n", "\n", "5. s=[2.57 0.85], a=1, r=0.108041, s'=[2.47 0.96]\n", "\n", "6. s=[2.47 0.96], a=1, r=0.143001, s'=[2.37 1.04]\n", "\n", "7. s=[2.37 1.04], a=2, r=0.123498, s'=[2.42 1.16]\n", "\n", "8. s=[2.42 1.16], a=3, r=0.123498, s'=[2.42 1.28]\n", "\n", "9. s=[2.42 1.28], a=1, r=0.166017, s'=[2.3 1.32]\n", "\n", "10. s=[2.3 1.32], a=0, r=0.166017, s'=[2.3 1.32]\n", "\n", "11. s=[2.3 1.32], a=2, r=0.156982, s'=[2.33 1.43]\n", "\n", "12. s=[2.33 1.43], a=0, r=0.156982, s'=[2.33 1.43]\n", "\n", "13. s=[2.33 1.43], a=1, r=0.204839, s'=[2.2 1.45]\n", "\n", "14. s=[2.2 1.45], a=0, r=0.204839, s'=[2.2 1.45]\n", "\n", "15. s=[2.2 1.45], a=2, r=0.200836, s'=[2.21 1.54]\n", "\n", "16. s=[2.21 1.54], a=0, r=0.200836, s'=[2.21 1.54]\n", "\n", "17. s=[2.21 1.54], a=0, r=0.200836, s'=[2.21 1.54]\n", "\n", "18. s=[2.21 1.54], a=3, r=0.200836, s'=[2.21 1.66]\n", "\n", "19. s=[2.21 1.66], a=3, r=0.200836, s'=[2.21 1.79]\n", "\n", "20. s=[2.21 1.79], a=2, r=0.214082, s'=[2.18 1.88]\n", "\n", "21. s=[2.18 1.88], a=1, r=0.265354, s'=[2.06 1.85]\n", "\n", "22. s=[2.06 1.85], a=1, r=0.320326, s'=[1.94 1.84]\n", "\n", "23. s=[1.94 1.84], a=2, r=0.337244, s'=[1.9 1.88]\n", "\n", "24. s=[1.9 1.88], a=3, r=0.337244, s'=[1.9 2.01]\n", "\n", "25. s=[1.9 2.01], a=1, r=0.39219, s'=[1.79 1.99]\n", "\n", "26. s=[1.79 1.99], a=2, r=0.418166, s'=[1.74 2.02]\n", "\n", "27. s=[1.74 2.02], a=2, r=0.445432, s'=[1.68 2.03]\n", "\n", "28. s=[1.68 2.03], a=3, r=0.445432, s'=[1.68 2.16]\n", "\n", "29. s=[1.68 2.16], a=1, r=0.497747, s'=[1.58 2.15]\n", "\n", "30. s=[1.58 2.15], a=1, r=0.550098, s'=[1.47 2.16]\n", "\n", "31. s=[1.47 2.16], a=1, r=0.601659, s'=[1.37 2.17]\n", "\n", "32. s=[1.37 2.17], a=0, r=0.601659, s'=[1.37 2.17]\n", "\n", "33. s=[1.37 2.17], a=1, r=0.651617, s'=[1.26 2.19]\n", "\n", "34. s=[1.26 2.19], a=2, r=0.684893, s'=[1.19 2.15]\n", "\n", "35. s=[1.19 2.15], a=0, r=0.684893, s'=[1.19 2.15]\n", "\n", "36. s=[1.19 2.15], a=2, r=0.715252, s'=[1.13 2.1 ]\n", "\n", "37. s=[1.13 2.1 ], a=3, r=0.715252, s'=[1.13 2.23]\n", "\n", "38. s=[1.13 2.23], a=2, r=0.748103, s'=[1.05 2.17]\n", "\n", "39. s=[1.05 2.17], a=3, r=0.748103, s'=[1.05 2.3 ]\n", "\n", "40. s=[1.05 2.3 ], a=3, r=0.748103, s'=[1.05 2.43]\n", "\n", "41. s=[1.05 2.43], a=1, r=0.781842, s'=[0.97 2.49]\n", "\n", "42. s=[0.97 2.49], a=0, r=0.781842, s'=[0.97 2.49]\n", "\n", "43. s=[0.97 2.49], a=0, r=0.781842, s'=[0.97 2.49]\n", "\n", "44. s=[0.97 2.49], a=3, r=0.781842, s'=[0.97 2.61]\n", "\n", "45. s=[0.97 2.61], a=1, r=0.805741, s'=[0.91 2.69]\n", "\n", "46. s=[0.91 2.69], a=1, r=0.824818, s'=[0.86 2.79]\n", "\n", "47. s=[0.86 2.79], a=2, r=0.866945, s'=[0.75 2.74]\n", "\n", "48. s=[0.75 2.74], a=2, r=0.903284, s'=[0.63 2.68]\n", "\n", "49. s=[0.63 2.68], a=2, r=0.933263, s'=[0.52 2.59]\n", "\n", "50. s=[0.52 2.59], a=3, r=0.933263, s'=[0.52 2.71]\n", "\n", "51. s=[0.52 2.71], a=0, r=0.933263, s'=[0.52 2.71]\n", "\n", "52. s=[0.52 2.71], a=1, r=0.942903, s'=[0.48 2.93]\n", "\n", "53. s=[0.48 2.93], a=2, r=0.967834, s'=[0.36 2.86]\n", "\n", "54. s=[0.36 2.86], a=2, r=0.985387, s'=[0.24 2.72]\n", "\n", "55. s=[0.24 2.72], a=2, r=0.995285, s'=[0.14 2.34]\n", "\n", "\n", "reached terminal state!\n" ] } ], "source": [ "np.set_printoptions(suppress=True,precision=2)\n", "\n", "n_time_steps = 100\n", "seed=6\n", "\n", "env=QubitEnv(n_time_steps,seed)\n", "env.reset(random=True)\n", "\n", "done=False\n", "j=0\n", "while j < n_time_steps:\n", " \n", " # pick a random action\n", " action=np.random.choice([0,1,2,3]) # equiprobable policy\n", " \n", " # take an environment step\n", " state=env.state.copy()\n", " state_p, reward, done = env.step(action)\n", " \n", " print(\"{}. s={}, a={}, r={}, s'={}\\n\".format(j, state, action, np.round(reward,6), state_p))\n", " \n", " j+=1\n", " \n", " if done:\n", " print('\\nreached terminal state!')\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OpenAI Gym Environments\n", "\n", "Next, we shall look at some OpenAI anvironments: Atari video games, the Cart Pole problem, and the Mountain Car problem. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAM4AAAD8CAYAAAA/rZtiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAARYklEQVR4nO3de4xc9XnG8e/jtRc7xsTrGAgxTvAFUEyTOokLldKgtBBjUBWHSqRGEXULshMploiSokJIi1UVqaQxUZUqREaguE3KpRASpCYtFkIgCImxEwMG42AbB3zJOnHSeH2R7d19+8c5a2aXHe/M78zsnBmejzSaOb9zmfdo99k5c/bMO4oIzKw+E1pdgFk7cnDMEjg4ZgkcHLMEDo5ZAgfHLEHTgiNpiaRtkrZLurlZz2PWCmrG/3EkdQG/AD4B7AaeA66NiJcb/mRmLdCsV5yLge0RsTMijgP3A0ub9Fxm425ik7Y7C3ijYno3cEm1hSX58gUro99ExJmjzWhWcDTK2LBwSFoJrGzS85s1wi+rzWhWcHYDsyumzwX2Vi4QEWuBteBXHGs/zXqP8xxwvqQ5krqBZcCjTXous3HXlFeciOiXtAr4X6ALuDciXmrGc5m1QlNOR9ddRAkP1a677jrmzZtX8/IHDx7kzjvvPDktidtuu62u53zooYfYsmXLyelLLrmEK6+8sq5trF69uq7lxzJz5kxWrVpV1zpr1qyhr6+voXWM9JWvfIWJE9/8u/+Nb3yDAwcONPppNkXEotFmNOs9TtubMmUKZ5xxRs3LDw4OvmWsnvWBYb8IAN3d3XVtoxl/BCdMmFD3fkijnRtqrGnTpjFp0qST0xMmjO9FMA5OjZ5++mmeeeaZk9Nz587lmmuuqWsba9asob+//+T0ihUrmDFjRs3r79mzh+985zsnpydPnsyNN95YVw31OnDgAHfccccpl7npppvG/Re31RycGh06dIje3t6T0z09PXVvo7e3d1hwKh/X4sSJE8NqmDJlSt011GtgYGDYc1rGwbFTmj59OsuWLTvlMuNxaFY2Do6dUnd3NxdccEGryygdB8fqMjAwwAMPPHDKZY4ePTpO1bSOg2N1GRwcZOPGja0uo+UcnBrNnz9/2JmjmTNn1r2NxYsXDzttPXXq1LrWnz59OkuWLDk5XXk6drxMmDBhWA2jeeKJJzh27Ng4VdQaDk6N5s+fz/z58wtt4/LLLy+0/vTp01m8eHGhbRTV1dU1Zg0//vGPHZy3q1deeYXf/e53NS8/2nH9s88+W9dzjvzP969+9au6t9FoR48erbuG48ePN6maN23YsGHYEcB4v6/yJTdm1ZX7kpvJkyczZ86cVpdhNszWrVurzitFcGbOnMmKFStaXYbZMF/84herznt7XWBk1iAOjlkCB8csgYNjliA5OJJmS3pC0lZJL0m6MR9fLWmPpM357arGlWtWDkXOqvUDX4qIn0maBmyStD6f9/WI+Frx8szKKTk4EbEP2Jc/7pO0lawRoVnHa8h7HEnnAR8CfpoPrZL0gqR7JdX/UUmzkiscHEmnAw8DX4iIg8BdwDxgIdkr0poq662UtFHSxsOHDxctw2xcFQqOpElkofluRHwPICJ6I2IgIgaBu8kasL9FRKyNiEURsajey+vNWq3IWTUB9wBbI+LOivFzKha7Gtgycl2zdlfkrNpHgeuAFyVtzse+DFwraSFZk/VdwGcLPIdZKRU5q/Y0o38rwQ/TyzFrD75ywCxBKT5WMJZ77rmHvXv3jr2gWY1mzZrF9ddfn7x+WwSnr6+vro8xm42l3n7YI/lQzSyBg2OWwMExS+DgmCVwcMwSODhmCRwcswQOjlkCB8csgYNjlsDBMUvg4JglcHDMEjg4ZgkKfaxA0i6gDxgA+iNikaQZwAPAeWQfnf50RPgzAdZRGvGK86cRsbDim6tuBh6PiPOBx/Nps47SjEO1pcC6/PE64FNNeA6zlioanAAek7RJ0sp87Oy8Pe5Qm9yzCj6HWekU/ej0RyNir6SzgPWSXql1xTxoKwF6etwl19pLoVeciNib3+8HHiHr2tk71JQwv99fZV138rS2VaST59T86z2QNBVYTNa181Fgeb7YcuAHRYs0K5sih2pnA49knXCZCPxnRPyPpOeAByXdALwOXFO8TLNyKdLJcyfwh6OMHwAuK1KUWdn5ygGzBG3RkPBfFy1iyvz5rS7DOsjRnh5eK7B+WwTn9IkTmdbd3eoyrIN0TSz2q+9DNbMEDo5ZAgfHLIGDY5agLU4OxLuOMTjlSKvLsA4S75hcaP22CA7v6Ieu/lZXYR0kTiv2++RDNbMEDo5ZAgfHLIGDY5agLU4OnOga5PhEnxywxunvGiy0flsE58jk48TE460uwzrI0YK/Tz5UM0vg4JglSD5Uk3QhWcfOIXOBfwCmAyuAX+fjX46IH6Y+j1kZFfno9DZgIYCkLmAPWaebvwG+HhFfa0SBZmXUqJMDlwE7IuKXefOOxpoAgxOi8du1t60o+CalUcFZBtxXMb1K0l8BG4EvFW26fnB2P5MmnSiyCbNhTpzoh9+nr1/45ICkbuCTwH/lQ3cB88gO4/YBa6qst1LSRkkbDx8+XLQMs3HViLNqVwI/i4hegIjojYiBiBgE7ibr7vkW7uRp7awRwbmWisO0ofa3uavJunuadZSiXyz1DuATwGcrhr8qaSHZNxnsGjHPrCMUCk5EHAHeNWLsukIVmbWBtrhWbX2czcHBYh91Nav0zpjOHxVYvy2CMwgM0oT/D9nb1mDBfwv6WjWzBA6OWQIHxyyBg2OWoC1ODgxs+CQnjvjbCqxx+qcehwtH/XramrRFcOL/ziYOTmt1GdZB4kQfVb7XuSY+VDNL4OCYJXBwzBI4OGYJ2uLkQO++9ez/tfuqWeMcP6sbeHfy+m0RnDd+eT+vv/56q8uwDnL86PuAG5PX96GaWQIHxyyBg2OWYMzgSLpX0n5JWyrGZkhaL+nV/L6nYt4tkrZL2ibpimYVbtZKtbzifBtYMmLsZuDxiDgfeDyfRtICsh5rF+XrfDPv8mnWUcYMTkQ8Bfx2xPBSYF3+eB3wqYrx+yPiWES8BmynSnsos3aW+h7n7IjYB5Dfn5WPzwLeqFhudz72Fm5IaO2s0ScHRmsMMOqnu92Q0NpZanB6hxoP5vdD12fvBmZXLHcusDe9PLNySg3Oo8Dy/PFy4AcV48sknSZpDnA+sKFYiWblM+YlN5LuAz4OzJS0G7gN+GfgQUk3AK8D1wBExEuSHgReBvqBz0fEQJNqN2uZMYMTEddWmXVZleVvB24vUpRZ2fnKAbMEDo5ZAgfHLIGDY5bAwTFL4OCYJXBwzBI4OGYJHByzBA6OWQIHxyyBg2OWwMExS+DgmCVwcMwSODhmCRwcswSpnTz/RdIrkl6Q9Iik6fn4eZKOStqc377VxNrNWia1k+d64A8i4oPAL4BbKubtiIiF+e1zjSnTrFySOnlGxGMR0Z9P/oSsDZTZ20Yj3uNcD/yoYnqOpJ9LelLSx6qt5E6e1s4KfSObpFvJ2kB9Nx/aB7w3Ig5I+gjwfUkXRcTBketGxFpgLcDs2bNH7fZpVlbJrziSlgN/DnwmIgIgb7Z+IH+8CdgBXNCIQs3KJCk4kpYAfwd8MiKOVIyfOfS1HpLmknXy3NmIQs3KJLWT5y3AacB6SQA/yc+gXQr8o6R+YAD4XESM/IoQs7aX2snznirLPgw8XLQos7LzlQNmCRwcswQOjlkCB8csgYNjlsDBMUvg4JglcHDMEjg4ZgkcHLMEDo5ZAgfHLIGDY5bAwTFL4OCYJXBwzBI4OGYJUjt5rpa0p6Jj51UV826RtF3SNklXNKtws1ZK7eQJ8PWKjp0/BJC0AFgGXJSv882h5h1mnSSpk+cpLAXuz9tEvQZsBy4uUJ9ZKRV5j7Mqb7p+r6SefGwW8EbFMrvzsbdwJ09rZ6nBuQuYBywk6965Jh/XKMuO2qUzItZGxKKIWDR16tTEMsxaIyk4EdEbEQMRMQjczZuHY7uB2RWLngvsLVaiWfmkdvI8p2LyamDojNujwDJJp0maQ9bJc0OxEs3KJ7WT58clLSQ7DNsFfBYgIl6S9CDwMlkz9s9HxEBTKjdroYZ28syXvx24vUhRZmXnKwfMEjg4ZgkcHLMEDo5ZAgfHLIGDY5bAwTFL4OCYJXBwzBI4OGYJHByzBA6OWQIHxyyBg2OWwMExS+DgmCVIbUj4QEUzwl2SNufj50k6WjHvW02s3axlxvwEKFlDwn8D/n1oICL+cuixpDXA7yuW3xERCxtUX1Nc9Z73APCjvXtHb8FjNoZaPjr9lKTzRpsnScCngT9rcF1N9fcf+ACSeGzfPvrD0bH6FX2P8zGgNyJerRibI+nnkp6U9LGC2zcrpVoO1U7lWuC+iul9wHsj4oCkjwDfl3RRRBwcuaKklcBKgJ6enpGzzUotOTiSJgJ/AXxkaCwijgHH8sebJO0ALgA2jlw/ItYCawFmz549rsdLOw8dQlRpMWpWgyKvOJcDr0TE7qEBSWcCv42IAUlzyRoS7ixYY8N95plnWl2CtblaTkffBzwLXChpt6Qb8lnLGH6YBnAp8IKk54GHgM9FRK3fdGDWNlIbEhIRfz3K2MPAw8XLMis3XzlglsDBMUvg4JglcHDMEjg4ZgkcHLMEDo5ZAgfHLIGDY5ag6NXRDXGwa5D1ZxyuOv/3Xf4aUavPTQsWcOlZZ1Wd39XVxelPPpm8/VIEJ4BjE6pfqzw4fqVYhzhj0iTOnDz51AsdO5a8fR+qmSVwcMwSlOJQzazR1u3cyX/v2VN1/nlTp/KF978/efsOjnWk7X19bO/rqzr/UH9/oe07OPa2tOfIEf7pxReT11eUoD1S9ztPj3f/8Qerzu/9yYscP3hoHCsyA2BTRCwadU5EnPIGzAaeALYCLwE35uMzgPXAq/l9T8U6twDbgW3AFTU8R/jmWwlvG6v+ztbwS30O8OH88TTgF8AC4KvAzfn4zcAd+eMFwPPAacAcYAfQ5eD41oa3qsEZ83R0ROyLiJ/lj/vIXnlmAUuBdfli64BP5Y+XAvdHxLGIeI3slefisZ7HrJ3U9X+cvBXuh4CfAmdHxD7IwgUMXd8wC3ijYrXd+ZhZx6j5rJqk08k62HwhIg5mbaNHX3SUsRhleyc7eZq1m5pecSRNIgvNdyPie/lwr6Rz8vnnAPvz8d1kJxSGnAvsHbnNiFgbEYuqnrUwK7FaGhIKuAfYGhF3Vsx6FFieP14O/KBifJmk0yTNIevmuaFxJZuVQA1n1f6E7FDrBWBzfrsKeBfwONnp6MeBGRXr3Ep2Nm0bcKVPR/vWpreqZ9VK8Q9QSa0vwuytqv4D1FdHmyVwcMwSODhmCRwcswQOjlmCsnwe5zfA4fy+U8ykc/ank/YFat+f91WbUYrT0QCSNnbSVQSdtD+dtC/QmP3xoZpZAgfHLEGZgrO21QU0WCftTyftCzRgf0rzHsesnZTpFcesbbQ8OJKWSNomabukm1tdTwpJuyS9KGmzpI352AxJ6yW9mt/3tLrOaiTdK2m/pC0VY1Xrl3RL/vPaJumK1lRdXZX9WS1pT/4z2izpqop59e/PWJf8N/MGdJF9/GAu0E3W5GNBK2tK3I9dwMwRY6M2MynjDbgU+DCwZaz6SWjGUpL9WQ387SjLJu1Pq19xLga2R8TOiDgO3E/W7KMTLGX0ZialExFPAb8dMVyt/qWUvBlLlf2pJml/Wh2cTmnsEcBjkjblvRSgejOTdtGJzVhWSXohP5QbOvRM2p9WB6emxh5t4KMR8WHgSuDzki5tdUFN1K4/s7uAecBCYB+wJh9P2p9WB6emxh5lFxF78/v9wCNkL/XVmpm0i0LNWMomInojYiAiBoG7efNwLGl/Wh2c54DzJc2R1A0sI2v20TYkTZU0begxsBjYQvVmJu2io5qxDP0RyF1N9jOC1P0pwRmQq8ja6u4Abm11PQn1zyU7K/M8WW/tW/Pxqs1MynYD7iM7fDlB9hf4hlPVT53NWEqyP/8BvEjWdOZR4Jwi++MrB8wStPpQzawtOThmCRwcswQOjlkCB8csgYNjlsDBMUvg4Jgl+H9+MPTG7IEpyQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from IPython import display\n", "\n", "import gym\n", "from IPython import display\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "\n", "\n", "env = gym.make('BreakoutDeterministic-v4')\n", "#env = gym.make('SpaceInvaders-v0')\n", "\n", "#env = gym.make('CartPole-v1')\n", "#env = gym.make('MountainCar-v0')\n", "\n", "\n", "env.reset()\n", "img = plt.imshow(env.render(mode='rgb_array')) # only call this once\n", "\n", "\n", "n_time_steps=100\n", "for _ in range(n_time_steps):\n", " # plot frame\n", " img.set_data(env.render(mode='rgb_array')) # just update the data\n", " display.display(plt.gcf())\n", " display.clear_output(wait=True)\n", " # choose action\n", " action = env.action_space.sample()\n", " # take action\n", " frame, reward, is_done, _ = env.step(action) " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# print(frame.shape, reward, is_done, _)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# print(env.__dir__() )" ] } ], "metadata": { "kernelspec": { "display_name": "RL_class", "language": "python", "name": "rl_class" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "latex_metadata": { "affiliation": "Faculty of Physics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria", "author": "Marin Bukov", "title": "Reinforcement Learning Course: WiSe 2020/21" } }, "nbformat": 4, "nbformat_minor": 4 }