DEV Community

Cover image for An Introduction to Reinforcement Learning With OpenAI Gym’s ‘Taxi’
Joy
Joy

Posted on • Originally published at gocoder.one

An Introduction to Reinforcement Learning With OpenAI Gym’s ‘Taxi’

In this introductory tutorial, we'll apply reinforcement learning (RL) to train an agent to solve the 'Taxi' environment from OpenAI Gym. We'll cover:

  • A basic introduction to RL
  • Setting up OpenAI Gym & Taxi
  • Step-by-step tutorial on how to train a Taxi agent in Python3 using RL

Before we start, what's 'Taxi'?

Taxi is one of many environments available on OpenAI Gym. These environments are used to develop and train reinforcement learning agents.

The goal of Taxi is to pick-up passengers and drop them off at the destination in the least amount of moves. In this tutorial, you'll start with an agent that plays randomly:

Random Taxi Agent

…and successfully apply reinforcement learning to train an agent to solve the game:

Trained Taxi Agent

💡 An introduction to Reinforcement Learning

Think about how you might teach a dog a new trick, like telling it to sit:

  • If it performs the trick correctly (it sits), you'll reward it with a treat (positive feedback) ✔️
  • If it doesn't sit correctly, it doesn't get a treat (negative feedback) ❌

By continuing to do things that lead to positive outcomes, the dog will learn to sit when it hears the command in order to get its treat. Reinforcement learning is a subdomain of machine learning which involves training an 'agent' (the dog) to learn the correct sequences of actions to take (sitting) on its environment (in response to the command 'sit') in order to maximise its reward (getting a treat). This can be illustrated more formally as:

Sutton Barton RL Source: Sutton & Barto

🏋️ Installing OpenAI Gym and Taxi

We'll be using the 'Taxi-v3' environment for this tutorial.

You'll need to install:

  • OpenAI Gym pip install gym
  • NumPy pip install numpy

The following snippet will import the necessary packages, and create the Taxi environment:

import numpy as np
import gym
import random

# create Taxi environment
env = gym.make('Taxi-v3')
Enter fullscreen mode Exit fullscreen mode

🎲 Random Agent

We'll start by implementing an agent that doesn't learn at all. Instead, it will select actions at random. This will be our baseline.

The first step is to give our agent the initial state. A state tells our agent what the current environment looks like. In Taxi, a state defines the current positions of the taxi, passenger, and pick-up and drop-off locations. Below are examples of three different states for Taxi:

Taxi States
Note: Yellow = taxi, Blue letter = pickup location, Purple letter = drop-off destination

To get the initial state:

# create a new instance of taxi, and get the initial state
state = env.reset()
Enter fullscreen mode Exit fullscreen mode

Next, we'll run a for-loop to cycle through the game. At each iteration, our agent will:

  1. Make a random action from the action space (0 - south, 1 - north, 2 - east, 3 - west, 4 - pick-up, 5 - drop-off)
  2. Receive the new state

Here's our random agent script:

import gym
import numpy as np
import random

# create Taxi environment
env = gym.make('Taxi-v3')

# create a new instance of taxi, and get the initial state
state = env.reset()

num_steps = 99
for s in range(num_steps+1):
    print(f"step: {s} out of {num_steps}")

    # sample a random action from the list of available actions
    action = env.action_space.sample()

    # perform this action on the environment
    env.step(action)

    # print the new state
    env.render()

# end this instance of the taxi environment
env.close()
Enter fullscreen mode Exit fullscreen mode

You can run this and watch your agent make random moves. Not super exciting, but hopefully this helped you get familiar with the OpenAI Gym toolkit.

Next, we'll implement the key algorithm that will enable our agent to learn from the environment in order to solve Taxi.

📖 Q-Learning Agent

Q-learning is a reinforcement learning algorithm that seeks to find the best possible next action given its current state, in order to maximise the reward it receives (the 'Q' in Q-learning stands for quality - i.e. how valuable an action is).

Let's take the following starting state:

Taxi state

Which action (up, down, left, right, pick-up or drop-off) should it take in order to maximise its reward? (Note: blue = pick-up location and purple= drop-off destination)

First, let's take a look at how our agent is 'rewarded' for its actions. Remember in reinforcement learning, we want our agent to take actions that will maximise the possible rewards it receives from its environment.

'Taxi' reward system

According to the Taxi documentation:

"…you receive +20 points for a successful drop-off, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions."

Looking back at our original state, the possible actions it can take and the corresponding rewards it will receive are shown below:

Taxi rewards

In the image above, the agent loses 1 point per timestep it takes. It will also lose 10 points if it uses the pick-up or drop-off action here.

We want our agent to go North towards the pick-up location denoted by a blue R- but how will it know which action to take if they are all equally punishing?

Exploration

Our agent currently has no way of knowing which action will lead it closest to the blue R. This is where trial-and-error comes in - we'll have our agent take random actions, and observe what rewards it gets, i.e. our agent will do some exploration of the environment. This is what our random agent was doing earlier.

Over many iterations, our agent will have observed that certain sequences of actions will be more rewarding than others. Along the way, our agent will need to keep track of which actions led to what rewards.

Introducing… Q-tables

A Q-table is simply a look-up table storing values representing the maximum expected future rewards our agent can expect for a certain action in a certain state (known as Q-values). It will tell our agent that when it encounters a certain state, some actions are more likely than others to lead to higher rewards. It becomes a 'cheatsheet' telling our agent what the best action to take is.

The image below illustrates what our 'Q-table' will look like:

  • Each row corresponds to a unique state in the 'Taxi' environment
  • Each column corresponds to an action our agent can take
  • Each cell corresponds to the Q-value for that state-action pair - a higher Q-value means a higher maximum reward our agent can expect to get if it takes that action in that state.

Q-table

Before we begin training our agent, we'll need to initialize our Q-table as so:

state_size = env.observation_space.n  # total number of states (S)
action_size = env.action_space.n      # total number of actions (A)

# initialize a qtable with 0's for all Q-values
qtable = np.zeros((state_size, action_size))
Enter fullscreen mode Exit fullscreen mode

As our agent explores, it will update the Q-table with the Q-values it finds. To calculate our Q-values, we'll introduce the Q-learning algorithm.

Q-Learning Algorithm

The Q-learning algorithm is given below. We won't go into details, but you can read more about it in Ch 6 of Sutton & Barto (2018).

Q-learning algorithm

The Q-learning algorithm will help our agent update the current Q-value (Q(St,At)) with its observations after taking an action. I.e. increase Q if it encountered a positive reward, or decrease Q if it encountered a negative one.

Note that in Taxi, our agent doesn't receive a positive reward until it successfully drops off a passenger (+20 points). Hence even if our agent is heading in the correct direction, there will be a delay in the positive reward it should receive. The following term in the Q-learning equation addresses this:

Maximum Q

This term adjusts our current Q-value to include a portion of the rewards it may receive sometime in the future (St+1). The 'a' term refers to all the possible actions available for that state. The equation also contains two hyperparameters which we can specify:

  1. Learning rate (α): how easily the agent should accept new information over previously learnt information
  2. Discount factor (γ): how much the agent should take into consideration the rewards it could receive in the future versus its immediate reward

Here's our code for implementing the Q-learning algorithm:

# hyperparameters to tune
learning_rate = 0.9
discount_rate = 0.8

# dummy variables
reward = 10 # R_(t+1)
state = env.observation_space.sample()      # S_t
action = env.action_space.sample()          # A_t
new_state = env.observation_space.sample()  # S_(t+1)

# Qlearning algorithm: Q(s,a) := Q(s,a) + learning_rate * (reward + discount_rate * max Q(s',a') - Q(s,a))
qtable[state, action] += learning_rate * (reward + discount_rate * np.max(qtable[new_state,:]) - qtable[state,action])
Enter fullscreen mode Exit fullscreen mode

Exploration vs Exploitation Trade-off

Earlier, we let our agent explore the environment to update our Q-table. As our agent learns more about the environment, we can let it use this knowledge to take more optimal actions - known as exploitation.

During exploitation, our agent will look at its Q-table and select the action with the highest Q-value (instead of a random action). Over time, our agent will need to explore less, and start exploiting what it knows instead.

There are many ways to implement an exploration-exploitation strategy. Here's just one example:

# dummy variables
episode = random.randint(0,500)
qtable = np.random.randn(env.observation_space.sample(), env.action_space.sample())

# exploration-exploitation tradeoff
epsilon = 1.0     # probability that our agent will explore
decay_rate = 0.01 # of epsilon

if random.uniform(0,1) < epsilon:
    # explore
    action = action = env.action_space.sample()
else:
    # exploit
    action = np.argmax(qtable[state,:])

# epsilon decreases exponentially --> our agent will explore less and less
epsilon = np.exp(-decay_rate*episode)
Enter fullscreen mode Exit fullscreen mode

In the example above, we set some value epsilon between 0 and 1. If epsilon is 0.7, there is a 70% chance that on this step our agent will explore instead of exploit. We've set epsilon to exponentially decay with each step, so that our agent explores less and less over time.

Bringing it all together

We're done with all the building blocks needed for our reinforcement learning agent. The process for training our agent will look like:

  1. Initialising our Q-table with 0's for all Q-values
  2. Let our agent play Taxi over a large number of games
  3. Continuously update the Q-table using the Q-learning algorithm and an exploration-exploitation strategy

Here's the full implementation:

import numpy as np
import gym
import random

def main():

    # create Taxi environment
    env = gym.make('Taxi-v3')

    # initialize q-table
    state_size = env.observation_space.n
    action_size = env.action_space.n
    qtable = np.zeros((state_size, action_size))

    # hyperparameters
    learning_rate = 0.9
    discount_rate = 0.8
    epsilon = 1.0
    decay_rate= 0.005

    # training variables
    num_episodes = 1000
    max_steps = 99 # per episode

    # training
    for episode in range(num_episodes):

        # reset the environment
        state = env.reset()
        done = False

        for s in range(max_steps):

            # exploration-exploitation tradeoff
            if random.uniform(0,1) < epsilon:
                # explore
                action = env.action_space.sample()
            else:
                # exploit
                action = np.argmax(qtable[state,:])

            # take action and observe reward
            new_state, reward, done, info = env.step(action)

            # Q-learning algorithm
            qtable[state,action] = qtable[state,action] + learning_rate * (reward + discount_rate * np.max(qtable[new_state,:])-qtable[state,action])

            # Update to our new state
            state = new_state

            # if done, finish episode
            if done == True:
                break

        # Decrease epsilon
        epsilon = np.exp(-decay_rate*episode)

    print(f"Training completed over {num_episodes} episodes")
    input("Press Enter to watch trained agent...")

    # watch trained agent
    state = env.reset()
    done = False
    rewards = 0

    for s in range(max_steps):

        print(f"TRAINED AGENT")
        print("Step {}".format(s+1))

        action = np.argmax(qtable[state,:])
        new_state, reward, done, info = env.step(action)
        rewards += reward
        env.render()
        print(f"score: {rewards}")
        state = new_state

        if done == True:
            break

    env.close()

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

👏 What's next?

There are many other environments available on OpenAI Gym for you to try (e.g. Frozen Lake). You can also try optimising the implementation above to solve Taxi in fewer episodes.

Some other useful resources include:

Top comments (0)