DEV Community

Dakota Day
Dakota Day

Posted on

Reinforcing Reinforcement Learning

Welcome to the third part of my machine learning series! In this blog, we'll cover some of the basics of the final learning pattern, reinforcement learning(RL).

Necessary Parts

We'll start by talking about the key pieces inside of reinforcement learning. RL focuses on an agent being able to interact with an environment and make decisions. These decisions are based on how the agent is able to gain reward from its actions. The goal? Get as much reward as the agent can!
The parts defined are this:

  • The Agent: Defined as the thing learning, or moving through the environment. For example, in video games like Pong or Super Mario, the agent would be the paddle or Mario himself!

  • The Environment: This concerns everything that the agent interacts with. In video games, the level you're currently in would be the environment

  • Actions: These are choices that our agent can take, these will change for every agent, as they all have different goals. In Pong, the goal is to score by hitting the ball, so actions for the paddle are up and down. Mario on the other hand, has a more complicated environment, and he's able to run, jump, go down pipes, etc. So his actions would include up, down, left, and right!

  • State: This is a snapshot of the environment at any given time. This is how the agent can "see" the environment as we would. State is important because it gives context for the agent's decision making. Mapping different states to actions that give consecutive rewards is the goal for the agent.

  • Rewards: This represents a function that provides feedback from the environment. This is a crucial part of RL because this will cause the agent to act the way it does. You don't want to only think about giving the agent rewards for doing something good. You should also make sure that take away rewards for doing something undesirable.
    Let's take Mario for example again. Starting off, since the goal is to move right(to the flag and beat the level), we may give him rewards for moving right. In the first level, there is a goomba and pipe that would impede his progress immediately. To make sure that he learns to jump over the goomba, we want to take away from his reward on a death. For the pipe, we may want to make it so that every second that passes, Mario will lose a "point". This encourages him to move to the flag in the most efficient way possible.

Piece it Together

So how do all of those pieces come together?

Basic Diagram of RL

This diagram shows the basic process our agent takes while learning.

Starting

When learning a new video game, how do we, as humans, learn how to play? Well, if the game doesn't have any tutorials, we typically start by pressing all the buttons to see what they do. Our agent actually starts by doing something similar. Since the agent doesn't know how to get rewards, it takes random actions until it finds actions that give it multiple rewards.

Feedback

While the agent takes those random actions, it gets feedback from the environment. Over time, the agent will adjust its actions to get more rewards while trying to minimize penalties. It takes the actions that got it rewards or actions that penalized it and the state where it took those actions, and passes it to the next "generation". Each generation bases its decisions on past generations' decisions, making it more probable to take those "good" actions to get the rewards it craves.

Policy

This is simply a term to describe the agent's strategy. This would be the mapping function for states and actions. It would take a given state as an input, and outputs the best action for that state. The policy function changes drastically for every algorithm, and depending on the problem, can be difficult to solve.

Exploration vs. Exploitation

Exploration or exploitation? That is the choice that the agent has to make: should it take new actions(exploration), or take actions it already knows to give it rewards(exploitation).

  • Exploration: An agent may take actions the it doesn't normally take to gather new information in a given state. The benefit of this is to potentially find a better policy than it currently has. A downside of this, however, is that exploration can lead to immediate reward "dry spells" over long term rewards.

  • Exploitation: The agent may also take actions that it knows to give immediate rewards. This is necessary for the agent to get high rewards based on its past actions, resulting in high immediate rewards. The downside here is that the agent knows actions to give it rewards, so it won't explore new, potentially better, options.

One major challenge is achieving balance of these two concepts. A popular strategy one could use is called the Epsilon(ε)-Greedy Strategy.

Epsilon(ε)-Greedy Strategy

This strategy gives the agent a chance to explore, ε(something small, say 0.1), and chance to exploit, (1 - ε). Here, 90% of the time, the agent will take the action it knows will give it immediate reward. That last 10% will be used taking a new, unexplored action. You can also make it so that as time progresses, ε reduces, and the agent takes less exploring actions, since after a certain point, there can't be better actions.

Applications of RL
Reinforcement learning can be used in many parts of life. From self driving cars to natural language processing. We already know it has applications in gaming from my previous examples. In healthcare, it can be used to give treatments to patients based on policies learned from RL systems. That's just to name a few, check out this blog by Derrick Mwiti to more deeply explore some other applications.

Conclusions

So, what is the take-away? RL is a powerful tool that enables machines to learn from their own experiences. Agents learn to make decisions in an environment by exploring the best actions to take. It is already used in many parts of our lives and always getting better!

Sources:
What is Reinforcement Learning

An Introduction to Reinforcement Learning

Epsilon-Greedy Q-learning

10 Real-Life Applications of Reinforcement Learning

Top comments (0)