A machine learning method called reinforcement learning teaches an agent how to execute a series of decisions. This is accomplished by interacting with its surroundings through trial-and-error and gathering feedback. The latter are a direct result of actions taken and take the shape of rewards or penalties.
The agent's primary objective is to increase the overall reward of its actions.
A bot player in a computer game or a robot moving inventory in a physical warehouse are both examples of agents acting in an environment as part of reinforcement learning. The field of reinforcement learning aims to teach the agent the best approach to use in each environment.
In terms of possible states and transitions, environments are frequently complicated and substantial. In a game of Go, for instance, the range of viable moves dramatically expands as the game goes on. After the first two moves, there are roughly 400 possible moves in chess, whereas there are about 130,000 moves in go. Due to incomplete information, the environment might also be challenging. For instance, in card games, a player may not fully understand the cards of their opponents.
Markov Decision Processes, often known as MDPs, are frequently used to model and codify reinforcement learning problems. A set of environment states, actions an agent can take in each state, a reward function, and a transition model for state mappings are all included in the MDP.
One significant difference between supervised learning and reinforcement learning is that in the latter, the machine learning model is immediately "educated" as to whether or not its prediction on input is accurate. The former operates in a setting where the whole reward of agents' actions is delayed and spread out over a series of choices.
At games like Go, Dota 2, Starcraft II, and Atari video games, reinforcement algorithms have had considerable success defeating human specialists. In the second section of our essay, we'll take a closer look at the computer program AlphaGo, which outperformed human players in the game of Go.
Reinforcement learning fundamentals
We are introducing reinforcement learning's fundamental ideas in order to better comprehend it:
A Pac-Man in a maze is an example of an agent, which is an entity that takes actions in its surroundings.
Any interaction between the agent and its surroundings is considered an action. For instance, Pac-Man can travel left, right, up, or down; the environment is the setting in which the agent exists and with which it interacts. The environment records the state of the agent and the action after each action, then returns the reward for the agent and the subsequent state of the agent.
**state **of an agent is the circumstance it is in right now and is returned by the environment after each action. It is the position of the Pac-Man in the maze, the location of the four different colored ghosts, and the agent's earned awards in the Pac-Man game. The agent receives a reward after doing a certain action.
**agent **may obtain rewards right away or later. In the Pac-Man game, for instance, receiving a reward of 100 points for each cherry is controlled by the discount factor, which determines how valuable rewards are in relation to time and makes future prizes less valuable than present-day ones.
**policy **that maps an agent's states to actions and provides the maximum overall reward is called policy. Each state has an expected value of future benefits that the agent would anticipate receiving in that state if it followed the policy's instructions. Learning the best course of action is one of reinforcement learning's primary objectives. Policies can be both deterministic as well as have elements of chance by being stochastic,
**Value **is the long-term benefit anticipated from the existing situation and policies.
**model **can predict the next state and provide a reward for the current condition and the action made. It also models how the environment works.
The model-based technique can be used to tackle the reinforcement learning problem, or it can be done without one (model-free approach). Model-based techniques seek to accurately model the environment before learning the best course of action from this model. In a model-free approach, the agent discovers the best course of action by trial and error.
Supervised learning, unsupervised learning and reinforcement learning
In supervised learning, the model is developed using labeled data, where each instance of the data has a predetermined result. Classification (prediction of classes, such as whether an email is spam or not) and regression models are further divisions of supervised learning techniques (prediction of continuous outputs, e.g. prediction of sales in the next quarter).
Unsupervised learning refers to algorithms that analyze data without knowing the results of individual data instances. Finding groupings of data instances that are similar to one another but different from instances in other groups is known as clustering or unsupervised learning.
As we've seen, reinforcement learning explains decision and reward systems that learn in a setting where they receive rewards for "good" behaviors and punishment for "poor" activities. Reinforcement learning is similar to unsupervised learning in that it doesn't call for prior knowledge of the results for a set of input data.
Types of reinforcement learning
We distinguish between two types of reinforcement learning:
positive reinforcement – giving your pet their favorite food in exchange for good behavior is an example of positive reinforcement. Reinforcement is defined as the process of making a behavior stronger or more frequent. Positive reinforcement can encourage models to make modifications that are long-lasting and sustainable, helping to maximize performance for a specific task. The most prevalent form of reinforcement used in reinforcement learning issues is positive reinforcement.
negative reinforcement – When an event affects behavior negatively, such as by lowering frequency or intensity, it is referred to as negative reinforcement.
Application of reinforcement learning – AlphaGo
AlphaGo, a computer program, is one of the most well-known examples of reinforcement learning in action. The fact that AlphaGo was able to defeat the top human players at a game of Go, which was long thought to be too challenging for a computer to pull off, is significant not only for the real-world sector but also for how artificial intelligence is perceived by broader audiences.
The significant development in this regard was in March 2016, when AlphaGo defeated Lee Sedol in a five-game match. This was also the first instance in which a computer program outperformed the 9-dan professional player without the use of a handicap. AlphaGo is currently not the most powerful computer go software, as AlphaGo Master, AlphaGO Zero, and AlphaZero were developed by DeepMind, which is a division of Google and the creator of AlphaGo.
Go is a traditional game that was created almost 2500 years ago in China. On a board with black and white stones, it is played between two players. The fact that there are more possible stone placements in the game of Go than there are atoms in the universe is one of the reasons it was thought that computers couldn't handle it.
Go is a perfect information game, meaning that each player is aware of every previous move. If one considers that each player makes the best move possible at every opportunity, one may predict the end of a game in this sort of game from each present condition. One must use simulations to determine the value of each move in order to find the ideal game. This is accomplished by going through the entire search or game tree.
A game state is represented by each node of a search tree. The transition from the node to one of the child nodes happens when a player makes a move. Finding the best route through the search tree is the goal. Go is incredibly complex, thus utilizing a modern computer to determine the best course of action in a given position would take far longer than is practicable.
The goal of AlphaGo is to shrink the search area to a size where the total number of games is still low enough to allow for evaluation in a matter of seconds. For this, it employs the Monte Carlo tree search algorithm (MCTS), which randomly samples candidate moves.
The Supervised Learning (SL) policy, which was trained on billions of positions from the KGS Go Server, is the other important part of AlphaGo after MCTS. Reinforcement learning is the element that predicts the optimal winning moves, even though the SL policy aids in prediction of the most probable next moves. According to the original article, the Reinforcement Learning policy network defeated the SL policy network in more than 80% of games when they were played head-to-head https://www.nature.com/articles/nature16961.
Additional uses for reinforcement learning
Many different industries employ reinforcement learning for a variety of uses. Several uses and applications include:
- individualized suggestions
- trade tactics used by financial institutions,
- manufacturing
- management (strategic planning)
- inventory control,
- robotics,
- commercial automation
- individualized suggestions
- convert image to text
- distribution management, such as determining the best delivery routes
- self-driving vehicles,
- advertising,
- finding categories of websites, we recently wrote a more detailed article on this topic here
- chemistry (optimizing chemical reactions),
- gaming personalization
- power sources,
- analysis of top Shopify stores
- live auctioning,
- building interesting mathe tasks for students
- news recommendation
Conclusion
In this article, we introduced an important field of machine learning – reinforcement learning, which is becoming increasingly used in many different fields.
Reinforcement learning has found considerable success in many other fields, such as autonomous driving and we expect that it will play an important role in the future of AI.
Top comments (0)