Reinforcement learning is a machine learning approach that allows an agent to learn how to make a sequence of decisions, by interacting with the environment by means of a trial and error.
And through receiving feedback in a form of reward or penalty for actions done.
The main goal of the agent is to maximize the cumulative, total reward of actions it takes through interaction with environment.
An example of an agent that is acting in an environment can be a programmed player in a computer game or a physical robot that is moving crates in a store.
Reinforcement learning concerns itself with finding an optimal strategy for the agent in any environment.
Environments considered are usually very complex and large, when judged by the number of potential states and transitions from different states.
As an example, in the game of Go the number of possible states quickly increases with the progression of the game.
In the other key game, chess, there are around 400 possible moves after the first two moves, whereas in Go, there are around 130,000 moves.
Environment can also be complex due to existence of imperfect information.
Example for this are card games, where the cards of opponents are not fully known to the player.
Reinforcement learning is often approached and modeled as a Markov Decision Process or MDP. MDP consists of a group of environment states, actions that are possible for agent in each particular state, a reward function and a transition model for mapping between states.
For better understanding of reinforcement learning it is useful to highlight one important distinction to more commonly used supervised learning - in the latter the machine learning model is instantly “informed” whether its prediction is either correct or not. Whereas the former operates in an environment, where the total reward of actions taken by agents is kind of delayed and only occurs over a sequence of decisions.
Reinforcement algorithms have in recent years achieved great success in beating humans at complex games, like Go, Dota-2, Starcraft II and even Atari video games.
In the second part of our post, we will take a more close look at the AlphaGo that achieved great success in the game of Go against human players.
Concepts of reinforcement learning
To better understand what is reinforcement learning, let us introduce several concepts:
- agent is any entity that takes actions in its environment, e.g. this can be a Pac-Man figure in the maze,
- action is an interaction of the agent with its environment, that is valid in terms of rules that govern the environment. In context of Pac-Man game, the latter can e.g. take left, right, up or down moves,
- environment is where the agent is acting and with which it interacts. After action of the agent, the environment takes the state of the agent and action and returns the reward for the agent and agent’s next state,
- state is the current situation that the agents find itself in and it is returned by the environment, after each action of the agent. In Pac-Man game, this is the position of the Pac-Man in the maze, position of the four coloured ghosts and accumulated rewards of the agent,
- reward is what is received by the agent after it takes a specific action. Rewards can be either immediate or delayed. In Pac-Man game, a reward of 100 points is e.g. received for each cherry,
- discount factor affects the value of reward with respect to time –> future rewards are thus discounted by a factor and worth less than immediate rewards (similar to how the discounted cash flow valuation works in finance).
- policy is the method which maps the agent’s states to actions, that offer the highest total reward. Each state has an expected value of future rewards that the agent receives in its current state if it takes the action according to the policy. One of the main goals of reinforcement learning is to learn the optimal policy. Policies can be both deterministic as well as have elements of chance by being stochastic,
- value is the expected reward over the long-term from the current state and policy,
- model simulates the functioning of the environment and can return the next state and reward for given state and the action taken.
Reinforcement learning task can be solved either by using the model (model-based approach) or without a model (model-free approach). Model-based approaches try to correctly model its environment and then learn the optimal policy based on this model.
In model-free approach, the optimal policy is learned through trial and error by the agent.
Supervised learning, unsupervised learning and reinforcement learning
There are several important differences between reinforcement learning and supervised learning:
- Reinforcement learning trains the agent by letting it interact with is environment
- Supervised learning trains the model by applying it on train data and letting it learn from (potential) deviations of model’s predictions from labels
- There is no labelled data set available in reinforcement learning
- Supervised learning models learn on the labelled data set
- Reinforcement learning trains the agent to make a sequence of -
- decisions and not only single decision.
- Supervised learning model gives a single decision or prediction on given input data instance
Types of reinforcement learning
We distinguish between two types of reinforcement learning:
Positive reinforcement – reinforcement is considered positive, when an event has a positive effect on the behaviour, by e.g. increasing the frequency or strength of the behaviour, example would be giving your pet a favourite food for favourable behaviour. Positive reinforcement helps to maximize the performance for given task and can cause models to make sustainable changes that last for longer periods of time. Positive reinforcement is the most common type of reinforcement used in reinforcement learning problems,
Negative reinforcement – reinforcement is considered negative, when an event has a negative effect on the behaviour, by e.g. decreasing the frequency or strength of the behaviour.
Reinforcement Learning Algorithms
As noted previously, reinforcement learning algorithms can be divided into two main groups:
- model-based
- model-free
Model-free methods include policy optimization and Q-learning.
Policy optimization involves learning the policy that maps states to actions. Policies are further subdivided into deterministic, where the mapping of state to action is done in a deterministic way. Stochastic policies, on the other hand, involve element of chance in the mapping of state to action.
Another important model-free reinforcement learning algorithm is Q-learning. The latter aims to find the optimal action for given current state. This is done by employing the concept of Q-table which maps the actions to values. The values in Q-table are calculated during exploration phase, when the agent selects random actions, receives rewards and based on that the Q-table is updated.
For problems with huge space of possible states, the calculation of Q-table can become computationally difficult. To improve performance on these problems, Deep-Q-Learning has been introduced, in which the Q-table is replaced by the deep neural net. The neural net receives as input the current state and produces values of each possible action.
Application of reinforcement learning – AlphaGo
One of the most well-known applications of reinforcement learning is the development of a computer program AlphaGo. It is not only important for the RL field but also perception of artificial intelligence by the wider audiences as the AlphaGo was able to beat the best human players at playing a game of Go, which was long considered to be too difficult for a computer to achieve a feat of surpassing human-level ability.
The major event in this respect occurred in March 2016, when AlphaGo was able to beat Lee Sedol in a five-game match, which was also the first time a computer program was better than the 9-dan professional player without handicap. As of today, AlphaGo is not the most powerful computer Go program, as the DeepMind (developer of AlphaGo and part of Google) developed three more powerful successors – AlphaGo Master, AlphaGO Zero and AlphaZero.
Go is an ancient game, invented in China almost 2500 years ago. It is played by two players on a board with black and white stones. One of the reasons why the Go game was seen as too difficult for computers is the huge number of possible placements of stones – higher than the number of atoms in the Universe.
Go is a game of perfect information, which means that each player has information on all the previous moves. In this type of games, one can determine the outcome of a game from each current state, if one assumes that each player takes the most optimal move on every turn. To find the optimal game one needs to calculate the value of each move with help of simulations. This is done by going through the search or game tree with all possible moves.
Each node of a search tree represents a state in the game. When a player performs a move, the transition occurs from the node to one of the children nodes. The aim is to find the optimum path through the search tree. Due to complexity of Go, the calculation of optimum action in given current state, using present day computer, would take many orders of magnitude more time than practical.
AlphaGo aim is thus to reduce search space to a dimension where the number of possible games (to the end) is still small enough so that one can evaluate it in a time that is in the order of seconds. It uses Monte Carlo tree search algorithm (MCTS) for this purpose, by randomly sampling for potential moves.
MCTS is only one of the key components of AlphaGo, the other is a Supervised Learning (SL) policy, which was trained on millions of positions from the KGS Go Server. Although the SL policy helps with prediction of most likely next moves, the Reinforcement learning is the component that predicts the best winning moves.
Conclusion
In this article, we introduced an important field of machine learning – reinforcement learning. This is an approach which is becoming increasingly used in many different fields, from advertising, finance, autonomous vehicles to industrial automation and other sectors.
Reinforcement learning has become more widely known as part of the frameworks that enabled a program called AlphaGo to beat the best human players at the ancient game of Go. This occurred at a time, when many considered that such a feat for a computer program is still at least a decade away.
Top comments (0)