**Have you watched this?**

**Or this?**

Watch these videos as per your convenience and go through this blog to understand (without much mathematical sophistication) how we arrived at this stage.

If you have already watched these videos I totally understand your excitement. This breakthrough came in 2013 after the AlexNet in a paper titled Playing Atari with Deep Reinforcement Learning.

# The Preliminaries

**Machine Learning provides automated methods that can detect patterns in data and use them to achieve some tasks**. Generally, ML tasks are categorized into:

Supervised Learning - the task of learning from labeled datasets.

Unsupervised Learning - the task of drawing inferences from

unlabeled datasetsReinforcement Learning - the task of maximizing cumulative rewards from sequences of action taken by agents while interacting with its environment

We will be discussing on RL here, so let's understand an RL problem by imagining a robot mouse maze world.

The robot mouse is the **agent**, Maze is the **environment** with food at few locations as positive **rewards** and electricity at other locations as negative rewards. At every moment it can **observe** the full state of the maze to decide about the **actions** (such as turn left/right and move forward) to consider next. The **goal** is to allow our robot to learn on its own to find as much food as possible while avoiding an electric shock whenever possible.

To solve these tasks, ML exploits the idea of the *function
approximators*. Since the 1950s we know of several types of function approximators such as:

- Linear Models by Anderson et al. (1958)
- SVMs by Cortes and Vapnik (1995)
- Decisions Tree by Liaw, Wiener, et al. (2002) and Geurts et al. (2006)
- Gaussian Processes by Rasmussen (2004) and
- Deep Learning by Yann LeCun, Yoshua Bengio & Geoffrey Hinton (formally introduced in 2015 after networks actually started going deep) Read it here

DeepMind pioneered by applying deep learning methods into RL and since then, DRLs have achieved beyond human-level performance across many challenging domains (two of which you saw above).

# Fundamentals of RL

The two major RL entities are:

**Agent** - is a piece of software that implements some *policy*. Basically, this policy must decide what action is needed at every time step, given our observations, thus solving a given problem in a more-or-less efficient way.

**Environment** - a model of the world which is external to an agent and communicating in limited scope by rewards (obtained from the environment), actions (executed by the agent and given to the environment), and observations (some information besides the rewards that the agent receives from the environment). The state of the environment can change based on the agent's actions.

**Actions**, **Observations** and **Rewards** are the communication channels here.

Actions are things that an agent can do in the environment. It can be either discrete or continuous.

Observations are pieces of information that the environment provides the agent with, like what's going on around them. It may be relevant to the upcoming reward or may not, but it eventually doesn't drive the agent's learning process (as a reward does).

The purpose of a reward is to give agent feedback about its success, and it's an important central concept in RL. Basically, the term *reinforcement* comes from the fact that a reward obtained by an agent should reinforce its behavior in a positive or negative way.

That's basically the fundamentals of RL. The environment could be an extremely complicated physics model, and an agent could easily be a large neural network implementing the latest RL algorithm, but the basic pattern stays the same: *on every step, an agent takes some observations from the environment, does its calculations, and selects the action to issue. The result of this action is a reward and a new observation. The final goal is to achieve the largest accumulated reward over its sequence of actions.*

# RL Methods

RL methods are mostly categorized as

### Model-free or Model-based

Model-free methods: means that the method doesn't build a model of the environment or reward; it just directly connects observations to actions (or values that are related to actions). In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take. These methods are usually easier to train.

In contrast,

Model-based methods: try to predict what the next observation and/or reward will be. Based on this prediction, the agent is trying to choose the best possible action to take, very often making such predictions multiple times to look more and more steps into the future.

### Policy-based or Value-based

Policy-based methods: are directly approximating the policy of the agent, that is, what actions the agent should carry out at every step. The policy is usually represented by a probability distribution over the available actions.

In contrast,

Value-based methods: here instead of the probability of actions, the agent calculates the value of every possible action and chooses the action with the best value.

### On-policy or Off-policy

On-policy methods: heavily depend on the training data to be sampled according to the current policy we're updating.

In contrast,

Off-policy methods: the ability of the method to learn on old historical data obtained by a previous version of the agent.

# Cross-Entropy

One of the basic RL methods is **cross-entropy** - a model-free, policy-based, and on-policy method which simply means:

- It doesn't build any model of the environment; it just says to the agent what to do at every step
- It approximates the policy of the agent
- It requires fresh data obtained from the environment

In practice, the agent needs to pass an observation from the environment to the network, get probability distribution over actions (i.e. policy), and perform random sampling using probability distribution to get an action to carry out (similar to ML classification task). At the beginning of the training when our weights are random, the agent behaves randomly. After the agent gets an action to issue, it fires the action to the environment and obtains the next observation and reward for the last action. This experience of our agent is called **episodes**. Thus, the loop of such episodes continues.

But there are several limitations of the cross-entropy method:

- For training, our episodes have to be short
- The total reward for the episodes should have enough variability to separate good episodes from bad ones
- There is no intermediate indication about whether the agent has succeeded or failed

Also, for interesting environments, the optimal policy is much harder to formulate and it's even harder to prove their optimality.

# The Bellman Equation of optimality

Richard Bellman proved that *the optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus discounted long-term reward for the next state*. The Bellman optimality equation of value `V`

of state `S0`

is given as

You may also notice that this definition is recursive: the value of the state is defined via the values of immediately reachable states. These values not only give us the best reward that we can obtain, but they basically give us the optimal policy to obtain that reward: if our agent knows the value for every state, then it automatically knows how to gather all this reward.

Now the value of action `Q(s, a)`

which equals the total reward we can get by executing action `a`

in state `s`

is defined as

We can also define `V(s)`

via `Q(s, a)`

And finally, we can express `Q(s, a)`

via itself

A general procedure to calculate `Q(s, a)`

is the **value iteration algorithm** which allows us to numerically calculate the values of actions with known transition probabilities and rewards includes the following steps:

- Initialize all
`Q(s, a)`

to zero - For every state
`s`

and every action`a`

in this state, perform update: - Repeat step 2

Q-values are much more convenient, as for the agent it's much simpler to make decisions about actions based on Q than based on V. For Q, to choose the action based on the state, the agent just needs to calculate Q for all available actions, using the current state and choose the action with the largest value of Q. To do the same using values of states, the agent needs to know not only values but also probabilities for transitions. In practice, we rarely know them in advance, so the agent needs to estimate transition probabilities for every action and state pair.

# Deep Q-Learning

In practice, the **value-iteration algorithm** has several obvious limitations.

The first obvious problem is the count of environment states and our ability to iterate over them. In the Value iteration, we assume that we know all states in our environment in advance, can iterate over them, and can store value approximation associated with the state.

To give you an example of an environment with a much larger number of potential states, let's consider the Atari 2600 game platform (a very popular benchmark among RL researches). Let's calculate the state space for the Atari platform. The resolution of the screen is 210 x 160 pixels, and every pixel has one of 128 colors. So, every frame of the screen has 210 Γ 160 = 33600 pixels and the total amount of different screens possible is`128^33600`

, which is slightly more than`10^70802`

. If we decide to just enumerate all possible states of Atari once, it will take billions of years even for the fastest supercomputer. Also, 99.9% of this job will be a waste of time, as most of the combinations will never be shown during even long gameplay, so we'll never have samples of those states. So, this causes a major limitation for the value iteration method.The second problem with the value iteration approach is that it limits us to discrete action spaces. Indeed, both

`Q(s, a)`

and`V(s)`

approximations assume that our actions are a mutually exclusive discrete set, which is not true for continuous control problems where actions can represent continuous variables, such as the angle of a steering wheel. This issue is much more challenging than the first (and I would like to cover it in another blog), so for now, let's assume that we have a discrete count of actions and this count is not large.

To solve the first limitation, we can use states obtained from the environment to update the values of states, which can save us lots of resources. This modification of the Value iteration method is known as **Q-learning**. But what if those count of the observable set of states is huge (not infinite). For example, Atari games can have a large variety of different screens, so if we decide to use raw pixels as individual states, we'll quickly realize that we have too many states to track and approximate values for. For example, consider the two different situations in a game of Pong where the agent has to act on them differently due to change in single-pixel representing the ball.

As a solution to this problem, we can use a nonlinear representation that maps both state and action onto a value, which can be trained using a **deep neural network**.

Cut short, There are many tips and tricks that researchers have discovered to make Deep Q-Networks training more stable and efficient. However, tricks like

π-greedy: switching between random and Q policy using the probability hyperparameter π. By varying, we can select the ratio of random actions. The usual practice is to start with = 1.0 (100% random actions and that's better as it gives us more uniformly distributed information about the environment states) and slowly decrease it to some small value such as 5% or 2% of random actions. It helps both to explore the environment in the beginning and to stick to a good policy at the end of the training.

replay buffer: use a large buffer of our past experience and sample training data from it, instead of using our latest experience. It allows us to train on more-or-less independent data, but data will still be fresh enough to train on samples generated by our recent policy.

and

- target network: To make training more stable we keep a copy of our network and use it for the
`Q(sβ², aβ²)`

value in the Bellman equation. This network is synchronized with our main network only periodically, for example, once in N steps (where N is usually quite a large hyperparameter, such as 1k or 10k training iterations).

formed the basis that allowed DeepMind to successfully train a DQN on a set of 7 Atari games and demonstrate the efficiency of this approach applied to complicated environments. Later, at the beginning of 2015, they published a revised version of the 2013 paper, on 49 different games.

The final form of the DQN algorithm used in the paper contains the following steps:

Initialize parameters for

`Q(s, a)`

and`β(s, a)`

with random weights, π -> 1.0, and empty replay buffer.With probability π, select a random action

`a`

, otherwise`a = argmax_a Q(s, a)`

Execute action

`a`

in an emulator and observe reward`r`

and the next state`s'`

Store transition

`(s, a, r, s')`

in the replay bufferSample a random mini-batch of transitions from the replay buffer

For every transition in the buffer, calculate target

`y = r`

if the episode has ended at this step or`y = r + πΎ max β(s', a')`

, otherwiseCalculate loss, β = (Q(s, a) - y)^2

Update Q(s, a) using the SGD optimizer algorithm by minimizing the loss in respect to model parameters

Every N steps copy weights from Q to β

Repeat from step 2 until converged.

# Remarks

This paper had a significant effect on the Reinforcement Learning field by demonstrating that, despite common belief, it's possible to use nonlinear approximators in RL. This proof of concept stimulated large interest in the deep Q-learning field in particular and in deep RL in general. Since 2015, many improvements have been proposed, along with tweaks to the basic architecture, which significantly improves convergence, stability, and sample efficiency of the basic DQN invented by DeepMind. In January 2016, DeepMind published a paper titled Mastering the game of Go with deep neural networks and tree search which presented the AlphaGo version which defeated European Go Champion Fan Hui. Very conveniently, again in October 2017, they published a paper titled Rainbow: Combining Improvements in Deep Reinforcement Learning which presented the seven most important improvements to DQN reaching SOTA results on Atari Games Arcade. 3 years since, this technology has progressed a lot, has produced some amazing results and shall surely be surprising us in the future.

## Top comments (2)

Awesome explanation man,

My first time diving into this and I feel the way youve explained it was very easy to pick up on!

Thanks a lot Greg for the appreciation. :D