DEV Community

RoksanaTanov
RoksanaTanov

Posted on

Real-World Applications of Q-Learning: Gentle Introduction

Q-Learning

Reinforcement learning, as per our recent post, deals with a unique problem setup where an arbitrary agent is trying to learn the optimal way of interacting with an environment. In return for its actions, it receives delayed labels also known as rewards; the agent’s ultimate goal is to find an optimal policy that maximizes cumulative numerical return.

RL-based technologies have already been implemented by inventive companies to configure, in an optimal way, multi-tier web-networks, build robust recommendation algorithms, engineer sophisticated intrusion detection schemes for IoT networks, and so on.

In this post, we’ll delve a bit deeper into sequential decision making, discuss, in general terms, the promise and shortcomings of Reinforcement Learning and explain the principles of one of the most popular RL strategies – Q-learning.

Let’s get to it!

Mathematically, Reinforcement Learning problems are formulated as Markov Decision Processes (MDPs); they’re defined by:

set of states (S), set of actions (A);
distribution of rewards (R) – a function that maps a state-action pair to a reward;
transition probability distribution (P) over the next state (given a state-action pair);
discount factor (γ) that defines how important are the rewards we get now vs later in the episode.
In an MDP, everything begins with an environment’s sampling of initial state from initial state distribution. Then, a so-called reinforcement learning loop begins:

1) agent plays an action (a1) in a state (s1),

2) environment, having received the state-action pair, sends back a reward R1(a1,s1)

3) agent transitions into the next state (s2).

This continues iteratively until we reach the terminal state and the episode ends.

All agent’s actions are dictated by a policy π (a function that maps s at time step t to a at time step t). The objective is to find an optimal policy π* that will tell the agent, given the current state, which action brings the maximum cumulative discounted reward.

So, how is π* calculated?

First, let’s establish how the goodness (optimality) of states and actions is established in RL.

Each policy produces a certain trajectory of states, actions, and rewards.

We use a Value function (V) to measure how good a certain state is, in terms of expected cumulative reward, for an agent following a certain policy. The optimal value function (V*), therefore, is one that gives us maximum achievable value (return) for each state in given state space (set of all possible states).

A Q-value function (Q) shows us how good a certain action is, given a state, for an agent following a policy. The optimal Q-value function (Q*) gives us maximum return achievable from a given state-action pair by any policy.

One of the key properties of Q* is that it must satisfy Bellman Optimality Equation, according to which the optimal Q-value for a given state-action pair equals the maximum reward the agent can get from an action in the current state + the maximum discounted reward it can obtain from any possible state-action pair that follows. The equation looks like this:

Q* (s, a) = E [Rt+1 + γ max a′ Q*(s′,a′)]

Since Bellman Equation helps calculate Q* at each time step, it gives us a way to determine the optimal policy; we know the value of the following state-action pair, so the optimal strategy is to play actions that maximize the expected value of

R + γ Q*(s’,a’)

R is the immediate reward
Q*(s’,a’) – is the optimal Q-Value for the state we’ll end up in
γ– is the discounting factor
The optimal policy π, as we can infer from this, is to take the best action – as defined by Q – at each time step.

View blog and read more about Q-learning and Deep Q-Learning. The post appeared first on Custom Software Development Company Perfectial.

Top comments (1)

Collapse
 
incrementis profile image
Akin C.

Hello RoksanaTanov,

thank you for the article.
I couldn't find much about Q-Learning on DEV.to, so I'm glad you took the time to post about it.

Here are some well-intentioned critiques to make your article even better:

  • The link to the wiki regarding "Markov Decision Processes (MDPs)" doesn't work for me as the result is as follows:
    "_Wikipedia does not have an article with this exact name. Please search for Markov decision process rel= in Wikipedia to check for alternative titles or spellings. _"

  • The article on the Avenga website is much better structured and formatted than the one on Dev.to, but I've read the one on Dev.to so I'd like to see it at least a little better as it is at the moment.

Here is an example:
"set of states (S), set of actions (A);
distribution of rewards (R) – a function that maps a state-action pair to a reward;"

improved suggestion:
"
(S): Set of states
(A): Set of actions
"

I would have liked to have shared your article on Twitter with improved content.