DEV Community

Cover image for Markov Decision Processes(MDP) basic concept
Taher Fattahi
Taher Fattahi

Posted on

2

Markov Decision Processes(MDP) basic concept

A Markov decision process (MDP) is a mathematical framework used to model decision-making problems in which outcomes are partly random and partly under the control of a decision-maker. It consists of a set of states, a set of actions, and a set of rewards. The decision-maker takes actions in each state, and the environment transitions to a new state based on the action taken and a probability distribution over the next state. The decision-maker receives a reward based on the state and action taken.

Here’s an example of an MDP:

Consider a robot that is trying to navigate through a maze. The robot can move in four directions: up, down, left, or right. The robot’s goal is to reach the exit of the maze. The robot’s current state is its current location in the maze. The robot can only see a limited area around it, so it doesn’t have access to the entire maze. The robot’s actions are to move in one of the four directions. The environment transitions to a new state based on the action taken and a probability distribution over the next state. For example, if the robot moves up, there might be a 70% chance that it stays in the same location, a 10% chance that it moves to the left, a 10% chance that it moves to the right, and a 10% chance that it moves down. The robot receives a reward based on the state and action taken. For example, the robot might receive a reward of +10 if it reaches the exit of the maze, a reward of -1 if it bumps into a wall, and a reward of -0.1 for each step it takes.

The environment is typically formulated as a Markov Decision Process

1.) set of states: S
In chess, a state is a given configuration of the board

2.) set of actions: A (for example: up, down, left, right)
Every possible move in chess or tic-tac-toe

3.) Conditional distribution of the next state
The next state depends on the actual one exclusively „Markov-property"
P(s'|s,a) transition probability matrix

4.) R(s,s') the reward function of the next s' state after the agent makes action a while being in the state s

5.) γ gamma discount factor

The Markov property is a fundamental assumption in reinforcement learning. It states that the current state of an agent contains all the information that is relevant for decision-making. In other words, the future state of the agent depends only on its current state and not on its past states. This assumption enables theoretical proofs for certain algorithms and is widely used in current research.

For example, in a Markov decision process (MDP), we assume that the current state is independent of the whole history given the previous state. However, in real-life scenarios, this assumption may not always hold true and can lead to an RL algorithm failing in a specific environment.

Reference:
https://web.stanford.edu/class/cme241/lecture_slides/david_silver_slides/MDP.pdf

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Billboard image

Imagine monitoring that's actually built for developers

Join Vercel, CrowdStrike, and thousands of other teams that trust Checkly to streamline monitor creation and configuration with Monitoring as Code.

Start Monitoring

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay