DEV Community

Hayden Rear
Hayden Rear

Posted on

Temporal Differencing vs Markov Chain Monte Carlo to Understand Model-Free, Model-Based, and Inverse Reinforcement Learning.

Reinforcement Learning

I remember the first lecture I watched on reinforcement learning. It was an MIT lecture about using a robot to bounce a ping-pong ball using a paddle. The lecturer discussed how in order for the machine to learn, it needed to "remember" the experiences in order to update it's ability.

I had an AHA! moment, allowing me to realize how a machine could "learn". At the most extreme, it could be a table remembering each exact experience, and it could just try again and again until it succeeded, recording the successes and using the states previous to those successes to determine how to make the next second a "success". It could record a state and the action to take based on that state in order to maximize "success".

However much I thought I knew in that moment, I failed to understand the complexity. It was one of those moments where I understood some very basic thing and generalized it to every possibility. In fact, to use reinforcement learning, you don't even need to be able to create "successes", but can learn as we go along if we have a way of extracting reward from the environment. Then, there is one step further: dynamically learning the reward function. Taking this step is almost "creepy". Not only does the agent learn as it goes along, it also learns how to learn by learning "what" to learn! This would be called "Inverse Reinforcement Learning".

Markov Chain Monte Carlo

Monte Carlo methods are the foundations of a lot of important machine learning algorithms, including temporal differencing. I think that most machine learning algorithms can be understood in terms of Markov Chain Monte Carlo, such as Transformers. and Temporal Differencing. Markov Chain Monte Carlo is a model-based approach, meaning it extracts the "model" from the environment in the first step, and then samples from the environment to find the most desirable state based on the probability of retrieving a state from an action, and the reward of the states.

Temporal Differencing

Temporal differencing takes this a step further. In Monte Carlo, we start with modeling the environment. Temporal differencing is more towards the "online" learning side. We start by initializing our model of the effect of the decisions arbitrarily. As we go along, we update the function that predicts the reward of the next state, and by "next state" I actually mean every state that I might take based on that state, often discounted. I am a chessmaster that is looking ahead into all of the moves at this point, discounting them. Or I am a really bad chess player who has no idea how to determine the moves. Hopefully you move from bad to good as time goes on.

Really the model-based and model-free both have "models", but the definition of "model" is blurry in this context:

What is the "Model" in the context of model-based and model-free

We talk about "model-based" and "model-free". But what is this "model"? The model can be thought of learning how to estimate the states, actions, and rewards, and how they relate to each other. If I am in a particular state, then these are my actions and this is the reward. This distinction between model-based and model-free lies in the environment dynamics. Does the model maintain some explicit idea of the transition dynamics of the environment based on the actions? Does the model then sample from this idea of the environment in order to determine which action to take? In the model-free algorithms, we learn a policy based on the state. In the model-based algorithms, we model the environment directly. It's a bit confusing because of course we have models, but in model-free we have a model that models the model of the environment, or something like that.

Inverse Reinforcement Learning as a Step from Model-Free

If I thought of model-based as one step, moving towards "online", and then I took another step in that direction, model-free, then I would consider learning the reward function online as one step too far!

Top comments (0)