How does a memoryless chain map a probability distribution? My rabbit hole into Markov Chain Monte Carlo

#programming #machinelearning #learning #computerscience

As my exams were approaching, I was neck-deep in revision. Literally. My neck ached after long library hours so I let my mind wander during my breaks. I thought about how at a very macroscopic level, all these AI tools I use to help me code, organise and plan use many vast matrices internally to make computations. They execute a plethora of transformations and operations to produce a valid response to my input (the prompt). A strange thought then came to my head, matrices talking to each other through transforming elements in sync. After a brief search I found that matrix operations are also used in many social networks and other machine learning applications. So I chose to plan my next project on it! After more research into matrices, I stumbled upon Markov chain uses in matrices.

What is a Markov chain matrix (Transition Matrix)?

A Markov chain is a system where the next state is only dependent on the current state but not the whole past trajectory; it's memoryless! Just like the geometric distributions I came across in my course earlier this year. This is really useful in modeling sequential variables where the next one is dependent on the current one. In the frame of the bigger picture, Markov Chains can be stored and used in matrix operations and is normally called a Transition Matrix. It's a special matrix that is used to predict systems that rely on Markov chain dynamics when transitioning between states. All the matrix entries must be nonnegative and the rows must add up to one.

With this newfound knowledge, I went ahead and looked for computer science applications of Markov chains, and I began reading about Markov Chain Monte Carlo (MCMC). I've already come across Monte Carlo sampling while looking through backtesters for the IMC prosperity 4 challenge.

What is Monte Carlo sampling?

Monte Carlo sampling is helpful for estimating the mean of a complex distribution. It uses a large sample and it estimates the distribution mean by computing the sample average. The most frequent analogy I've come across is throwing darts at a dartboard, where your target region is the dartboard and the darts are the random samples. The distribution is the probability function that dictates where the darts land. Due to the law of large numbers and central limit theorem these estimators can give you a reliable estimate of the mean. Monte Carlo sampling is very computationally intense by nature and with more sophisticated statistical models working in higher dimensions, the computational cost can grow very quickly!

The key insight that's used in MCMC is using Markov Chain logic to suggest the next point, ensuring the next value only uses the current one to compare the relative probability density of the two points. This sounds incomplete right? Until you realise that MCMC doesn't require the normalizing constant, while conventional methods do.

Because MCMC only needs the unnormalised posterior, it can characterise the full shape of distributions that conventional methods couldn't touch. It can model peaks, tails and correlations between parameters while being as precise as your number of samples allows, entirely independent of the normalising constant.

Instead of trying to sample randomly across a massive space to propose samples in the high probability regions (which wastes computational power), an MCMC algorithm acts like a "smart walker". It proposes a new step, evaluates the target probability density, and decides whether to move or stay put. Over thousands of iterations, the chain approaches a stationary equilibrium where it spends time in regions proportional to their actual probability.

Great, but what does it actually do?

MCMC is applied in real world scenarios to characterise distributions where simple methods fail due to the intractability of the posterior. This is perfect for drug discovery and genomics where parameter interactions are complex. It can also be an effective way of estimating uncertainty. Which is great for doctors who need to explain how risky or effective certain drugs are with a more specific degree of uncertainty.

After reviewing the mathematical concepts, I wanted to investigate the current implementations of MCMC. I found the Metropolis-Hastings (MH) algorithm, the most fundamental algorithm for running an MCMC simulation. It's crazy that it was invented in the 1950's! MH algorithms are great for low dimensionality probability distributions but recent advancements in the field have moved away from MH and towards approaches that use different proposal methods like Hamiltonian Monte Carlo (HMC). Interesting insights I found in articles related to MCMC such as a recent benchmark by Towards Data Science and Blackjax-NS: GPU Bayesian Inference is that MH is primarily built to run on the CPU but samples faster on the GPU sometimes depending on the dimensionality and posterior of your target distribution. The only issue with that is, its performance varies significantly depending on the dimensionality and structure of your problem, and some ports can actually be slower than CPU computation.

While digesting these mathematical concepts and algorithmic implementations, I decided that the most effective way for me to learn and understand these would be through my own implementation of Metropolis-Hastings on the CPU. After improving my understanding I'll improve my implementation and compare it to open-source algorithms. I want to be able to understand why their implementation performs differently in some areas. Perhaps after the CPU implementation I could explore GPU/TPU methods and how they differ. I will keep updated posts as my progress continues.

Thank you for taking the time to read my thoughts!