Angela F.

Posted on Sep 30

Stop Your UMAP From Moving: The Pipeline Randomness You're Missing

#algorithms #datascience #machinelearning

During the Spring and Summer of 2025, I was asked to create a web application to show clusters of semantically similar tweets. This was an exciting endeavor aimed at creating a dashboard to visualize how ideas spread in society.

This article is about a specific problem I solved in this project: how to get cluster plots that were consistent from run to run. To achieve this, a deep dive into understanding how randomness is used within UMAP's algorithm became necessary: how it is applied, how it helps with efficiency, its pros, and its cons. This guided me in learning how to control it, and subsequently how to create the consistent render that was asked of me.

The Paradox of Randomness in Optimization

It's sometimes interesting to learn that the concepts and strategies you think you understand, you in fact do not. Such was the case with reproducibility. I had to first resolve the fact that to create stability from run to run, I needed to understand how randomness was being used as an optimization tool.

Machinelearningmastery.com offered a ton of clarity on this:

Using randomness in an optimization algorithm allows the search procedure to perform well on challenging optimization problems that may have a nonlinear response surface. This is achieved by the algorithm taking locally suboptimal steps or moves in the search space that allow it to escape local optima.

How UMAP Uses Randomness

This image gives an excellent visual explanation of the quote above. If the algorithm did not make random choices from time to time, it would find the local minima and naturally believe it had found the most optimal solution. A move along the X-axis in either direction would appear to be a worse choice as the graph would begin to travel back up the Y-axis. However, if we employ randomness, this gives the algorithm permission to make a seemingly 'bad choice' and in doing so creates the potential for finding the global minima – which is the optimal solution in this case.

How this helps us in creating semantic clusters: UMAP needs to make random choices when it's learning from your data. It picks random starting spots for your data points, then makes random moves to create better clusters. This is the start of why we get different clusters every time we rerun our code.

Controlling the Chaos with random_state

To achieve stability with our renders, we would need to control for this. This is where random_state comes in. Behind the scenes, randomness isn't really random at all. It is still a calculated value. By using random_state = 42, as an example, 42 becomes the seed that is plugged into a mathematical formula. A formula that is rerun many times as random number generators need to produce a long sequence of numbers for UMAP to use throughout its process.

Here is how it works:

First call: plug 42 into the formula, get number A
Second call: plug number A back into the formula, get number B
Third call: plug number B back into the formula, get number C
And so on…

Since we are controlling for the very first input, and each subsequent calculation requires that the input be the output of the previous calculation, the sequence of values generated becomes predictable.

My UMAP reducer looked like this:

reducer = umap.UMAP(
    n_components=2, 
    random_state=42, 
    spread=spread, 
    min_dist=min_dist,
    n_neighbors=n_neighbors,
    metric='cosine'
)
base_coordinates = reducer.fit_transform(base_embeddings)    
return reducer, base_coordinates, base_indices

Multiple Sources of Randomness

After implementing this solution, I ran the project. Everything clustered as it should have. Then I reran the project, and much to my surprise – the data points moved. The render was not stable. There must be more points of randomness that I wasn't considering.

I spent a great deal of time scouring the internet to try and find any resource that spoke to the different points within the UMAP pipeline where randomness would occur. I came across an article entitled "Uniform Manifold Approximation and Projection in R". I performed a quick 'CTRL + F' and searched for 'randomness' to assess whether there would be anything useful here quickly. There was certainly something useful. The article states:

As shown in the section on tuning, embedding of a raw dataset can be stabilized by setting a seed for random number generation. However, the results are stable only when the raw data are re-processed in exactly the same way. Results are bound to change if data are presented in a different order.

The Root(?) of the Problem

So, what this meant for my code was a situation where my dataset was experiencing both 'change' and 'randomness' in two places:

The first was with the sample number of tweets being used at each run. I have it set to a default value of 21K tweets. At each run, it was using a random set of 21K tweets.
Additionally, I have also implemented a feature where the user can enter in 'test tweets' – this was also changing the data at each run, each time a tweet was added and each time a tweet was deleted.

But if you recall, the article specifically stated: "...the results are stable only when the raw data are re-processed in exactly the same way."

I then added a random seed here:

base_tweets_df = base_tweets_df.sample(n=max_tweets, random_state=42)

This would hopefully ensure that the sample tweets would remain consistent from run to run. After implementing the second random seed, I reran the project. And to my astonishment, the clusters no longer moved from run to run. I could now change the sample data size and rerun the project, and I would get a consistent and static render of clusters from run to run.

The Final Challenge: Dynamic User Input

The next test would be to see if my success held when adding and removing test tweets via the UI. Unfortunately, my previous success would be short-lived. Again, I would encounter data points that would move each time a tweet was added or removed.

"...the results are stable only when the raw data are re-processed in exactly the same way."

That phrase again. It was clear that every time a user inputted a test tweet or deleted a test tweet, the raw data was not being re-processed in the exact same way. UMAP was recalculating everything from scratch as it thought it had a new dataset to work with.

The Storage Room Analogy: A Mental Model for the Solution

So, the solution for this went like this. Let's say you were storing all of your holiday decorations in a storage room. And you have decided that Christmas decorations would always go in the upper right corner of the shelves. And Halloween decorations always went in the lower left corner, and Thanksgiving in the middle, so on and so forth. If this layout never changes, then when new boxes of Christmas decorations are added, instinctually they will go in the upper right corner. And that new glow-in-the-dark skeleton you got for this Halloween will naturally go in the lower left corner. These rules never change; therefore, when adding new decorations, they will always go in the same cluster of related boxes in the same location, and when you remove them, you will always be able to predictably find them.

In this image, you can see that each holiday has a designated spot for its decorations. With these spots enforced, any new boxes of Christmas decorations will naturally go in the upper right-hand corner, and Thanksgiving decorations added will naturally go in the middle. This is possible because the rules have been cached and are applied to each new box of decorations as they are added.

This was the basis for the fix. In the context of the code, the idea is identical. I would train UMAP on the exact same dataset from the embeddings file. The random seed ensured that the same dataset would be used from run to run, and the reducer would be cached so that UMAP's existing trained model could use .transform() for new tweets. So, the base dataset would always cluster the same, and as new test tweets came in and had the same reducer applied, they would always cluster in the same location as well – exactly in the same way that Thanksgiving decorations would always end up in the middle of the shelves.

Conclusion

This project has taught me quite a bit. For starters, there is nuance in machine learning that is important to consider. The primary example being that without the feature that allows the user to enter test tweets and remove them – the two random seeds likely would have solved my entire reproducibility issue alone.

I think at times it's also easy to default to 'Does this solution fix my problem?' If yes, great! – and we move on. Without diving into how stochastic algorithms work, and more importantly 'why' they have the randomness characteristic – solving this without this knowledge would have been so much more painful and time-consuming.

In the end, what started as a frustrating bug became an important learning moment. Sometimes, the most annoying of bugs end up teaching us the most.

DEV Community