Combining popular things and random things

#javascript #architecture #devops

Author's note: As always, if I glossed over something or the flow is confusing, let me know in the comments.

Weighted randomization is a way of selecting a random item with bias.

Let’s say you have a list of 1,000 movies ranked by popularity. You want to watch something different, but you also want something popular. You can’t simply randomize all 1,000 movies the standard way because then your selection has an equal chance of being good as it does of being bad. Nevertheless, we still want something “random” that is likely “good.”

The most basic (and incorrect) way to do this is by creating additional items in a list and then shuffle that list.

Let’s say you are going through numbers 1–5 and you want "1" to show up more often than the others.

[1, 2, 3, 4, 5] gives every option an equal chance.
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5] gives the 1s a far better chance.

Simple, right?

What do you do when you have thousands/millions of rows in a database and significantly varying weights?

If we use the list method above, the amount of items in our set will grow substantially and dramatically affect performance.

Light Example:

Let's say we want to rank 100 baby names starting at a popularity of 100 equally distributed down to 1.

Meaning James = 100, John = 99, Robert = 98, etc

We can use the formula for 100+99+98+97…+2+1. That would mean our basic set above would be 5,050 items in length just to do a simple weighted comparison of 100 items.

// This set would be 5,050 items long
['James', 'James', 'James', [...], 'John', 'John', ...]

In a Javascript sense, that'd be names.length == 100, but when sorting it namesForSorting.length == 5050. That's unacceptable.

How we came up with that number: (100 = 100 names)

(X + 1) * (X / 2)
(100 + 1) * (100 / 2)
101 * 50 = 5,050

What if we want to compare... 65,250 items?

Let’s try 65,250 items in the formula to get a random weighted item using the list method above.

Same formula, new number:

(X + 1) * (X / 2)
(65,250 + 1) * (65,250 / 2)
65,251 * 32,625 = 2,128,813,875

There’s no way we should create a list with two billion one hundred twenty eight million eight hundred thirteen thousand eight hundred seventy-five items. That is gratuitous waste of resources that will only get worse. What about a million records? (500 billion, ouch)

Note: In this example, we’re only using equal distribution of popularity (1, 2, 3, 4, 5+). In reality, the ranks could be anything. (100, 100, 100, 95.6, 91, 85, 85, 85, 84, 84, 84,…] That means you should expect your sets to likely have much higher quantities than our basic example.

Fortunately, there’s a much faster way to get the same exact results.

Practical

Let’s use the Top 16 Football Teams in the AFC from 2017.

These are the four steps:

Sum up all of your rank/popularity into a variable: X.
Generate a random number between 0 and X. We’ll call it Y.
Iterate through your list of data. Subtract each row’s rank/popularity from your random number (Y).
When Y ≤ 0, that is your weighted randomized object index.

Here’s a JS Fiddle that ranks the top 16 teams, distributing a score of 0–100 points equally to each team. We run the test 5,000 times and you’ll see how often each item is selected and how consistent it happens.

Out of 5,000 randomizations, these are the selections:

603 — New England Patriots
520 — Pittsburgh Steelers
512 — Jacksonville Jaguars
472 — Kansas City Chiefs
447 — Tennessee Titans
405 — Buffalo Bills
384 — Baltimore Ravens
336 — Los Angeles Chargers
279 — Cincinnati Bengals
264 — Oakland Raiders
219 — Miami Dolphins
197 — Denver Broncos
150 — New York Jets
105 — Indianapolis Colts
70 — Houston Texans
37 — Cleveland Browns

What the above results prove is that every team had a chance to be chosen at “random.” The Patriots were chosen 603 times whereas the Browns were chosen 37 times. It didn’t exclude the Browns for being unpopular, but they were certainly picked less often.

The benefit of this method is that instead of shuffling through 136 items (16+15+14…), we only run a subtraction operation anywhere between 1 and 16 times. Less operations is less computing power.

As per our first extremely simple example at the introduction of this article: Instead of an expensive shuffling operation of 2,128,813,875 items in a set, we only run a simple subtraction operation anywhere between 1 and 65,250 times.

Question: How much processing power does it take to subtract an integer ~50 times… ~4,000 times…~10,000 times?
Answer: Not much.