Author's note: As always, if I glossed over something or the flow is confusing, let me know in the comments.
Weighted randomization is a way of selecting a random item with bias.
Let’s say you have a list of 1,000 movies ranked by popularity. You want to watch something different, but you also want something popular. You can’t simply randomize all 1,000 movies the standard way because then your selection has an equal chance of being good as it does of being bad. Nevertheless, we still want something “random” that is likely “good.”
The most basic (and incorrect) way to do this is by creating additional items in a list and then shuffle that list.
Let’s say you are going through numbers 1–5 and you want "1" to show up more often than the others.
[1, 2, 3, 4, 5] gives every option an equal chance.
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5] gives the 1s a far better chance.
What do you do when you have thousands/millions of rows in a database and significantly varying weights?
If we use the list method above, the amount of items in our set will grow substantially and dramatically affect performance.
Let's say we want to rank 100 baby names starting at a popularity of 100 equally distributed down to 1.
Meaning James = 100, John = 99, Robert = 98, etc
We can use the formula for 100+99+98+97…+2+1. That would mean our basic set above would be 5,050 items in length just to do a simple weighted comparison of 100 items.
// This set would be 5,050 items long ['James', 'James', 'James', [...], 'John', 'John', ...]
names.length == 100, but when sorting it
namesForSorting.length == 5050. That's unacceptable.
How we came up with that number: (100 = 100 names) (X + 1) * (X / 2) (100 + 1) * (100 / 2) 101 * 50 = 5,050
What if we want to compare... 65,250 items?
Let’s try 65,250 items in the formula to get a random weighted item using the list method above.
Same formula, new number: (X + 1) * (X / 2) (65,250 + 1) * (65,250 / 2) 65,251 * 32,625 = 2,128,813,875
There’s no way we should create a list with two billion one hundred twenty eight million eight hundred thirteen thousand eight hundred seventy-five items. That is gratuitous waste of resources that will only get worse. What about a million records? (500 billion, ouch)
Note: In this example, we’re only using equal distribution of popularity (1, 2, 3, 4, 5+). In reality, the ranks could be anything. (100, 100, 100, 95.6, 91, 85, 85, 85, 84, 84, 84,…] That means you should expect your sets to likely have much higher quantities than our basic example.
Fortunately, there’s a much faster way to get the same exact results.
Let’s use the Top 16 Football Teams in the AFC from 2017.
These are the four steps:
- Sum up all of your rank/popularity into a variable: X.
- Generate a random number between 0 and X. We’ll call it Y.
- Iterate through your list of data. Subtract each row’s rank/popularity from your random number (Y).
- When Y ≤ 0, that is your weighted randomized object index.
Here’s a JS Fiddle that ranks the top 16 teams, distributing a score of 0–100 points equally to each team. We run the test 5,000 times and you’ll see how often each item is selected and how consistent it happens.
Out of 5,000 randomizations, these are the selections:
603 — New England Patriots 520 — Pittsburgh Steelers 512 — Jacksonville Jaguars 472 — Kansas City Chiefs 447 — Tennessee Titans 405 — Buffalo Bills 384 — Baltimore Ravens 336 — Los Angeles Chargers 279 — Cincinnati Bengals 264 — Oakland Raiders 219 — Miami Dolphins 197 — Denver Broncos 150 — New York Jets 105 — Indianapolis Colts 70 — Houston Texans 37 — Cleveland Browns
What the above results prove is that every team had a chance to be chosen at “random.” The Patriots were chosen 603 times whereas the Browns were chosen 37 times. It didn’t exclude the Browns for being unpopular, but they were certainly picked less often.
The benefit of this method is that instead of shuffling through 136 items (16+15+14…), we only run a subtraction operation anywhere between 1 and 16 times. Less operations is less computing power.
As per our first extremely simple example at the introduction of this article: Instead of an expensive shuffling operation of 2,128,813,875 items in a set, we only run a simple subtraction operation anywhere between 1 and 65,250 times.
Question: How much processing power does it take to subtract an integer ~50 times… ~4,000 times…~10,000 times?
Answer: Not much.