When I wrote a random function in C, based on the high number of program iterations, I noticed that certain states occurred more frequently, even though I was seeding the rand() function based on time. Later, I realized the problem was with my own calculations; the way I had written the code suffered from "Modulo Bias," which caused some outcomes to have a higher probability!
The rand() function generates a positive number in the range [0, RAND_MAX], so the range length is RAND_MAX + 1. The first code, which contains the Modular Bias error, looks like this:
int rng_int(int min, int max){
int range = max - min + 1;
int random = rand() % range;
return random + min;
}
Assume RAND_MAX is a very small number like 15. In this case, if I want random numbers from the range [0, 5], since my range length is 6, that larger range of 0 to 15 must be mapped onto the numbers 0 to 5 to get my desired result. (For example, if the function gives me 13, I don't want it because it's not in the 0-5 range, so I map it to the number 1). Now, for all outputs from 0 to 15 of rand(), this is what happens:
0 % 6 = 0 1 % 6 = 1
2 % 6 = 2 3 % 6 = 3
4 % 6 = 4 5 % 6 = 5
6 % 6 = 0 7 % 6 = 1
8 % 6 = 2 9 % 6 = 3
10 % 6 = 4 11 % 6 = 5
12 % 6 = 0 13 % 6 = 1
14 % 6 = 2 15 % 6 = 3
Did you notice? The numbers {0,1,2,3} are repeated | mapped 3 times, while the numbers {4,5} are only repeated | mapped 2 times!We fix this error here using the rejection sampling method with a do while loop:
int rng_int(int min, int max){
unsigned range = (unsigned)max - min + 1;
unsigned limit = RAND_MAX - (RAND_MAX % range);
int random;
do {
random = rand();
} while((unsigned)random >= limit);
return min + (random % (int)range);
}
What happened? With the limit variable, we equalized everyone's chances. The limit value is the largest number in the range [0, RAND_MAX] that is perfectly divisible by our desired range length. In our example, limit equals 12; within the do while loop, we ensure that if the output isn't within the range [0, limit - 1], it gets re-selected so that everyone's chances stay equal.
Ultimately, what exactly is Modulo Bias? Well, Modulo (with that cute pronunciation) is just the % operator, and in the world of probability, the weighting of probability toward one side when everyone is expected to have an equal chance is called Bias—like when one side of a coin is heavier than the other.
Digging Deeper:
The value of RAND_MAX in current implementations is 2,147,483,647.
You can find this by running the following commands in a bash terminal:
echo '#include <stdio.h>
#include <stdlib.h>
int main(){ printf("%d\n", RAND_MAX); }' | gcc -x c - && ./a.out
You might think to yourself that this probabilistic inequality occurred only because we assumed a small RAND_MAX in our example. (We call it "probabilistic" because if the larger range to be mapped happens to be an exact multiple of our target range, all probabilities are distributed equally). Well, let's assume its value is 2,147,483,647 and we want to map it onto the range [0, 5]:
First, let's find the lengths of the ranges:
Mapped Range = 2,147,483,647 + 1 = 2,147,483,648 (which is $2^{31}$)
Mapping Range = 6
The result of integer division tells us the minimum number of times each number inside the range is mapped:
Mapped Range / Mapping Range = 357,913,941
The result of the modulo operator tells us that the first two numbers of our target range are mapped one extra time, so they get one more chance! (In fact, it's always these initial numbers of the range that might get the extra chance).
Mapped Range % Mapping Range = 2
Thus, we conclude:
{2,3,4,5} are mapped 357,913,941 times
{0,1} are mapped 357,913,942 times
TRUTH CHECK:$4 \times 357,913,941 + 2 \times 357,913,942 = 2,147,483,648$ (The full Mapped Range!)
You might say, "See? Since the probabilities are so close in such a large range, one extra chance doesn't even get noticed; our point was right." Even if it's not noticeable, the chances are still not equal. But you're right—it's barely worth writing a few extra lines of code. However, let's increase the range we want for random number generation and see if it's still "not worth it." This time, let our target range be [0, 999999]. Repeating the steps above:
Mapped Range = 2,147,483,648
Mapping Range = 1,000,000
Mapped Range / Mapping Range = 2,147
Mapped Range % Mapping Range = 483,648
In simple terms, what kind of justice is this, where the first 483,648 numbers of the range have an extra chance in a range that only has a million numbers in total?
🖤 - PiBytes
Top comments (0)