Fixing Biased Entropy: The Von Neumann Unbiasing Trick

#computerscience #algorithms #softwareengineering

TL;DR: I've found that hardware entropy sources are rarely uniform. To solve this, I use Von Neumann Unbiasing, which pairs bits and discards identical results (00, 11). By mapping 01 to 0 and 10 to 1, I can extract a perfectly fair 50/50 distribution from any biased source, provided the bias is constant and bits are independent.

I’ve found that hardware is always noisier than you’d expect—and rarely in the way you want. When I pull entropy from thermal jitter or diode noise, I'm dealing with the messy physical world, which doesn't care about my requirement for a perfect distribution. A sensor might lean toward a logic high or low due to temperature fluctuations or voltage drops, and in practice, achieving a perfect 0.5 probability out of a physical component is almost impossible.

If I see someone using biased entropy for key generation, I know they're shrinking their effective keyspace and making their system vulnerable to brute-force attacks. A cryptographic key is only as strong as the randomness used to create it. If your bits are weighted 60/40, you’ve introduced a pattern that an attacker can exploit. I need to process that raw, physical noise into a mathematically balanced stream before it is used in a production environment.

Why are hardware random number generators often biased?

Physical sensors are influenced by environmental conditions and internal circuit resistance that favor one electrical state over another. Unlike a mathematical pseudo-random number generator (PRNG), a hardware source is a physical device subject to manufacturing defects and external interference.

I want you to imagine you’re building a microservice that relies on an internal hardware RNG. If that hardware is even slightly more sensitive to a certain voltage threshold, it will produce more ones than zeros. This bias can be subtle—perhaps only a 1% shift—but in the world of security, that shift is enough to weaken the randomness of every session key your service generates.

How do you turn a weighted coin flip into a 50/50 result?

I group incoming bits into pairs and discard any pair where the bits match. I only output a single bit when I see a transition—either a zero followed by a one, or a one followed by a zero.

This logic ensures that the output remains fair even if the source is heavily biased. Here is the mapping logic I use to clean the stream:

First Bit	Second Bit	Action	Result
0	0	Discard	(None)
1	1	Discard	(None)
0	1	Accept	0
1	0	Accept	1

Why does this trick guarantee a perfect probability split?

The reason I love this trick is because the math is elegantly simple: p * q is always equal to q * p. Even if your source favors one side, the probability of seeing a specific sequence of two different bits is mathematically identical to seeing its reverse.

Let’s say I am looking at a broken hardware sensor that lands on heads (1) 75% of the time and tails (0) 25% of the time.

The probability of (1,1) is 0.75 * 0.75 (0.5625) -> Discarded.
The probability of (0,0) is 0.25 * 0.25 (0.0625) -> Discarded.
The probability of (0,1) is 0.25 * 0.75 (0.1875) -> Result: 0.
The probability of (1,0) is 0.75 * 0.25 (0.1875) -> Result: 1.

Since 0.1875 is exactly equal to 0.1875, I get an exactly 50% chance of getting a 0 or a 1. The original bias doesn't change the fact that the two mixed outcomes are equally likely.

What are the performance trade-offs of unbiasing?

The primary trade-off I see is throughput; I am forced to throw away a massive amount of raw data, which can lead to entropy starvation in systems like Linux. When the entropy pool in /dev/random runs dry, the OS blocks, which can halt a deployment or stall a cryptographic handshake.

In that 75/25 bias scenario, I am discarding 62.5% of the raw bits. If I have a system generating long-term SSH host keys during a cloud instance boot-up, this discarding logic can cause a visible hang. I've seen setup scripts stop and deployment pipelines stall because the system was waiting on a hardware-accelerated source that was too biased to keep up with the demand. When I implement this in firmware, I keep the logic as lean as possible:

def unbias(bit_stream):
    while True:
        x, y = next(bit_stream), next(bit_stream)
        if x != y:
            return x

FAQ

Does this work if the hardware bias changes over time?

No. This algorithm relies on the probability (p) remaining constant across both bits in the pair. If the bias is drifting rapidly—for instance, if I am looking at a sensor that is overheating and its 1/0 ratio is swinging wildly every millisecond—the unbiasing effect breaks down and I may still end up with skewed output.

What happens if the bits are correlated?

If the bits are not independent—meaning a 1 is more likely to be followed by another 1 (autocorrelation)—this trick fails. In those cases, I would typically use a cryptographic hash function like SHA-256 as an entropy extractor to flatten the distribution and remove the patterns.

Is there a more efficient way to extract bits?

Yes, algorithms like the Peres or Elias strategies extract more entropy by looking at longer bit sequences. However, I rarely use them because they require complex state management and larger memory buffers. Von Neumann is my go-to for low-level work because it requires zero memory and can be implemented with a simple loop or a few logic gates.