Engineering Post: An owned PRNG, because "reproducible" has to mean reproducible

#csharp #dotnet #nuget #showdev

I’m reviving Munchausen, a C# NuGet package I started 9 years ago. This is part 2 of an 8-part series documenting both the development process and the engineering decisions behind bringing the project back to life.

This is the Engineering Post: the reasoning, trade-offs, API decisions, and technical choices behind this part of the project.

Munchausen promises that the same model, definition, seed, and library version produce the same data across all runtimes, OSes, and architectures. That promise is load-bearing: it's what lets you commit a seed and trust your test fixtures forever. And it rules out System.Random immediately, because its algorithm has
changed across .NET versions, and its seeded sequence was never contractual.

So M1 builds an owned pseudo-random generator from public-domain algorithms and wraps it in the helpers that the rest of the library will draw from.

This is the first substantial test of the design: can "same seed, same data" be turned into behavior that the implementation can prove?

SplitMix64 → xoshiro256**

The core is two well-known generators by Vigna and Blackman:

internal struct SplitMix64
{
    private ulong _state;
    public SplitMix64(ulong seed) => _state = seed;

    public ulong Next()
    {
        ulong z = unchecked(_state += 0x9E3779B97F4A7C15UL);
        z = unchecked((z ^ (z >> 30)) * 0xBF58476D1CE4E5B9UL);
        z = unchecked((z ^ (z >> 27)) * 0x94D049BB133111EBUL);
        return z ^ (z >> 31);
    }
}

SplitMix64 expands a single seed word into the four state words that xoshiro256** needs. xoshiro is fast, high quality, and, critically, portable: the seeded byte stream is identical on any machine, which is the entire point.

DeterministicRandom wraps the engine and exposes the distribution helpers:
inclusive Int/Long via Lemire's nearly-divisionless rejection (uniform integers in [min, max] with no modulo bias), Decimal from a 96-bit draw, Double from a 53-bit mantissa, Bool(p), Pick, Weighted, Sample (Fisher–Yates without replacement), Enum, String, AlphaNumeric, Bytes, and Guid. Everything downstream, datasets and user rules alike, draws from this one stream, so consumption order is the determinism contract.

The test that can't lie

Here's why M1 is one of my favorite milestones to verify: the correctness test
is unambiguous. xoshiro256** and SplitMix64 have published reference vectors.
Either my output matches them or my code is wrong, there's no judgment call.

The xoshiro vector is even hand-checkable. Seed the state to {1, 2, 3, 4} and
the first output is:

rotl(s[1] * 5, 7) * 9  =  rotl(10, 7) * 9  =  1280 * 9  =  11520

So the first test asserts 11520, which I can verify with a calculator. The
remaining 15 values (and the SplitMix64-from-seed-0 sequence) come from a
cross-validated reference suite (skeeto/rng-go), hardcoded as the expected
arrays. These are external truth, not values I generated.

Goldens vs. references

There are two kinds of pinned outputs here, and the distinction matters:

Reference vectors prove the algorithm is correct against the outside world.
Goldens prove the implementation is stable across runs and versions.

For the golden, I generated the first 16 raw draws of DeterministicRandom(42)
out-of-band, wrote them to a committed fixture, and the test asserts a fresh
instance reproduces them. Because the reference-vector tests already prove the
algorithm is right, the golden isn't circular, it's a regression lock on a
validated implementation. I treat a failing golden as a finding, never a file
to regenerate.

One decision worth flagging

The public seed is an int, but SplitMix64 wants a ulong. This was one detail I
had not settled while designing the API. I chose sign-extension
(unchecked((ulong)seed)). For positive seeds, including every current golden,
this is identical to zero-extension, so nothing is locked to the choice yet. It still touches reproducibility, so I documented it explicitly.

Verification

PRNG outputs match the published vectors; helper sanity tests confirm inclusive
bounds, weighted-normalization proportions, and sample-without-replacement
uniqueness; the seed-42 golden reproduces across process runs. All green, zero
warnings.

What's next: M2, the Metadata Layer

Determinism is the how of generation. M2 is the what: a reflection layer that
discovers a model's members in a stable order (by MetadataToken, because order
is part of determinism), reads nullability and required, and builds fast
get/set accessors via compiled expressions, all behind interfaces so a
source-generated, AOT-safe implementation can slot in later. Reflection is
expensive and we only want to pay for it once, at build time, never per object.