DEV Community: Loïc Baumann

Deadlock-Free by Construction: How Typhon Eliminates Deadlocks Instead of Detecting Them

Loïc Baumann — Mon, 27 Apr 2026 17:48:24 +0000

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.

It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I'm Building a Database Engine in C#

What Game Engines Know About Data That Databases Forgot

Microsecond Latency in a Managed Language

Deadlock-Free by Construction (this post)

MVCC at Microsecond Scale (coming soon)

GitHub repo • 📬 Subscribe via RSS

Deadlocks are usually treated as a runtime problem. We treat them as a design bug.

That sounds like a slogan. It isn't. It's the actual reasoning behind three architectural decisions that, taken together, make a lock-dependency cycle impossible in Typhon — not unlikely, not rare, impossible. The engine ships without a deadlock detector. From the project's own concurrency overview:

Deadlock detection is explicitly not implemented — it would add overhead for a scenario that cannot occur in the current architecture.

This post is the how: how three structural decisions remove three classes of edges from the lock-dependency graph, and how that elimination cascades into "no cycle is possible." But it's also the why: why the constraint was set at project inception, before any code existed to deadlock, and what it cost.

The upfront bet

I didn't compare deadlock detection schemes before starting Typhon. I'd seen them in production at previous engines, and the pattern was always the same: a separate background scanner, a wait-for graph, victim selection heuristics, transaction abort and retry. A lot of code. A lot of edge cases. None of it bulletproof for the user, who still sees occasional one-second pauses or unexplained transaction failures under load.

So I made an upfront call, recorded as ADR-003, the project's first concurrency decision, dated 2024-01 (project inception):

Optimistic locking: No locks during execution; conflict detection only at commit.

(ADRs — Architecture Decision Records — are short documents capturing one design choice with its context and rationale. They're a paper trail for why a thing was built a particular way, not just what it does. Typhon has 49 of them so far. They live in the project's internal documentation, not in the public repo.)

That's the bet. No locks across data, whatever the architectural cost. Not because I had proof prevention would be faster — I didn't run those benchmarks — but because the implementation cost of detection is real, the result is never bulletproof, and trading an architectural cost up front for never paying a runtime cost later is the trade I wanted.

The three "pillars" that follow aren't a survey of alternatives I considered. They're what the architecture had to become once the constraint was set. MVCC was the obvious starting point. Optimistic Lock Coupling for indexes followed because traditional B+Tree latch coupling violated the constraint at the index level. The "no cross-table latching" rule emerged because anything else reintroduced the cycles I was trying to eliminate.

It's constraint-driven design, not survey-driven. And it's why this post claims a property — deadlock-free by construction — instead of a benchmark.

What a deadlock actually is

Briefly, because the rest of the post needs the picture.

Two transactions, T1 and T2. T1 holds a lock on row A and asks for a lock on row B. T2 holds B and asks for A. Neither can proceed. Each is waiting for the other; the wait will never end. That's a cycle in the lock-dependency graph — the directed graph whose nodes are transactions and whose edges are "is waiting for." A deadlock is a cycle in that graph. Detection-based databases scan for cycles and break them by aborting one transaction. Prevention-based databases make cycles impossible to form.

The three sections that follow each remove one class of edges from that graph. With every class removed, no cycle is possible.

Pillar 1: MVCC eliminates inter-transaction data locks

The textbook deadlock — T1 locks row A, T2 locks row B, both want the other — requires row-level locking between transactions. Typhon doesn't do that.

Reads are snapshot-consistent: every transaction is frozen at the global tick value when it began. A reader sees a stable view of the database for its entire lifetime. It never asks for a lock, because there's nothing to lock against — the snapshot is already immutable.

Writes don't lock existing rows either. They create new revisions, with the previous revision left intact for any transaction whose snapshot still references it. Two writers updating the same component don't fight over a lock; they each append a new revision to the chain. Conflict detection happens at commit time, as a single CAS operation: when the writer tries to install its new revision as the current one, the engine checks that the version it built on is still current. If not, the writer aborts and retries.

This removes the entire edge class of "data locks held across transactions." There are no row locks, no read locks, no write locks on data. The wait-for graph at the transaction level has no edges to form a cycle from.

The cost isn't free. Two writers updating the same component will conflict at commit, and one of them will retry. For game-server workloads where most components are written by exactly one system, conflicts are rare. For general OLTP workloads with high write contention, the cost would shift the trade — fewer deadlocks, more aborts. Different curve.

Pillar 2: Optimistic Lock Coupling for index structures

Even without row locks, an index structure (B+Tree, R-Tree) is shared mutable state. Traditional databases serialize access through latch coupling: a reader holds a latch on the parent node while acquiring one on the child, releases the parent, advances. It's a chain of overlapping latches walking down the tree.

That pattern can deadlock. Reader R has the parent latched and wants the child; concurrent writer W has the child latched and walks back up to fix the parent. Two threads, two index latches, mutual wait.

Typhon uses Optimistic Lock Coupling (Leis et al., 2019) instead. Readers don't latch at all. Each B+Tree node carries a 32-bit version counter. The reader reads the version, traverses, then re-reads the version at the end — if it changed, the traversal data may have been mutated mid-flight, so the reader restarts.

// From Typhon.Engine/Data/Index/OlcLatch.cs
public int ReadVersion()
{
    int v = _version;
    return (v & 0b11) == 0 ? v : 0;  // locked (bit 0) or obsolete (bit 1) -> restart
}

public bool TryWriteLock()
{
    int v = _version;
    if ((v & 0b1) != 0) return false;
    return Interlocked.CompareExchange(ref _version, v | 0b1, v) == v;
}

public void WriteUnlock()
{
    int v = _version;
    _version = ((v >> 2) + 1) << 2 | (v & 0b10);  // version++, keep obsolete, clear lock
}

Bit 0 is the write-lock flag; bits 2–31 are a monotonic version counter. ReadVersion returns 0 if the node is locked or obsolete — the caller treats that as "restart." TryWriteLock is a single CAS. WriteUnlock increments the version atomically with releasing the lock.

Writers latch only the modified nodes, and they acquire from root to leaf, in strict order. No reader ever blocks a writer. No writer ever holds a parent latch while waiting on a child. The same pattern is reused by the spatial R-Tree — same OlcLatch, same protocol — so this single mechanism covers both index families.

This removes the edge class of "index-level latch cycles."

Pillar 3: No cross-table latch holding

Two edge classes are gone. The third is the most boring and the most important: at any given moment, a thread never holds a latch in more than one table.

Each ComponentTable in Typhon has independent indexes, independent revision chains, independent page allocations. A transaction's commit path processes one table at a time. When the commit moves from table A to table B, all of A's latches are released first.

The only resource that would be shared across tables is the page cache. Latches there could form cycles across the entire engine. So the page cache doesn't use latches. That refactor is recorded as ADR-033, dated 2026-02-12:

Replace per-page reference counting with epoch-based protection. Each transaction enters an epoch scope that pins the current global epoch; pages accessed within the scope are stamped with that epoch; eviction defers any page whose epoch is still active.

The previous approach was reference counting: every page access incremented a counter, every release decremented it. A transaction touching 100 pages paid for 200 atomic operations — and atomics aren't free, each one stalls the CPU pipeline waiting for cache-line coherence. Epochs collapse that into two operations regardless of how many pages the transaction touches: one to enter the scope, one to exit.

But the deadlock-freedom payoff isn't the cost reduction. It's that the page cache never holds a lock anyone else could wait on. No latch, no waiter queue, no edge in the lock-dependency graph at all.

This removes the last edge class — cross-structure cycles. With all three classes gone, there is no graph in which a cycle can form.

This pillar is the one I worry about most, and the one most likely to break in the future. It's enforced by convention, not by the type system. Future features — cross-table indexes, parallel query execution holding read latches across multiple tables, foreign-key constraints — would each require extending the lock-hierarchy discipline. The concurrency overview explicitly lists those scenarios as known risks. I'll have to introduce explicit lock ordering when I get there.

What the bet costs

Prevention isn't free; it just shifts the cost.

What's eliminated	What remains
Deadlocks (cycles in the lock graph)	Aborts at commit — local retry
Detection runtime overhead	OLC restarts under index contention
Wait-for-graph data structures	Livelock under heavy contention (different problem)

A writer that loses the commit-time CAS doesn't trigger a global abort — it retries from where it was, against a refreshed baseline. An OLC reader that sees a version change doesn't block a writer — it restarts the traversal. These are local costs. A Postgres deadlock victim aborts the entire transaction; a Typhon OLC restart is one tree traversal.

Livelock — repeated retries that never converge — is a different beast. It can't deadlock, but it can starve. Typhon's AdaptiveWaiter handles this with a spin-then-yield progression: 65,536 tight spin iterations first (most contention resolves there), then exponentially halving spin counts interleaved with Thread.Sleep(100µs). The 100µs sleep is below the OS scheduler quantum, so wake latency stays sub-millisecond. It bounds livelock probability without trading away the latency targets.

So: deadlocks gone, aborts and restarts kept, livelock bounded by a spin policy.

What others do

I didn't survey these in depth before committing to prevention — the upfront "no locks" constraint was made on principle. But for the reader's context, here's the landscape Typhon sidesteps.

System	Strategy	Cost model
PostgreSQL	Wait-for graph, triggered after `deadlock_timeout` (1s default)	Detection deferred to ≥1s lock wait; cycle scan is expensive but rare
MySQL InnoDB	Wait-for graph + victim selection (smallest tx by row modifications wins)	Detection can be disabled on high-concurrency systems in favor of `innodb_lock_wait_timeout`
CockroachDB	Per-node in-memory lock tables + Raft-replicated write intents	Detection is near-instantaneous; cost shifted to lock-table maintenance
Typhon	Prevention by structure (three pillars above)	No detection runtime cost; cost shifted to OLC restarts and commit-time aborts

These are all sound engineering choices for their workloads. Postgres' deferred detection is rare-event optimization. InnoDB's "smaller transaction wins" is a pragmatic heuristic for the OLTP shapes it's tuned for. CockroachDB's instantaneous detection genuinely solves the latency problem detection has elsewhere. None of these are wrong. They're answering a different question: given that we accept locks, how do we manage their cycles?

Typhon answers a different question: given that we don't accept locks, what does the rest of the architecture have to look like? That's why the comparison isn't "Typhon is faster" — it's "Typhon paid the cost in a different layer." Each row above describes where the cost lives, not who's faster.

A footnote: TigerBeetle reaches the same end via a different upfront constraint — single-writer serializable execution. No concurrent transactions, no deadlocks. Different category, same conclusion: detection is the wrong layer to solve this.

What I'd flag for a reviewer

Three honest acknowledgments.

Pillar 3 is enforced by convention, not by the type system. The compiler won't catch a future PR that holds latches across two ComponentTables. The discipline lives in code review and architectural awareness, not in mechanically-checked invariants. To compensate, I've set up a list of explicit design rules that Claude Code enforces during design, development, and code review. Pillar 3's "no cross-table latching" invariant is on that list; any code that would violate it gets flagged before it reaches the diff.

OLC restart cost is bounded but not zero. Under heavy write contention on a hot B+Tree leaf, optimistic readers can restart a few times before getting a clean version. The restart is one traversal, not a transaction abort, but it's not free either.

The "deadlock-free" claim assumes the current feature set. Cross-table indexes, parallel queries holding read latches across tables, and foreign-key constraints are all listed as future scenarios that would require extending the discipline. The structural argument holds for what ships today; future features will need to maintain it consciously.

What's next

The next post drills into Pillar 1 — how Typhon's MVCC works without cloning rows. The big trick is per-component revision chains instead of per-row tuple versioning: an entity with eight components that updates one creates a single new revision, not eight. The visibility check is a single comparison against the transaction's snapshot tick. And the EnabledBits exception dictionary pattern — zero-overhead fast path, dictionary slow path — is the prettiest piece of code in the engine.

Microsecond Latency in a Managed Language: The Performance Philosophy Behind Typhon

Loïc Baumann — Sun, 12 Apr 2026 20:58:24 +0000

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.

It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I'm Building a Database Engine in C#

What Game Engines Know About Data That Databases Forgot

Microsecond Latency in a Managed Language (this post)

Deadlock-Free by Construction (coming soon)

GitHub repo • 📬 Subscribe via RSS

The first two posts in this series covered the why and the what. Why C# for a database engine. What happens when you combine ECS storage with database guarantees.

This post is the how. Specifically: the five design principles that guide every performance decision in Typhon. Not a bag of tricks — a philosophy. Individual optimizations come and go as the engine evolves, but these principles are stable. They're what let a managed language deliver sub-microsecond transaction latency.

When your tick budget is 16 milliseconds and you have 100,000 entities to process, every nanosecond of per-entity cost matters. And most of that cost comes from decisions made at design time, not runtime.

Principle 1: Control Memory Layout

Performance starts at the struct definition, not the algorithm. If your data layout causes cache misses, no algorithm can save you.

The most dramatic example: Typhon recently moved from per-entity hash-table lookups to cluster-based Structure of Arrays (SoA) storage. Same data, same queries, different memory layout. Measured on a Ryzen 9 7950X:

Path	ns / entity	vs baseline
Standard EntityAccessor	139 ns	1.0x
ArchetypeAccessor (cached)	94 ns	1.5x
Cluster iteration	2.5 ns	55x

That's a 55x improvement from changing memory layout alone. The reason: clusters pack N entities (8 to 64, auto-computed per archetype) in contiguous SoA memory. All positions together, all health values together. Every cache line the CPU loads is 100% useful data. For 100K entities, the working set dropped from scattered L3/DRAM access to ~2.5 MB that fits entirely in L2 cache — and L2 is 3x faster than L3 on Zen 4.

The cluster size isn't a magic constant. An auto-tuning algorithm evaluates every N from 8 to 64 and picks the one that maximizes entities per 8 KB page for a given archetype's component schema. Non-power-of-2 sizes often pack better: N=14 can yield 28 entities per page vs N=16 yielding only 16. The capacity is derived from the data, not from convention.

False sharing is the other side of layout control. When multiple threads write to adjacent fields, the CPU bounces the shared cache line between cores — a 40-60 cycle penalty per bounce. Typhon wraps mutable per-thread state in 64-byte padded structs. The WAL commit buffer goes further: explicit padding fields isolating the producer's _tailPosition and the consumer's _drainPosition onto separate cache lines. Seven unused long fields between them, suppressed with #pragma warning, because the correct layout matters more than the linter's opinion.

The same hardware awareness drives B+Tree node sizing:

[StructLayout(LayoutKind.Sequential, Pack = 4)]
unsafe public struct Index32Chunk
{
    // 256 bytes — fills four cache lines. Adjacent Line Prefetcher (ALP) on
    // Zen 4+/recent Intel automatically fetches paired 64-byte lines within
    // 128-byte regions, so two ALP triggers cover the full node.

    public const int Capacity = 29;

    public int Control;
    public int OlcVersion;       // bit 0 = locked, bit 1 = obsolete, bits 2-31 = version
    public int PrevChunk;
    public int NextChunk;
    public int LeftValue;
    public int HighKey;          // B-link upper bound
    public fixed int Values[Capacity];  // 29 × 4 = 116 bytes
    public fixed int Keys[Capacity];    // 29 × 4 = 116 bytes
}

This struct is exactly 256 bytes because of the CPU's prefetcher. The Adjacent Line Prefetcher on modern x86 fetches paired 64-byte lines within 128-byte aligned regions — so two ALP triggers cover the full node. A 256-byte node costs effectively the same as a 128-byte node in terms of memory access, but holds nearly twice the keys.

The capacity of 29 keys isn't a round number because it isn't derived from the algorithm. It's derived from the hardware: 256 bytes of budget minus 24 bytes of header, divided across Keys and Values arrays. Typhon has three B+Tree variants — 16-bit, 32-bit, and 64-bit keys — and all three hit exactly 256 bytes with different capacities (38, 29, and 19 keys respectively). Post #1 mentioned 128-byte nodes. We've since moved to 256 bytes after measuring ALP behavior on Zen 4 — capacity went up, lookup latency stayed flat.

Principle 2: Eliminate Allocations on Hot Paths

In .NET, every allocation is a future GC event. On hot paths, the cost isn't the allocation itself (~5 ns) — it's the Gen0/Gen1 collection later that pauses unrelated threads. The discipline is simple: allocate nothing in steady state.

ref struct is the primary weapon. A ref struct lives on the stack, dies when the scope ends, and the GC never knows it existed. Post #1 showed EntityRef (96 bytes, inline component cache). But ref structs are a systematic discipline in Typhon, not a one-off optimization:

OlcLatch: wraps a single ref int — the B+Tree node's version field. The entire optimistic lock coupling protocol (read version, validate, try-write-lock) in a struct that's basically a typed pointer. Allocated millions of times per second during tree traversal, at zero GC cost.
EpochGuard: RAII scope for epoch-based page protection. Enter and exit in 3.3 ns. Because it's a ref struct, it can't be boxed, captured in a closure, or passed to async code — exactly the constraints you want for a scope guard.
WalClaim: a Write-Ahead Log buffer claim containing a Span<byte> that points directly into native WAL memory. Can't escape to the heap by construction — the Span field makes it a ref struct automatically.
PointInTimeAccessor: a reusable snapshot attached to parallel workers. One per worker, stored in a flat array indexed by worker ID. Zero per-entity dictionary overhead — no Dictionary<EntityId, T> on the hot path.

For short-lived buffers, stackalloc with a threshold pattern: stack-allocate when the array is small (under 64 elements), fall back to the heap otherwise. Most arrays stay small, so they never touch the allocator.

For larger long-lived buffers, the Pinned Object Heap: GC.AllocateArray<byte>(capacity, pinned: true). Pre-zeroed by the OS, never compacted by the GC, stable pointer for direct access. Typhon's HashMap uses this for its entire entry array.

For medium reusable buffers, ArrayPool<T>.Shared. FPI compression rents 9 KB buffers, returns them in a finally block. Query execution rents stream arrays sized for the common case (8 slots), doubles if needed.

Four strategies — ref struct for scoped access, stackalloc for small temporaries, POH for large long-lived buffers, ArrayPool for medium reusable buffers. The result: zero hot-path allocations in steady state.

Principle 3: Reduce Memory Indirections

Every pointer chase is a potential cache miss. An L3 hit costs ~100 cycles, a DRAM miss costs ~200+. The goal: minimize the number of hops from "I want this data" to "here's the data."

Post #1 showed the flagship example — the SIMD chunk accessor with its 3-tier lookup (MRU check, Vector256 search, clock-hand eviction). Each tier reduces indirection compared to the next.

Epoch-based page protection eliminates another class of indirection. The traditional approach: atomic increment on page access, atomic decrement on release. For N page accesses in a transaction, that's 2N atomic operations — each one a potential cache-line bounce. Typhon uses epoch-based protection instead: one stamp when entering a transaction scope, one clear when exiting. Pages accessed within an active epoch can't be evicted. Cost: 2 operations per transaction, regardless of how many pages are touched.

Zone maps eliminate entire clusters of indirection. Each indexed field maintains per-cluster min/max bounds. A range query like WHERE Level >= 50 checks two integers per cluster — if the cluster's maximum is below 50, skip every entity in it without loading a single component byte. The impact at different selectivities, measured on 100K entities:

Selectivity	Without zone maps	With zone maps	Speedup
100%	13.4 ms	1.3 ms	10x
50%	13.4 ms	0.65 ms	21x
10%	13.4 ms	0.16 ms	84x
1%	13.4 ms	0.05 ms	268x

The float ordering trick makes this work for non-integer types: an IEEE 754 sign-flip converts floats to a representation where integer comparison order equals numeric order, enabling the same two-comparison interval overlap check regardless of field type.

At the other end of the scale, division elimination saves cycles on every single chunk lookup:

// Field: precomputed at segment creation
// Replaces expensive division (~20-80 cycles) with multiply+shift (~3-4 cycles)
private readonly ulong _divMagic;

// Constructor: compute magic multiplier once
_divMagic = (0x1_0000_0000UL + (uint)_otherChunkCount - 1) / (uint)_otherChunkCount;

// Hot path: every chunk lookup uses this instead of idiv
var pageIndex = (int)((adjusted * _divMagic) >> 32);
var offset = (int)(adjusted - (uint)(pageIndex * _otherChunkCount));

Integer division (idiv on x64) is notoriously slow — 20 to 80 cycles depending on operand size. The magic multiplier replaces it with a multiply and a shift: 3-4 cycles. The precomputation happens once when a segment is created; the benefit repeats on every one of the millions of chunk lookups that follow. Six lines of math, 20x speedup on a hot path. This is a classic systems programming trick that most managed-language developers have never needed — but when your per-entity budget is 2.5 nanoseconds, you need it.

Principle 4: Let the JIT Help

The JIT compiler is your optimization partner, not your enemy. Write code in patterns it can optimize, and it does work for you that you'd have to do manually in C or Rust.

Constrained generics give you monomorphization. When you write where TMask : struct, IArchetypeMask<TMask>, the JIT generates a separate native code path for each concrete type. ArchetypeMask256 (four ulong fields, bitwise operations) gets fully inlined — no vtable, no virtual dispatch. This is the same optimization Rust gets from generics, but opt-in through the struct constraint.

sealed enables devirtualization. DirtyBitmap and ArchetypeClusterInfo are both on hot paths and both sealed. The JIT knows no subclass can exist, so it converts virtual calls to direct calls and can inline them.

[AggressiveInlining] eliminates call overhead on micro-operations. B+Tree binary search, transaction state validation, every lock acquire/release — the overhead of a method call (save registers, set up stack frame, restore) is 2-5 ns. On a path called millions of times, that compounds.

SoA layout enables auto-vectorization. When a cluster is fully occupied (all N slots in use), the iteration loop becomes a simple sequential walk over contiguous SoA arrays with no branches. The JIT can auto-vectorize this on AVX2 — processing 8 floats per SIMD instruction. The SoA layout isn't just about cache locality; it's about giving the JIT a pattern it can vectorize.

But the most surprising JIT trick is dead-code elimination through static readonly fields:

// TelemetryConfig.cs — field declarations
/// <summary>
/// static readonly fields allow the JIT to eliminate disabled telemetry code paths
/// entirely. When a readonly field is false, the JIT treats guarded blocks as dead
/// code and removes them completely in Tier 1 compilation.
/// </summary>
public static readonly bool Enabled;
public static readonly bool EcsEnabled;
public static readonly bool EcsActive;    // Combined: Enabled && EcsEnabled

// Static constructor — computed once at startup
static TelemetryConfig()
{
    var section = config.GetSection("Typhon:Telemetry");
    Enabled = section.GetValue("Enabled", false);
    EcsEnabled = ecsSection.GetValue("Enabled", false);
    EcsActive = Enabled && EcsEnabled;
}

// EcsQuery.cs — usage on hot path
if (TelemetryConfig.EcsActive)
{
    activity = TyphonActivitySource.StartActivity("ECS.Query.Execute");
    activity?.SetTag(TyphonSpanAttributes.EcsArchetype, typeof(TArchetype).Name);
}

When EcsActive is false, the JIT doesn't just short-circuit the branch — it eliminates the entire if block from the generated native code. No branch instruction, no condition check, zero cost. The static readonly field, initialized in a static constructor, is treated as a constant after Tier 1 JIT compilation. The dead branch and everything inside it vanish.

This gives you zero-cost observability. Full OpenTelemetry tracing when enabled; literally nothing — not even a branch — when disabled. Most C# developers don't know the JIT does this. It's worth structuring your telemetry and feature flags around this pattern.

Principle 5: Design for the Hardware

The CPU manual is a requirements document. Cache-line size, SIMD register width, TLB coverage, memory bandwidth — these aren't abstract numbers. They drive struct sizing, batch sizes, and allocation strategy.

Cache-line size (64 bytes on x86, 128 bytes on Apple Silicon) drives CacheLinePaddedInt sizing, B+Tree node alignment, and SoA array alignment. The ViewDeltaRingBuffer aligns each sub-buffer to 64-byte boundaries so that the hardware prefetcher doesn't waste bandwidth loading adjacent unrelated data.

SIMD width determines batch sizes. Typhon's SimdPredicateEvaluator uses three-tier CPU dispatch for filtering entities by field values: AVX-512 processes 16 integer comparisons per instruction, AVX2 processes 8, with a scalar fallback for older hardware. The AVX-512 path uses a workaround — .NET doesn't expose 512-bit gather intrinsics, so it performs two 256-bit AVX2 gathers and combines them into a Vector512 for the comparison step. The JIT emits a native vpcmpd instruction for the 16-wide comparison. On Zen 4 (which double-pumps 512-bit operations), throughput matches two AVX2 iterations but with half the loop overhead.

Software prefetch hides memory latency where it matters most. During HashMap resize, speculative prefetch computes the future entry's position in the resized table and issues Sse.Prefetch0 to start loading that cache line while the current entry is being processed. The JIT translates this to a prefetcht0 instruction — essentially free to issue, and it hides 100+ cycles of latency per entry.

BMI2 instructions accelerate spatial indexing. Morton key encoding (Z-order curves) uses Bmi2.ParallelBitDeposit to interleave X/Y coordinates in ~1 cycle. The scalar fallback costs ~10 cycles. Morton ordering places spatially adjacent grid cells at nearby array indices, improving cache locality during neighbor queries.

TLB coverage constrains working set design. Without 2 MB huge pages, x86 L2 TLB covers only 8-12 MB. Every access beyond that risks a 15-20 ns page walk penalty on top of the data access itself. Typhon's cluster storage keeps 100K entities in ~2.5 MB — comfortably within L2 TLB coverage even without huge pages. For larger datasets, the page cache's 8 KB pages and sequential access patterns keep the hardware prefetcher effective.

Memory bandwidth (~50 GB/s on Zen 4) is the ceiling for bulk scans. If your SoA component scan isn't approaching this number, something is leaving performance on the table — unnecessary indirection, poor alignment, or branches that defeat the prefetcher.

All measurements in this post were taken on an AMD Ryzen 9 7950X with .NET 10, BenchmarkDotNet, release configuration.

The Numbers

Individual principles are nice. What matters is how they compound. Here's what the engine actually delivers:

Operation	Latency
Cluster iteration (per entity)	2.5 ns
CRUD lifecycle (spawn, read, update, destroy, commit)	2.95 μs
Transaction create-read-commit (100 entities)	3.6 μs
B+Tree point lookup (10K entries)	191 ns
Component read (1 MVCC version)	703 ns
Component read (50 MVCC versions)	720 ns
Uncontended RW lock acquire	7.5 ns
Page cache hit	5.5 ns
Chunk accessor MRU hit	1.1 ns
Epoch enter/exit	3.3 ns
Cascade delete 10K entities	7.6 μs

The version invariance number deserves a callout: reading a component with 50 MVCC revisions costs the same as reading one with a single revision. 703 ns vs 720 ns — within measurement noise. The revision chain design works.

These principles also scale to parallel execution:

Workers	Tick time	Speedup	Efficiency
1	~37 ms	1.0x	100%
2	~18 ms	2.1x	104%
4	~10 ms	3.8x	95%
8	~5.3 ms	7.1x	89%

89% parallel efficiency on 8 workers. The 16-worker result (6.7x, 42% efficiency) hits the L3 cache / CCD boundary on the 7950X — a hardware wall, not a software one.

To put these numbers in perspective, here's the concurrency cost hierarchy that drives Typhon's design decisions:

Level	Cost	Example
0: Thread-local	~2 ns	TLS counter, local variable
1: Uncontended atomic	5-10 ns	AccessControl read latch
2: Contended atomic	20-140 ns	Multiple writers, same lock
3: System call	500-1000 ns	Timestamp via syscall
4: Context switch	~10,000 ns	Blocking lock, futex wait
5: Oversubscription	100,000+ ns	More threads than cores

Each level is roughly 10x more expensive than the previous one. Typhon's AdaptiveWaiter (spin → yield → sleep progression) keeps most contention at Level 2, avoiding the 100x jump to Level 4. The cache-line padding from Principle 1 keeps parallel workers from bouncing each other between Level 1 and Level 2. Every design decision maps to staying as low in this hierarchy as possible.

Trade-offs

Unsafe is unsafe. These techniques require unsafe code — pointer arithmetic, raw memory access, manual layout control. One bug can corrupt the page cache. Roslyn analyzers catch some classes of errors at compile time, but not all. The safety net has holes.

Complexity budget. Magic multipliers, SIMD evaluators, epoch-based protection, zone maps — each one is simple in isolation. The combination creates a codebase that demands systems-level understanding to navigate. There's no shortcut around understanding the hardware.

Not all of this transfers. Most .NET applications don't need microsecond latency. Using CacheLinePaddedInt in a web API is premature optimization. These techniques are for when you've measured, profiled, and confirmed that memory access patterns are your bottleneck — not before.

What's Next

The next post dives into concurrency: "Deadlock-Free by Construction: How Typhon Eliminates Deadlocks Instead of Detecting Them." Most databases treat deadlocks as a runtime problem — detect the cycle, abort a transaction, retry. Typhon makes deadlocks structurally impossible through a three-pillar mathematical argument. No detection, no timeouts, no retries.

What Game Engines Know About Data That Databases Forgot

Loïc Baumann — Sun, 05 Apr 2026 22:20:39 +0000

💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.

It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.

Series: A Database That Thinks Like a Game Engine

Why I'm Building a Database Engine in C#

What Game Engines Know About Data That Databases Forgot (this post)

Microsecond Latency in a Managed Language (coming soon)

Game servers sit at an uncomfortable intersection. They need the raw throughput of a game engine — tens of thousands of entities updated every tick. But they also need what databases provide: transactions that don't corrupt state, queries that don't scan everything, and durability that survives crashes.

Today, game server teams pick one side and hack around the other. An Entity-Component-System framework for speed, with manual serialization to a database for persistence. Or a database for safety, with an impedance mismatch every time they touch game state.

Typhon draws from both traditions. It's a database engine that stores data the way game engines do — and provides the guarantees that game servers need. Here's why those two worlds aren't as far apart as they look.

Two Fields, One Problem

ECS architecture evolved in game engines. Relational databases evolved in enterprise software. They never talked to each other. But look at what they built:

ECS Concept	Database Concept	Shared Principle
Archetype	Table	Homogeneous, fixed-schema storage
Component	Column	Typed, blittable, bulk-iterable data
Entity	Row	Identity with dynamic composition
System	Query	Process all records matching a signature
Frame Budget (16ms)	Latency SLA	Hard real-time deadline

An ECS "archetype" is a table. A "component" is a column. A "system" is a query. The vocabulary is different, the underlying structure is the same. Two fields, separated by decades and industry boundaries, converged on structurally identical solutions because they were solving the same fundamental problem: managing structured data under performance constraints.

This convergence is why a synthesis is possible at all. It's not an accident — it's driven by the same physics. Data must be laid out for the CPU cache. Access patterns must be predictable. Latency budgets are real.

What We Learned From Game Engines

ECS taught the database world something important about how data should be stored. Three lessons Typhon draws directly from game engine architecture:

Cache locality by default. In a traditional row store, reading all player positions means loading entire rows — names, inventories, health, everything. Most of those bytes are wasted. In ECS, components are stored per type: all positions contiguous, all health values contiguous. Reading 10,000 positions is a linear memory scan where every byte is useful.

This matters more than most developers realize. An L1 cache hit costs roughly 1 nanosecond. A DRAM miss costs 60-70 ns — a 65x penalty. When your database layout forces cache misses, no amount of algorithmic cleverness can save you.

Zero-copy is the default, not the optimization. In a traditional database, reading a record means deserializing from a storage page into a language-level object. In ECS, a component is already in memory in its final layout — you just hand back a pointer. Typhon preserves this: components are blittable unmanaged structs read directly from pinned memory pages. No serialization, no managed heap allocation, no GC involvement.

Entity as pure identity. In ECS, an entity is just an ID — a 64-bit number with no inherent structure. All data lives externally in component tables. This is the opposite of ORM thinking where the object is the entity. Typhon inherits this: EntityId is a lightweight value type, all state lives in typed component storage. This separation is what makes the rest of the architecture possible — per-component versioning, per-component storage modes, independent indexes per component type.

What We Learned From Databases

Traditional databases solved problems that ECS never had to face. Four capabilities Typhon draws from database architecture:

ACID transactions with per-component MVCC. Game engines typically have no isolation. Two systems modifying the same entity in the same tick is a race condition — and in a single-process game, you control the execution order so you can manage it. On a game server with concurrent player sessions, you can't.

Databases solved this decades ago with MVCC: snapshot isolation where readers never block writers, with conflict detection at commit time. Typhon brings this in — but with a twist. Traditional databases version entire rows. Typhon versions each component independently. An entity's PositionComponent and InventoryComponent each maintain their own revision chain: a circular buffer of 12-byte revision entries, each stamped with a 48-bit transaction sequence number.

// Simplified: finding the visible revision for a snapshot
foreach (var rev in WalkRevisions(entityId))
{
    if (rev.IsolationFlag && rev.TSN != myTransactionTSN)
        continue;  // Skip uncommitted revisions from other transactions

    if (rev.TSN <= snapshotTSN)
        return rev; // Most recent revision visible to our snapshot
}

This means a transaction reading a player's position sees a consistent frozen point-in-time across all component types simultaneously — without locking any of them. Writers never block readers. And because revisions are per-component rather than per-entity, updating a player's position doesn't create a new version of their inventory. Less data copied, less garbage to collect.

Indexed selective access. This is the big one. ECS systems iterate everything matching a component signature every tick. That works brilliantly for particle simulations where every particle needs updating. But game servers often don't need all of them:

Scenario	Total Entities	Processed Per Tick	Useful Work
Battle royale (per-client relevancy)	50,000 actors	500–2,000	1–4%
MMO area of interest	100,000	200–1,000	0.2–1%
Physics (awake bodies only)	All rigidbodies	Awake subset	5–20%

When you're processing 1–4% of your entities, scanning everything is doing 25–100x more work than necessary. ECS frameworks recognized this — Unity DOTS added enableable components, Flecs added group_by, Unreal MassEntity added LOD tiers. These are all clever workarounds for the same underlying issue: ECS was designed for bulk iteration, not selective access.

Databases solved this with indexes. B+Trees for value-based lookups, spatial trees for area-of-interest queries, selectivity estimation to decide when to scan versus when to seek. Typhon brings these into the component storage model — not as bolted-on workarounds, but as first-class citizens.

Spatial partitioning. For spatial access patterns specifically — the #1 selective access need in game servers — Typhon integrates a two-layer spatial index directly into the component storage:

Layer 1: Sparse hash map — maps coarse grid cells to entity counts. O(1) rejection of empty regions before the tree is even touched.
Layer 2: Page-backed R-Tree — AABB, radius, ray, frustum, and kNN queries. Same OLC-latched, SOA node architecture as the B+Trees.

Both layers run inside the same transactional model as everything else. No external spatial hash bolted on alongside your ECS. No cache locality destroyed by chasing pointers into a separate data structure.

Durability. A game client can afford to lose state on crash — reload the level. A game server cannot. Player inventories, economy state, progression data — all must survive process restarts and crashes. WAL-based crash recovery, checkpointing, configurable fsync — these are database fundamentals that game servers need but ECS frameworks never provided.

Query planning. When you have both indexes and sequential storage, someone needs to decide which access path to use. Databases have decades of work on cost-based query optimization — selectivity estimation, histogram statistics, index selection. Typhon brings a query planner into the ECS world: given a predicate on a component field, it automatically chooses full scan or B+Tree seek based on estimated selectivity.

Purpose-Built for Game Servers

Typhon doesn't glue ECS and database concepts together with duct tape. It synthesizes them into a single model designed for game server workloads.

A component in Typhon is simultaneously an ECS component and a database schema:

[Component]
public struct PlayerComponent
{
    [Field]
    public String64 Name;

    [Field]
    [Index]                    // B+Tree for fast lookups
    public int AccountId;

    [Field]
    public float Experience;
}

Blittable, unmanaged, fixed-size, stored contiguously per type — that's the ECS side. Typed fields with automatic B+Tree indexes on marked fields — that's the database side. One declaration, both worlds.

The query API makes the synthesis concrete:

var topPlayers = db.Query<Player>()
    .Where(p => p.Level >= 50)
    .OrderByDescending(p => p.Level)
    .Take(10)
    .ExecuteOrdered(tx);

ECS-style typed component access. Database-style predicate filtering with automatic index selection. Inside a transaction with snapshot isolation. The query planner chooses scan vs B+Tree based on selectivity — the developer doesn't have to.

And because game servers have different durability needs for different operations, Typhon lets you choose per unit of work:

// Position ticks: game-engine speed, batched durability
using var uow = dbe.CreateUnitOfWork(DurabilityMode.Deferred);

// Legendary item drop: database safety, immediate fsync
using var uow = dbe.CreateUnitOfWork(DurabilityMode.Immediate);

Same engine, same API. Deferred mode gives game-engine-class commit latency for position updates that can be re-simulated on crash. Immediate mode gives database-class guarantees for a transaction that grants a rare item worth real money. The game server decides per operation — not globally.

Storage Modes: Not All Data Is Equal

A game server doesn't treat all data the same. Player positions change 60 times per second and can be re-simulated on crash. Inventory mutations are rare but must never be lost. AI runtime state — current targets, threat scores, pathfinding waypoints — is recomputed every tick and worthless after a restart.

Traditional databases treat all data identically. Traditional ECS keeps everything in memory with no durability distinction. Typhon lets you choose per component type:

Mode	MVCC History	Persisted	Change Tracking	Best For
Versioned	Full revision chains	Yes (WAL + checkpoint)	Via MVCC	Inventory, economy, progression
SingleVersion	Current state only	Yes (WAL + checkpoint)	DirtyBitmap	Positions, health, frequently-updated state
Transient	Current state only	No	DirtyBitmap	AI blackboard, threat scores, pathfinding scratch

SingleVersion components skip the revision chain overhead entirely — no circular buffer, no per-write allocation. They track changes through a DirtyBitmap instead: one bit per entity, flipped on write, scanned on tick fence. This is how game engines track what changed, and it's the right model for data that updates every tick.

Versioned components get full MVCC with snapshot isolation — readers see consistent historical state, writers don't block readers, conflicts are detected at commit time. This is how databases protect critical data, and it's the right model for things that must never be corrupted.

Transient components never touch disk at all — no WAL, no checkpoint, no recovery. Pure in-memory storage with the same query and indexing API as everything else. AI blackboard data that's recomputed every tick has no business paying persistence overhead.

The same engine, the same transaction API, but the storage layer does exactly what each component type needs. This is what "purpose-built for game servers" means in practice.

Views: The Bridge Between ECS Systems and Database Queries

In ECS, a "system" runs every tick, processing all matching entities. In a database, a "materialized view" maintains a cached result set and refreshes it incrementally. Typhon's Views are both:

using var view = db.Query<ItemData>()
    .Where(i => i.Rarity >= 3)
    .ToView();

// Game loop
while (running)
{
    using var tx = dbe.CreateQuickTransaction();
    view.Refresh(tx);  // Microsecond incremental refresh

    // React to changes — like an ECS system, but only for what changed
    var delta = view.GetDelta();
    foreach (var pk in delta.Added)   SpawnVisual(pk);
    foreach (var pk in delta.Removed) DespawnVisual(pk);
    foreach (var pk in delta.Modified) UpdateVisual(pk);
    view.ClearDelta();
}

The initial ToView() runs a full query. After that, Refresh() drains a lock-free ring buffer of changes pushed by the commit path — only entities whose indexed fields actually changed are re-evaluated. If 100,000 entities match your view but only 12 changed since last refresh, you do 12 evaluations, not 100,000.

This is the iterate-everything problem solved from the database side: don't re-scan, track deltas.

Trade-offs

Specializing for game servers means giving things up.

Blittable components only. No string, no object references, no variable-length arrays inside components. Text uses fixed-size types like String64. This is the price of zero-copy reads and cache-friendly storage — and it's a constraint game developers are already familiar with from ECS frameworks.

Entity-centric relationships, not SQL JOINs. Typhon supports navigation links, 1:N and N:M relationships — but they follow entity references, closer to a graph database than a traditional SQL one. This matches how game servers naturally think about data (an entity has components, a guild contains members), but if your mental model is SELECT ... FROM a JOIN b ON a.x = b.y, it's a different paradigm.

Schema in code, not SQL. Components are C# structs with attributes, not DDL statements. Natural for game developers, unfamiliar territory for database administrators. If your team thinks in SQL, this is a paradigm shift.

What's Next

In the next post, I'll go deeper into the performance philosophy that makes all of this actually fast — data-oriented design, cache-line awareness, and zero-allocation hot paths. The principles that let a managed language hit microsecond-latency transactions.

If you want to follow along, the best way is to star the repo or subscribe to the RSS feed.

Why I'm Building a Database Engine in C#

Loïc Baumann — Sun, 29 Mar 2026 13:02:55 +0000

When I tell people I'm building an ACID database engine in C#, the first reaction is always the same: "But what about GC pauses?"

It's a fair question. Nobody builds high-performance database engines in .NET. The assumption is that you need C, C++, or Rust for this class of software — that managed languages are fundamentally disqualified from the microsecond-latency club.

After 30 years of building real-time 3D engines and systems software, I chose C# anyway. The project is called Typhon: an embedded ACID database engine targeting 1–2 microsecond transaction commits. And the reasons behind that choice might change how you think about what C# can do.

The Case Against C# (Let's Steel-Man It)

Before I make my case, let me honestly lay out every argument against choosing C# for this. These are real concerns, not strawmen.

The GC is non-deterministic. It can pause all your threads whenever it wants. For a database engine that promises microsecond latency, a 10ms Gen2 collection is catastrophic — that's 10,000x your latency budget.

You don't control memory layout. The managed heap decides where objects live. The GC can move them around during compaction. You can't guarantee that your B+Tree nodes sit on cache-line boundaries, or that your page cache buffer won't get relocated mid-transaction.

JIT warmup is real. The first call to any method pays the compilation cost. In a database engine, the first transaction after startup shouldn't be 100x slower than the steady state.

Virtual dispatch and bounds checking add overhead. Every array access has a hidden bounds check. Every interface call goes through a vtable. In a hot loop processing millions of entities, these nanoseconds compound.

These are all legitimate problems. I won't pretend they aren't. But here's what most people miss: modern C# has answers for every single one of them.

What Most People Don't Know About C

The C# that most developers know — classes, garbage collection, LINQ — is only half the language. There's a whole other side that the .NET runtime team has been quietly building for a decade, and it looks nothing like what you'd expect.

unsafe gives you C-level control. Raw pointers, pointer arithmetic, stackalloc for stack buffers, fixed-size arrays — the JIT generates the same mov/cmp/jne instructions you'd get from C. Not "close to C." The same instructions.

GCHandle.Alloc(Pinned) makes the GC irrelevant where it matters. You can pin byte arrays so the GC never moves them. Typhon's entire page cache is pinned memory — the GC doesn't touch it, doesn't scan it, doesn't move it. It's just raw bytes at a fixed address, exactly like malloc in C.

ref struct eliminates heap allocations on hot paths. A ref struct can never escape to the heap. It lives on the stack, dies when the scope ends, and the GC never knows it existed. Typhon's entity accessor (EntityRef) is a 96-byte ref struct — zero allocation, zero GC pressure.

Constrained generics give you true monomorphization. When you write where T : unmanaged, the JIT generates a separate native code path for each type parameter. sizeof(T) becomes a constant. Dead branches get eliminated. It's the same optimization Rust gets from generics — not a runtime dispatch, but compile-time specialization.

Hardware intrinsics are first-class. System.Runtime.Intrinsics gives you Vector256, Sse42.Crc32, BitOperations.TrailingZeroCount — the same SIMD instructions available in C/C++, with the same performance, and runtime feature detection so you can fall back gracefully.

[StructLayout(Explicit)] gives you exact memory layout. Field offsets, padding, size — you control every byte. Cache-line alignment, false-sharing prevention, bit-packing — it's all there.

This isn't "C# trying to be C." It's C# providing a genuine systems programming layer on top of a best-in-class managed ecosystem.

What Typhon Actually Looks Like

Theory is nice, now let's look at real code.

Hardware-accelerated WAL checksums

Every page written to the Write-Ahead Log needs a CRC32C checksum. Here's what that looks like in C# — calling CPU instructions by name:

private static uint ComputePartial(uint crc, ReadOnlySpan<byte> data)
{
    if (Sse42.X64.IsSupported)   return ComputeSse42X64(crc, data);
    if (Sse42.IsSupported)       return ComputeSse42X32(crc, data);
    if (ArmCrc32.Arm64.IsSupported) return ComputeArm64(crc, data);
    return ComputeSoftware(crc, data);
}

private static uint ComputeSse42X64(uint crc, ReadOnlySpan<byte> data)
{
    ulong crc64 = crc;
    ref byte ptr = ref MemoryMarshal.GetReference(data);
    int offset = 0;
    int aligned = data.Length & ~7;

    while (offset < aligned)
    {
        crc64 = Sse42.X64.Crc32(crc64, Unsafe.ReadUnaligned<ulong>(ref Unsafe.Add(ref ptr, offset)));
        offset += 8;
    }

    uint crc32 = (uint)crc64;
    while (offset < data.Length)
    {
        crc32 = Sse42.Crc32(crc32, Unsafe.Add(ref ptr, offset));
        offset++;
    }
    return crc32;
}

Sse42.X64.Crc32() compiles to a single x86 crc32 instruction. The runtime detects the CPU capabilities, the JIT eliminates the dead branches, and what executes is the same code a C programmer would write — but with automatic fallback on platforms without SSE4.2. Result: ~1.3 µs per 8 KB page.

The SIMD chunk accessor

This is Typhon's page cache hot path — a 16-slot cache that finds your data in one of three tiers:

// === ULTRA FAST PATH: MRU check ===
var mru = _mruSlot;
if (_pageIndices[mru] == pageIndex)
{
    var headerOffset = pageIndex == 0 ? _rootHeaderOffset : _otherHeaderOffset;
    return (byte*)_baseAddresses[mru] + headerOffset + offset * _stride;
}

// === FAST PATH: SIMD search through all 16 cached slots ===
fixed (int* indices = _pageIndices)
{
    var target = Vector256.Create(pageIndex);

    var v0 = Vector256.Load(indices);
    var mask0 = Vector256.Equals(v0, target).ExtractMostSignificantBits();
    if (mask0 != 0)
    {
        var slot = BitOperations.TrailingZeroCount(mask0);
        return GetFromSlot(slot, pageIndex, offset, dirty);
    }

    var v1 = Vector256.Load(indices + 8);
    var mask1 = Vector256.Equals(v1, target).ExtractMostSignificantBits();
    if (mask1 != 0)
    {
        var slot = 8 + BitOperations.TrailingZeroCount(mask1);
        return GetFromSlot(slot, pageIndex, offset, dirty);
    }
}

The _pageIndices array is a fixed int[16] — 64 bytes, one cache line, packed for SIMD. One Vector256.Equals compares 8 page indices in a single instruction. The MRU fast path handles the common case (repeated access to the same page) with a single branch — branch predictor friendly, near-zero cost.

Zero-copy entity reads

EntityRef is a ref struct — stack-only, 96 bytes, with an inline fixed array caching component locations:

public unsafe ref struct EntityRef
{
    internal readonly EntityId _id;
    internal readonly ArchetypeMetadata _archetype;
    internal readonly ArchetypeEngineState _engineState;
    internal readonly Transaction _tx;
    internal ushort _enabledBits;
    internal readonly bool _writable;
    private fixed int _locations[16];  // inline component chunk IDs

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public ref readonly T Read<T>(Comp<T> comp) where T : unmanaged
    {
        byte slot = _archetype.GetSlot(comp._componentTypeId);
        int chunkId = _locations[slot];
        var table = _engineState.SlotToComponentTable[slot];
        return ref _tx.ReadEcsComponentData<T>(table, chunkId);
    }
}

That Read<T> call goes from method call → slot lookup → chunk ID → page cache → pointer arithmetic → ref readonly T pointing directly into a pinned memory page. Zero copies. Zero allocations. Zero GC involvement. The where T : unmanaged constraint means the JIT knows the exact layout — it compiles to pointer arithmetic, nothing more.

JIT-specialized hash functions

Even the hash functions exploit the JIT. Since sizeof(TKey) is a compile-time constant for constrained generics, the dead branches vanish:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static uint ComputeHash<TKey>(TKey key) where TKey : unmanaged
{
    if (sizeof(TKey) == 4) return FastHash32(Unsafe.As<TKey, uint>(ref key));
    if (sizeof(TKey) == 8) return XxHash32_8Bytes(Unsafe.As<TKey, long>(ref key));
    return XxHash32_Bytes((byte*)Unsafe.AsPointer(ref key), sizeof(TKey));
}

When you call ComputeHash<int>(42), the JIT generates just the 4-byte path. The other two branches are completely eliminated. This is real monomorphization, not runtime dispatch.

The Productivity Argument

A database engine is more than its hot path. Around the core engine sits a large shell of infrastructure: configuration management, structured logging, telemetry, dependency injection, testing, benchmarking.

In C or Rust, you'd build much of this yourself or stitch together crates/libraries with varying quality. In .NET, this is production-grade and free: ILogger and OpenTelemetry for observability, BenchmarkDotNet for rigorous micro-benchmarks, NUnit for testing, IConfiguration for settings. All well-documented, all interoperable, all maintained by Microsoft or battle-tested OSS communities.

For a solo developer building a database engine, this is a genuine competitive advantage. I spend my time on concurrency primitives and page cache eviction, not on reinventing a logging framework.

It's the Memory Layout, Not the Language

Here's the insight that years of real-time 3D engines taught me: the bottleneck in a database engine is memory access patterns, not instruction throughput.

A cache miss to DRAM on a Ryzen 7950X costs 61–73 nanoseconds. That's ~250 CPU cycles doing nothing, waiting for data. A CAS operation hitting L1 costs 1.4 nanoseconds. The ratio is 50:1.

No amount of "zero-cost abstractions" in your language can save you if your data structures cause cache misses. Conversely, if your data layout is cache-friendly — contiguous, aligned, predictable access patterns — the language barely matters. C# with unsafe generates identical machine code to C on hot paths. The JIT is that good.

What matters is:

Cache-line awareness: Typhon's B+Tree nodes are 128 bytes — two cache lines. The stride prefetcher on Zen4 covers the second line automatically. This alone cut insert latency by 53% and lookup latency by 30% versus 64-byte nodes.
Data-oriented design: Structure of Arrays over Array of Structures. SIMD-friendly layouts. Blittable types only.
Minimizing indirections: Every pointer chase is a potential cache miss. The SIMD chunk accessor's MRU hit avoids the chase entirely.

The language you write in matters far less than the memory layout you design.

The Numbers

All measurements on a Ryzen 9 7950X, .NET 10.0, BenchmarkDotNet, release configuration.

Operation	Latency	Throughput
CRUD lifecycle MVCC (spawn, read, update, destroy, commit)	1.2 µs	830K ops/sec
90 reads/10 updates workload (100 ops per tx, MVCC)	22 µs	~4.5M entity-ops/sec
B+Tree lookup (hit)	267 ns	3.7M ops/sec
B+Tree sequential scan (per key)	2.1 ns	479M keys/sec
Uncontended lock acquire	7.8 ns	128M ops/sec
Page cache hit	5.3 ns	—

Context: an uncontended CAS on Zen4 costs 1.4 ns. A DRAM round-trip costs 61–73 ns. Typhon's lock acquire (7.8 ns) is about 5 CAS operations — tight, considering it handles shared/exclusive arbitration with waiter tracking. The 267 ns B+Tree lookup implies 6–7 memory accesses, which matches a tree traversal through L2/L3 cache.

These are early alpha numbers. There's room to improve. But they validate the core thesis: C# is not the bottleneck.

Trade-offs

No choice is without cost. Here's what I'd tell someone considering the same path.

Memory safety is on you. In unsafe blocks, you can corrupt memory, dereference bad pointers, overflow buffers — the compiler won't save you. Span<T> is a slightly slower but totally safe alternative.

The GC hasn't been a problem — but it could be. By pinning the page cache and using ref struct on hot paths, Gen2 collections are rare and cheap. But I won't pretend this is guaranteed. A workload that allocates heavily in managed code between transactions could still see pauses. The answer is discipline: don't allocate on hot paths. The language lets you — it just doesn't force you.

"But Rust would give you compile-time safety." True — the borrow checker catches ownership and lifetime bugs that unsafe C# can't. But C# has a trick Rust doesn't: Roslyn analyzers. I wrote a custom analyzer suite (TYPHON001–007) that enforces domain-specific safety rules as compiler errors:

[NoCopy] attribute + analyzer: performance-critical structs like ChunkAccessor cannot be passed by value — the compiler errors if you forget ref. This is the same guarantee Rust's borrow checker gives for move semantics, but scoped to the types that actually matter.
Ownership tracking: if you create a ChunkAccessor or Transaction and don't dispose it, that's a compiler error — not a runtime leak. The analyzer tracks ownership transfers through assignments, returns, and ref/out parameters, [return: TransfersOwnership] on a method helps to express ownership transfer for the analyzer to act accordingly.
Disposal completeness: if your type holds a critical disposable field and your Dispose() method misses it or has an early return that skips it — compiler error.

// This is a compile-time error in Typhon — TYPHON001
void Process(ChunkAccessor accessor) { ... }  // ✗ Error: must be passed by ref

void Process(ref ChunkAccessor accessor) { ... }  // ✓ OK

You don't get Rust's safety for free in C#. But you can build the exact subset you need as compiler errors, tailored to your domain. And unlike Rust's borrow checker, these rules carry domain context in the diagnostics: "causes page cache deadlock" is more actionable than "value moved here."

Rust's ecosystem for the surrounding infrastructure (logging, DI, configuration, testing) is also less mature than .NET's, and as a solo developer, my velocity matters. I chose the language where I ship faster.

JIT warmup is real but manageable. The first few transactions after cold start are slower. For an embedded engine (no separate server process), this is acceptable — the host application typically has its own warmup. For a server database, you'd want tiered compilation or AOT.

What's Next

In the next post, I'll explain why an ACID database engine borrows its storage architecture from game engines — specifically the Entity-Component-System pattern. Game engines and databases are solving the same fundamental problem: managing structured data with extreme performance constraints. They just evolved completely different solutions.

If you want to follow along, the best way is to star the repo or subscribe to the RSS feed.

This is Post #1 in a series about building a database engine in C#. Next up: "What Game Engines Know About Data That Databases Forgot".