Designing for Sub-Microsecond Latency (link)

#programming #python #hft #opensource

Lessons from Building a Minimal Execution Engine
Modern systems are fast — but predictable fast is rare.

Most frameworks optimize for throughput, developer velocity, or horizontal scalability. When you care about tail latency, determinism, and sub-microsecond critical paths, those abstractions often become liabilities.

I built SubMicro Execution Engine to explore what happens when latency — not features — is the primary design constraint. Below are a few practical lessons that shaped the system.

Latency Lives in the Edges, Not the Core Logic The actual “work” a system performs is rarely the bottleneck.

Latency hides in:

memory allocation
cache-line contention
branch misprediction
scheduler handoffs
synchronization primitives
The engine minimizes these by:
keeping hot paths allocation-free
favoring flat, cache-friendly data layouts
avoiding implicit synchronization
designing execution flows that fit in L1/L2 cache
If you can’t draw the hot path from memory, you don’t control latency.

Determinism Beats Raw Throughput A system that does 1M ops/sec sometimes is less useful than one that does 200k ops/sec always.

Design choices were guided by:

stable execution order
predictable scheduling
minimal dynamic behavior in hot paths
This trades peak throughput for tight latency distributions, which matter far more in real-time and trading-style systems.

Abstractions Have a Cost — Measure Them Ruthlessly Abstractions aren’t bad, but unmeasured abstractions are dangerous.

In low-latency systems:

virtual dispatch can cost more than the logic itself
generic containers hide memory access patterns
“clean” interfaces often fragment the execution path
The engine favors:
explicit control over execution
visible data movement
simple, inspectable components
Code clarity is preserved by removing layers, not adding them.

Scheduling Is a Latency Feature Schedulers decide when work happens — which is as important as what happens.

Design considerations include:

minimal context switching
optional busy-polling strategies
execution models that avoid OS interference in hot paths The goal is to keep execution close to the CPU, not bouncing between queues and threads.

Measure the Tail, Not the Average Average latency lies.

The engine is designed with the assumption that:

p99 and p99.9 matter more than the mean
occasional spikes break real-time systems
instrumentation must be lightweight enough for production use If you don’t measure the tail, you are optimizing blind.

Closing Thoughts
Sub-microsecond systems are not built by adding optimizations — they’re built by removing uncertainty.

This project is intentionally minimal. It is not a framework. It is an exploration of how far you can push latency control when every design decision answers one question:

Does this reduce or increase unpredictability?

Repo: submicro-execution-engine
GitHub: https://github.com/krish567366/submicro-execution-engine
website: https://submicro.krishnabajpai.me/