DEV Community

Cover image for When Stability Improves Performance (Threading)
Joseph Boone
Joseph Boone

Posted on

When Stability Improves Performance (Threading)

The common assumption in concurrent systems is that stability and performance pull in opposite directions. You add safety mechanisms, locks, routing constraints, and you pay for them in throughput. This post is about a case where that assumption turned out to be wrong.

The Premise

TokenGate is a token-managed concurrency system. Decorated functions return tokens instead of executing immediately. Those tokens are admitted through a wrapped decorator, routed to per-core mailboxes by weight class, and executed on thread pool workers.

This system also includes a successful separation over async coordination and threaded execution, which is a common source of complexity in concurrent systems.

TokenGate aims to ease this by using tokens as a bridge between the async event loop and a thread pool. The async event loop manages the routing and coordination of tokens, while the thread pool handles the execution. The routing model assigns tokens to cores by weight and storage speed.

Ops are handled in separation making them distinct at every stage.

The weight classification sets the core viability of a task. This is useful
to allow "front-back-fill" scheduling patterns where light work occupies the
latter cores while heavy work utilizes the first core and falls back toward
the others as load increases.

Each weight class has a defined core range:

  • HEAVY → All Cores
  • MEDIUM → Core 2 +
  • LIGHT → Core 3 +

Within those ranges, a staggered position counter distributes tokens across workers in FIFO with the ability to interleave retry tokens.

What Was Built

Sticky Token Registry

Tokens are marked when seen to have related args: Pinning tokens with matching (operation_type, args) keys to the core that first receives them keeping data locality clean.

When a token arrives, sticky_registry.mark() creates the sticky anchor that automatically sees these related parts.

@task_token_guard(
    operation_type="my_op",
    tags={"weight": "medium", "sticky_anchor": "my_domain"},
)
def my_operation(n: int) -> int:
    ...
Enter fullscreen mode Exit fullscreen mode

This is reactive it catches collisions as they arrive. It handles the case where two tokens with identical logical identity are submitted concurrently and would otherwise chain across other core domains.

Hash Conductor

The second layer is proactive. Instead of waiting to see if the collision would happen, the conductor anchors an entire call chain to a domain before any child token is even routed, it's a heavier pattern but is a secure way to ensure all related data is interpreted on the same core.

When a lead token is decorated with external_calls, a SHA-256 seed is
generated from the token ID and the call list:

seed = SHA-256( token_id + ":" + freeze(external_calls) )
Enter fullscreen mode Exit fullscreen mode

Full 64-character hex digest. The token ID is included so two leads with
identical call lists still get independent domains.

That seed is pinned to whichever core the lead lands on. Any token spawned
during the lead's execution inherits the seed and is routed to the same core automatically. No configuration needed inside call-sites. No explicit passing. The seed propagates through a thread-local set in the executor thread before the lead function runs while in token routing.

The most important thing to consider is that workers within the core domain may still execute the tokens in parallel. Respecting the workers staggered routing. It's not a "nerf" to the capability of the system - while operating under saturated load conditions noticeable performance gains were observed.

@task_token_guard(
    operation_type="lead_op",
    tags={"weight": "medium", "external_calls": ["child_op"]},
)
def lead_operation(n: int) -> list:
    # These children inherit the seed and land on the same core
    return [child_op(n + i) for i in range(4)]
Enter fullscreen mode Exit fullscreen mode

The pending lead operations count starts at 1 and increments for each subsequent external call at creation time, and decrements on every completion. When it reaches zero the seed is released.

The Benchmark

15 doubling waves. 131,068 tokens total.

Wave   Tokens    Tok/s     Lat(ms)   Overlap
1      4         1386.2    0.721     1.44×
2      8         2391.2    0.418     2.48×
3      16        2744.8    0.364     4.82×
4      32        2812.7    0.356     11.32×
5      64        2880.0    0.347     22.01×
6      128       2907.6    0.344     29.78×
7      256       2846.8    0.351     37.98×
8      512       2811.5    0.356     41.81×
9      1024      2813.9    0.355     44.18×
10     2048      2644.3    0.378     44.86× ← peak overlap
11     4096      2816.3    0.355     38.34×
12     8192      2819.9    0.355     32.64×
13     16384     2765.0    0.362     27.92× ← better sustained performance
14     32768     2707.7    0.369     24.96× ←
15     65536     2789.5    0.358     24.21× ←

Zero failures. Avg latency 0.386ms/token.
Enter fullscreen mode Exit fullscreen mode

The previous ceiling was around 17× overlap after saturation. This
run hit 44.86× at wave 10 and descended gracefully from there.
Latency moved 0.04ms across the entire run from wave 3 to wave 15.

Why Stability Produced More Concurrency

The 17× previous ceiling wasn't a capacity ceiling. It was a friction ceiling.

At that overlap level the old routing was generating cross-domain
traffic. Related tokens landing on different cores meant cache lines being
written back and refilled across the interconnect. The scheduler was spending a growing proportion of its time on coordination rather than execution.

Domain anchoring removed the wasted power. Tokens that belong together stay together. The cache lines loaded for a lead token's data are still warm when its children execute on the same core. The cross-core traffic that was expanding with overlap now barely exists for conducted chains. The scheduler has more execution headroom compared to coordination cost. The overlap ceiling rises.

This is why the overlap column reaches over 20× and holds flat latency while doing it. The system isn't working harder. It's working cleaner, scaling better.

Calling Production Crews

If you run concurrent Python workloads task queues, async pipelines, anything with related operations that currently route freely, I'd like to know what you see looking at this and any poking at my work is helpful.

As a self taught developer I'm open to criticism and would love to learn from some trained or learned folks.

Registering calls for anchoring

The sticky registry and hash conductor are opt-in. Existing code routes normally.

Hashed domain anchoring and sticky tokens are aiming to be the first "production ready features" for this.

The cases I'm most interested in:

Anything that hits an unexpected concurrency ceiling without obvious cause.

The repo is public. Issues and observations welcome. GitHub

Leave some feedback! Tavari

TokenGate represents my journey through hobbyist coding for nearly 4000 hours, times change, I'm now opening up as business. Stay tuned for the future of Tavari.

Top comments (0)