The common assumption in concurrent systems is that stability and performance pull in opposite directions. You add safety mechanisms, locks, routing constraints, and you pay for them in throughput. This post is about a case where that assumption turned out to be wrong.
The Premise
TokenGate is a token-managed concurrency system. Decorated functions return tokens instead of executing immediately. Those tokens are admitted through a wrapped decorator, routed to per-core mailboxes by weight class, and executed on thread pool workers.
This system also includes a successful separation over async coordination and threaded execution, which is a common source of complexity in concurrent systems.
TokenGate aims to ease this by using tokens as a bridge between the async event loop and a thread pool. The async event loop manages the routing and coordination of tokens, while the thread pool handles the execution. The routing model assigns tokens to cores by weight and storage speed.
Ops are handled in separation making them distinct at every stage.
The weight classification sets the core viability of a task. This is useful
to allow "front-back-fill" scheduling patterns where light work occupies the
latter cores while heavy work utilizes the first core and falls back toward
the others as load increases.
Each weight class has a defined core range:
-
HEAVY→ All Cores -
MEDIUM→ Core 2 + -
LIGHT→ Core 3 +
Within those ranges, a staggered position counter distributes tokens across workers in FIFO with the ability to interleave retry tokens.
What Was Built
Sticky Token Registry
Tokens are marked when seen to have related args: Pinning tokens with matching (operation_type, args) keys to the core that first receives them keeping data locality clean.
When a token arrives, sticky_registry.mark() creates the sticky anchor that automatically sees these related parts.
@task_token_guard(
operation_type="my_op",
tags={"weight": "medium", "sticky_anchor": "my_domain"},
)
def my_operation(n: int) -> int:
...
This is reactive it catches collisions as they arrive. It handles the case where two tokens with identical logical identity are submitted concurrently and would otherwise chain across other core domains.
Hash Conductor
The second layer is proactive. Instead of waiting to see if the collision would happen, the conductor anchors an entire call chain to a domain before any child token is even routed, it's a heavier pattern but is a secure way to ensure all related data is interpreted on the same core.
When a lead token is decorated with external_calls, a SHA-256 seed is
generated from the token ID and the call list:
seed = SHA-256( token_id + ":" + freeze(external_calls) )
Full 64-character hex digest. The token ID is included so two leads with
identical call lists still get independent domains.
That seed is pinned to whichever core the lead lands on. Any token spawned
during the lead's execution inherits the seed and is routed to the same core automatically. No configuration needed inside call-sites. No explicit passing. The seed propagates through a thread-local set in the executor thread before the lead function runs while in token routing.
The most important thing to consider is that workers within the core domain may still execute the tokens in parallel. Respecting the workers staggered routing. It's not a "nerf" to the capability of the system - while operating under saturated load conditions noticeable performance gains were observed.
@task_token_guard(
operation_type="lead_op",
tags={"weight": "medium", "external_calls": ["child_op"]},
)
def lead_operation(n: int) -> list:
# These children inherit the seed and land on the same core
return [child_op(n + i) for i in range(4)]
The pending lead operations count starts at 1 and increments for each subsequent external call at creation time, and decrements on every completion. When it reaches zero the seed is released.
The Benchmark
15 doubling waves. 131,068 tokens total.
Wave Tokens Tok/s Lat(ms) Overlap
1 4 1386.2 0.721 1.44×
2 8 2391.2 0.418 2.48×
3 16 2744.8 0.364 4.82×
4 32 2812.7 0.356 11.32×
5 64 2880.0 0.347 22.01×
6 128 2907.6 0.344 29.78×
7 256 2846.8 0.351 37.98×
8 512 2811.5 0.356 41.81×
9 1024 2813.9 0.355 44.18×
10 2048 2644.3 0.378 44.86× ← peak overlap
11 4096 2816.3 0.355 38.34×
12 8192 2819.9 0.355 32.64×
13 16384 2765.0 0.362 27.92× ← better sustained performance
14 32768 2707.7 0.369 24.96× ←
15 65536 2789.5 0.358 24.21× ←
Zero failures. Avg latency 0.386ms/token.
The previous ceiling was around 17× overlap after saturation. This
run hit 44.86× at wave 10 and descended gracefully from there.
Latency moved 0.04ms across the entire run from wave 3 to wave 15.
Why Stability Produced More Concurrency
The 17× previous ceiling wasn't a capacity ceiling. It was a friction ceiling.
At that overlap level the old routing was generating cross-domain
traffic. Related tokens landing on different cores meant cache lines being
written back and refilled across the interconnect. The scheduler was spending a growing proportion of its time on coordination rather than execution.
Domain anchoring removed the wasted power. Tokens that belong together stay together. The cache lines loaded for a lead token's data are still warm when its children execute on the same core. The cross-core traffic that was expanding with overlap now barely exists for conducted chains. The scheduler has more execution headroom compared to coordination cost. The overlap ceiling rises.
This is why the overlap column reaches over 20× and holds flat latency while doing it. The system isn't working harder. It's working cleaner, scaling better.
Calling Production Crews
If you run concurrent Python workloads task queues, async pipelines, anything with related operations that currently route freely, I'd like to know what you see looking at this and any poking at my work is helpful.
As a self taught developer I'm open to criticism and would love to learn from some trained or learned folks.
Registering calls for anchoring
The sticky registry and hash conductor are opt-in. Existing code routes normally.
Hashed domain anchoring and sticky tokens are aiming to be the first "production ready features" for this.
The cases I'm most interested in:
Anything that hits an unexpected concurrency ceiling without obvious cause.
The repo is public. Issues and observations welcome. GitHub
Leave some feedback! Tavari
TokenGate represents my journey through hobbyist coding for nearly 4000 hours, times change, I'm now opening up as business. Stay tuned for the future of Tavari.
Top comments (0)