Maximum Concurrency (& Sub-Quadratic Scaling) By TokenGate

#performance #code #programming #python

What is TokenGate?

TokenGate is a beta Python concurrency system built around a token-managed execution model. Instead of managing threads directly, you decorate synchronous functions and the system handles routing, admission, and worker assignment automatically.

The core idea is simple: every function call becomes a token. That token moves through a lifecycle — created, waiting, admitted, executing, completed — while the coordinator manages a pool of core-pinned workers underneath. You interact with the public API, TokenGate handles everything else.

As of v0.2.2.0 tokens are natively awaitable. That's what made this test possible to write cleanly:

result  = await token
results = await asyncio.gather(*tokens)

The decorated functions stay synchronous. The orchestrator stays async. TokenGate sits between them and keeps the pipeline full.

What I did

I dispatched 65,536 task tokens simultaneously and completed all of them in 30 seconds. Zero failures. This's what the numbers actually showed.

What I found was an unexpected goldilocks zone — a region of task sustainability that not only exceeded worker capacity but held stable all the way to 65,536 simultaneous submissions:

  RESULTS SUMMARY

  Wave   Tokens   OK    Fail      Time    Tok/s   Lat(ms)    Conc   Overlap   ΣTask(ms)
  -------------------------------------------------------------------------------------
  1      4        4     0       0.002s   1718.7    0.582ms   1.00×     2.39×       5.56ms
  2      8        8     0       0.002s   3781.8    0.264ms   2.20×     3.79×       8.01ms
  3      16       16    0       0.004s   4067.5    0.246ms   2.37×     5.42×      21.31ms
  4      32       32    0       0.007s   4392.7    0.228ms   2.56×    11.90×      86.71ms
  5      64       64    0       0.015s   4268.0    0.234ms   2.48×    23.01×     345.10ms
  6      128      128   0       0.029s   4475.1    0.223ms   2.60×    38.72×    1107.58ms
  7      256      256   0       0.059s   4331.8    0.231ms   2.52×    52.89×    3125.85ms
  8      512      512   0       0.113s   4532.8    0.221ms   2.64×    60.30×    6810.70ms
  9      1024     1024  0       0.246s   4165.0    0.240ms   2.42×    61.87×   15211.02ms
  10     2048     2048  0       0.505s   4052.4    0.247ms   2.36×    33.82×   17091.60ms
  11     4096     4096  0       0.980s   4181.1    0.239ms   2.43×    33.15×   32476.06ms
  12     8192     8192  0       2.163s   3786.8    0.264ms   2.20×    30.14×   65197.90ms
  13     16384    16384 0       4.327s   3786.7    0.264ms   2.20×    23.91×  103450.88ms
  14     32768    32768 0       8.754s   3743.3    0.267ms   2.18×    18.72×  163876.79ms
  15     65536    65536 0      17.314s   3785.2    0.264ms   2.20×    17.16×  297095.85ms
  -------------------------------------------------------------------------------------
  TOTAL  131068   131068 0      86.845s

  Avg latency across waves  : 0.268 ms/token
  Peak concurrency ratio    : 2.64×
  Peak overlap ratio        : 61.87×

What "Sub-Quadratic Scaling" Really Means

This isn't a special property of AI or machine learning — it's basic math that shows up anywhere work can be parallelized. A calculator running two operations simultaneously instead of sequentially is doing the same thing at a smaller scale depending on the return times. The term just describes a curve: double the input, less than double the cost.

The overlap ratio — the rightmost metric — measures Σ(individual task execution times) divided by wave elapsed time. If every task ran back-to-back on a single thread it would read 1.0×. Values above 1× mean real parallel execution is happening. Values well above worker count mean something more interesting is going on.

Workers that finish a short task don't wait for the wave to end. They immediately pull the next token. So a worker that cycles through eight short tasks within one wave window contributes eight task-durations to the overlap sum while only occupying one worker-slot in wall time. The total concurrent activity compounds.

This is sub-quadratic scaling — doubling the token count costs less than double the time, because additional tokens slide into gaps that already exist in the schedule rather than adding full sequential cost. The system does more work per unit time as load increases, up to a point that point is wave 9.

At 1024 tokens the overlap ratio peaks at 61.87×. Wave 10 transitions — workers saturate fully, rapid cycling slows, and the ratio settles near the hardware worker count. From there it holds. Wave 15 at 65,536 tokens: 17.16× sustained overlap, flat throughput, zero failures. The system found its floor and stayed there even when heavily overloaded.

What do these tasks look like?

These are deliberately varied and non-trivial — a prime sieve, string manipulation, list sorting, and an iterative SHA-256 chain. All four are plain synchronous functions. The decorator is the only thing that makes them token-aware:

@task_token_guard(operation_type='cpu_crunch', tags={'weight': 'light'})
def cpu_crunch(n: int) -> int:
    """Sum primes up to n — lightweight CPU-bound work."""
    total = 0
    for i in range(2, n):
        if all(i % j != 0 for j in range(2, int(i ** 0.5) + 1)):
            total += i
    return total


@task_token_guard(operation_type='string_ops', tags={'weight': 'light'})
def string_transform(seed: int) -> str:
    """Generate and mangle a string — lightweight string work."""
    rng = random.Random(seed)
    chars = [rng.choice(string.ascii_letters) for _ in range(300)]
    text = ''.join(chars)
    return text[::-1].upper().replace('A', '4').replace('E', '3').replace('I', '1')


@task_token_guard(operation_type='data_transform', tags={'weight': 'medium'})
def data_sort(size: int) -> List[int]:
    """Sort a random list — medium CPU work."""
    rng = random.Random(size)
    data = [rng.randint(0, 100_000) for _ in range(size)]
    return sorted(data)


@task_token_guard(operation_type='hash_compute', tags={'weight': 'heavy'})
def hash_chain(seed: str, iterations: int) -> str:
    """Iterative SHA-256 chain — heavier CPU-bound work."""
    h = seed.encode()
    for _ in range(iterations):
        h = hashlib.sha256(h).digest()
    return h.hex()

The entire orchestrator is async and touches no internal controls:

tokens  = submit_batch(target)
results = await asyncio.gather(*tokens, return_exceptions=True)

That's it. Submit, await, report.

Closing

Making this system fail under normal conditions would require hundreds of thousands of simultaneous submissions — a load that isn't realistic for a single server in any real scenario. The architecture stabilizes under load rather than degrading. That's not an accident, it's a consequence of how token admission and worker pinning interact at scale.

Your sweet spot will land in a different place than mine depending on your hardware. The test is in demo/ — run it and see where your system peaks.

https://tavari.online/