Mitul Shah for Ferro Labs AI

Posted on Mar 26

AI Gateways Are Not I/O-Bound Proxies I Benchmarked 5 of Them to Prove It

#ai #go #python #benchmark

The wrong mental model

Most engineers think of AI gateways as thin reverse proxies. The mental model is something like nginx with proxy_pass accept a request, forward it to OpenAI, stream the response back. I/O-bound. The runtime barely matters.

This model is wrong.

Here is what actually happens on every request through an AI gateway:

Parse the JSON body
Validate the API key,
Check rate limits
Resolve the routing rule
Select an upstream provider
Mutate headers
Forward the request
Parse the streaming response
Log the event
Update usage meters
Some gateways add policy evaluation, retry logic, or response transformation on top.

None of that is I/O work. It is CPU work and it serializes under concurrent load.

I built Ferro Labs AI Gateway, so I have a stake in this argument. That is also why I ran the benchmark: to understand exactly where different architectures break under pressure, including my own. I profiled five open-source AI gateways with flamegraphs and traced each failure mode to its root cause. The results surprised me.

Methodology

Five gateways. One machine. Same mock upstream. No tuning.

Gateway	Version	Language	Architecture
Ferro Labs	v1.0.0	Go	Native binary
Kong OSS	3.9.1	Go/Lua	nginx + OpenResty
Bifrost	v1.0.0	Go	Native binary
LiteLLM	1.82.6	Python	FastAPI + uvicorn
Portkey	latest	TypeScript / Node.js	Docker, host network

Hardware: GCP n2-standard-8 — 8 vCPU, 32 GB RAM, Debian 12.

Upstream: A Go mock server on localhost:9000 returning fixed OpenAI-compatible responses with a hardcoded 60ms latency. This is the critical methodological choice — 60ms is the theoretical minimum response time. Anything above that is pure gateway overhead. Any failures are the gateway's fault, not the upstream's.

Load tool: k6, constant-VU scenarios from 50 to 1,000 concurrent users. Each gateway tested in complete isolation: start, load, kill, wait 5 seconds for OS cleanup, repeat.

Every gateway ran its out-of-the-box configuration. No connection pool tuning, no worker count adjustments, no warmup tricks. If your defaults break at 300 VU, that is a data point.

One exception: Portkey required Docker rather than a native Node.js process. The native process throws TypeError: immutable under concurrent persistent HTTP connections — a bug in the bundled undici version where response Headers objects are mutated after being marked immutable on a reused connection. Single requests work. curl works. Portkey's own published benchmark runs behind a Kubernetes Network Load Balancer, which masks the issue. Docker with --network host resolves it by handling connection lifecycle differently. The results below reflect the Docker run. Streaming scenarios are excluded — Portkey's SSE implementation behaved inconsistently with the bench runner regardless of process model.

Full configs: ferro-labs/ai-gateway-performance-benchmarks at commit f6889a4.

The numbers

Throughput across concurrency levels

Gateway	150 VU	300 VU	500 VU	1,000 VU	Memory
Ferro Labs	2,447	4,890	8,014	13,925	32–135 MB
Kong AI Gateway	2,443	4,885	8,133	15,891	43 MB flat
Bifrost	2,441	0 †	0 †	0 †	107–333 MB
LiteLLM	175 ‡	—	—	—	335–1,124 MB
Portkey	851 §	843 §	855 §	891 §	—

† Bifrost: 10M+ request failures at ≥300 VU — connection pool starvation.
‡ LiteLLM: ~175 RPS CPU-bound ceiling regardless of concurrency.
§ Portkey: latency degrades 3–6× above 150 VU; errors accumulate at 500+ VU.

Four patterns emerge. Ferro Labs and Kong scale linearly — double the VUs, roughly double the RPS. Bifrost hits a cliff. LiteLLM hits a ceiling. Portkey plateaus — throughput flatlines at ~850 RPS from 150 VU onward while latency compounds and errors accumulate. These are fundamentally different failure modes, and the flamegraphs explain why.

Ferro Labs latency under load

VU	RPS	p50	p99	Memory
50	813	61.3ms	64.1ms	36 MB
150	2,447	61.2ms	63.4ms	47 MB
300	4,890	61.2ms	64.4ms	72 MB
500	8,014	61.5ms	72.9ms	89 MB
1,000	13,925	68.1ms	111.9ms	135 MB

Remember the 60ms floor. At 500 VU, Ferro adds 1.5ms of overhead at p50. At 1,000 VU — nearly 14,000 requests per second — p50 overhead is 8.1ms and p99 is 51.9ms. Memory grows linearly from 36 MB to 135 MB. No cliffs, no ceilings.

What the flamegraphs reveal

Numbers tell you what happens. Flamegraphs tell you why.

Ferro Labs — Go pprof, 50 VU, 30s

Click to view Ferro Labs CPU flamegraph

Ferro Labs CPU flamegraph — 32% in Syscall6 (network I/O), 8% in runtime.futex

The profile is dominated by Syscall6 at 32% of CPU time — the Go runtime making network read/write system calls. Another 8% goes to runtime.futex for goroutine scheduling. The actual gateway logic — routing, header mutation, request forwarding — is so thin it barely registers as a distinct flame.

This is what a well-behaved proxy should look like: nearly all time spent waiting on I/O, gateway work measured in microseconds. Adding concurrency adds throughput because there is no CPU bottleneck to serialize against.

LiteLLM — py-spy, 20 VU, 30s

Click to view LiteLLM CPU flamegraph

LiteLLM CPU flamegraph — hot path through user_api_key_auth → FastAPI middleware → uvicorn

Different story entirely. The hot path runs through user_api_key_auth → the FastAPI middleware chain → uvicorn's event loop. Every single request burns CPU on Python middleware before a byte goes upstream.

This is the ~175 RPS ceiling made visible. At 20 VU, the Python process is already CPU-saturated. Adding virtual users adds middleware queue depth, not throughput.

To be fair: LiteLLM is not trying to be a high-throughput reverse proxy. It is a feature-rich LLM management layer — key management, spend tracking, model fallbacks, dozens of provider adapters. But if you are evaluating it as an inline proxy for production traffic, the throughput ceiling is a hard constraint.

Bifrost — linux perf, 50 VU, 30s

Click to view Bifrost CPU flamegraph

Bifrost CPU flamegraph — 7% in _raw_spin_unlock_irqrestore from futex contention

Bifrost is Go, like Ferro Labs, but the profile reveals a different pattern. 7% of CPU time sits in _raw_spin_unlock_irqrestore — a kernel function triggered by futex wakeups when goroutines contend on a shared lock.

A note on methodology: Bifrost's pprof endpoint returns an HTML stub in benchmark builds, so I used linux perf instead. The profile is less granular — kernel symbols rather than Go function names — but the contention pattern is clear.

At 50 VU, this contention is manageable. At 300 VU, it becomes catastrophic.

The failure modes

Bifrost: the cliff at 300 VU

At 150 VU, Bifrost is competitive — 2,441 RPS, within 0.2% of Ferro Labs and Kong. At 300 VU, it drops to zero: 10 million+ failed requests.

This is connection pool starvation. The upstream connection pool has a fixed capacity that holds under moderate load. When concurrency crosses a threshold, goroutines block waiting for a connection slot. Blocked goroutines hold resources — stack memory, request contexts, partial buffers — that other goroutines need. The system cascades into deadlock.

The memory chart confirms it. Bifrost's memory climbs from 107 MB to 333 MB as goroutines pile up — each holding a stack, a request context, and partial state. The flamegraph's futex contention at 50 VU was the early warning sign of the cliff at 300.

This is not a tuning problem. It is an architectural constraint in how the connection pool interacts with Go's goroutine scheduler under burst load.

LiteLLM: the ceiling at 175 RPS

LiteLLM's failure mode is quieter and, in some ways, more dangerous. There is no cliff — requests do not fail. They just never get faster.

At ~175 RPS, the Python process is compute-saturated. The middleware chain — auth, routing, logging, metering — consumes all available CPU on every request. Memory climbs to 1.1 GB as request contexts queue up waiting for their turn through the middleware stack.

The flamegraph predicted this exactly: the hot path is pure application-level compute. Uvicorn's event loop can accept connections at any rate, but processing each one requires running the full middleware stack. This is a CPU-bound workload wearing an async costume.

Portkey: the event loop ceiling

Portkey tells a subtler story than Bifrost or LiteLLM. At 50 VU it is competitive — 783 RPS, 62.6ms p50, essentially zero overhead above the 60ms mock floor. Then throughput flatlines. At 150 VU: 851 RPS. At 1,000 VU: 891 RPS. The gateway absorbs a 20× increase in concurrency and converts it into a 4% increase in throughput.

The mechanism is event loop congestion. Portkey is a TypeScript/Node.js gateway running on a single-threaded event loop. Like Python's uvicorn, Node.js can accept connections asynchronously — but the per-request middleware work (auth, routing, header mutation, JSON serialization) runs synchronously on that one thread. Under load, requests queue behind each other. At 150 VU, p50 jumps from 62.6ms to 174ms. At 300 VU, it reaches 343ms. The p50 dip at 500 VU (293ms) is a queue-draining artifact, not a performance improvement.

Portkey degrades more gracefully than Bifrost — no hard cliff, no cascading failure — but less gracefully than LiteLLM. Where LiteLLM slows down without dropping requests, Portkey accumulates errors: 2.96% at 500 VU, 7.57% at 1,000 VU. Node.js handles more I/O than Python's GIL allows (~850 vs. ~175 RPS), but both hit the same class of problem: a single-threaded runtime serializing CPU-bound middleware work.

A note on methodology: Portkey required Docker because the native Node.js process has a concurrency bug. Under persistent keep-alive connections — which Go's HTTP client uses by default — Portkey's undici layer throws TypeError: immutable when concurrent requests try to mutate response headers on a reused connection. This does not appear with curl (fresh connection per invocation) or behind a load balancer (which Portkey's own benchmark uses). Setting DisableKeepalive: true in the bench runner confirmed the root cause — it made the error disappear, but also made the benchmark non-comparable to the other gateways. Docker with --network host was the methodological middle ground: stable results, comparable network path. The throughput ceiling and latency degradation shown above are real characteristics of the gateway under load — Docker does not change the event loop constraints, only the connection handling layer that was triggering the bug. I do not have a flamegraph for Portkey — Docker containers do not lend themselves to the same profiling approach. This is a gap, same as Kong.

The Kong question

I would be cherry-picking if I did not address this: Kong outperforms Ferro Labs at 1,000 VU — 15,891 RPS vs. 13,925.

This is expected, and it is architecturally interesting.

Kong is nginx + OpenResty. Under the hood, it runs on nginx's event loop — 20 years of optimization for exactly this class of workload. Nginx does not spawn a goroutine per connection. It multiplexes connections across a small, fixed pool of worker processes with non-blocking I/O at the kernel level. At extreme concurrency, this model avoids the per-goroutine overhead that Go accumulates.

Kong's memory tells the same story: 43 MB flat across all concurrency levels. No linear growth. No accumulation. The event loop model simply does not allocate per-connection state the way goroutines do.

I do not have a flamegraph for Kong. Profiling nginx with Lua plugins requires instrumenting worker processes at the system level — there is no pprof-style endpoint to hit. This is a gap in the analysis.

At 150–500 VU, Ferro Labs and Kong are within 2% of each other. The divergence appears at the 1,000 VU extreme. Both architectures scale. The choice between them is about what else you need from your gateway stack.

The thesis, earned

AI gateways are not I/O-bound proxies. They are CPU and connection orchestration systems where runtime, pool design, and middleware architecture determine whether you scale linearly or hit a wall.

The evidence:

Go vs. Python is a 14× throughput gap at 150 VU (2,447 vs. 175 RPS), with Node.js (Portkey, ~850 RPS) sitting between them. This is not a language war — it is a physics constraint. Interpreted runtimes with single-threaded event loops hit CPU ceilings that Go's goroutine-per-connection model sidesteps entirely. Node.js handles more I/O than Python, but both serialize middleware work on one thread.
Within Go, architecture matters. Ferro Labs and Bifrost are both single-binary Go programs. One scales to 13,925 RPS. The other collapses at 300 VU. Connection pool design is the difference between linear scaling and cascading failure.
nginx's event loop is real. Kong's two decades of optimization show at the extremes. If you consistently push 10,000+ RPS through a single node, the event loop model has a genuine edge.

None of this is visible if you benchmark at 50 VU for 30 seconds and call it a day.

The benchmark repo is at ferro-labs/ai-gateway-performance-benchmarks. Clone it, bring your own hardware, run make setup && make bench. If you find different results, open an issue — that is how benchmarks earn trust.

DEV Community