I'd routed the same one-word prompt to Claude Haiku and to Gemini 2.5 Flash. Flash has the lower per-token price, so this should have been an easy win. It wasn't. Flash is a thinking model: before it answered "Paris," it spent a few dozen tokens reasoning, and reasoning is billed as output. Haiku answered in 4 tokens. Flash spent about 28 — lower unit price, far more units, ~8.6× the bill per request. I only caught it because I'd instrumented every call: tokens, cost, latency, written to Postgres. And I instrument every call because that's what you do when you've spent years keeping payment systems honest.
That instinct is the whole point. For two and a half years I built cross-border real-time payments at NPCI — the kind of system where a rounding error is an incident and a downstream going dark is a Saturday night. This year I built an LLM gateway, and I kept reaching for the same tools. That's not a coincidence.
AI infrastructure looks like a new field. Underneath, it's the systems work backend engineers have always done. A model API is a downstream dependency: it's slow, it's occasionally down, it rate-limits you, and it bills you per call. You have integrated a dependency exactly like that before — a payment processor, a KYC provider, a partner bank. The model isn't magic. It's a new kind of expensive, flaky downstream, and the hard parts — reliability, cost control, failover — are the parts we already solved.
The gateway needed a circuit breaker per provider, and I'd built exactly that at NPCI — on a payments platform where every partner integration was a downstream in a different country, each with its own SLA and its own definition of "up." My fault-tolerant Go processors wrapped those partners in circuit breakers and rate limiters for one reason: when a partner started failing, the worst move was to keep hammering it while your own worker pools filled with stuck calls. So you trip, fail fast, and let a trickle of traffic probe for recovery. CLOSED, OPEN, HALF_OPEN. Wiring those same three states around Anthropic and Gemini, the only thing that changed was the noun — "partner bank" became "model provider."
The gateway needed to meter every call into Postgres. That's an audit log, and payments runs on audit logs. Across 20+ Go services I'd keyed everything on idempotency so a retried request never double-counted; the cost log keys on request_id for the identical reason. And money was always fixed-precision, never a float — which is why cost_usd is a NUMERIC, not a float. You don't approximate money, and you don't approximate spend.
Retries versus the breaker — transient versus sustained — was muscle memory. In payments a careless retry is a double-debit, so one timeout means an idempotent retry, never a blind resubmit; sustained failure means you stop sending. I wrote that distinction into financial workflows for years. Rewriting it for an LLM call, the cost of getting it wrong fell from someone's money to a wasted token — but the shape of the problem didn't move an inch.
I won't pretend it was all familiar. A few things were genuinely new, and they're all about the model itself. Token economics is a real discipline — the 8.6× surprise up top doesn't exist in any database I've ever billed for. A model is non-deterministic in a way a datastore isn't: the same input can return a different output, so you can't assert on exact strings, and "testing" becomes evals and distributions instead of equality. And a model quietly spending your output budget on its own reasoning has no equivalent in a payment switch. That's the new surface area — but it's a few weeks of new sitting on years of foundation, not a new career.
So if you're a backend engineer wondering whether your distributed-systems experience transfers to AI: it doesn't just transfer — it's the scarce part. Anyone can call an LLM API in an afternoon. Far fewer people can make that call reliable when the provider degrades, cheap when the token math turns against you, and observable when someone asks you to explain the bill. That's not AI expertise. That's the work you've already been doing, with a model attached.
The gateway is live at https://llm-gateway-python.onrender.com, and the code is on GitHub: https://github.com/Yogesh23012001/llm-gateway-python.
Top comments (1)
One of the biggest misconceptions in AI is that model expertise matters more than systems expertise. In production, the hard problems are often the same ones we’ve dealt with for years: reliability, observability, cost control, retries, failover, and auditing.The model is new. Distributed systems are not.