What Building a Multi-Model AI Gateway Taught Me About Reliability

#ai #api #architecture #llm

I’m building OpenRain, an OpenAI-compatible AI API gateway. I originally thought the hard part would be integrating more providers. I was wrong. The hard part is absorbing inconsistency — and still giving developers something stable enough to trust in production.

Building an OpenAI-Compatible Multi-Model Gateway Taught Me What “Unified API” Really Means

When I first started thinking about a multi-model gateway, I had a fairly simple mental model.

Take a few major model providers. Put a clean API in front of them. Normalize the request shape. Forward the response. Add billing and routing. Done.

That picture survives for about five minutes.

The moment you try to make the system feel reliable to real developers, the problem changes completely. You stop thinking like someone stitching together vendor APIs, and start thinking like someone who is responsible for a production contract. That shift is where most of the complexity comes from.

A “unified AI API” sounds like an interface design problem. In practice, it turns into a systems problem: inconsistent semantics, unstable tail latency, streaming edge cases, partial failures, fuzzy usage data, and policy enforcement that has to happen in the hot path.

This post is a reflection on what surprised me most while building an OpenAI-compatible gateway, and why the word “compatible” hides much more engineering work than it seems.

1. I learned very quickly that “OpenAI-compatible” is an overloaded phrase

At first glance, compatibility looks straightforward. Many providers accept a similar request shape:


json
{
  "model": "some-model",
  "messages": [
    { "role": "user", "content": "Hello" }
  ]
}
But request-shape compatibility is the easy part.

What breaks much sooner is behavior.

Different providers interpret system prompts differently. Some support tool calling, but only in a narrow sense. Some stream clean deltas; others stream in a way that is technically valid but operationally awkward. Some report usage consistently. Some are mostly compatible until you hit a multimodal edge case, or a timeout, or a high-concurrency burst.

This was one of the first lessons that changed how I thought about the architecture: if you want one API surface outside, you need a much stricter abstraction inside.

For me, that meant treating the gateway as four separate problems:

defining one canonical internal request model
maintaining a capability map for each provider and model
translating requests into provider-specific formats
normalizing responses and failures back into one stable contract
Without that separation, the abstraction leaks everywhere.

And once it leaks, developers feel it immediately.

2. I thought routing would be about model selection. It turned out to be about constraints.
A lot of people talk about “access to many models” as if the number itself is the value.

I don’t think that’s how developers actually experience it.

In real usage, they are usually not asking for “many models.” They are asking for a model that fits a constraint: lower latency, better reasoning, stronger coding, longer context, better Chinese, lower cost, more stable uptime.

That changes the design of routing.

If routing is just if model_name == x, then you are not really building a gateway. You are building a switch statement with branding around it.

A usable gateway needs a model registry that knows more than names. It needs to know what each model can do, how expensive it is, how it behaves under load, what regions are healthy, and what kinds of traffic it is appropriate for.

A simplified version looks something like this:

Copy{
  "model_id": "provider/model-version",
  "supports_stream": true,
  "supports_tools": true,
  "supports_vision": false,
  "supports_reasoning": true,
  "context_window": 128000,
  "latency_profile": {
    "p50_ms": 900,
    "p95_ms": 2400
  },
  "availability_score": 0.997
}
Once you have that, routing becomes something more honest. It becomes a policy engine.

You can send long-document workloads to long-context models. You can bias interactive traffic toward lower p95. You can control which keys can access which classes of models. You can choose fallback targets that are operationally safe, not just syntactically similar.

That was an important mindset shift for me: the real job of routing is not choosing a model. It is choosing a compromise.

3. The latency problem was less about speed than about unpredictability
I used to think “fast enough” was mostly a p50 question.

It isn’t.

In user-facing AI systems, what people remember is variance.

A model that usually responds in one second but occasionally takes twelve feels unreliable, even if its average latency looks fine in a dashboard. This becomes especially obvious in chat, copilot-like experiences, and any workflow where the user is waiting synchronously.

So over time I became much more interested in p95 and p99 than in average speed.

That has consequences.

First, timeout budgets can’t be one-size-fits-all. A lightweight chat model and a heavier reasoning model should not inherit the same expectations.

Second, “provider is up” is not a very useful health model. A provider can be technically available and still be degraded enough to damage user experience. Health has to be dynamic and contextual: recent timeouts, regional instability, error rate, queueing behavior, tail latency.

Third, failover is not just a reliability feature. It is a semantics feature.

If you retry too aggressively across providers, you can create duplicate outputs, ambiguous billing, inconsistent tone, or broken streaming. By the time I appreciated this fully, I had stopped thinking of failover as “just try the next one.” It became a question of whether the substitution is safe, whether the request is still idempotent in practice, and whether the user experience survives the handoff.

That is a much less glamorous problem than “smart routing,” but in production it matters more.

4. Streaming is where the abstraction gets stress-tested
Normal request-response flows are relatively easy to make look clean.

Streaming is where the truth comes out.

This was one of the most humbling parts of the system for me. On paper, many providers support streaming. In reality, they do it differently enough that “supported” doesn’t mean “interchangeable.”

The differences show up everywhere:

chunk structure
delta semantics
finish reasons
heartbeat behavior
tool call emission
usage timing
connection endings
partial failures
You can hide some of this, but not by pretending it isn’t there. The only sustainable answer I found was to explicitly build a stream normalization layer:

Copyprovider stream -> parser -> event normalizer -> policy filter -> client stream
Even then, the problem doesn’t end at format normalization.

Once streaming traffic grows, you start worrying about backpressure, client disconnects, memory growth, timeout propagation, cleanup of half-open streams, and whether a stream that already emitted tokens can still be retried without making the product feel broken.

A lot of “compatible API” claims sound convincing until streaming enters the picture. That’s where the operational differences become impossible to ignore.

5. Usage accounting ended up being much closer to distributed state management than to finance
Before building this, I would probably have described usage accounting as a billing concern.

Now I think that framing is incomplete.

In a multi-model gateway, usage is part of runtime truth. It is not just something you summarize later for a dashboard.

You have to decide what the system believes happened when a request is admitted, forwarded, partially streamed, completed, retried, interrupted, or compensated. You have to deal with providers that report usage at different times. You have to handle client disconnects, partial outputs, ambiguous timeouts, and request paths that were rejected before they ever reached upstream.

At some point, I realized the only sane way to think about this was as an internal ledger with explicit states, not as a single “token in / token out” record.

Something like:

received
admitted
forwarded
partially streamed
completed
failed
compensated
That model is less elegant than a neat billing summary, but it is closer to what actually happens in a live system.

And if your internal truth is fuzzy, your billing, rate limiting, dashboards, and support tooling will all drift in different directions.

6. Per-key control sounds administrative until you have to enforce it in real time
From the outside, things like unified billing, key-level restrictions, and model allowlists can sound like console features.

From the inside, they are request-path logic.

A real platform quickly accumulates rules like:

this key can use only lower-cost models
this project cannot call reasoning models
this tenant has a spend ceiling
this environment must stay within a region or provider set
this customer can access multimodal features, but not image generation
this internal service can stream, another cannot
None of that can be bolted on cleanly at the end. It has to be designed into the gateway itself.

That was another lesson I didn’t appreciate enough early on: authorization in AI infrastructure is no longer just about identity. It is about identity under constraints.

Who is making the call is only the first question. The more difficult question is: what are they allowed to invoke, under which policy, with which budget, and with what operational risk?

7. Error handling is one of the most human parts of the system
There is something deceptively important about error design.

Raw upstream errors are often inconsistent, noisy, or not actionable. Different vendors disagree on status codes, error shapes, retry semantics, and what even counts as a timeout.

If you pass all of that directly through, the user ends up integrating not with your platform, but with the chaos behind it.

So I gradually came to see error normalization as a product decision, not just an engineering cleanup task.

If the external API is unified, then the failure experience has to be unified too. Developers should not need to reverse-engineer five providers through your gateway.

A stable error contract does more than improve DX. It gives support, observability, and incident response a common language. That matters a lot once the system starts handling real traffic.

8. Observability turned out to be the part that makes everything else possible
If I had to pick one area that became more important over time, it would be observability.

Not because dashboards are nice to have, but because once multiple providers, regions, models, retries, and fallback paths sit behind one endpoint, you lose intuition very quickly.

You need to see what was requested, what was actually routed, which region it hit, whether it streamed, how many retries happened, how long it spent in each stage, whether the provider was degrading, whether the fallback rate is climbing, whether structured outputs are failing more often for one adapter than another.

Without that, routing is guesswork.

Without that, failover is guesswork.

Without that, debugging becomes a sequence of anecdotes.

Over time, I started to think of observability as the thing that converts a gateway from a thin aggregation layer into infrastructure you can improve deliberately.

9. The hardest part is that the target never stops moving
One thing I underestimated was how often the ecosystem changes beneath you.

Providers update model versions. Pricing changes. Context windows change. Rate limits shift. Streaming behavior changes. Response fields evolve. A provider can remain “compatible” in broad marketing terms while becoming materially different in an edge case that breaks your assumptions.

That means architecture needs to assume motion.

The patterns that have felt most durable to me are fairly unglamorous:

isolate providers behind adapters
keep one canonical internal schema
keep routing policy configurable
centralize capability metadata
make quirks declarative when possible
preserve enough telemetry to detect silent regressions
None of this is exciting to describe. All of it becomes important once you are no longer integrating a demo, but operating a surface that other developers are relying on.

10. What people call “aggregation” is often really a search for reliability
From the outside, a multi-model gateway looks like a convenience layer.

Inside the system, it feels more like an attempt to absorb volatility.

Yes, unified access matters. So does lower switching cost. So does having one integration surface instead of many.

But the part that keeps revealing itself as the real value is reliability: consistent behavior, safer fallbacks, clearer errors, better visibility, better policy control, and fewer vendor-specific surprises leaking into application code.

That’s the part I did not fully understand before building this.

The hard part is not exposing more models.

The hard part is standing between a messy, fast-changing provider ecosystem and a developer who just wants their production system to behave predictably.

That, more than anything else, is what “unified API” has come to mean to me.

If you’re building in this space too, I’d genuinely be curious how you think about normalization, routing, streaming, and failover. I’ve found that those four topics reveal most of the real architecture very quickly.

Top comments (1)

sanreds • Jun 6

Gateway level reliability is the part of the stack most posts skip past. The piece I'd love to see covered more is fallback semantics under cost constraints, when your primary model 429s, do you retry, downgrade, or fail fast? Each one breaks a different downstream assumption. Curious how you handled the fallback policy.