DEV Community

Cover image for AI Gateway Architecture: 7 Cross-Cutting Concerns (2026)
Nishil Bhave
Nishil Bhave

Posted on • Originally published at maketocreate.com

AI Gateway Architecture: 7 Cross-Cutting Concerns (2026)

AI Gateway Architecture

Inside an AI Gateway: 7 Cross-Cutting Concerns That Don't Belong in Your App Code

Pull up the source of any production AI feature shipped in 2026 and you'll find the same mess. An OpenAI client wrapped in a try/except. A hardcoded retry. An if model == "claude" branch, a half-broken cache that nobody trusts, and a Slack alert that fires whenever Anthropic has a bad afternoon. Every team builds the same plumbing. Every team gets it wrong in the same places. AI gateway architecture is the pattern that ends this duplication.

OpenAI, Anthropic, and Google now command 89% of enterprise wallet share for AI (a16z, 2026). The average enterprise LLM bill grew from $4.5M to $7M in two years. It's projected to hit $11.6M next year. At that spend, the patterns scattered through your app code aren't ergonomics. They're architecture.

This is what AI gateway architecture solves. It's the API gateway pattern reincarnated for LLMs — a thin layer between your app and the model providers. The gateway owns the cross-cutting concerns nobody should be reimplementing per-feature. Let's walk through the seven concerns and what each looks like done right.

4-layer API security framework explaining gateway-level rate limiting, CORS, JWT, and injection prevention

Key Takeaways

  • 89% of enterprise LLM spend goes to OpenAI, Anthropic, and Google combined (a16z, 2026). Multi-provider posture is now the default, not the exception.
  • 60% of LLM API errors are rate-limit failures (Datadog State of AI Engineering, 2026); fallback routing isn't a nice-to-have.
  • Semantic caching cuts LLM bills 30-50% in production at Portkey-scale (Portkey, 2026) and up to 90% with stacked L1+L2 layers.
  • LiteLLM (44.7K stars), Portkey (~8K), and Bifrost (4.3K) are the three open AI gateways shaping the category. Pick by traffic profile, not popularity.

What Is AI Gateway Architecture, and Why Does It Matter Now?

AI gateway architecture is built around a reverse proxy purpose-built for LLM traffic. It sits between your application and one or more model providers. It handles the concerns that don't belong inside any single feature: routing, retries, caching, rate-limiting, observability, redaction, and cost attribution. Think of it as the LLM-aware sibling of Kong, Envoy, or AWS API Gateway. Same architectural role, completely different traffic shape.

Why now? Three forces collided. First, multi-provider posture became standard. Most enterprise teams now route conversational tasks to OpenAI, coding tasks to Anthropic, and multimodal to Gemini in one app. Second, costs exploded. Worldwide AI spending will reach $2.52 trillion in 2026, a 44% jump year-over-year (Gartner, 2026). Most of that flows through inference APIs your finance team can't audit. Third, providers became unreliable enough that single-vendor bets started feeling reckless. OpenAI alone had four major outages in 2026.

Chart

Source: a16z, "How 100 Enterprise CIOs Are Building and Buying Gen AI in 2026"

The unifying observation is simple. Every concern listed below shows up in every LLM-using app, no matter the feature. Concentrating them in AI gateway architecture is just the DRY principle applied to a new layer.


Concern 1: How Does Multi-Provider Routing Work?

Routing is the gateway's most visible job: take an incoming request and decide which provider, model, and region should serve it. The naive version is a config file mapping task_type → model_id. The production version is a policy engine.

37% of enterprises now run 5+ models in production, and Anthropic's enterprise share jumped from 12% in 2026 to 32% in 2026 while OpenAI's dropped from 50% to 25% (Menlo Ventures, 2026). The implication: routing logic that hardcodes provider names ages in months, not years.

The routing dimensions that actually matter in practice, and that you'll regret merging into one config, are these:

  • Capability — which models can handle this prompt's input modality, context window, and tool-use requirements?
  • Cost class — premium, standard, or fast/cheap? Anthropic's Opus tier costs roughly 5x its Haiku tier.
  • Latency budget — interactive (sub-2s p95), background (sub-30s), or batch (best effort)?
  • Geography — does this request need to stay in-region for data residency?
  • Tenant tier: paid customers route to higher-quality models; free-tier traffic goes to cheap fallbacks.

In practice: I've seen teams collapse all five dimensions into a single model: "gpt-4o" string and then spend months unpicking it when their first enterprise customer asks for EU-only routing. Build the policy table from day one. Even if every row says "use gpt-4o," the structure is what saves you later.

According to a16z's empirical study of 100 trillion tokens through OpenRouter, model-mix changed materially every quarter (a16z State of AI, 2026). Routing policy must be a hot config, not a deploy.


Concern 2: Why Is Fallback and Retry Logic So Important?

Fallback is the gateway's safety net. When the primary provider fails, rate-limited, timed out, returning 5xx, or simply down, the gateway must transparently retry against a different model or provider without your application code knowing.

5% of all LLM call spans report errors and 60% of those errors are exceeded rate limits (Datadog, 2026). Three of every five LLM failures, then, aren't bugs in your code. They're capacity in someone else's data center. A gateway that can fall back from claude-opus → claude-sonnet → gpt-4o recovers all of those silently.

The pattern that works:

  1. Primary attempt with provider-specific timeout (e.g., 30s for Opus, 8s for Haiku).
  2. On 429/503/timeout: retry with exponential backoff, but only twice. Three retries against a degraded provider just compounds the latency.
  3. On final failure: switch model class. Drop quality before dropping the request.
  4. Hard ceiling: total wall-clock budget for the entire chain, enforced at the gateway. The user's request waits 12 seconds, not 12 seconds × 4 attempts.

Anthropic's Claude.ai uptime ran at 99.32% over the past 30 days as of February 2026 — nearly five hours of monthly downtime (Helicone status data, 2026). OpenAI saw four major outages in 2026: a multi-hour global event in June and a routing misconfiguration in December, among others. If your business depends on a single provider being up, your business is going down with them.

Worth noting: "Multi-provider fallback" sounds elegant until you realize prompts trained for Claude rarely produce identical output on GPT-4o. Build fallback assuming degraded but acceptable output, never identical. Test the failure path explicitly: pin the gateway to fallback mode in staging for a day per quarter and look at what breaks.


Concern 3: What Does Semantic Caching Actually Save You?

Semantic caching is the gateway's biggest cost lever. Unlike traditional caches that key on exact string match, semantic caches embed the prompt, find vector-similar prior prompts, and serve the cached completion if the similarity exceeds a threshold (typically cosine ≥ 0.95).

Most applications see 20-40% cost reduction in the first week with Portkey's semantic caching. Mature deployments often report 30-50% reductions (Portkey, 2026). Stacked L1 (exact) plus L2 (semantic) cache layers cut typical 10M-request-per-month workloads by 54% (TokenMix, 2026).

Glowing circuit board macro shot with green and orange traces representing data flow through cache layers

But the real-world numbers vary brutally by workload. Here's the data:

Chart

Source: Portkey real-world hit rate data; Helicone production benchmarks; TokenMix L1/L2 architecture analysis

Two non-obvious things to get right:

Cache key composition. A semantic cache that ignores temperature, system prompt, tool definitions, or user context will return wrong-but-similar answers. The cache key needs to include every input that could change the output, not just the user message.

Stale-while-revalidate. For LLM responses, "fresh" doesn't always matter. Serve stale on cache hit, refresh asynchronously, and the user perceives sub-100ms latency. This pattern alone reduces p95 latency from ~2s to <200ms on cache-hit paths.

Our experience: The first semantic cache I built was useless because every prompt included a timestamp and request ID. Every embedding looked unique. Caching is an exercise in subtraction: strip from the prompt everything that doesn't change the answer, then embed.


Concern 4: How Should Rate Limiting and Quotas Be Enforced?

Rate limiting in LLM-land is two problems wearing a trenchcoat. There's the limit your provider imposes on you (TPM, RPM, concurrency), and the limit you impose on your customers. The gateway has to enforce both, in the same request path.

Provider-side rate limits are the leading cause of LLM failures — 60% of all errors per Datadog. The gateway needs to track token consumption per provider per minute. It then queues, throttles, or sheds traffic before hitting the wall. Naive token counters lag because token usage is only known after the response. The fix: estimate input tokens at request time with a tokenizer (tiktoken for OpenAI, Anthropic's tokenizer SDK), reserve capacity, then reconcile on response.

comparison of token bucket, sliding window, and fixed window algorithms for API rate limiting

Customer-side quotas need a different model entirely. Per-user TPM is too coarse for free tiers (one prompt-injection attack burns the daily budget). What works is multi-dimensional quotas:

  • Tokens per user per hour
  • Tokens per organization per day
  • Concurrent requests per user
  • Spend per user per month (priced, not token-counted)

The shift to spend-based quotas matters because output tokens cost 3-10x more than input tokens. A user submitting one paragraph who gets back a 2,000-word essay costs more than a user submitting twenty short questions. Token-counting hides this; dollar-counting reveals it.

In practice: The single most-impactful rate-limit decision I've seen is separating queues by tenant tier. Free-tier abuse should never starve paying customers. One queue per tier, with priority weights, fixes 90% of "AI is slow today" complaints. Without it, your AI is slow because some bot found your endpoint.


Concern 5: What Does LLM Observability Require Beyond APM?

Traditional APM tools tell you that a request was slow. LLM observability has to tell you why the model said what it said. That's a strictly larger surface area.

A complete trace for a single LLM request has to capture a lot. The full prompt (inputs, system prompt, tool definitions). The model's response. Prompt and completion token counts. Computed cost in USD. Latency at each stage. The model and provider chosen. Fallback path if any, cache hit/miss, redaction events, and the user's tenant context. That's 12+ fields per call. Multiply by tens of thousands of calls per day.

Datadog's State of AI Engineering 2026 found teams that added LLM observability saw their monitoring bills rise 40-200% depending on volume and custom metric instrumentation (Datadog, 2026). Some teams responded by simply not monitoring AI workloads properly, which means they get paged for outages instead of seeing them coming.

A developer reviewing colorful syntax-highlighted code on a dark monitor screen representing LLM trace inspection and debugging

The trace shape that's actually useful:

trace_id: t_abc123
├── gateway.receive (12ms)
├── policy.evaluate → claude-opus (3ms)
├── cache.semantic.lookup → MISS (45ms)
├── redact.pii (2 PII removed, 8ms)
├── provider.anthropic.call
│   ├── prompt_tokens: 1,847
│   ├── completion_tokens: 412
│   ├── cost_usd: 0.0289
│   └── latency_ms: 1,842
├── cache.semantic.write (4ms)
└── response.stream (412 tokens)
total_cost_usd: 0.0289 | total_latency_ms: 1,914
Enter fullscreen mode Exit fullscreen mode

Three things this enables that APM doesn't. Replay: resubmit a failing prompt against another model to compare output. Cost-by-feature: surface which product burned $12K in March 2026. Quality regression detection: catch when the same prompt starts producing worse answers after a model update.

Worth noting: Don't put raw prompts and completions in your standard trace store. They contain PII, customer data, and sometimes secrets. Route them to a separate, access-controlled store with shorter retention. Your traces graduate to forensic evidence the moment a customer asks "why did the AI say that about me?"


Concern 6: How Do You Redact PII Before It Leaves Your Network?

PII redaction is the concern that turns AI gateways from optional to mandatory in regulated environments. Once a prompt with a customer's SSN, credit card, or medical record reaches a third-party provider, you've lost control of it. Depending on jurisdiction, you may have just notified your DPO of a reportable incident.

Roughly 4.7% of employees have pasted confidential data into ChatGPT, and ~11% of all employee-submitted data is classified as confidential (Cyberhaven research, cited by Pangea, 2026-2026). State-of-the-art PII redaction tools hit only 92-95% accuracy by category (Statsig, 2026). Even the best systems leak 5-8% of sensitive entities.

The gateway-level redaction pipeline that works:

  1. Pre-flight detection: regex + NER model (Presidio, AWS Comprehend) scans the prompt before any provider call.
  2. Tokenization: replace detected entities with reversible tokens ([EMAIL_001], [CC_002]).
  3. Provider call with the redacted prompt.
  4. Detokenization on response: re-substitute original values into the model's output before returning to the user.
  5. Audit: log redaction events with entity counts (never raw values) for compliance review.

The non-obvious failure mode: redaction often breaks the prompt's meaning. Ask "summarize the medical history of John Smith," then replace "John Smith" with [NAME_001], and you get a generic summary. A capable gateway preserves role (the model knows it's a person) while stripping identity. Presidio's pseudonymization mode handles this. Ad-hoc regex doesn't.

In practice: The single highest-leverage gateway feature for regulated industries is outbound redaction with audit logs. It's the difference between "we use AI" and "we can pass a SOC 2 audit while using AI."


Concern 7: Why Is Cost Attribution the Hardest Problem?

Cost attribution sounds boring. It is also the concern that determines whether your AI initiative survives its first board review.

Average enterprise AI spend grew from ~$4.5M to ~$7M in two years. It's projected to grow another 65% to ~$11.6M next year (a16z, 2026). At those numbers, "AI costs" is no longer a single line item. It's a portfolio. The CFO wants to know which products, which customers, and which features are profitable on AI economics. Without gateway-level attribution, nobody knows.

Chart

Sources: Featherless LLM API Pricing 2026; Oplexa AI Inference Cost Crisis 2026; Iternal AI cost calculators

Most companies overpay 50-90% on LLM costs (Oplexa, 2026), and most apps underestimate token usage by 2-3x at planning. The reasons aren't mysterious; they're just invisible without a gateway:

  • Output tokens cost 3-10x more than input tokens. A short prompt that triggers a long answer is expensive.
  • Agentic workflows use 5-30x more tokens than single-prompt features. One user-facing "click" might fan out into 20 LLM calls behind the scenes.
  • RAG context tax: sending thousands of doc tokens with every query inflates per-call cost 2-5x.

The gateway tags every request with tenant_id, feature_id, user_id, and cost_usd and emits a billing event. Aggregate that stream and you have per-feature P&L by tomorrow morning.

A finding worth surfacing: When teams I've worked with first turned on per-feature cost attribution, the top three features by cost were never the top three by user-perceived value. The biggest spend was almost always a debug or admin tool somebody forgot was wired to GPT-4. Attribution doesn't just enable budgeting. It surfaces waste no audit would have found.


AI Gateway Architecture Choices: LiteLLM vs Portkey vs Bifrost vs Build Your Own

The three most-adopted open AI gateways in 2026 occupy different niches. Pick by traffic profile and team capability, not by feature checklist.

Gateway GitHub stars Language Sweet spot Trade-off
LiteLLM 44,728 Python Largest provider catalog, simplest install, vibrant community Python overhead caps it at ~500 RPS per process
Portkey ~8,000 Hosted/SaaS Deep observability, governance, enterprise audit trails Less flexibility for custom routing logic; managed pricing
Bifrost 4,305 Go 11μs overhead per request at 5,000 RPS, 50x faster than LiteLLM Smaller community; fewer pre-built integrations

LiteLLM is the right starting point for most teams: 100+ providers behind one OpenAI-compatible API, self-hostable, well-documented. When you outgrow Python (typically around 500 RPS), Bifrost's Go implementation handles 5,000+ RPS with negligible overhead (Maxim AI benchmarks, 2026). Portkey is the choice when audit, governance, and per-user trace quality matter more than low-level control, typical for regulated industries.

Building your own gateway makes sense in exactly two cases. First, you have unusual routing requirements (e.g., on-prem-only models with custom auth) that no OSS gateway supports. Second, you have <50 RPS and want full control with no operational dependency. For everyone else, the cost of reinventing semantic cache plus retry plus observability plus redaction will far exceed the savings.


Building Minimal AI Gateway Architecture in 200 Lines

To make the architecture concrete, here's the smallest gateway that captures five of the seven concerns. It's intentionally crude — no semantic cache, no PII — but it shows how the layers compose. In Python:

import time, hashlib, redis, asyncio
from anthropic import AsyncAnthropic
from openai import AsyncOpenAI

redis_client = redis.Redis()
clients = {"anthropic": AsyncAnthropic(), "openai": AsyncOpenAI()}

ROUTING_POLICY = [
    {"task": "code", "primary": ("anthropic", "claude-opus"), "fallback": ("openai", "gpt-4o")},
    {"task": "chat", "primary": ("openai", "gpt-4o"), "fallback": ("anthropic", "claude-sonnet")},
]

async def gateway(prompt, task, tenant_id, user_id):
    start = time.monotonic()

    # 1. Routing
    policy = next(p for p in ROUTING_POLICY if p["task"] == task)
    candidates = [policy["primary"], policy["fallback"]]

    # 2. Exact-match cache
    cache_key = f"llm:{task}:{hashlib.sha256(prompt.encode()).hexdigest()}"
    if cached := redis_client.get(cache_key):
        emit_metric(tenant_id, user_id, task, cost_usd=0.0, cache="HIT")
        return cached.decode()

    # 3. Per-tenant rate limit (100 RPM)
    tenant_key = f"rl:{tenant_id}:{int(time.time() // 60)}"
    if redis_client.incr(tenant_key) > 100:
        raise RateLimitError(f"tenant {tenant_id} exceeded 100 RPM")
    redis_client.expire(tenant_key, 60)

    # 4. Provider call with fallback
    last_error = None
    for provider, model in candidates:
        try:
            response, cost = await call_provider(provider, model, prompt)
            redis_client.setex(cache_key, 3600, response)
            emit_metric(tenant_id, user_id, task, cost_usd=cost,
                        provider=provider, latency_ms=(time.monotonic()-start)*1000)
            return response
        except (TimeoutError, RateLimitError, ProviderError) as e:
            last_error = e
            continue
    raise last_error
Enter fullscreen mode Exit fullscreen mode

Five of the seven concerns are visible in 30 lines: routing policy, exact-match cache, rate limiting, provider fallback, and observability metric emission. Add a Presidio pass before the provider call and a vector-similarity cache layer with Pinecone or pgvector, and you're at six. The seventh — granular cost attribution — is just enriching the emit_metric call with feature_id and routing the events to a billing topic.

The point isn't that you should build this. It's that the architecture isn't mystical. Once you see the layering, the choice between rolling your own and adopting LiteLLM/Portkey/Bifrost becomes a straightforward operational-cost decision.

Glowing global network connections over earth representing distributed AI gateway routing across regions


Frequently Asked Questions

Is AI gateway architecture just API gateway architecture with extra steps?

Architecturally yes, operationally no. Traditional API gateways (Kong, Envoy) handle HTTP-shaped traffic: auth, rate limiting, routing on path. AI gateway architecture adds LLM-specific concerns: token-aware rate limits, semantic caching, prompt-level redaction, model fallback, and per-token cost attribution. According to Datadog (2026), 60% of LLM errors are rate-limit failures. That's a class of error a generic gateway can't even detect, let alone route around.

Should I use a hosted AI gateway or self-host?

Self-host (LiteLLM, Bifrost) when you have predictable traffic, sensitive data, and engineering capacity to operate it. Hosted (Portkey, Cloudflare AI Gateway) when audit trails, observability dashboards, and zero ops matter more than per-request cost. Most teams under $5M in annual LLM spend (a16z, 2026) get more value from hosted; above that, self-hosting starts paying back the operational tax.

How much does an AI gateway actually save in production?

Semantic caching alone delivers 30-50% cost reduction in mature deployments and up to 90% with stacked L1+L2 layers (Portkey, 2026). Adding fallback recovers the 5% of requests Datadog measured as failed calls. Combined with cost attribution that surfaces waste, real-world teams report 40-60% total LLM bill reduction within six months of gateway adoption.

Will an AI gateway slow down my requests?

A well-designed gateway adds <50ms p99 overhead. Bifrost benchmarks at 11μs per request at 5,000 RPS (Maxim AI, 2026). Compare that to typical LLM call latency (1-5 seconds) and the gateway is rounding error. If your gateway is adding meaningful latency, the bottleneck is almost always synchronous semantic-cache embedding. Fix that with async refresh and stale-while-revalidate.

Can an AI gateway help with compliance (SOC 2, HIPAA, GDPR)?

Yes, and for many regulated industries this is the primary purchase driver. A gateway provides centralized PII redaction, audit logs of every prompt/response (hashed or stored separately), data residency enforcement via region-aware routing, and a single point for security review. It's far easier to certify one gateway as compliant than to certify every feature that calls an LLM.


Putting It Together

The pattern is clear once you stop looking at LLM features and start looking at LLM traffic. Routing, fallback, caching, rate limiting, observability, redaction, and cost attribution are all the same shape: cross-cutting concerns that show up in every feature, get implemented seven different ways, and rot under their own complexity.

AI gateway architecture centralizes them. The savings compound — 30-50% on inference costs, dramatically reduced incident MTTR, and per-feature P&L visibility. The architectural simplification compounds harder. New features stop shipping their own retry logic. New providers integrate by config change. Compliance reviews start from "we have one place to audit."

If you're building anything more than a single-prompt demo, your application code shouldn't know which provider is up, what last quarter's caching strategy was, or how much each user's session cost. That belongs one layer down. Start there — even if your first version is 200 lines of glue around LiteLLM. The leverage you get from getting the abstraction right will outlast every model you'll route through it.

guide to designing agentic systems with multi-step LLM orchestration

Top comments (0)