Jangwook Kim

Posted on May 3 • Originally published at effloow.com

Cloudflare AI Gateway: Zero-Config LLM Proxy for Production

#cloudflare #aiinfrastructure #llmproxy #workersai

Every production AI application hits the same wall eventually. You start calling OpenAI directly, usage grows, costs spike unpredictably, you have no idea which requests are slow, and when OpenAI has an outage your whole product goes down. The standard answer is to add an LLM proxy — but most options involve either running your own infrastructure (LiteLLM) or switching to a managed aggregator that locks you into their model catalog (OpenRouter).

Cloudflare AI Gateway takes a different position. It sits between your code and any AI provider as an edge-native proxy, adding caching, rate limiting, spend tracking, and logs without requiring you to change a single line of SDK code. If you are already deploying on Cloudflare — or plan to — it is the path of least resistance to production-grade AI observability.

This guide covers the full setup: AI Gateway configuration, the URL substitution pattern, Workers AI bindings, model fallback with automatic retry, and where the free tier ends and paid features begin.

Effloow Lab ran a sandbox PoC: Wrangler 4.87.0 installed, local dev server launched at localhost:8787, and a Worker with AI Gateway caching and fallback patterns was verified syntactically. URL patterns for 7 providers were confirmed against the official Cloudflare documentation. Actual inference and live cache hit rate testing requires an authenticated Cloudflare account.

What Cloudflare AI Gateway Actually Is

AI Gateway is not an AI provider. It does not host models. It is a reverse proxy at Cloudflare's edge that intercepts your API calls to external providers — OpenAI, Anthropic, Groq, Replicate, Hugging Face, Google AI Studio, and more — and adds a layer of control without any changes to your application logic.

The core value proposition fits in one sentence: replace api.openai.com/v1 in your request URL with a Cloudflare gateway URL, and your calls automatically get logged, cached, and protected by rate limits.

# Before
https://api.openai.com/v1/chat/completions

# After (same headers, same body, zero code changes)
https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/openai/chat/completions

Everything else — the Authorization header, the request body, the response format — stays identical. Your OpenAI SDK, your Anthropic client, your Groq calls: none of them need to be rewritten.

This is the key architectural insight. Most LLM proxy solutions require you to adopt a new SDK or change how you structure requests. AI Gateway works at the HTTP level, so it is completely transparent to your existing code.

Why This Matters for Production

Before diving into the setup, it is worth being specific about the problems AI Gateway solves.

Unpredictable costs. Without visibility into per-user or per-feature AI usage, it is hard to catch runaway spend before the bill arrives. AI Gateway's dashboard shows per-gateway token counts, costs, and request volumes in real time.

No caching. Identical prompts — think FAQ chatbots, document summarization of the same input, templated code generation — waste money by hitting the API every time. AI Gateway caches based on exact prompt match, so repeated calls return instantly from cache at zero token cost.

Single point of failure. If your sole AI provider has an outage, your product is down. The Universal Endpoint lets you configure a fallback to a second provider that activates automatically when the primary fails.

No rate control. Without limits, a single misbehaving client or a burst of traffic can exhaust your daily quota. AI Gateway's rate limiting controls requests per window at the gateway level, before they reach the upstream provider.

No audit trail. When something goes wrong — a bad generation, an unexpected cost spike, a compliance question — you need logs. AI Gateway stores every request and response for review in the dashboard.

Step 1: Create Your Gateway

Setting up AI Gateway takes about two minutes in the Cloudflare dashboard.

Log in to dash.cloudflare.com and select your account.
Navigate to AI → AI Gateway.
Click Create Gateway, enter a name (e.g. my-ai-gateway), and click Create.

Your gateway is now active. Cloudflare assigns it an ID based on the name you chose. The full endpoint base URL follows this pattern:

https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/

You can find your Account ID in the Cloudflare dashboard sidebar or via npx wrangler whoami after authenticating.

Step 2: Route Provider Calls Through the Gateway

The URL substitution pattern is consistent across all supported providers. Replace the provider's base URL with the Cloudflare gateway equivalent:

Provider	Original Base URL	Via AI Gateway
OpenAI	api.openai.com/v1	gateway.ai.cloudflare.com/v1/{acct}/{gw}/openai/
Anthropic	api.anthropic.com	gateway.ai.cloudflare.com/v1/{acct}/{gw}/anthropic/
Groq	api.groq.com/openai/v1	gateway.ai.cloudflare.com/v1/{acct}/{gw}/groq/
Google AI Studio	generativelanguage.googleapis.com	gateway.ai.cloudflare.com/v1/{acct}/{gw}/google-ai-studio/
Replicate	api.replicate.com/v1	gateway.ai.cloudflare.com/v1/{acct}/{gw}/replicate/
Workers AI	api.cloudflare.com/client/v4/accounts/{id}/ai/run	gateway.ai.cloudflare.com/v1/{acct}/{gw}/workers-ai/

For most SDK clients, this is a one-line change to a base URL constant. The OpenAI Python SDK accepts a base_url parameter:

from openai import OpenAI

client = OpenAI(
    api_key="your_openai_api_key",
    base_url="https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/openai/"
)

# Everything else stays the same
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain Cloudflare AI Gateway"}]
)

All requests now appear in your AI Gateway dashboard with full logs, latency breakdowns, and cost tracking.

Step 3: Enable Caching

Caching is where AI Gateway pays for itself fastest on steady-state workloads. When two requests arrive with identical prompts, the second one is served from cache — zero tokens consumed, near-zero latency.

To enable caching:

In the Cloudflare dashboard, go to AI Gateway → Settings.
Enable Cache Responses.
Set the default TTL (time-to-live) for cached responses. A value of 3600 (one hour) works well for most use cases.

You can also control caching per request by passing the cf-aig-cache-ttl header:

curl https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/openai/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -H "cf-aig-cache-ttl: 86400" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is 2+2?"}]}'

Setting cf-aig-skip-cache: true bypasses caching for requests where you always need a fresh response — for example, when generating personalized content based on real-time user state.

When caching delivers the most value:

FAQ chatbots with repeated questions
Document summarization pipelines where the same document is processed multiple times
Templated code generation with fixed prompts
Any batch job where inputs repeat across runs

Step 4: Add Rate Limiting

Rate limiting prevents cost spikes and abuse. AI Gateway supports both request-count limits and token-count limits per configurable time window.

In the dashboard under AI Gateway → Rate Limiting, you can set:

Requests per time window — for example, 100 requests per minute per gateway
Window type — sliding window (smoother enforcement) or fixed window (simpler billing periods)

For per-user rate limits, attach user identifiers as metadata via the cf-aig-metadata header. This does not enforce per-user limits directly at the gateway level, but it allows you to filter and analyze per-user usage in the dashboard and then enforce limits upstream in your Worker logic.

// In your Cloudflare Worker
const response = await fetch(gatewayUrl, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${env.OPENAI_API_KEY}`,
    "Content-Type": "application/json",
    "cf-aig-metadata": JSON.stringify({ userId: request.headers.get("x-user-id") }),
  },
  body: JSON.stringify({ model: "gpt-4o-mini", messages })
});

Step 5: Configure Fallback and Retry

The most operationally significant feature of AI Gateway is automatic fallback via the Universal Endpoint. Instead of routing to a single provider, you define an ordered list of providers. If the first provider fails or times out, the gateway automatically tries the next one.

The Universal Endpoint lives at:

https://gateway.ai.cloudflare.com/v1/{ACCOUNT_ID}/{GATEWAY_ID}/

The request body uses a providers array that specifies the cascade:

const response = await fetch(universalEndpoint, {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "cf-aig-retry-count": "3",
    "cf-aig-retry-backoff": "exponential",
    "cf-aig-retry-delay": "500",
  },
  body: JSON.stringify({
    providers: [
      {
        provider: "workers-ai",
        endpoint: "@cf/meta/llama-4-scout-17b-16e-instruct",
        headers: { Authorization: `Bearer ${CF_API_TOKEN}` },
        query: { messages }
      },
      {
        provider: "openai",
        endpoint: "chat/completions",
        headers: { Authorization: `Bearer ${OPENAI_API_KEY}` },
        query: { model: "gpt-4o-mini", messages }
      }
    ]
  })
});

In this setup, requests first go to Workers AI. If that fails — model unavailable, timeout, rate limit error — the gateway automatically retries up to three times with exponential backoff, then falls through to OpenAI. The calling code gets a single response regardless of which provider actually handled it.

Retry configuration options:

cf-aig-retry-count: 1–5 attempts
cf-aig-retry-backoff: constant, linear, or exponential
cf-aig-retry-delay: 100ms, 500ms, 1s, 2s, 3s, or 5s base delay

Step 6: Workers AI Binding (Cloudflare Workers Only)

If you are building on Cloudflare Workers, you get a tighter integration through the env.AI binding. This approach handles authentication automatically and supports the AI Gateway gateway option directly in the run call:

// wrangler.toml
// [ai]
// binding = "AI"

export default {
  async fetch(request, env) {
    const response = await env.AI.run(
      "@cf/meta/llama-4-scout-17b-16e-instruct",
      {
        messages: [
          { role: "system", content: "You are a helpful assistant." },
          { role: "user", content: "Summarize the key features of Cloudflare AI Gateway." }
        ]
      },
      {
        gateway: {
          id: "my-ai-gateway",
          skipCache: false,
          cacheTtl: 3600,
          metadata: { requestType: "summary" }
        }
      }
    );
    return Response.json(response);
  }
};

The gateway option in env.AI.run() automatically routes the inference call through your named gateway, with caching and logging active. Workers AI supports 190+ inference locations globally and covers Llama 4, Gemma 3, Qwen 3, Whisper, image generation, and embedding models.

Workers AI: What Runs Where

Workers AI is separate from AI Gateway but integrates naturally with it. It lets you run open-source models on Cloudflare's GPU infrastructure without provisioning your own servers.

The model catalog as of 2026 includes:

Text generation: Llama 4 Scout (17B 16E), Llama 3 (8B, 70B), Qwen 3.5 (17B active), Mistral 7B
Vision: LLaVA, Llama 3.2 Vision
Embeddings: BGE-Large-EN, Multilingual-E5
Image generation: Stable Diffusion XL, Flux
Speech: Whisper, TTS

Workers AI is billed per neuron (Cloudflare's unit of compute), with a free tier of 10,000 neurons per day on the Workers Free plan. For most development and light production workloads, this covers real usage.

Observability: What You Get in the Dashboard

Once traffic flows through AI Gateway, the dashboard shows:

Requests: total count, success rate, error rate, by provider
Latency: p50/p90/p99 for each provider and model
Costs: per-gateway and per-request estimated cost based on token counts
Cache hit rate: percentage of requests served from cache
Logs: per-request detail with full prompt and response (toggle storage per gateway)

The logs view is particularly useful for debugging. You can filter by provider, model, status code, or metadata fields, and inspect the exact request body and response for any logged call.

Data Loss Prevention (DLP) scanning is included on all plans, including free, as of 2026. It scans request and response bodies for PII patterns before they are logged.

AI Gateway vs LiteLLM vs OpenRouter

These three tools solve adjacent but different problems:

Dimension	Cloudflare AI Gateway	LiteLLM	OpenRouter
Hosting	Cloudflare edge (managed)	Self-hosted (Python)	Managed SaaS
Model catalog	Proxy to any external provider	100+ providers	200+ hosted models
Setup time	~2 minutes (URL change)	15–30 minutes (Docker)	~5 minutes (API key)
Caching	Yes (edge, exact match)	Yes (semantic via Redis)	No
Rate limiting	Yes (built-in)	Yes (configurable)	Limited
Cost observability	Yes (dashboard + DLP free)	Yes (Prometheus metrics)	Basic
Free tier	Yes (core features free)	Open-source	No (pay-per-token)
Best fit	Cloudflare-first stack	Self-hosted, Python infra	Fast model access, no infra

The clearest decision rule: if you are already using Cloudflare Workers, Pages, or R2, AI Gateway integrates without friction. If you want a self-hosted router with the most flexible configuration and are comfortable running Python infrastructure, LiteLLM is the more powerful option. If you just need access to a large model catalog without any infrastructure concern, OpenRouter is the fastest path.

Pricing: What Is Actually Free

As of 2026, the following AI Gateway features are free on all Cloudflare plans:

Dashboard analytics and request logs
Exact-match response caching with configurable TTL
Rate limiting (requests per window)
DLP scanning
Model fallback via Universal Endpoint
Automatic retry

Paid features include:

Unified Billing: pay for OpenAI, Anthropic, and other providers through a single Cloudflare invoice. Adds a small transaction convenience fee.
Advanced log storage: extended retention beyond the default window
Custom cost tracking: assign per-token prices for models with non-standard pricing

Workers AI is billed separately based on neurons consumed, with the free tier covering 10,000 neurons per day.

FAQ

Q: Does AI Gateway add latency to my requests?

Cloudflare's edge network serves ~95% of the world's internet users within 50ms. For most deployments, the added latency from routing through AI Gateway is 1–5ms, well within the noise of LLM inference times (which are typically 500ms–10s). Cached responses return in single-digit milliseconds.

Q: Can I use AI Gateway without Cloudflare Workers?

Yes. AI Gateway works with any HTTP client from any server or platform. You only need the URL substitution pattern. Cloudflare Workers and the env.AI binding are optional — they offer a tighter integration but are not required.

Q: Does caching work for streaming responses?

Exact-match caching works for non-streaming responses. Streaming responses (SSE) can be proxied through AI Gateway for logging and observability, but streaming is not cached — each streaming request hits the upstream provider.

Q: What happens to logs if I disable log storage?

If you disable log storage for a gateway, requests are still proxied and analytics (counts, latency, costs) are still tracked, but the per-request prompt and response content is not saved. Useful for sensitive workloads where you want observability without storing user data.

Q: Is there a SDK for AI Gateway configuration?

Yes. The Cloudflare Terraform provider and the REST API both support managing gateways programmatically. For most teams, dashboard configuration is sufficient. Infrastructure-as-code teams can manage gateways via POST https://api.cloudflare.com/client/v4/accounts/{account_id}/ai-gateway/gateways.

Key Takeaways

Cloudflare AI Gateway is a zero-config edge proxy that adds caching, rate limiting, cost tracking, and logs to any AI provider call by changing one URL.
The core features — caching, rate limiting, DLP, fallback, analytics — are free on all Cloudflare plans.
The URL substitution pattern is consistent across 7+ providers: prepend gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}/ to any provider endpoint.
Workers AI integrates via an env.AI binding with native AI Gateway support, covering 190+ inference locations and a growing open-source model catalog.
The Universal Endpoint enables ordered fallback across providers with automatic retry, providing resilience against single-provider outages.
For teams already on Cloudflare's platform, AI Gateway adds production-grade observability with no infrastructure overhead.

Bottom Line

Cloudflare AI Gateway is the easiest path to production AI observability for teams already on the Cloudflare platform. The URL substitution pattern means zero code rewrites, the core features are genuinely free, and the edge-native caching can meaningfully cut costs on repeat-heavy workloads. If you are not on Cloudflare already, it is worth evaluating alongside LiteLLM — the self-hosted option with deeper routing control — depending on whether you prefer managed simplicity or maximum configurability.

DEV Community