Rakesh Roushan

Posted on Feb 5

Building Multi-Tenant AI Middleware on Cloudflare's Edge (No GPUs Required)

#ai #cloudflare #typescript #architecture

The Problem: AI APIs Are Expensive to Self-Host

Every AI startup eventually hits the same wall: you need to serve AI models to multiple customers, each with their own rate limits, usage tracking, and billing. The typical solution? Spin up GPU servers, add a reverse proxy, bolt on auth, and pray your $10K/month infrastructure bill doesn't explode.

I took a different approach. AgentFlare runs entirely on Cloudflare Workers — no GPU servers, no containers, no Kubernetes. Just edge-deployed TypeScript serving 50+ AI models with sub-50ms latency globally.

Here's how I built it and the architecture decisions that made it work.

Architecture Overview

AgentFlare is a multi-tenant AI middleware platform. Think of it as an OpenAI-compatible API gateway that runs on Cloudflare's edge network. The stack:

Runtime: Cloudflare Workers (V8 isolates, not containers)
Framework: Hono (lightweight, edge-native)
Database: D1 (SQLite at the edge)
State: Durable Objects (rate limiting, agent memory)
Queues: Cloudflare Queues (async usage processing)
AI: Workers AI (50+ models, no GPU management)

The key insight: Cloudflare already runs the GPUs. Workers AI gives you access to Llama, Mistral, FLUX, Whisper, and dozens more — all via a binding, not an HTTP call. Your code runs in the same data center as the model.

The OpenAI-Compatible API Layer

The first design decision: make every endpoint OpenAI-compatible. This means any app using the OpenAI SDK works with AgentFlare by changing one line:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'oi_your_agentflare_key',
  baseURL: 'https://api.agentflare.dev/v1',
});

const response = await client.chat.completions.create({
  model: 'llama-3.3-70b-fast',
  messages: [{ role: 'user', content: 'Explain edge computing' }],
});

Under the hood, the inference service maps friendly model IDs to Cloudflare's internal identifiers:

const MODEL_MAP: Record<string, string> = {
  'llama-3.3-70b-fast': '@cf/meta/llama-3.3-70b-instruct-fp8-fast',
  'flux-1-schnell': '@cf/black-forest-labs/flux-1-schnell',
  'whisper': '@cf/openai/whisper',
  'qwen-3-30b': '@cf/qwen/qwen3-30b-a3b-fp8',
  // 50+ more models...
};

This abstraction layer means I can swap providers, add new models, or route traffic between models without breaking any client integrations.

Rate Limiting with Durable Objects

This is where it gets interesting. Traditional rate limiters use Redis or in-memory counters — both fail at the edge because there's no shared state between Workers in different data centers.

Durable Objects solve this. Each rate limiter is a single JavaScript object with a globally unique ID, running in exactly one location. Every request to that tenant routes to the same Durable Object:

export class RateLimiterDurableObject extends DurableObject {
  private windowMs = 60000; // 1 minute window

  async checkLimit(limit: number): Promise<{
    allowed: boolean;
    remaining: number;
    resetAt: number;
  }> {
    const now = Date.now();
    const stored = await this.ctx.storage.get<RateLimitState>('state');

    if (stored && now - stored.windowStart >= this.windowMs) {
      // Window expired — reset
      this.state = { requests: [], windowStart: now };
    }

    // Filter stale requests, check count
    const active = stored?.requests.filter(
      ts => now - ts < this.windowMs
    ) || [];

    if (active.length >= limit) {
      return { allowed: false, remaining: 0, resetAt: stored!.windowStart + this.windowMs };
    }

    active.push(now);
    await this.ctx.storage.put('state', { requests: active, windowStart: stored?.windowStart || now });

    return { allowed: true, remaining: limit - active.length, resetAt: (stored?.windowStart || now) + this.windowMs };
  }
}

The beauty: zero coordination overhead. No distributed locks, no consensus protocols, no Redis cluster. The Durable Object is the lock — it's a single-threaded actor with built-in persistence.

Pre-Flight Metering: Don't Let Users Overdraw

One of the trickiest problems in multi-tenant AI: a user with $0.50 remaining sends a request that costs $2.00. By the time inference finishes, they've already overspent.

AgentFlare solves this with pre-flight metering — estimating cost before inference and reserving balance upfront:

function estimateCost(requestType: string, body?: RequestBody): number {
  let estimatedTokens = ESTIMATED_TOKENS[requestType] || 2000;

  if (body?.max_tokens) {
    estimatedTokens = Math.max(estimatedTokens, body.max_tokens);
  }

  // Add input tokens from message content
  if (body?.messages) {
    const inputChars = body.messages.reduce(
      (sum, m) => sum + String(m.content).length, 0
    );
    estimatedTokens += Math.ceil(inputChars / 4);
  }

  const tokenCost = (estimatedTokens / 1_000_000) * COST_PER_MILLION_TOKENS;
  return Math.max(tokenCost, COST_PER_ACTION);
}

After inference completes, we calculate actual cost and adjust the balance difference asynchronously via Queues. Overcharged? We refund. Undercharged? We debit more. The user never knows — they just see accurate billing.

Stateful AI Agents via Durable Objects

Most AI APIs are stateless — every request starts fresh. AgentFlare supports stateful agents where conversation history persists across requests using Durable Objects:

export class AgentDurableObject extends DurableObject {
  async addMessage(role: string, content: string): Promise<void> {
    const state = await this.getState();
    state.conversationHistory.push({ role, content, timestamp: Date.now() });

    // Rolling window: keep last 100 messages
    if (state.conversationHistory.length > 100) {
      state.conversationHistory = state.conversationHistory.slice(-100);
    }

    await this.ctx.storage.put('state', state);
  }

  // WebSocket support for real-time streaming
  async fetch(request: Request): Promise<Response> {
    if (request.headers.get('Upgrade') === 'websocket') {
      const pair = new WebSocketPair();
      this.ctx.acceptWebSocket(pair[1]);
      return new Response(null, { status: 101, webSocket: pair[0] });
    }
    // REST API fallback...
  }
}

Each agent gets its own Durable Object instance with persistent storage. WebSocket support means clients can stream responses in real-time without polling.

Neuron-Based Cost Calculation

Cloudflare prices Workers AI in "neurons" — an abstraction over GPU compute time. Different models consume different neuron rates per million tokens:

const NEURON_PRICING: Record<string, { input: number; output: number }> = {
  '@cf/meta/llama-3.2-1b-instruct':  { input: 2457,  output: 18252  },
  '@cf/meta/llama-3.3-70b-instruct': { input: 26668, output: 204805 },
  '@cf/qwen/qwq-32b':                { input: 60000, output: 90909  },
};

const NEURON_COST_PER_1K = 0.011; // $0.011 per 1,000 neurons

AgentFlare exposes this to tenants transparently. Each response includes X-Balance-Remaining and X-Estimated-Cost headers, so clients can implement their own budget controls.

What I'd Do Differently

1. Token estimation is lossy. The chars / 4 heuristic works for English but breaks for CJK text and code. I'd integrate a proper tokenizer (tiktoken-wasm) if I rebuilt this.

2. Fail-closed on metering outages. If the UsageCounter Durable Object is unreachable, AgentFlare blocks the request (HTTP 503). For higher availability, I'd add a soft fallback that logs and allows the request, then reconciles later.

3. AI Gateway from day one. I initially used direct Workers AI bindings everywhere. Adding AI Gateway later for caching, fallbacks, and analytics required refactoring the inference service. Start with Gateway — the latency overhead is negligible and the observability is worth it.

Numbers

Cold start: ~5ms (V8 isolates, not containers)
Auth + rate limit check: ~2ms (KV cache hit) / ~15ms (D1 lookup)
Inference latency: Model-dependent (Llama 3.1 8B: ~200ms for short responses)
Global edge locations: 300+
Monthly cost for the platform itself: ~$5 (Workers Paid plan) + AI usage

The platform cost is essentially free. You pay for AI inference only when tenants use it.

Try It

If you're building multi-tenant AI products and tired of managing GPU infrastructure, this architecture might save you months of work. The entire pattern runs on Cloudflare Workers — deploy it on your account and you have a production-ready AI API in under an hour.

Shipping daily. Follow for more: @BuildWithRakesh

DEV Community