Cloudflare Pages + FastAPI: Building a Hybrid Edge/Origin Architecture for Sub-100ms AI Response Times
I've spent the last two years building CitizenApp's AI features—9 different ones, from document summarization to entity extraction. And I've made every mistake you can make with latency: cold starts on Vercel, origin thrashing on Render, users waiting 2–3 seconds for responses they'd already generated yesterday.
Last quarter, I implemented a hybrid edge/origin architecture using Cloudflare Workers (via Pages Functions) + FastAPI. The result? Sub-100ms responses for cached AI inferences, and a 70% reduction in origin load. But here's the thing: most SaaS devs treat Cloudflare as a dumb CDN. It's not. It's a compute platform. And if you're running a backend at all, you're leaving latency on the table by not pushing business logic to the edge.
This post walks through the exact pattern I use in production. It's not theoretical. It's the pattern that cut our p95 latency from 1.8s to 0.3s for repeated queries.
Why Edge Caching + Origin Validation Matters for AI Workloads
AI responses are deterministic. If user A asks "summarize this contract" and gets a response, user B asking the same question should get the same response. That's a cache hit waiting to happen.
But caching isn't just about speeding up repeated queries. It's about shifting compute cost. Every Claude API call costs money. Every call to your origin costs CPU. Every millisecond a user waits is a millisecond they're not using your product.
I prefer edge caching + origin fallback over:
- Pure origin caching (memcached/Redis): Adds 50–100ms round-trip latency even for hits.
- CDN caching without validation: You can't validate JWTs or check permissions at the CDN layer without building custom logic.
- Client-side caching: Users expect to see fresh data across devices, and cache invalidation is a nightmare at scale.
Cloudflare Workers let you cache at the edge while validating permission before serving. That's the magic.
The Architecture: Layers and Responsibilities
Here's how it works:
- Cloudflare Worker (Edge): Validates JWT, checks permissions, and serves from cache if valid.
- FastAPI Origin: Handles cache misses, calls Claude API, stores results in Redis with TTL.
- Redis: Distributed cache shared between origins (I use Render's managed Redis).
The flow:
User Request
↓
Cloudflare Worker (validate JWT, check permissions)
↓
Cache hit? → Serve immediately (< 10ms)
↓
Cache miss? → Forward to FastAPI origin
↓
FastAPI (call Claude, cache result)
↓
Return to edge, serve to user
Cloudflare Worker: Edge Logic
I use Cloudflare Pages Functions, which are Workers with a simpler deployment model. Here's the pattern:
// functions/api/summarize.ts
import { verify } from '@noble/ed25519';
const JWT_PUBLIC_KEY = 'your-public-key-base64';
const CACHE_TTL = 86400; // 24 hours
interface CacheEntry {
response: {
summary: string;
model: string;
tokens: number;
};
permissions: string[];
timestamp: number;
}
export const onRequest: PagesFunction = async (context) => {
const { request, env } = context;
if (request.method !== 'POST') {
return new Response('Method not allowed', { status: 405 });
}
// Parse request
const { document_id, document_hash, tenant_id } = await request.json();
// Extract and validate JWT
const authHeader = request.headers.get('authorization');
if (!authHeader?.startsWith('Bearer ')) {
return new Response('Unauthorized', { status: 401 });
}
const token = authHeader.slice(7);
let decoded: any;
try {
decoded = await validateJWTAtEdge(token, JWT_PUBLIC_KEY);
} catch (e) {
return new Response('Invalid token', { status: 401 });
}
// Permission check (this is the KEY insight)
if (decoded.tenant_id !== tenant_id) {
return new Response('Forbidden', { status: 403 });
}
// Check if user has `ai:summarize` permission
if (!decoded.permissions?.includes('ai:summarize')) {
return new Response('Forbidden', { status: 403 });
}
// Build cache key: includes tenant + document hash (not ID, hash)
// Why hash? So we cache by content, not by document ID.
const cacheKey = `summarize:${tenant_id}:${document_hash}`;
// Try cache
const cached = await env.CACHE_KV.get(cacheKey, 'json');
if (cached) {
return new Response(JSON.stringify(cached.response), {
headers: { 'content-type': 'application/json', 'x-cache': 'hit' },
});
}
// Cache miss: forward to origin
const originUrl = new URL('/api/summarize', env.ORIGIN_URL);
const originRequest = new Request(originUrl, {
method: 'POST',
headers: {
'content-type': 'application/json',
'x-forwarded-by': 'cloudflare-worker',
'x-tenant-id': tenant_id,
// Forward JWT for origin to re-verify (belt and suspenders)
'authorization': authHeader,
},
body: JSON.stringify({
document_id,
document_hash,
tenant_id,
}),
});
const originResponse = await fetch(originRequest);
if (!originResponse.ok) {
return originResponse;
}
const responseData = await originResponse.json();
// Cache the result
await env.CACHE_KV.put(
cacheKey,
JSON.stringify({
response: responseData,
permissions: decoded.permissions,
timestamp: Date.now(),
}),
{ expirationTtl: CACHE_TTL }
);
return new Response(JSON.stringify(responseData), {
headers: { 'content-type': 'application/json', 'x-cache': 'miss' },
});
};
async function validateJWTAtEdge(token: string, publicKey: string): Promise<any> {
// For production, use a JWT library. This is simplified.
// I prefer @noble/ed25519 because it's isomorphic (browser + edge).
const parts = token.split('.');
if (parts.length !== 3) throw new Error('Invalid JWT');
const header = JSON.parse(atob(parts[0]));
const payload = JSON.parse(atob(parts[1]));
const signature = parts[2];
// Verify signature (expensive, but you'll cache this)
// In production, I cache verified JWTs for 5 minutes
return payload;
}
Why this matters:
Permission check at the edge: You never send unauthorized requests to your origin. This is crucial for multi-tenant SaaS. I burned myself once sending a request to origin, having it fail permission check, and returning 403 to the user after 2 seconds of latency. Not great.
Cache by content hash, not ID: If the same contract text is uploaded twice with different IDs, both get cache hits. Document IDs change; content doesn't.
TTL at edge: 24 hours for summarization (contracts don't change often). You can make this configurable per feature.
FastAPI Origin: Thin and Fast
Your origin is now only responsible for cache misses and Claude API calls:
# app/routes/summarize.py
from fastapi import APIRouter, Depends, HTTPException
from app.dependencies import get_current_user, get_redis
from app.schemas import SummarizeRequest, SummarizeResponse
from app.clients.claude import claude_client
import redis
router = APIRouter(prefix="/api", tags=["ai"])
@router.post("/summarize")
async def summarize(
req: SummarizeRequest,
current_user: dict = Depends(get_current_user),
redis_client: redis.Redis = Depends(get_redis),
) -> SummarizeResponse:
"""
Called only on cache misses from Cloudflare Worker.
"""
# Verify tenant ownership (defense in depth)
if current_user["tenant_id"] != req.tenant_id:
raise HTTPException(status_code=403, detail="Forbidden")
if "ai:summarize" not in current_user.get("permissions", []):
raise HTTPException(status_code=403, detail="Forbidden")
# Fetch document from DB
doc = await db.fetch_document(req.document_id, req.tenant_id)
if not doc:
raise HTTPException(status_code=404, detail="Document not found")
# Verify document hash (ensures we're processing the right version)
if doc.content_hash != req.document_hash:
raise HTTPException(status_code=409, detail="Document changed")
# Call Claude (expensive operation)
summary = await claude_client.summarize(doc.content, doc.language)
response = SummarizeResponse(
summary=summary,
model="claude-3-5-sonnet",
tokens=summary.usage.input_tokens + summary.usage.output_tokens,
)
# Cache in Redis (for other origins and fallback)
cache_key = f"summarize:{req.tenant_id}:{req.document_hash}"
redis_client.setex(cache_key, 86400, response.model_dump_json())
return response
Why thin?
Your origin shouldn't do anything it doesn't have to. JWT validation? That's CPU better spent
Top comments (0)