Deploy Claude API on Cloudflare Workers: Edge-Native AI with Durable Objects and KV in TypeScript
Cloudflare Workers run in 300+ data centers — sub-10ms cold starts, no servers. Combine that with Claude's API and you get AI inference at the edge, right next to your users. This guide covers building a production-grade AI assistant on Workers: streaming responses, Durable Objects for per-user conversation state, KV for prompt caching, and rate limiting all bundled into one deploy.
Why Cloudflare Workers for Claude
Standard Node/Bun deployments for Claude integrations mean round trips from your app server to Anthropic's API. A user in Tokyo hitting your US-East server adds 200ms before Claude even responds. Workers collapse that to near-zero: your code executes in the Cloudflare PoP closest to the user.
The other win is cost. Workers are billed per request with a generous free tier (100k req/day). For an AI assistant serving intermittent requests, Workers beat a persistent server that idles.
Requirements:
- Cloudflare account (free tier works)
- Wrangler CLI v3+
- Node 18+
- Anthropic API key
Project Setup
npm create cloudflare@latest claude-edge-ai -- --type worker
cd claude-edge-ai
npm install @anthropic-ai/sdk
wrangler.toml configuration:
name = "claude-edge-ai"
main = "src/index.ts"
compatibility_date = "2024-09-23"
compatibility_flags = ["nodejs_compat"]
kv_namespaces = [
{ binding = "CACHE", id = "YOUR_KV_NAMESPACE_ID" }
]
[[durable_objects.bindings]]
name = "CONVERSATIONS"
class_name = "ConversationSession"
[[migrations]]
tag = "v1"
new_classes = ["ConversationSession"]
[vars]
ANTHROPIC_MODEL = "claude-sonnet-4-6"
MAX_TOKENS = "1024"
Set the API key as a secret (never in wrangler.toml):
wrangler secret put ANTHROPIC_API_KEY
Durable Object for Conversation State
Durable Objects give each user a single-threaded, stateful actor with persistent storage — perfect for conversation history without a database.
// src/conversation.ts
import { DurableObject } from "cloudflare:workers";
interface Message {
role: "user" | "assistant";
content: string;
}
export class ConversationSession extends DurableObject {
private messages: Message[] = [];
private readonly MAX_HISTORY = 20;
async fetch(request: Request): Promise<Response> {
const { action, content } = await request.json<{
action: "add" | "get" | "clear";
content?: string;
role?: "user" | "assistant";
}>();
if (action === "add") {
const { role = "user" } = await request.json<any>();
this.messages.push({ role, content: content! });
// Keep last N messages to stay within context window
if (this.messages.length > this.MAX_HISTORY) {
this.messages = this.messages.slice(-this.MAX_HISTORY);
}
await this.ctx.storage.put("messages", this.messages);
return new Response("ok");
}
if (action === "get") {
const stored = await this.ctx.storage.get<Message[]>("messages");
this.messages = stored ?? [];
return Response.json(this.messages);
}
if (action === "clear") {
this.messages = [];
await this.ctx.storage.delete("messages");
return new Response("cleared");
}
return new Response("unknown action", { status: 400 });
}
}
The Durable Object persists across requests for the same session ID. No Redis, no external DB — state lives in Cloudflare's infrastructure.
Main Worker with Streaming
// src/index.ts
import Anthropic from "@anthropic-ai/sdk";
import { ConversationSession } from "./conversation";
export { ConversationSession };
interface Env {
ANTHROPIC_API_KEY: string;
ANTHROPIC_MODEL: string;
MAX_TOKENS: string;
CACHE: KVNamespace;
CONVERSATIONS: DurableObjectNamespace;
}
// Rate limiting via KV — lightweight leaky bucket
async function checkRateLimit(
env: Env,
userId: string,
maxPerMinute = 20
): Promise<boolean> {
const key = `ratelimit:${userId}:${Math.floor(Date.now() / 60000)}`;
const current = Number((await env.CACHE.get(key)) ?? "0");
if (current >= maxPerMinute) return false;
await env.CACHE.put(key, String(current + 1), { expirationTtl: 120 });
return true;
}
// KV-based prompt response cache for identical queries
async function getCachedResponse(
env: Env,
prompt: string
): Promise<string | null> {
const hash = await crypto.subtle.digest(
"SHA-256",
new TextEncoder().encode(prompt)
);
const key = `cache:${btoa(String.fromCharCode(...new Uint8Array(hash))).slice(0, 32)}`;
return env.CACHE.get(key);
}
async function setCachedResponse(
env: Env,
prompt: string,
response: string
): Promise<void> {
const hash = await crypto.subtle.digest(
"SHA-256",
new TextEncoder().encode(prompt)
);
const key = `cache:${btoa(String.fromCharCode(...new Uint8Array(hash))).slice(0, 32)}`;
// Cache for 1 hour — adjust for your use case
await env.CACHE.put(key, response, { expirationTtl: 3600 });
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// CORS preflight
if (request.method === "OPTIONS") {
return new Response(null, {
headers: {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "POST, OPTIONS",
"Access-Control-Allow-Headers": "Content-Type, X-Session-ID",
},
});
}
if (request.method !== "POST" || new URL(request.url).pathname !== "/chat") {
return new Response("Not found", { status: 404 });
}
const sessionId = request.headers.get("X-Session-ID") ?? "anonymous";
const userId = request.headers.get("CF-Connecting-IP") ?? sessionId;
// Rate limit check
const allowed = await checkRateLimit(env, userId);
if (!allowed) {
return new Response(
JSON.stringify({ error: "Rate limit exceeded. Try again in a minute." }),
{ status: 429, headers: { "Content-Type": "application/json" } }
);
}
let body: { message: string; stream?: boolean };
try {
body = await request.json();
} catch {
return new Response(JSON.stringify({ error: "Invalid JSON" }), {
status: 400,
headers: { "Content-Type": "application/json" },
});
}
const { message, stream = true } = body;
if (!message?.trim()) {
return new Response(JSON.stringify({ error: "Message required" }), {
status: 400,
headers: { "Content-Type": "application/json" },
});
}
// Get conversation history from Durable Object
const doId = env.CONVERSATIONS.idFromName(sessionId);
const doStub = env.CONVERSATIONS.get(doId);
const historyResp = await doStub.fetch(
new Request("http://do/history", {
method: "POST",
body: JSON.stringify({ action: "get" }),
})
);
const history = await historyResp.json<Array<{ role: string; content: string }>>();
const client = new Anthropic({ apiKey: env.ANTHROPIC_API_KEY });
// For non-streaming: check cache first
if (!stream) {
const cached = await getCachedResponse(env, message);
if (cached) {
return Response.json({ response: cached, cached: true });
}
}
const messages = [
...history,
{ role: "user" as const, content: message },
];
if (stream) {
// Server-Sent Events streaming response
const { readable, writable } = new TransformStream();
const writer = writable.getWriter();
const encoder = new TextEncoder();
// Kick off streaming in background (Workers support this pattern)
(async () => {
let fullResponse = "";
try {
const stream = client.messages.stream({
model: env.ANTHROPIC_MODEL ?? "claude-sonnet-4-6",
max_tokens: Number(env.MAX_TOKENS ?? 1024),
system:
"You are a helpful AI assistant. Be concise and accurate.",
messages,
});
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
const text = chunk.delta.text;
fullResponse += text;
await writer.write(
encoder.encode(`data: ${JSON.stringify({ text })}\n\n`)
);
}
}
await writer.write(
encoder.encode(`data: ${JSON.stringify({ done: true })}\n\n`)
);
// Persist to conversation history
await doStub.fetch(
new Request("http://do/add", {
method: "POST",
body: JSON.stringify({ action: "add", role: "user", content: message }),
})
);
await doStub.fetch(
new Request("http://do/add", {
method: "POST",
body: JSON.stringify({
action: "add",
role: "assistant",
content: fullResponse,
}),
})
);
} catch (err) {
await writer.write(
encoder.encode(
`data: ${JSON.stringify({ error: "Stream error" })}\n\n`
)
);
} finally {
await writer.close();
}
})();
return new Response(readable, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
"Access-Control-Allow-Origin": "*",
},
});
} else {
// Non-streaming: await full response, cache it
const response = await client.messages.create({
model: env.ANTHROPIC_MODEL ?? "claude-sonnet-4-6",
max_tokens: Number(env.MAX_TOKENS ?? 1024),
system: "You are a helpful AI assistant. Be concise and accurate.",
messages,
});
const text =
response.content[0].type === "text" ? response.content[0].text : "";
await setCachedResponse(env, message, text);
// Persist exchange
await doStub.fetch(
new Request("http://do/add", {
method: "POST",
body: JSON.stringify({ action: "add", role: "user", content: message }),
})
);
await doStub.fetch(
new Request("http://do/add", {
method: "POST",
body: JSON.stringify({ action: "add", role: "assistant", content: text }),
})
);
return Response.json({ response: text, cached: false });
}
},
};
Client-Side Streaming Consumer
// Example: consume the SSE stream from a React component
async function chat(message: string, sessionId: string) {
const response = await fetch("https://claude-edge-ai.YOUR_SUBDOMAIN.workers.dev/chat", {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-Session-ID": sessionId,
},
body: JSON.stringify({ message, stream: true }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split("\n");
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = JSON.parse(line.slice(6));
if (data.text) process.stdout.write(data.text); // or update React state
if (data.done) break;
}
}
}
Deploy
# Create KV namespace
wrangler kv:namespace create CACHE
# Update wrangler.toml with the returned namespace ID, then:
wrangler deploy
Zero config, zero servers. Your Claude integration is now live on Cloudflare's global edge.
Performance Characteristics
On a production Workers deployment with this setup:
- P50 TTFB: ~80ms (user in same region as Cloudflare PoP)
- P99 TTFB: ~250ms (cross-PoP, Anthropic API latency dominates)
- Cold start: <5ms (Workers JS runtime, not V8 sandbox warm-up bottleneck)
- Cache hit: ~15ms end-to-end (KV read + response serialization)
The Anthropic API itself introduces 200–800ms for first-token depending on model and load. The edge deployment can't eliminate that, but it eliminates your infrastructure's contribution to the latency stack.
Adding Prompt Caching (Anthropic-Side)
For long system prompts that don't change per request, use Anthropic's prompt caching to cut costs by up to 90% and reduce TTFT:
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are a helpful AI assistant for AcmeCorp. Here is our full product catalog and FAQ: [... large static content ...]",
cache_control: { type: "ephemeral" },
},
],
messages,
});
The first call creates the cache entry. Subsequent calls with the identical cached block hit it — you pay input cache read pricing (~10% of standard input) instead of full input pricing.
Shipping Faster
Building AI SaaS products with Cloudflare Workers + Claude? The AI SaaS Starter Kit ($99) includes a production-ready Workers template with Durable Objects, KV caching, Stripe billing integration, and a Next.js frontend already wired to your edge AI endpoint.
Need the Claude API patterns without the infrastructure boilerplate? The Ship Fast Skill Pack ($49) bundles everything from streaming SSE to prompt caching to multi-turn conversation management as reusable Claude Code skills.
For automated workflows that call your edge AI on a schedule or in response to events, see the Workflow Automator MCP ($15/mo) — connects n8n, Zapier, and Make.com to your Cloudflare Worker endpoints.
Tags: cloudflare workers, claude api, typescript, edge computing, durable objects, ai, streaming, serverless
Building your own multi-agent system? The full source code — orchestration, skills, and automation scripts — is open-sourced at https://github.com/Wh0FF24/whoff-agents.
Top comments (0)