DEV Community

Cover image for Cloudflare as an Inference Layer for Agents: What It Promises and What Worries Me
Juan Torchia
Juan Torchia

Posted on • Originally published at juanchi.dev

Cloudflare as an Inference Layer for Agents: What It Promises and What Worries Me

There's a belief baked into the dev community that distributing AI inference close to the user is, by definition, good. More speed, less latency, better experience. And yeah, in the abstract it makes sense. The problem is that "distributed" and "decentralized" are not synonyms — there's a massive difference between the two that's getting completely lost in all the excitement around Cloudflare AI Platform.

When something runs across 300 PoPs around the world but everything flows through a single company, with a single usage policy, a single billing relationship, and a single corporate decision that can change the rules of the game overnight... that's not distribution. That's centralization with better latency.

And before you tell me I'm being paranoid: remember that we already talked about the opacity in token usage across AI tools. The pattern keeps repeating.

What Cloudflare Is Actually Betting On with Its AI Platform for Agents and Inference

Cloudflare Workers AI isn't new. They've had edge inference for a while now — models like Llama, Mistral, Phi running in their distributed data centers, accessible through a simple API from a Worker. The technical proposition is real and well-executed.

But what changed in the last few months is the focus. Cloudflare stopped talking about "AI inference" in general and started talking specifically about agents. And that changes the entire analysis.

The architecture they're pushing has a few concrete pieces:

Workers AI — The inference engine itself. Models running at the edge, close to the user, with latencies that are genuinely impressive in some cases.

Durable Objects — The mechanism for maintaining state between calls. If an agent needs to remember what it did in the previous step, that memory lives here.

Queues + Workflows — Async task orchestration. The agent fires off work, the work gets queued, another Worker processes it. Reasonably well thought out.

AI Gateway — The observability proxy. All AI traffic flows through here: logging, rate limiting, response caching, cost control.

On paper, it's a complete platform for building agentic systems. And what strikes me most is that it solves a real problem: right now, if you want to build an agent with persistent state, retry logic, and decent observability, you're duct-taping four different services from four different vendors together. Cloudflare offers that integrated.

// A basic agent running on Cloudflare Workers
// The simplicity is real — and that's part of the problem too
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Inference runs at the edge, close to the user
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [
        {
          role: 'system',
          // Agent context lives here
          content: 'You are an agent that helps with code analysis'
        },
        {
          role: 'user',
          content: await request.text()
        }
      ],
      // Token control — important for costs
      max_tokens: 1024
    })

    return Response.json(response)
  }
}
Enter fullscreen mode Exit fullscreen mode
// Durable Object to maintain agent state across turns
export class AgentWithMemory implements DurableObject {
  private history: Array<{role: string, content: string}> = []

  constructor(private state: DurableObjectState, private env: Env) {}

  async fetch(request: Request): Promise<Response> {
    const { message } = await request.json() as { message: string }

    // Retrieve persisted history (survives across requests)
    this.history = await this.state.storage.get('history') ?? []

    // Add the new message
    this.history.push({ role: 'user', content: message })

    // Inference with full context
    const response = await this.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: this.history
    })

    const responseText = (response as any).response
    this.history.push({ role: 'assistant', content: responseText })

    // Persist the updated history
    await this.state.storage.put('history', this.history)

    return Response.json({ response: responseText })
  }
}
Enter fullscreen mode Exit fullscreen mode

This works. I tested it. The latency is noticeably better than hitting OpenAI from Buenos Aires. The DX is good. The problem isn't in the technical implementation.

The Gotchas Nobody Mentions When Talking About Cloudflare AI Platform for Agents

This is where I switch into reflective mode, because it connects to something I learned the hard way across 30 years of infrastructure work.

The pricing model gets opaque at scale. Workers AI has a generous free tier. But Durable Objects have their own billing. Queues too. AI Gateway too. When you assemble the complete stack for an agent in production, the real cost isn't the sum of the components — there are interactions between them that will surprise you. I've already talked about opacity in token consumption, and here the problem multiplies because you've got multiple resources billing in parallel.

Vendor lock-in with the flavor of an open platform. Workers look like standard JavaScript. The models are open source. But the integration between Workers AI + Durable Objects + Queues is Cloudflare-specific. If you decide to migrate tomorrow, you're not migrating code — you're redesigning architecture. That has a cost that doesn't show up in any pricing calculator.

The available models aren't the best models. Workers AI runs quantized models, optimized to run at the edge. Llama 3.1 8B quantized is not the same as Llama 3.1 70B at full precision. For many agent use cases — especially ones involving complex reasoning, multi-step planning, or decisions with real consequences — that difference matters. A lot.

Privacy has nuances you need to read carefully. Cloudflare has reasonable usage policies and doesn't claim it'll train on your data. But "reasonable" isn't the same as "legally guaranteed." If your agent processes sensitive information, remember what we already analyzed about the legal privilege of AI conversations — the layer where inference runs doesn't solve the question of what happens to that data.

Observability is good but control is limited. AI Gateway gives you logs, metrics, caching. Excellent. But if Cloudflare decides to change how rate limiting works, deprecate a model, or adjust free tier limits, you find out after it's already done. Centralizing inference means centralizing that operational risk too.

// What looks simple has hidden layers of dependency
// This "innocent" code ties you to: Workers Runtime, AI Binding,
// Durable Objects API, Cloudflare Storage — all at once
export class DangerouslySimpleAgent implements DurableObject {
  constructor(private state: DurableObjectState, private env: Env) {}

  async fetch(request: Request): Promise<Response> {
    // Every single one of these lines is Cloudflare-specific
    // There's no abstraction that lets you swap the provider
    const memory = await this.state.storage.get('state')
    const inference = await this.env.AI.run('...', { messages: [] })
    await this.state.storage.put('state', inference)

    // This doesn't run anywhere else without significant rewriting
    return Response.json(inference)
  }
}
Enter fullscreen mode Exit fullscreen mode

What gives me the most specific discomfort is this: the agents that actually matter — the ones that will have real impact — are going to make decisions with consequences. Send an email, execute a transaction, modify an external system. Concentrating the inference that feeds those decisions into a single platform, with the control limitations I just described, is an architectural decision with security implications that go way beyond surface-level compliance.

It's not that Cloudflare is malicious. It's that risk concentration is a structural problem regardless of the vendor's intentions.

FAQ: What People Actually Ask About Cloudflare AI Platform and Agents

Can Cloudflare Workers AI replace OpenAI for agents in production?
Depends on the use case. For tasks that need powerful models (GPT-4 level), not yet — the models available in Workers AI are capable but have reasoning limitations by comparison. For simpler tasks — classification, information extraction, structured text generation — it works well and with better latency. The real tradeoff is capability vs. latency vs. vendor lock-in, and that equation is yours to solve for your specific context.

Are Durable Objects a good solution for long-term agent state?
They're a solid solution for short-to-medium-term conversational state. For long-term agent memory (remembering information from conversations weeks or months ago, doing semantic search over history), Durable Objects alone aren't enough — you need to combine them with Vectorize (Cloudflare's vector DB service) or an external solution. Which, again, adds more layers to the lock-in.

What happens to my data when I process sensitive information through Workers AI?
Cloudflare claims not to use Workers AI data to train models. But "claims" and "contractually guarantees with legal consequences" are different things. If you're processing health data, financial data, or anything regulated, you need to read the terms of service carefully and probably talk to someone who understands the legal implications in your jurisdiction. We've already seen that AI conversations have less legal protection than we assume.

Does it make sense to use Cloudflare AI Platform if I'm already using Vercel AI SDK?
They can coexist, but the stack gets complicated. Vercel AI SDK abstracts inference providers reasonably well, and Workers AI is one of those providers. But once you start using Durable Objects for state, you're outside the Vercel world. In practice, people who use Workers AI for inference tend to use the rest of the Cloudflare stack too, because the integration is the actual value. If you've already invested in Vercel, think hard about whether the latency benefit justifies the added complexity.

Does Cloudflare AI Gateway actually help control token costs?
Yes, genuinely. Response caching is useful for repetitive queries (common in agents that make the same tool calls repeatedly). Rate limiting helps avoid billing surprises. Logging gives you real visibility into what's consuming what. It's one of the strongest parts of the proposition. The catch is that it gives you visibility into consumption within Cloudflare — but if your agent also calls external APIs (OpenAI, Anthropic, etc.) through the gateway, it captures those too, which is actually useful.

When DOES it make sense to bet heavily on Cloudflare as your inference layer for agents?
When latency is critical and your users are globally distributed. When the available models are sufficient for your use case. When your team already lives in the Cloudflare ecosystem. When request volume is high and AI Gateway caching can generate real savings. And when you have clarity about the lock-in tradeoffs and accept them consciously — not because you didn't see them, but because for your specific context, the value outweighs the risk.

Where I Land After Turning This Over for Weeks

Something similar happened to me as what I described with local inference: the alternative that seems obvious has limitations that don't appear in the initial pitch. With Cloudflare, the pitch is "distributed inference close to your users for your agents." What doesn't appear in that pitch is the risk concentration, the limits of the available models, and the depth of the lock-in.

None of this means Cloudflare AI Platform is a bad option. It means it's an option with specific tradeoffs that you need to understand before building your agent architecture on top of it.

What generates the most discomfort for me — and this is genuine, not FUD — is that the agent ecosystem is still at a stage where we don't have good curation tools for knowing what works and what's hype. In that context, a platform that offers full integration and excellent DX has a massive adoption advantage. And when something has a massive adoption advantage at an early stage, it tends to become the de facto standard even if it's not the best long-term technical choice.

I'm going to keep experimenting with Cloudflare AI Platform for specific cases. The latency is real, the DX is real, pieces like AI Gateway are genuinely useful. But my agent architecture isn't going to depend exclusively on any single vendor until the space matures enough that I can evaluate the options with more clarity.

That's the lesson I learned when I wiped production servers with rm -rf at 19: systems that look solid from the outside have failure points you only find when something goes wrong. And with agents making decisions with real consequences, I'd rather distribute that risk before we all have to learn the lesson the hard way.

Are you building agents on Cloudflare? I genuinely want to know what you've run into in production — real cases are always more informative than benchmarks.

Top comments (0)