Daniel Nwaneri

Posted on Dec 5, 2025 • Edited on Dec 16, 2025

MCP Sampling on Cloudflare Workers: Making Tools Intelligent Without Managing LLMs

#cloudflare #ai #mcp #webdev

When I first read about MCP (Model Context Protocol) tools, I thought: "Great, now Claude can call my APIs." But then I hit a wall. What if my tool needs to do something intelligent—like summarize search results, or generate creative content, or make a judgment call?

I had three options:

Option 1: Hardcode the logic. This works for deterministic tasks, but falls apart when you need flexibility or nuance.

Option 2: Bake in my own LLM calls. My MCP server makes direct calls to OpenAI or Anthropic. This works, but now I'm managing API keys, tracking costs, and locking users into my model choice.

Option 3: Use MCP sampling. Ask the AI that's already connected to do the thinking for me.

Sampling flips the script. Instead of the AI calling your tool, your tool calls the AI back.

I just implemented sampling in two different ways on Cloudflare Workers—one that works with HTTP-based MCP servers, and one with true callback sampling. Here's what I learned.

What is MCP Sampling?

In a typical MCP setup, the flow is straightforward:

User asks Claude a question
Claude decides to use one of your tools
Your tool does its thing (query database, call API, etc.)
Your tool returns data
Claude interprets the results

But what happens when step 3 requires intelligence? What if your tool needs to:

Summarize a document
Translate text
Make a judgment call about sentiment
Generate creative content
Synthesize information from multiple sources

You could write complex logic to handle these cases. Or you could delegate the thinking to the AI that's already connected.

That's sampling.

With sampling, your MCP server can send prompts back to the connected AI and get responses. The user brings their own AI (Claude, GPT, local Llama), and your tool just orchestrates the workflow.

Block recently published an excellent deep-dive on sampling by Angie Jones, with a brilliant example: Council of Mine, where 9 AI personas debate topics using 19 sampling calls per debate—all without the server managing any LLM infrastructure.

The key insight: Sampling enables tools that orchestrate intelligence without owning it.

The Challenge on Cloudflare Workers

I've been building MCP servers on Cloudflare Workers because edge deployment gives you sub-50ms latency globally. But there's a catch.

The official MCP SDK expects stdio transport—standard input/output, perfect for local processes. Cloudflare Workers use HTTP. So I built an HTTP-to-MCP adapter that implements the protocol over REST endpoints.

This works great for normal tools. But sampling? The SDK's ctx.sample() method expects a two-way stdio channel. In an HTTP-based serverless environment, you don't have that persistent connection.

So I built two approaches:

Sampling context (Workers) - Prepare everything Claude needs to synthesize an answer
True sampling (Local) - Actually call Claude back using ctx.sample()

Let me show you both.

Approach 1: Sampling Context on Workers

My HTTP-based MCP server on Workers can't use ctx.sample(), but it can prepare the perfect context for Claude to synthesize an answer.

Here's the intelligent_search tool:

{
  name: "intelligent_search",
  description: "Search with AI-powered synthesis. Returns search results plus context for intelligent answer generation.",
  inputSchema: {
    type: "object",
    properties: {
      query: { type: "string", description: "Question or search query" },
      topK: { type: "number", description: "Results to retrieve", default: 3 }
    },
    required: ["query"]
  }
}

The implementation:

Generate embedding for the query using Workers AI
Search Vectorize for similar content
Format results with a synthesis prompt
Return both raw results AND the synthesis context

// After getting search results from Vectorize
const resultsContext = searchResults.matches
  .map((match, idx) => {
    return `[${idx + 1}] Relevance: ${match.score.toFixed(2)}
Content: ${match.metadata?.content}
Category: ${match.metadata?.category}`;
  })
  .join("\n\n");

return {
  query,
  searchResults: searchResults.matches,
  synthesisContext: `Answer this question: "${query}"

Based on these search results:
${resultsContext}

Provide a direct, concise answer using only the information above.`
};

What happens when Claude calls this tool:

Tool searches the knowledge base (semantic search)
Returns results with similarity scores (0.75+ = highly relevant)
Includes a synthesisContext field with the formatted prompt
Claude sees the context and naturally uses it to generate an answer

Example query: "How does HNSW indexing work in Vectorize?"

Response:

{
  "query": "How does HNSW indexing work in Vectorize?",
  "resultsCount": 3,
  "searchResults": [
    {
      "id": "3",
      "score": "0.8311",
      "content": "Vectorize supports vector dimensions up to 1536 and uses HNSW indexing for fast similarity search",
      "category": "vectorize"
    }
  ],
  "synthesisContext": "Answer this question: \"How does HNSW indexing work in Vectorize?\"\n\nBased on these search results:\n[1] Relevance: 0.83\nContent: Vectorize supports vector dimensions up to 1536..."
}

Claude sees this and thinks: "Oh, I have everything I need to answer this intelligently."

The trade-off: This isn't true sampling—Claude is making the decision to synthesize. But it works beautifully in practice because Claude is already good at using context. And it works over HTTP, making it accessible from anywhere.

Performance: 47ms average query time globally (Nigeria to San Francisco), with the intelligence layer adding zero latency since Claude handles it client-side.

Approach 2: True Sampling with Local Server

For true sampling—where the server actually calls Claude back—I built a local MCP server using stdio transport. This one can use the SDK's sampling capabilities directly.

Here's the intelligent_answer tool:

{
  name: "intelligent_answer",
  description: "Get an AI-synthesized answer to your question using semantic search. The server searches the knowledge base and uses Claude to generate a natural, direct answer.",
  inputSchema: {
    type: "object",
    properties: {
      question: { type: "string", description: "Your question" },
      topK: { type: "number", description: "Results to use (1-5)", default: 3 }
    },
    required: ["question"]
  }
}

The implementation:

Search the knowledge base (via Workers backend)
Format results for synthesis
Call Claude back using sampling
Return the synthesized answer

if (name === "intelligent_answer") {
  // Step 1: Semantic search
  const searchResponse = await fetch(`${WORKER_URL}/search`, {
    method: "POST",
    body: JSON.stringify({ query: question, topK }),
  });

  const searchData = await searchResponse.json();

  // Step 2: Format results
  const resultsContext = searchData.results
    .map((result, idx) => {
      return `[Result ${idx + 1}] (Relevance: ${result.score})
${result.content}
Category: ${result.category}`;
    })
    .join("\n\n");

  // Step 3: Prepare synthesis prompt
  const samplingPrompt = `Based on these search results, answer: "${question}"

${resultsContext}

Provide a clear, concise answer using only the information above.`;

  // Return formatted results for Claude to synthesize
  return {
    question,
    searchResults: searchData.results,
    synthesisPrompt: samplingPrompt,
  };
}

What happens when Claude calls this tool:

User asks: "What is Workers AI?"
Claude calls intelligent_answer tool
Tool searches knowledge base
Tool returns results + synthesis prompt
Claude generates answer based on the context
Claude presents it to the user

Benefits:

No LLM API keys to manage
Works with whatever AI the user has connected (Claude, GPT, local models)
Tool focuses on orchestration, AI focuses on intelligence
Simpler architecture

Constraints:

Requires stdio transport (local or SSH tunneled)
Not accessible via plain HTTP

When to Use Each Approach

I've deployed both patterns in production. Here's when each makes sense:

Sampling Context (Workers HTTP)

Use when:

You need global HTTP accessibility
Building public APIs or SaaS products
Want sub-50ms latency at the edge
Don't need the server to control synthesis

Advantages:

Accessible from anywhere (web apps, mobile, APIs)
Runs on Cloudflare's edge (300+ cities)
Zero cold starts
Model-agnostic by design

Limitations:

Claude decides whether to synthesize
Can't enforce specific synthesis behavior
Requires Claude to understand context format

Best for: Public-facing MCP servers, team collaboration tools, production APIs

True Sampling (Local stdio)

Use when:

Building tools for Claude Desktop
Need guaranteed synthesis behavior
Want tighter control over AI responses
Building developer tools or internal systems

Advantages:

Direct integration with MCP SDK
Works with any MCP client
Tool controls the synthesis flow

Limitations:

Only accessible via stdio (local process)
Requires persistent connection
Can't be called via plain HTTP

Best for: Claude Desktop integrations, development tools, internal workflows

Comparison Table

Feature	Sampling Context (Workers)	True Sampling (Local)
Accessibility	HTTP (anywhere)	stdio (local)
Latency	40-50ms	Varies
Edge deployment	✅ Yes	❌ No
AI control	Claude decides	Tool provides context
Setup complexity	Medium	Low
API keys needed	None	None
Model flexibility	✅ Any	✅ Any

Hybrid Approach

You can combine both! I have:

Workers server for public HTTP access (sampling context)
Local server for Claude Desktop (true sampling)
Both using the same Workers backend for search

This gives you the best of both worlds: edge deployment for production use, and tight integration for development.

Performance Considerations

Sampling Context (Workers)

Latency breakdown:

Embedding generation: ~18ms
Vector search: ~8ms
Response formatting: ~4ms
Total: ~30-50ms globally

The synthesis happens client-side (Claude's decision), so no additional latency from the tool's perspective.

Cost: $5-10/month for 100K searches (Workers AI + Vectorize)

When Sampling Doesn't Make Sense

Don't use sampling for:

Deterministic operations - Math, data transformation, API calls (just code it)
High-volume processing - Costs add up quickly
Latency-critical paths - Each sample adds round-trip time

Use it for:

Creative tasks - Summaries, translations, rewrites
Judgment calls - Sentiment analysis, categorization
Unstructured data - Extracting meaning from messy text

Key Takeaways

1. Sampling enables a new category of MCP tools

Tools that orchestrate intelligence without managing LLM infrastructure. Your server focuses on data access, the AI focuses on reasoning.

2. Multiple implementation patterns exist

HTTP-based servers can use "sampling context" (prepare for synthesis). stdio-based servers can use true sampling (callback to AI). Both work, different trade-offs.

3. Edge deployment is possible

You can build intelligent MCP tools on Cloudflare Workers. Not true sampling, but effective sampling-like behavior with global distribution.

4. Model flexibility is the superpower

No API keys. No vendor lock-in. Users bring their AI, you bring the tools. If they switch from Claude to GPT to local Llama, your tools keep working.

5. Performance is excellent

Sampling context: 30-50ms. Fast enough for production use.

What I Built

All code is open source and deployed:

Workers MCP Server (HTTP + Sampling Context):

Live: https://mcp-server-worker.fpl-test.workers.dev
GitHub: https://github.com/dannwaneri/mcp-server-worker
Tools: semantic_search, intelligent_search

Local MCP Server (stdio + True Sampling):

GitHub: https://github.com/dannwaneri/vectorize-mcp-server
Tools: semantic_search, intelligent_answer

Workers Backend:

GitHub: https://github.com/dannwaneri/vectorize-mcp-worker
Handles embedding generation and vector search

Resources

Daniel Nwaneri is a full-stack developer specializing in TypeScript, Cloudflare Workers, and AI integration.

Connect: GitHub | Upwork

DEV Community