DEV Community

Cover image for Building an MCP Server on Cloudflare Workers with Semantic Search
Daniel Nwaneri
Daniel Nwaneri

Posted on

Building an MCP Server on Cloudflare Workers with Semantic Search

Most MCP (Model Context Protocol) tutorials show you how to build local servers that connect to Claude Desktop via stdio. That's great for development, but what if you want your MCP server accessible from anywhere? What if you want it running on the edge with sub-50ms latency globally?

I built an MCP server on Cloudflare Workers that provides semantic search using Vectorize and Workers AI. Along the way, I discovered the MCP SDK wasn't designed for serverless environments - so I had to adapt it.

In this article, I'll show you:

  • Why MCP on Workers is powerful (and tricky)
  • Three different MCP architectures I built
  • How to adapt the stdio-based SDK for HTTP
  • A working semantic search implementation with Vectorize
  • When to use each architecture

By the end, you'll have the knowledge to deploy production MCP servers on Cloudflare's edge network.

What we're building: A semantic search MCP server that lets Claude query a knowledge base using vector similarity instead of keywords. It runs globally on Cloudflare Workers, generates embeddings with Workers AI, and searches using Vectorize.


Why MCP on Workers?

MCP (Model Context Protocol) is Anthropic's standardized way for AI assistants to access external tools and data. Think of it as an API layer specifically designed for LLMs.

The problem: The official MCP SDK uses stdio (standard input/output) transport, which works great for local processes but doesn't work on serverless platforms like Cloudflare Workers.

Why I wanted Workers anyway:

  • Global edge deployment - Your MCP server runs in 300+ cities worldwide
  • Sub-50ms latency - Closer to users than any centralized server
  • Integrated AI - Workers AI for embeddings, Vectorize for vector search, all in one platform
  • Zero cold starts - V8 isolates are instant
  • Cost effective - Free tier is generous, pay only for what you use

Three MCP Architectures: From Local to Edge

I built the same semantic search functionality three different ways. Each has tradeoffs.

Architecture 1: Local MCP Server (stdio)

How it works:

Claude Desktop ──stdio──> Local MCP Server ──HTTP──> Data Sources
Enter fullscreen mode Exit fullscreen mode

This is the standard MCP pattern. Your server runs locally, connects to Claude Desktop via stdio, and can call external APIs or databases.

Pros:

  • Official MCP SDK works out of the box
  • Easy to debug (console.log everywhere)
  • Fast local development

Cons:

  • Only works with Claude Desktop
  • Can't share with others
  • Server dies when your laptop closes

Use case: Development, personal tools, local-only workflows

Code example:

const server = new Server({
  name: "local-search",
  version: "1.0.0"
});

const transport = new StdioServerTransport();
await server.connect(transport);
Enter fullscreen mode Exit fullscreen mode

Architecture 2: Hybrid MCP (Local Bridge → Workers)

How it works:

Claude Desktop ──stdio──> Local MCP Bridge ──HTTP──> Workers ──> Vectorize
Enter fullscreen mode Exit fullscreen mode

The local MCP server acts as a bridge, translating stdio to HTTP calls to your Workers backend.

Pros:

  • Works with Claude Desktop
  • Heavy lifting happens on Workers (embeddings, vector search)
  • Workers backend can be shared across multiple clients

Cons:

  • Two moving parts (local + remote)
  • Local bridge still required
  • Extra network hop

Use case: Claude Desktop users who want cloud-powered tools

Code example:

// Local bridge
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const response = await fetch("https://your-worker.dev/search", {
    method: "POST",
    body: JSON.stringify(request.params)
  });
  return await response.json();
});
Enter fullscreen mode Exit fullscreen mode

This is what most people will build first - it lets you use Claude Desktop while leveraging Workers' power.


Architecture 3: Full HTTP MCP on Workers

How it works:

Any Client ──HTTP──> Workers MCP Server ──> Vectorize/AI/D1
Enter fullscreen mode Exit fullscreen mode

The MCP server runs entirely on Workers, accepting HTTP requests instead of stdio.

Pros:

  • Accessible from anywhere (web apps, APIs, mobile)
  • Globally distributed on the edge
  • No local dependencies
  • True "MCP as a Service"

Cons:

  • Can't use the official MCP SDK's stdio transport
  • Need to implement HTTP-to-MCP adapter
  • More complex to build

Use case: Production applications, SaaS products, team collaboration

This is the architecture I'll focus on for the rest of the article because it's the most powerful - and the least documented.


Building the Workers MCP Server

Here's where it gets interesting. The MCP SDK expects a stdio transport, but Workers use HTTP. We need to build an adapter.

The Core Challenge

The official MCP SDK does this:

const transport = new StdioServerTransport();
await server.connect(transport);
Enter fullscreen mode Exit fullscreen mode

But in Workers, we receive HTTP requests:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Need to handle MCP protocol here
  }
}
Enter fullscreen mode Exit fullscreen mode

The solution: Implement the MCP protocol manually over HTTP. MCP uses JSON-RPC style requests, so we can map them directly:

HTTP POST /mcp
Body: {"method": "tools/list", "params": {}}
Response: {"tools": [...]}
Enter fullscreen mode Exit fullscreen mode

Step 1: Project Setup

Create a new Worker:

npm create cloudflare@latest mcp-server-worker
cd mcp-server-worker
npm install @modelcontextprotocol/sdk
Enter fullscreen mode Exit fullscreen mode

Configure wrangler.jsonc with your bindings:

{
  "name": "mcp-server-worker",
  "main": "src/index.ts",
  "compatibility_date": "2025-12-02",
  "compatibility_flags": ["nodejs_compat"],
  "ai": {
    "binding": "AI"
  },
  "vectorize": [
    {
      "binding": "VECTORIZE",
      "index_name": "your-index-name"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Why nodejs_compat? The MCP SDK has some Node.js dependencies. This flag enables them in Workers.

Step 2: The HTTP-to-MCP Adapter

Instead of using the SDK's transport layer, we handle MCP requests directly:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    if (url.pathname === "/mcp" && request.method === "POST") {
      const mcpRequest = await request.json();

      // Route based on MCP method
      if (mcpRequest.method === "tools/list") {
        return handleListTools();
      }

      if (mcpRequest.method === "tools/call") {
        return handleToolCall(mcpRequest.params, env);
      }
    }

    return new Response("MCP Server on Workers");
  }
};
Enter fullscreen mode Exit fullscreen mode

Step 3: Implementing Tools

Here's the semantic search tool implementation:

async function handleToolCall(params: any, env: Env) {
  const { name, arguments: args } = params;

  if (name === "semantic_search") {
    const query = args.query;
    const topK = args.topK || 5;

    // Generate embedding with Workers AI
    const embedding = await env.AI.run("@cf/baai/bge-small-en-v1.5", {
      text: query
    });

    // Search Vectorize
    const results = await env.VECTORIZE.query(embedding.data[0], {
      topK,
      returnMetadata: true
    });

    // Return MCP-formatted response
    return new Response(JSON.stringify({
      content: [{
        type: "text",
        text: JSON.stringify({
          query,
          resultsCount: results.matches.length,
          results: results.matches.map(m => ({
            id: m.id,
            score: m.score.toFixed(4),
            content: m.metadata?.content,
            category: m.metadata?.category
          }))
        }, null, 2)
      }]
    }));
  }

  throw new Error(`Unknown tool: ${name}`);
}
Enter fullscreen mode Exit fullscreen mode

Key points:

  • Workers AI generates the embedding (bge-small-en-v1.5 produces 384-dimensional vectors)
  • Vectorize performs the similarity search using cosine distance
  • Results include similarity scores (0-1, higher is better)
  • Response follows MCP's content format

Setting Up Vectorize and Populating Data

Before your MCP server can search, you need a Vectorize index with embedded content.

Create the Vectorize Index

wrangler vectorize create mcp-knowledge-base --dimensions=384 --metric=cosine
Enter fullscreen mode Exit fullscreen mode

Why 384 dimensions? The bge-small-en-v1.5 model produces 384-dimensional vectors. It's smaller than models like OpenAI's (1536 dims), making it faster and more cost-effective for edge deployment.

Why cosine metric? Cosine similarity works well for text embeddings because it measures angle, not magnitude - semantically similar texts cluster together regardless of length.

Populate the Index

Create a separate Worker to generate embeddings and populate Vectorize:

const knowledgeBase = [
  {
    id: "1",
    content: "Cloudflare Workers AI provides LLMs and embedding models at the edge",
    category: "ai"
  },
  {
    id: "2", 
    content: "Vectorize uses HNSW indexing for fast similarity search",
    category: "vectorize"
  },
  // ... more entries
];

export default {
  async fetch(request: Request, env: Env) {
    if (request.url.endsWith("/populate")) {
      const vectors = [];

      // Generate embeddings for each entry
      for (const entry of knowledgeBase) {
        const response = await env.AI.run("@cf/baai/bge-small-en-v1.5", {
          text: entry.content
        });

        vectors.push({
          id: entry.id,
          values: response.data[0],
          metadata: {
            content: entry.content,
            category: entry.category
          }
        });
      }

      // Insert into Vectorize
      await env.VECTORIZE.insert(vectors);

      return new Response(`Inserted ${vectors.length} vectors`);
    }
  }
};
Enter fullscreen mode Exit fullscreen mode

Deploy and run once:

curl -X POST https://your-worker.dev/populate
Enter fullscreen mode Exit fullscreen mode

Pro tip: Vectorize supports up to 1536 dimensions and batch inserts of 1000 vectors. For production, chunk your data and insert in batches.


Testing the Deployed Server

Deploy your MCP server:

wrangler deploy
Enter fullscreen mode Exit fullscreen mode

Test the endpoints:

List available tools:

curl -X POST https://mcp-server-worker.your-subdomain.workers.dev/mcp \
  -H "Content-Type: application/json" \
  -d '{"method":"tools/list","params":{}}'
Enter fullscreen mode Exit fullscreen mode

Perform a semantic search:

curl -X POST https://mcp-server-worker.your-subdomain.workers.dev/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "method": "tools/call",
    "params": {
      "name": "semantic_search",
      "arguments": {"query": "vector databases", "topK": 3}
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Response example:

{
  "content": [{
    "type": "text",
    "text": "{
      \"query\": \"vector databases\",
      \"resultsCount\": 3,
      \"results\": [
        {
          \"id\": \"2\",
          \"score\": \"0.7357\",
          \"content\": \"Vectorize uses HNSW indexing for fast similarity search\",
          \"category\": \"vectorize\"
        }
      ]
    }"
  }]
}
Enter fullscreen mode Exit fullscreen mode

Notice the similarity scores: Scores range from 0 (completely different) to 1 (identical). Anything above 0.7 usually indicates relevant content.


Performance: Edge vs. Centralized

Here's where running on Workers really shines. I tested the same semantic search from different locations.

Latency Results

Query: "AI embeddings and vector search"

Vectorize index: 8 entries (384-dimensional vectors)

Location Workers (Edge) Centralized Server*
Lagos, Nigeria 47ms 280ms
London, UK 23ms 156ms
San Francisco, US 31ms 12ms
Sydney, Australia 52ms 312ms

*Hypothetical centralized deployment in us-west (for comparison)

What's happening:

  • Workers runs in 300+ cities globally
  • Request hits the nearest datacenter
  • Workers AI and Vectorize are co-located
  • No cross-region latency

Breakdown of the 47ms (Lagos):

  1. Generate query embedding: ~18ms
  2. Vectorize similarity search: ~8ms
  3. Format and return response: ~21ms

Cost Analysis

Workers pricing (as of December 2024):

  • Workers AI embeddings: $0.004 per 1,000 requests
  • Vectorize queries: Included in Workers paid plan ($5/month)
  • Workers requests: 10 million requests free, then $0.50 per million

Example monthly costs for a moderate app:

  • 100,000 searches/month
  • 100,000 embedding generations
  • Total cost: ~$0.40 + $5 base = $5.40/month

Compare this to:

  • OpenAI embeddings: $0.13 per 1M tokens (~$13/month for similar usage)
  • Pinecone: $70/month minimum for hosted vector DB
  • Running your own: Server costs + maintenance time

Scaling Characteristics

I tested with varying index sizes:

Index Size Query Time Notes
100 vectors 6-8ms Sub-10ms, excellent
1,000 vectors 12-15ms Still very fast
10,000 vectors 18-25ms HNSW indexing shines
100,000 vectors 35-50ms Stays under 50ms

Key insight: HNSW indexing keeps queries fast even as you scale. Traditional brute-force search would be O(n) - unusable at 100k vectors.


Production Considerations

If you're deploying this for real users, here's what you need to add:

1. Authentication

const apiKey = request.headers.get("Authorization");
if (apiKey !== env.API_KEY) {
  return new Response("Unauthorized", { status: 401 });
}
Enter fullscreen mode Exit fullscreen mode

For production, use Cloudflare Access or OAuth.

2. Rate Limiting

// Using Durable Objects for rate limiting
const id = env.RATE_LIMITER.idFromName(clientId);
const limiter = env.RATE_LIMITER.get(id);
const allowed = await limiter.fetch("check");

if (!allowed.ok) {
  return new Response("Rate limit exceeded", { status: 429 });
}
Enter fullscreen mode Exit fullscreen mode

3. Caching

// Cache embeddings for common queries
const cacheKey = `embed:${query}`;
let embedding = await env.KV.get(cacheKey, "json");

if (!embedding) {
  embedding = await env.AI.run("@cf/baai/bge-small-en-v1.5", { text: query });
  await env.KV.put(cacheKey, JSON.stringify(embedding), { expirationTtl: 3600 });
}
Enter fullscreen mode Exit fullscreen mode

Caching embeddings can cut costs by 80%+ for repeated queries.

4. Monitoring

Use Workers Analytics Engine:

ctx.waitUntil(
  env.ANALYTICS.writeDataPoint({
    blobs: [toolName, userId],
    doubles: [latencyMs, similarityScore],
    indexes: [timestamp]
  })
);
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

What we built:

  • An MCP server running on Cloudflare Workers (not just local)
  • Semantic search powered by Workers AI + Vectorize
  • HTTP-to-MCP adapter (since the SDK expects stdio)
  • Global edge deployment with sub-50ms latency

The three architectures compared:

Architecture Best For Latency Accessibility
Local (stdio) Development, personal use Instant Claude Desktop only
Hybrid (bridge) Power users with cloud backend ~100ms Claude Desktop only
Workers (HTTP) Production, SaaS, teams 20-50ms Anywhere (HTTP)

Why this matters:

MCP is still new, and most examples show local-only implementations. But the real power comes from making your MCP servers accessible from anywhere:

  • Build once, use everywhere (web apps, mobile, APIs)
  • Share tools across teams
  • Deploy globally on the edge
  • True "AI tooling as a service"

Performance wins with Workers:

  • 47ms queries from Lagos to San Francisco Vectorize
  • $5-10/month for 100k searches
  • Zero cold starts with V8 isolates
  • Co-located AI and vector search

What's Next

Immediate next steps:

  1. Add more tools - Database queries, API integrations, file operations
  2. Build a local bridge - Let Claude Desktop use your Workers MCP server
  3. Production features - Auth, rate limiting, monitoring, caching
  4. Scale your index - Vectorize handles millions of vectors

Ideas to explore:

  • Multi-tenant MCP servers - Different indexes per user
  • Streaming responses - For long-running operations
  • Tool chaining - One MCP tool calling another
  • Hybrid search - Combine semantic + keyword search

The bigger picture:

MCP on Workers opens up possibilities:

  • Customer support bots with company knowledge
  • Code assistants with your codebase embedded
  • Research tools with domain-specific data
  • Personal AI with private data on the edge

We're early. Most MCP implementations are local-only. But serverless MCP on the edge is where this gets powerful - and you now know how to build it.


Resources

Code repositories:

Further reading:


About the Author

Daniel Nwaneri is a full-stack developer specializing in TypeScript, Cloudflare Workers, and AI integration. He builds production applications using Workers AI, Vectorize, and edge computing technologies.

Currently exploring MCP integrations and available for Cloudflare Workers consulting on Upwork.

Connect:

Top comments (2)

Collapse
 
cortland18 profile image
Tron Cortland

Love this deep dive on pushing MCP to the edge. One alternative angle: for some teams, a centralized MCP behind a regional cache/CDN might be “good enough” latency-wise while simplifying auth, observability, and data residency. Workers-based MCP feels ideal once those constraints and multi-region data stories are nailed.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

100% agree. I kinda jumped to edge blc I was already playing with Workers AI and Vectorize but centralized + CDN is way easier to manage.

The auth/observability tradeoffs are real. Edge makes sense if you actually need <50ms or have compliance requirements, but otherwise it's overkill.

Appreciate you adding this.
good reminder that simpler is usually better 🙏