Daniel Nwaneri

Posted on Dec 3

Building an MCP Server on Cloudflare Workers with Semantic Search

#cloudflare #ai #mcp #webdev

Most MCP (Model Context Protocol) tutorials show you how to build local servers that connect to Claude Desktop via stdio. That's great for development, but what if you want your MCP server accessible from anywhere? What if you want it running on the edge with sub-50ms latency globally?

I built an MCP server on Cloudflare Workers that provides semantic search using Vectorize and Workers AI. Along the way, I discovered the MCP SDK wasn't designed for serverless environments - so I had to adapt it.

In this article, I'll show you:

Why MCP on Workers is powerful (and tricky)
Three different MCP architectures I built
How to adapt the stdio-based SDK for HTTP
A working semantic search implementation with Vectorize
When to use each architecture

By the end, you'll have the knowledge to deploy production MCP servers on Cloudflare's edge network.

What we're building: A semantic search MCP server that lets Claude query a knowledge base using vector similarity instead of keywords. It runs globally on Cloudflare Workers, generates embeddings with Workers AI, and searches using Vectorize.

Why MCP on Workers?

MCP (Model Context Protocol) is Anthropic's standardized way for AI assistants to access external tools and data. Think of it as an API layer specifically designed for LLMs.

The problem: The official MCP SDK uses stdio (standard input/output) transport, which works great for local processes but doesn't work on serverless platforms like Cloudflare Workers.

Why I wanted Workers anyway:

Global edge deployment - Your MCP server runs in 300+ cities worldwide
Sub-50ms latency - Closer to users than any centralized server
Integrated AI - Workers AI for embeddings, Vectorize for vector search, all in one platform
Zero cold starts - V8 isolates are instant
Cost effective - Free tier is generous, pay only for what you use

Three MCP Architectures: From Local to Edge

I built the same semantic search functionality three different ways. Each has tradeoffs.

Architecture 1: Local MCP Server (stdio)

How it works:

Claude Desktop ──stdio──> Local MCP Server ──HTTP──> Data Sources

This is the standard MCP pattern. Your server runs locally, connects to Claude Desktop via stdio, and can call external APIs or databases.

Pros:

Official MCP SDK works out of the box
Easy to debug (console.log everywhere)
Fast local development

Cons:

Only works with Claude Desktop
Can't share with others
Server dies when your laptop closes

Use case: Development, personal tools, local-only workflows

Code example:

const server = new Server({
  name: "local-search",
  version: "1.0.0"
});

const transport = new StdioServerTransport();
await server.connect(transport);

Architecture 2: Hybrid MCP (Local Bridge → Workers)

How it works:

Claude Desktop ──stdio──> Local MCP Bridge ──HTTP──> Workers ──> Vectorize

The local MCP server acts as a bridge, translating stdio to HTTP calls to your Workers backend.

Pros:

Works with Claude Desktop
Heavy lifting happens on Workers (embeddings, vector search)
Workers backend can be shared across multiple clients

Cons:

Two moving parts (local + remote)
Local bridge still required
Extra network hop

Use case: Claude Desktop users who want cloud-powered tools

Code example:

// Local bridge
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const response = await fetch("https://your-worker.dev/search", {
    method: "POST",
    body: JSON.stringify(request.params)
  });
  return await response.json();
});

This is what most people will build first - it lets you use Claude Desktop while leveraging Workers' power.

Architecture 3: Full HTTP MCP on Workers

How it works:

Any Client ──HTTP──> Workers MCP Server ──> Vectorize/AI/D1

The MCP server runs entirely on Workers, accepting HTTP requests instead of stdio.

Pros:

Accessible from anywhere (web apps, APIs, mobile)
Globally distributed on the edge
No local dependencies
True "MCP as a Service"

Cons:

Can't use the official MCP SDK's stdio transport
Need to implement HTTP-to-MCP adapter
More complex to build

Use case: Production applications, SaaS products, team collaboration

This is the architecture I'll focus on for the rest of the article because it's the most powerful - and the least documented.

Building the Workers MCP Server

Here's where it gets interesting. The MCP SDK expects a stdio transport, but Workers use HTTP. We need to build an adapter.

The Core Challenge

The official MCP SDK does this:

const transport = new StdioServerTransport();
await server.connect(transport);

But in Workers, we receive HTTP requests:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Need to handle MCP protocol here
  }
}

The solution: Implement the MCP protocol manually over HTTP. MCP uses JSON-RPC style requests, so we can map them directly:

HTTP POST /mcp
Body: {"method": "tools/list", "params": {}}
Response: {"tools": [...]}

Step 1: Project Setup

Create a new Worker:

npm create cloudflare@latest mcp-server-worker
cd mcp-server-worker
npm install @modelcontextprotocol/sdk

Configure wrangler.jsonc with your bindings:

{
  "name": "mcp-server-worker",
  "main": "src/index.ts",
  "compatibility_date": "2025-12-02",
  "compatibility_flags": ["nodejs_compat"],
  "ai": {
    "binding": "AI"
  },
  "vectorize": [
    {
      "binding": "VECTORIZE",
      "index_name": "your-index-name"
    }
  ]
}

Why nodejs_compat? The MCP SDK has some Node.js dependencies. This flag enables them in Workers.

Step 2: The HTTP-to-MCP Adapter

Instead of using the SDK's transport layer, we handle MCP requests directly:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    if (url.pathname === "/mcp" && request.method === "POST") {
      const mcpRequest = await request.json();

      // Route based on MCP method
      if (mcpRequest.method === "tools/list") {
        return handleListTools();
      }

      if (mcpRequest.method === "tools/call") {
        return handleToolCall(mcpRequest.params, env);
      }
    }

    return new Response("MCP Server on Workers");
  }
};

Step 3: Implementing Tools

Here's the semantic search tool implementation:

async function handleToolCall(params: any, env: Env) {
  const { name, arguments: args } = params;

  if (name === "semantic_search") {
    const query = args.query;
    const topK = args.topK || 5;

    // Generate embedding with Workers AI
    const embedding = await env.AI.run("@cf/baai/bge-small-en-v1.5", {
      text: query
    });

    // Search Vectorize
    const results = await env.VECTORIZE.query(embedding.data[0], {
      topK,
      returnMetadata: true
    });

    // Return MCP-formatted response
    return new Response(JSON.stringify({
      content: [{
        type: "text",
        text: JSON.stringify({
          query,
          resultsCount: results.matches.length,
          results: results.matches.map(m => ({
            id: m.id,
            score: m.score.toFixed(4),
            content: m.metadata?.content,
            category: m.metadata?.category
          }))
        }, null, 2)
      }]
    }));
  }

  throw new Error(`Unknown tool: ${name}`);
}

Key points:

Workers AI generates the embedding (bge-small-en-v1.5 produces 384-dimensional vectors)
Vectorize performs the similarity search using cosine distance
Results include similarity scores (0-1, higher is better)
Response follows MCP's content format

Setting Up Vectorize and Populating Data

Before your MCP server can search, you need a Vectorize index with embedded content.

Create the Vectorize Index

wrangler vectorize create mcp-knowledge-base --dimensions=384 --metric=cosine

Why 384 dimensions? The bge-small-en-v1.5 model produces 384-dimensional vectors. It's smaller than models like OpenAI's (1536 dims), making it faster and more cost-effective for edge deployment.

Why cosine metric? Cosine similarity works well for text embeddings because it measures angle, not magnitude - semantically similar texts cluster together regardless of length.

Populate the Index

Create a separate Worker to generate embeddings and populate Vectorize:

const knowledgeBase = [
  {
    id: "1",
    content: "Cloudflare Workers AI provides LLMs and embedding models at the edge",
    category: "ai"
  },
  {
    id: "2", 
    content: "Vectorize uses HNSW indexing for fast similarity search",
    category: "vectorize"
  },
  // ... more entries
];

export default {
  async fetch(request: Request, env: Env) {
    if (request.url.endsWith("/populate")) {
      const vectors = [];

      // Generate embeddings for each entry
      for (const entry of knowledgeBase) {
        const response = await env.AI.run("@cf/baai/bge-small-en-v1.5", {
          text: entry.content
        });

        vectors.push({
          id: entry.id,
          values: response.data[0],
          metadata: {
            content: entry.content,
            category: entry.category
          }
        });
      }

      // Insert into Vectorize
      await env.VECTORIZE.insert(vectors);

      return new Response(`Inserted ${vectors.length} vectors`);
    }
  }
};

Deploy and run once:

curl -X POST https://your-worker.dev/populate

Pro tip: Vectorize supports up to 1536 dimensions and batch inserts of 1000 vectors. For production, chunk your data and insert in batches.

Testing the Deployed Server

Deploy your MCP server:

wrangler deploy

Test the endpoints:

List available tools:

curl -X POST https://mcp-server-worker.your-subdomain.workers.dev/mcp \
  -H "Content-Type: application/json" \
  -d '{"method":"tools/list","params":{}}'

Perform a semantic search:

curl -X POST https://mcp-server-worker.your-subdomain.workers.dev/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "method": "tools/call",
    "params": {
      "name": "semantic_search",
      "arguments": {"query": "vector databases", "topK": 3}
    }
  }'

Response example:

{
  "content": [{
    "type": "text",
    "text": "{
      \"query\": \"vector databases\",
      \"resultsCount\": 3,
      \"results\": [
        {
          \"id\": \"2\",
          \"score\": \"0.7357\",
          \"content\": \"Vectorize uses HNSW indexing for fast similarity search\",
          \"category\": \"vectorize\"
        }
      ]
    }"
  }]
}

Notice the similarity scores: Scores range from 0 (completely different) to 1 (identical). Anything above 0.7 usually indicates relevant content.

Performance: Edge vs. Centralized

Here's where running on Workers really shines. I tested the same semantic search from different locations.

Latency Results

Query: "AI embeddings and vector search"

Vectorize index: 8 entries (384-dimensional vectors)

Location	Workers (Edge)	Centralized Server*
Lagos, Nigeria	47ms	280ms
London, UK	23ms	156ms
San Francisco, US	31ms	12ms
Sydney, Australia	52ms	312ms

*Hypothetical centralized deployment in us-west (for comparison)

What's happening:

Workers runs in 300+ cities globally
Request hits the nearest datacenter
Workers AI and Vectorize are co-located
No cross-region latency

Breakdown of the 47ms (Lagos):

Generate query embedding: ~18ms
Vectorize similarity search: ~8ms
Format and return response: ~21ms

Cost Analysis

Workers pricing (as of December 2024):

Workers AI embeddings: $0.004 per 1,000 requests
Vectorize queries: Included in Workers paid plan ($5/month)
Workers requests: 10 million requests free, then $0.50 per million

Example monthly costs for a moderate app:

100,000 searches/month
100,000 embedding generations
Total cost: ~$0.40 + $5 base = $5.40/month

Compare this to:

OpenAI embeddings: $0.13 per 1M tokens (~$13/month for similar usage)
Pinecone: $70/month minimum for hosted vector DB
Running your own: Server costs + maintenance time

Scaling Characteristics

I tested with varying index sizes:

Index Size	Query Time	Notes
100 vectors	6-8ms	Sub-10ms, excellent
1,000 vectors	12-15ms	Still very fast
10,000 vectors	18-25ms	HNSW indexing shines
100,000 vectors	35-50ms	Stays under 50ms

Key insight: HNSW indexing keeps queries fast even as you scale. Traditional brute-force search would be O(n) - unusable at 100k vectors.

Production Considerations

If you're deploying this for real users, here's what you need to add:

1. Authentication

const apiKey = request.headers.get("Authorization");
if (apiKey !== env.API_KEY) {
  return new Response("Unauthorized", { status: 401 });
}

For production, use Cloudflare Access or OAuth.

2. Rate Limiting

// Using Durable Objects for rate limiting
const id = env.RATE_LIMITER.idFromName(clientId);
const limiter = env.RATE_LIMITER.get(id);
const allowed = await limiter.fetch("check");

if (!allowed.ok) {
  return new Response("Rate limit exceeded", { status: 429 });
}

3. Caching

// Cache embeddings for common queries
const cacheKey = `embed:${query}`;
let embedding = await env.KV.get(cacheKey, "json");

if (!embedding) {
  embedding = await env.AI.run("@cf/baai/bge-small-en-v1.5", { text: query });
  await env.KV.put(cacheKey, JSON.stringify(embedding), { expirationTtl: 3600 });
}

Caching embeddings can cut costs by 80%+ for repeated queries.

4. Monitoring

Use Workers Analytics Engine:

ctx.waitUntil(
  env.ANALYTICS.writeDataPoint({
    blobs: [toolName, userId],
    doubles: [latencyMs, similarityScore],
    indexes: [timestamp]
  })
);

Key Takeaways

What we built:

An MCP server running on Cloudflare Workers (not just local)
Semantic search powered by Workers AI + Vectorize
HTTP-to-MCP adapter (since the SDK expects stdio)
Global edge deployment with sub-50ms latency

The three architectures compared:

Architecture	Best For	Latency	Accessibility
Local (stdio)	Development, personal use	Instant	Claude Desktop only
Hybrid (bridge)	Power users with cloud backend	~100ms	Claude Desktop only
Workers (HTTP)	Production, SaaS, teams	20-50ms	Anywhere (HTTP)

Why this matters:

MCP is still new, and most examples show local-only implementations. But the real power comes from making your MCP servers accessible from anywhere:

Build once, use everywhere (web apps, mobile, APIs)
Share tools across teams
Deploy globally on the edge
True "AI tooling as a service"

Performance wins with Workers:

47ms queries from Lagos to San Francisco Vectorize
$5-10/month for 100k searches
Zero cold starts with V8 isolates
Co-located AI and vector search

What's Next

Immediate next steps:

Add more tools - Database queries, API integrations, file operations
Build a local bridge - Let Claude Desktop use your Workers MCP server
Production features - Auth, rate limiting, monitoring, caching
Scale your index - Vectorize handles millions of vectors

Ideas to explore:

Multi-tenant MCP servers - Different indexes per user
Streaming responses - For long-running operations
Tool chaining - One MCP tool calling another
Hybrid search - Combine semantic + keyword search

The bigger picture:

MCP on Workers opens up possibilities:

Customer support bots with company knowledge
Code assistants with your codebase embedded
Research tools with domain-specific data
Personal AI with private data on the edge

We're early. Most MCP implementations are local-only. But serverless MCP on the edge is where this gets powerful - and you now know how to build it.

Resources

Code repositories:

MCP Server on Workers - Full HTTP-based implementation
Vectorize Worker - Embedding generation and search
Local MCP Bridge - Stdio to HTTP adapter

Further reading:

About the Author

Daniel Nwaneri is a full-stack developer specializing in TypeScript, Cloudflare Workers, and AI integration. He builds production applications using Workers AI, Vectorize, and edge computing technologies.

Currently exploring MCP integrations and available for Cloudflare Workers consulting on Upwork.

Connect:

GitHub: github.com/dannwaneri
Upwork: Daniel Nwaneri

Top comments (2)

Tron Cortland • Dec 3

Love this deep dive on pushing MCP to the edge. One alternative angle: for some teams, a centralized MCP behind a regional cache/CDN might be “good enough” latency-wise while simplifying auth, observability, and data residency. Workers-based MCP feels ideal once those constraints and multi-region data stories are nailed.

Daniel Nwaneri • Dec 3

100% agree. I kinda jumped to edge blc I was already playing with Workers AI and Vectorize but centralized + CDN is way easier to manage.

The auth/observability tradeoffs are real. Edge makes sense if you actually need <50ms or have compliance requirements, but otherwise it's overkill.

Appreciate you adding this.
good reminder that simpler is usually better 🙏