Most MCP (Model Context Protocol) tutorials show you how to build local servers that connect to Claude Desktop via stdio. That's great for development, but what if you want your MCP server accessible from anywhere? What if you want it running on the edge with sub-50ms latency globally?
I built an MCP server on Cloudflare Workers that provides semantic search using Vectorize and Workers AI. Along the way, I discovered the MCP SDK wasn't designed for serverless environments - so I had to adapt it.
In this article, I'll show you:
- Why MCP on Workers is powerful (and tricky)
- Three different MCP architectures I built
- How to adapt the stdio-based SDK for HTTP
- A working semantic search implementation with Vectorize
- When to use each architecture
By the end, you'll have the knowledge to deploy production MCP servers on Cloudflare's edge network.
What we're building: A semantic search MCP server that lets Claude query a knowledge base using vector similarity instead of keywords. It runs globally on Cloudflare Workers, generates embeddings with Workers AI, and searches using Vectorize.
Why MCP on Workers?
MCP (Model Context Protocol) is Anthropic's standardized way for AI assistants to access external tools and data. Think of it as an API layer specifically designed for LLMs.
The problem: The official MCP SDK uses stdio (standard input/output) transport, which works great for local processes but doesn't work on serverless platforms like Cloudflare Workers.
Why I wanted Workers anyway:
- Global edge deployment - Your MCP server runs in 300+ cities worldwide
- Sub-50ms latency - Closer to users than any centralized server
- Integrated AI - Workers AI for embeddings, Vectorize for vector search, all in one platform
- Zero cold starts - V8 isolates are instant
- Cost effective - Free tier is generous, pay only for what you use
Three MCP Architectures: From Local to Edge
I built the same semantic search functionality three different ways. Each has tradeoffs.
Architecture 1: Local MCP Server (stdio)
How it works:
Claude Desktop ──stdio──> Local MCP Server ──HTTP──> Data Sources
This is the standard MCP pattern. Your server runs locally, connects to Claude Desktop via stdio, and can call external APIs or databases.
Pros:
- Official MCP SDK works out of the box
- Easy to debug (console.log everywhere)
- Fast local development
Cons:
- Only works with Claude Desktop
- Can't share with others
- Server dies when your laptop closes
Use case: Development, personal tools, local-only workflows
Code example:
const server = new Server({
name: "local-search",
version: "1.0.0"
});
const transport = new StdioServerTransport();
await server.connect(transport);
Architecture 2: Hybrid MCP (Local Bridge → Workers)
How it works:
Claude Desktop ──stdio──> Local MCP Bridge ──HTTP──> Workers ──> Vectorize
The local MCP server acts as a bridge, translating stdio to HTTP calls to your Workers backend.
Pros:
- Works with Claude Desktop
- Heavy lifting happens on Workers (embeddings, vector search)
- Workers backend can be shared across multiple clients
Cons:
- Two moving parts (local + remote)
- Local bridge still required
- Extra network hop
Use case: Claude Desktop users who want cloud-powered tools
Code example:
// Local bridge
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const response = await fetch("https://your-worker.dev/search", {
method: "POST",
body: JSON.stringify(request.params)
});
return await response.json();
});
This is what most people will build first - it lets you use Claude Desktop while leveraging Workers' power.
Architecture 3: Full HTTP MCP on Workers
How it works:
Any Client ──HTTP──> Workers MCP Server ──> Vectorize/AI/D1
The MCP server runs entirely on Workers, accepting HTTP requests instead of stdio.
Pros:
- Accessible from anywhere (web apps, APIs, mobile)
- Globally distributed on the edge
- No local dependencies
- True "MCP as a Service"
Cons:
- Can't use the official MCP SDK's stdio transport
- Need to implement HTTP-to-MCP adapter
- More complex to build
Use case: Production applications, SaaS products, team collaboration
This is the architecture I'll focus on for the rest of the article because it's the most powerful - and the least documented.
Building the Workers MCP Server
Here's where it gets interesting. The MCP SDK expects a stdio transport, but Workers use HTTP. We need to build an adapter.
The Core Challenge
The official MCP SDK does this:
const transport = new StdioServerTransport();
await server.connect(transport);
But in Workers, we receive HTTP requests:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Need to handle MCP protocol here
}
}
The solution: Implement the MCP protocol manually over HTTP. MCP uses JSON-RPC style requests, so we can map them directly:
HTTP POST /mcp
Body: {"method": "tools/list", "params": {}}
Response: {"tools": [...]}
Step 1: Project Setup
Create a new Worker:
npm create cloudflare@latest mcp-server-worker
cd mcp-server-worker
npm install @modelcontextprotocol/sdk
Configure wrangler.jsonc with your bindings:
{
"name": "mcp-server-worker",
"main": "src/index.ts",
"compatibility_date": "2025-12-02",
"compatibility_flags": ["nodejs_compat"],
"ai": {
"binding": "AI"
},
"vectorize": [
{
"binding": "VECTORIZE",
"index_name": "your-index-name"
}
]
}
Why nodejs_compat? The MCP SDK has some Node.js dependencies. This flag enables them in Workers.
Step 2: The HTTP-to-MCP Adapter
Instead of using the SDK's transport layer, we handle MCP requests directly:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
if (url.pathname === "/mcp" && request.method === "POST") {
const mcpRequest = await request.json();
// Route based on MCP method
if (mcpRequest.method === "tools/list") {
return handleListTools();
}
if (mcpRequest.method === "tools/call") {
return handleToolCall(mcpRequest.params, env);
}
}
return new Response("MCP Server on Workers");
}
};
Step 3: Implementing Tools
Here's the semantic search tool implementation:
async function handleToolCall(params: any, env: Env) {
const { name, arguments: args } = params;
if (name === "semantic_search") {
const query = args.query;
const topK = args.topK || 5;
// Generate embedding with Workers AI
const embedding = await env.AI.run("@cf/baai/bge-small-en-v1.5", {
text: query
});
// Search Vectorize
const results = await env.VECTORIZE.query(embedding.data[0], {
topK,
returnMetadata: true
});
// Return MCP-formatted response
return new Response(JSON.stringify({
content: [{
type: "text",
text: JSON.stringify({
query,
resultsCount: results.matches.length,
results: results.matches.map(m => ({
id: m.id,
score: m.score.toFixed(4),
content: m.metadata?.content,
category: m.metadata?.category
}))
}, null, 2)
}]
}));
}
throw new Error(`Unknown tool: ${name}`);
}
Key points:
- Workers AI generates the embedding (
bge-small-en-v1.5produces 384-dimensional vectors) - Vectorize performs the similarity search using cosine distance
- Results include similarity scores (0-1, higher is better)
- Response follows MCP's content format
Setting Up Vectorize and Populating Data
Before your MCP server can search, you need a Vectorize index with embedded content.
Create the Vectorize Index
wrangler vectorize create mcp-knowledge-base --dimensions=384 --metric=cosine
Why 384 dimensions? The bge-small-en-v1.5 model produces 384-dimensional vectors. It's smaller than models like OpenAI's (1536 dims), making it faster and more cost-effective for edge deployment.
Why cosine metric? Cosine similarity works well for text embeddings because it measures angle, not magnitude - semantically similar texts cluster together regardless of length.
Populate the Index
Create a separate Worker to generate embeddings and populate Vectorize:
const knowledgeBase = [
{
id: "1",
content: "Cloudflare Workers AI provides LLMs and embedding models at the edge",
category: "ai"
},
{
id: "2",
content: "Vectorize uses HNSW indexing for fast similarity search",
category: "vectorize"
},
// ... more entries
];
export default {
async fetch(request: Request, env: Env) {
if (request.url.endsWith("/populate")) {
const vectors = [];
// Generate embeddings for each entry
for (const entry of knowledgeBase) {
const response = await env.AI.run("@cf/baai/bge-small-en-v1.5", {
text: entry.content
});
vectors.push({
id: entry.id,
values: response.data[0],
metadata: {
content: entry.content,
category: entry.category
}
});
}
// Insert into Vectorize
await env.VECTORIZE.insert(vectors);
return new Response(`Inserted ${vectors.length} vectors`);
}
}
};
Deploy and run once:
curl -X POST https://your-worker.dev/populate
Pro tip: Vectorize supports up to 1536 dimensions and batch inserts of 1000 vectors. For production, chunk your data and insert in batches.
Testing the Deployed Server
Deploy your MCP server:
wrangler deploy
Test the endpoints:
List available tools:
curl -X POST https://mcp-server-worker.your-subdomain.workers.dev/mcp \
-H "Content-Type: application/json" \
-d '{"method":"tools/list","params":{}}'
Perform a semantic search:
curl -X POST https://mcp-server-worker.your-subdomain.workers.dev/mcp \
-H "Content-Type: application/json" \
-d '{
"method": "tools/call",
"params": {
"name": "semantic_search",
"arguments": {"query": "vector databases", "topK": 3}
}
}'
Response example:
{
"content": [{
"type": "text",
"text": "{
\"query\": \"vector databases\",
\"resultsCount\": 3,
\"results\": [
{
\"id\": \"2\",
\"score\": \"0.7357\",
\"content\": \"Vectorize uses HNSW indexing for fast similarity search\",
\"category\": \"vectorize\"
}
]
}"
}]
}
Notice the similarity scores: Scores range from 0 (completely different) to 1 (identical). Anything above 0.7 usually indicates relevant content.
Performance: Edge vs. Centralized
Here's where running on Workers really shines. I tested the same semantic search from different locations.
Latency Results
Query: "AI embeddings and vector search"
Vectorize index: 8 entries (384-dimensional vectors)
| Location | Workers (Edge) | Centralized Server* |
|---|---|---|
| Lagos, Nigeria | 47ms | 280ms |
| London, UK | 23ms | 156ms |
| San Francisco, US | 31ms | 12ms |
| Sydney, Australia | 52ms | 312ms |
*Hypothetical centralized deployment in us-west (for comparison)
What's happening:
- Workers runs in 300+ cities globally
- Request hits the nearest datacenter
- Workers AI and Vectorize are co-located
- No cross-region latency
Breakdown of the 47ms (Lagos):
- Generate query embedding: ~18ms
- Vectorize similarity search: ~8ms
- Format and return response: ~21ms
Cost Analysis
Workers pricing (as of December 2024):
- Workers AI embeddings: $0.004 per 1,000 requests
- Vectorize queries: Included in Workers paid plan ($5/month)
- Workers requests: 10 million requests free, then $0.50 per million
Example monthly costs for a moderate app:
- 100,000 searches/month
- 100,000 embedding generations
- Total cost: ~$0.40 + $5 base = $5.40/month
Compare this to:
- OpenAI embeddings: $0.13 per 1M tokens (~$13/month for similar usage)
- Pinecone: $70/month minimum for hosted vector DB
- Running your own: Server costs + maintenance time
Scaling Characteristics
I tested with varying index sizes:
| Index Size | Query Time | Notes |
|---|---|---|
| 100 vectors | 6-8ms | Sub-10ms, excellent |
| 1,000 vectors | 12-15ms | Still very fast |
| 10,000 vectors | 18-25ms | HNSW indexing shines |
| 100,000 vectors | 35-50ms | Stays under 50ms |
Key insight: HNSW indexing keeps queries fast even as you scale. Traditional brute-force search would be O(n) - unusable at 100k vectors.
Production Considerations
If you're deploying this for real users, here's what you need to add:
1. Authentication
const apiKey = request.headers.get("Authorization");
if (apiKey !== env.API_KEY) {
return new Response("Unauthorized", { status: 401 });
}
For production, use Cloudflare Access or OAuth.
2. Rate Limiting
// Using Durable Objects for rate limiting
const id = env.RATE_LIMITER.idFromName(clientId);
const limiter = env.RATE_LIMITER.get(id);
const allowed = await limiter.fetch("check");
if (!allowed.ok) {
return new Response("Rate limit exceeded", { status: 429 });
}
3. Caching
// Cache embeddings for common queries
const cacheKey = `embed:${query}`;
let embedding = await env.KV.get(cacheKey, "json");
if (!embedding) {
embedding = await env.AI.run("@cf/baai/bge-small-en-v1.5", { text: query });
await env.KV.put(cacheKey, JSON.stringify(embedding), { expirationTtl: 3600 });
}
Caching embeddings can cut costs by 80%+ for repeated queries.
4. Monitoring
Use Workers Analytics Engine:
ctx.waitUntil(
env.ANALYTICS.writeDataPoint({
blobs: [toolName, userId],
doubles: [latencyMs, similarityScore],
indexes: [timestamp]
})
);
Key Takeaways
What we built:
- An MCP server running on Cloudflare Workers (not just local)
- Semantic search powered by Workers AI + Vectorize
- HTTP-to-MCP adapter (since the SDK expects stdio)
- Global edge deployment with sub-50ms latency
The three architectures compared:
| Architecture | Best For | Latency | Accessibility |
|---|---|---|---|
| Local (stdio) | Development, personal use | Instant | Claude Desktop only |
| Hybrid (bridge) | Power users with cloud backend | ~100ms | Claude Desktop only |
| Workers (HTTP) | Production, SaaS, teams | 20-50ms | Anywhere (HTTP) |
Why this matters:
MCP is still new, and most examples show local-only implementations. But the real power comes from making your MCP servers accessible from anywhere:
- Build once, use everywhere (web apps, mobile, APIs)
- Share tools across teams
- Deploy globally on the edge
- True "AI tooling as a service"
Performance wins with Workers:
- 47ms queries from Lagos to San Francisco Vectorize
- $5-10/month for 100k searches
- Zero cold starts with V8 isolates
- Co-located AI and vector search
What's Next
Immediate next steps:
- Add more tools - Database queries, API integrations, file operations
- Build a local bridge - Let Claude Desktop use your Workers MCP server
- Production features - Auth, rate limiting, monitoring, caching
- Scale your index - Vectorize handles millions of vectors
Ideas to explore:
- Multi-tenant MCP servers - Different indexes per user
- Streaming responses - For long-running operations
- Tool chaining - One MCP tool calling another
- Hybrid search - Combine semantic + keyword search
The bigger picture:
MCP on Workers opens up possibilities:
- Customer support bots with company knowledge
- Code assistants with your codebase embedded
- Research tools with domain-specific data
- Personal AI with private data on the edge
We're early. Most MCP implementations are local-only. But serverless MCP on the edge is where this gets powerful - and you now know how to build it.
Resources
Code repositories:
- MCP Server on Workers - Full HTTP-based implementation
- Vectorize Worker - Embedding generation and search
- Local MCP Bridge - Stdio to HTTP adapter
Further reading:
About the Author
Daniel Nwaneri is a full-stack developer specializing in TypeScript, Cloudflare Workers, and AI integration. He builds production applications using Workers AI, Vectorize, and edge computing technologies.
Currently exploring MCP integrations and available for Cloudflare Workers consulting on Upwork.
Connect:
- GitHub: github.com/dannwaneri
- Upwork: Daniel Nwaneri
Top comments (2)
Love this deep dive on pushing MCP to the edge. One alternative angle: for some teams, a centralized MCP behind a regional cache/CDN might be “good enough” latency-wise while simplifying auth, observability, and data residency. Workers-based MCP feels ideal once those constraints and multi-region data stories are nailed.
100% agree. I kinda jumped to edge blc I was already playing with Workers AI and Vectorize but centralized + CDN is way easier to manage.
The auth/observability tradeoffs are real. Edge makes sense if you actually need <50ms or have compliance requirements, but otherwise it's overkill.
Appreciate you adding this.
good reminder that simpler is usually better 🙏