TL;DR
I deployed a semantic search system on Cloudflare's edge that costs $5-10/month instead of the typical $100-200+. It's faster, follows enterprise MCP composable architecture patterns, and handles production traffic. Here's how.
The Problem: AI Search is Expensive
Last month, I looked at typical AI infrastructure costs and realized why so many startups struggle to add semantic search.
Traditional RAG stack (for ~10,000 searches/month):
- Pinecone vector database: $50-70/month (Standard plan minimum)
- OpenAI embeddings API: $30-50/month (usage-based)
- AWS EC2 server (t3.medium): $35-50/month
- Monitoring/logging: $15-20/month
Total: $130-190/month for a feature that should be table stakes.
For a bootstrapped startup trying to add "AI-powered search" to their docs? That's $1,560-2,280/year before you've made a single dollar from the feature.
Something had to change.
The Hypothesis: What If Everything Ran on the Edge?
I'd been building MCP servers on Cloudflare Workers (wrote about it here), and I kept thinking: Why can't RAG run entirely on the edge?
Traditional setup has way too many hops:
User → App Server → OpenAI (embeddings) → Pinecone (search) → User
Each hop adds latency. Each service adds cost.
What if we could do this instead:
User → Cloudflare Edge (embeddings + search + response) → User
All in one place. No round trips. No idle servers burning money.
The Architecture: Collocate Everything
Here's what I built:
Vectorize MCP Worker - A single Cloudflare Worker that handles:
- Embedding generation (Workers AI)
- Vector search (Vectorize)
- Results formatting (in-worker)
- Authentication (built-in)
The entire stack runs on Cloudflare's edge in 300+ cities globally.
Technical Stack
-
Workers AI:
bge-small-en-v1.5model (384-dimensional embeddings) - Vectorize: Cloudflare's managed vector database (HNSW indexing)
- TypeScript: Full type safety
- HTTP API: Works from anywhere
Core Code (Simplified)
Search endpoint:
async function searchIndex(query: string, topK: number, env: Env) {
const startTime = Date.now();
// Generate embedding (runs on-edge)
const embeddingStart = Date.now();
const embedding = await env.AI.run("@cf/baai/bge-small-en-v1.5", {
text: query,
});
const embeddingTime = Date.now() - embeddingStart;
// Search vectors (also on-edge)
const searchStart = Date.now();
const results = await env.VECTORIZE.query(embedding, {
topK,
returnMetadata: true,
});
const searchTime = Date.now() - searchStart;
return {
query,
results: results.matches,
performance: {
embeddingTime: `${embeddingTime}ms`,
searchTime: `${searchTime}ms`,
totalTime: `${Date.now() - startTime}ms`
}
};
}
That's it. No complex orchestration. No service mesh. Just Workers AI + Vectorize.
Composable MCP Architecture in Practice
Recent enterprise MCP discussions (Workato's excellent series) highlight that most implementations fail by exposing raw APIs instead of composable skills.
The Problem with Naive MCP Implementations
Many teams build MCP servers by wrapping existing APIs:
get_guest_by_emailget_booking_by_guestcreate_payment_intentcharge_payment_methodsend_receipt_email- ... 47 tools total
The LLM must orchestrate 6+ API calls per task. Result: slow, error-prone, terrible UX.
The Composable Approach
Instead, this worker exposes high-level skills aligned with user intent:
-
semantic_search- Find relevant information -
intelligent_search- Search with AI synthesis
One tool call. Complete result. Backend handles all complexity.
The 9 Enterprise Patterns
This implementation follows 8 of 9 recommended enterprise MCP patterns:
1. Business Identifiers Over System IDs
// Users search with natural language
{ "query": "How does edge computing work?" }
// Not with database IDs
{ "vector_id": "a0I8d000001pRmXEAU" }
2. Atomic Operations
One tool call handles the entire workflow:
- Generate embedding (Workers AI)
- Search vectors (Vectorize)
- Format results
- Return performance metrics
No multi-step orchestration needed.
3. Smart Defaults
{
"query": "required",
"topK": "defaults to 5" // Reduce cognitive load
}
4. Authorization Built-In
// Production mode requires API key
// Dev mode allows testing without auth
// Tools are automatically scoped
if (env.API_KEY && !isAuthorized(request)) {
return new Response("Unauthorized", { status: 401 });
}
5. Error Documentation
Every error includes actionable hints:
{
"error": "topK must be between 1 and 20",
"hint": "Adjust your topK parameter to a value between 1-20"
}
6. Observable Performance
Built-in timing for every request:
{
"performance": {
"embeddingTime": "142ms",
"searchTime": "223ms",
"totalTime": "365ms"
}
}
7. Natural Language Alignment
Tool names match how people actually talk:
- "Search for X" →
semantic_search - Not "query_vector_database_with_cosine_similarity"
8. Defensive Composition
The /populate endpoint is idempotent - safe to call multiple times.
Benchmark Comparison
Enterprise composable design (from Workato's benchmarks):
- Response time: 2-4 seconds
- Success rate: 94%
- Tools needed: 12
- Calls per task: 1.8
This implementation:
- Response time: 365ms (6-10x faster)
- Success rate: ~100% (deterministic)
- Tools needed: 2 (minimal)
- Calls per task: 1 (one-shot)
The difference: Edge deployment + proper abstraction.
The Architecture Principle
Following Workato's guidance:
"Let LLMs handle intent, let backends handle execution."
LLM Responsibilities (Non-deterministic):
- Understanding user queries
- Selecting semantic_search vs intelligent_search
- Interpreting results for users
Backend Responsibilities (Deterministic):
- Generating embeddings reliably
- Querying vectors atomically
- Handling errors gracefully
- Ensuring consistent performance
- Managing authentication
This separation creates reliable, fast, user-friendly MCP tools - not fragile API wrappers.
The Results: Better AND Cheaper
Performance (Real Production Data)
I tested this from Port Harcourt, Nigeria to Cloudflare's edge on December 23, 2024:
| Operation | Time |
|---|---|
| Embedding generation | 142ms |
| Vector search | 223ms |
| Response formatting | <5ms |
| Total | 365ms |
Note: Performance varies by region and load. These are actual measurements from production deployment.
Cost Analysis (Actual Usage)
For 10,000 searches/day (300K/month):
My Solution:
- Workers: ~$3/month (based on CPU time)
- Workers AI: ~$3-5/month (at $0.011 per 1K neurons)
- Vectorize: ~$2/month (query dimensions)
- Total: $8-10/month
Traditional Alternatives (estimated for same volume):
- Pinecone Standard: $50-70/month (minimum + usage)
- Weaviate Cloud: $25-40/month (depends on storage)
- Self-hosted pgvector: $40-60/month (server + maintenance)
Savings: 85-95% depending on alternative chosen.
The Free Tier is Generous
Cloudflare's free tier covers:
- 100,000 Workers requests/day
- 10,000 AI neurons/day
- 30M Vectorize queries/month
Most side projects and small businesses never leave the free tier.
Production Features (Because It's Not Just a Demo)
1. Authentication
// Optional API key for production
if (env.API_KEY && !isAuthorized(request)) {
return new Response("Unauthorized", { status: 401 });
}
Dev mode works without auth. Production requires it. Simple.
2. Performance Monitoring
Every response includes timing:
{
"query": "edge computing",
"results": [...],
"performance": {
"embeddingTime": "142ms",
"searchTime": "223ms",
"totalTime": "365ms"
}
}
No separate APM tool needed. It's built in.
3. Self-Documenting API
Hit GET / for full API docs:
{
"name": "Vectorize MCP Worker",
"endpoints": {
"POST /search": "Search the index",
"POST /populate": "Add documents",
"GET /stats": "Index statistics"
}
}
4. CORS Support
Pre-configured for web apps. Just works.
Use Cases I've Seen Work
Internal Documentation Search
50-person startup with docs scattered across Notion, Google Docs, Confluence.
Before: Manual search. Employees wasted 30 mins/day looking for answers.
After: Semantic search finds the right doc in seconds.
Cost: $5/month (vs. $70 for Algolia DocSearch)
Customer Support Knowledge Base
SaaS with 500 support articles.
Before: Keyword search missed relevant articles.
After: AI-powered search suggests perfect matches.
Cost: $10/month (vs. $200+ for enterprise solutions)
Research Assistant
Academic with 1,000 PDFs.
Before: Ctrl+F through individual files.
After: Query entire library semantically.
Cost: $8/month
What I Learned
What Worked
1. Edge-first architecture is transformative
Collocating everything on the edge eliminated network hops. The performance improvement is immediate and measurable.
2. Composable tool design beats API wrappers
Exposing high-level skills instead of raw APIs made the system both faster and more reliable. The LLM focuses on intent, not orchestration.
3. Serverless pricing changes everything
When you're not paying for idle servers, you can experiment freely. Launch on Friday, usage spikes? NBD. It scales automatically.
4. Simple HTTP beats fancy SDKs
No version conflicts. No dependency hell. Just curl or fetch. Works from Python, Node, Go, whatever.
What Could Be Better
1. Local dev is awkward
Vectorize doesn't work in wrangler dev. You have to deploy to test search. Trade-off: fast iteration on everything else, deploy for full tests.
2. Knowledge base updates require redeployment
Currently, you edit the code and redeploy. Future: dynamic upload API. Trade-off: security vs. convenience.
3. 384 dimensions might not be enough for specialized domains
The bge-small-en-v1.5 model is great for general text. Medical or legal domains might benefit from larger models. Trade-off: speed vs. precision.
Cost Comparison Details
Methodology: All costs estimated for 10,000 searches/day (300K/month) with 10,000 stored vectors at 384 dimensions.
| Solution | Monthly Cost | Notes |
|---|---|---|
| This Worker | $8-10 | Cloudflare's published rates |
| Pinecone Standard | $50-70 | $50 minimum + usage |
| Weaviate Serverless | $25-40 | Usage-based pricing |
| Self-hosted + pgvector | $40-60 | Server + maintenance |
Prices as of December 2024. Your actual costs may vary based on usage patterns.
How to Deploy This Yourself
It's open source: https://github.com/dannwaneri/vectorize-mcp-worker
5-minute setup:
# Clone
git clone https://github.com/dannwaneri/vectorize-mcp-worker
cd vectorize-mcp-worker
npm install
# Create vector index
wrangler vectorize create mcp-knowledge-base --dimensions=384 --metric=cosine
# Deploy
wrangler deploy
# Set API key for production
openssl rand -base64 32 | wrangler secret put API_KEY
# Populate with your data
curl -X POST https://your-worker.workers.dev/populate \
-H "Authorization: Bearer YOUR_KEY"
# Search
curl -X POST https://your-worker.workers.dev/search \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"query": "your question", "topK": 3}'
Live demo: https://vectorize-mcp-worker.fpl-test.workers.dev
The Business Case
If you're a:
Startup founder: Stop overpaying for AI infrastructure. Deploy this for $5/month and focus your budget on features that differentiate you.
Consultant/Agency: You can now profitably include AI search in fixed-price projects. No ongoing infrastructure headaches to manage.
Enterprise team: Deploy per-department search without getting budget approval for $1,500+/year per team.
MCP Server Builder: Use this as a reference implementation for composable tool design that follows enterprise best practices.
The economics make sense. What used to require a dedicated line item is now cheaper than your team's daily coffee budget.
What's Next
I'm working on:
- [ ] Dynamic document upload API (no code changes needed)
- [ ] Semantic chunking for long documents
- [ ] Multi-modal support (images, tables)
- [ ] Comprehensive test suite
And I'm helping a few companies deploy this for their use cases. If you're spending $100+/month on AI search or building MCP servers, let's talk.
Connect
- GitHub: @dannwaneri
- Upwork: Profile
- Twitter: @dannwaneri
Questions? Comments? Building composable MCP tools too? Drop them below.
And if you found this useful, star the repo: https://github.com/dannwaneri/vectorize-mcp-worker
Related: MCP Sampling on Cloudflare Workers - How to build intelligent MCP tools without managing LLMs
Why Edge Computing Forced Me to Write Better Code - The economic forcing function behind this architecture
Inspired by: Beyond Basic MCP: Why Enterprise AI Needs Composable Architecture and Designing Composable Tools for Enterprise MCP
Top comments (3)
Thanks for the reactions @leob! Would love to hear your thoughts on the composable MCP architecture approach.
Are you building with MCP servers too?
Not yet! What do you think, should I ?
Definitely worth checking out. MCP is getting real traction right now.