Building RAG on the Edge: Cloudflare Workers, Vectorize, and FAISS — What Actually Works

#ai #cloudflare #automation

Intro

I built a Retrieval-Augmented Generation (RAG) system using Cloudflare Workers, their new Vectorize offering, and FAISS. Here's what I learned: edge computing is sexy, but it's not a silver bullet. The stack works, but you'll hit friction points that traditional approaches avoid. This isn't a tutorial—it's a postmortem of real tradeoffs.

Section 1: The Architecture That Looked Good on Paper

Why I Picked This Stack

Cloudflare Workers promised serverless inference without cold starts. Vectorize offered managed vector storage at the edge. FAISS (Facebook AI Similarity Search) promised blazing-fast local retrieval. On paper: zero latency, zero ops overhead, cost efficiency.

The reality was messier.

The setup:

Store embeddings in Vectorize (Cloudflare's managed vector DB backed by Postgres)
Deploy a Worker that chunks documents and generates embeddings using a local LLM
Use FAISS as a fallback for local-only inference during development

# Install dependencies
npm install @cloudflare/workers-types wrangler faiss-node
pip install faiss-cpu langchain sentence-transformers

# Configure wrangler.toml
[env.production]
vars = { VECTORIZE_INDEX = "rag-index-prod" }

The architecture looked solid. The execution revealed three hard problems.

Section 2: Where Cloudflare Workers + Vectorize Actually Breaks

Problem 1: Worker Execution Timeout vs. Embedding Generation

Cloudflare Workers have a 30-second CPU timeout in production. Generating embeddings for documents longer than ~2,000 tokens consistently exceeded this limit[2].

The workaround? Offload heavy lifting to a background job or Durable Objects. That defeats the "serverless simplicity" pitch.

# This works locally. This fails at the edge.
def embed_document(text: str):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(text, show_progress_bar=True)  # 5-15 seconds per doc
    return embeddings

Problem 2: Vectorize API Latency Isn't What They Advertise

Vectorize queries showed 200-400ms response times even for simple similarity searches. The marketing said "edge speed." The reality: you're still hitting a database round-trip. Local FAISS completed the same query in <5ms.

The gap compounds when you need to retrieve multiple document chunks and rerank them. Your RAG pipeline becomes a series of network hops, not a tight loop.

// Worker code - this is slower than it looks
export default {
  async fetch(request, env) {
    const query = await request.json();
    const embedding = await generateEmbedding(query.text);

    // This API call = 200-400ms of waiting
    const results = await env.VECTORIZE_INDEX.query(embedding);

    return new Response(JSON.stringify(results));
  }
};

Problem 3: Cloudflare Workers Don't Handle Stateful AI Well

LLMs need temperature, top_k, and token budget consistency. Workers are stateless and ephemeral. Storing inference state? You need KV or Durable Objects—more abstractions, more latency.

Running inference inside a Worker hit the CPU timeout wall immediately. Running it on external GPUs added network latency that erased any edge advantage.

Section 3: Why Local RAG Still Wins (For Now)

FAISS + Local Embeddings = Predictable Performance

I spun up a local stack with FAISS + Ollama (local LLM). Results:

Embedding latency: <5ms (vs. 200ms Vectorize)
Query latency: <10ms (vs. 300ms+ round-trip)
Inference latency: ~500ms (vs. 1-2s API calls to external LLMs)
Cost: $0 (vs. per-request pricing)

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load model once
model = SentenceTransformer('all-MiniLM-L6-v2')

# Index documents locally
documents = ["doc1", "doc2", "doc3"]
embeddings = np.array([model.encode(doc) for doc in documents])

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Query: <5ms
query_embedding = model.encode("user question")
distances, indices = index.search(np.array([query_embedding]), k=5)

The Reranking Problem Cloudflare Can't Solve

RAG quality depends on reranking—running retrieved documents through a cross-encoder to filter noise. This is CPU-intensive and requires local state (loaded models).

Cloudflare Workers can't run this efficiently. You have to make another API call or accept poor retrieval quality. Local setups run reranking in the same process, no overhead.

# This should be local, not a network call
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([['query', 'document1'], ['query', 'document2']])

# Ranked results stay in memory, ready for LLM
top_docs = [docs[i] for i in scores.argsort()[-5:][::-1]]

The Real Conclusions

Cloudflare Workers + Vectorize works for: Simple, stateless queries where latency tolerance is high (>1 second is acceptable).

Cloudflare Workers + Vectorize fails for: RAG pipelines requiring sub-200ms retrieval, reranking, or heavy inference. The abstraction leaks. You end up managing Durable Objects, KV fallbacks, and external services anyway.

Local RAG wins because:

No network overhead. All operations run in-process.
State management is trivial. Keep loaded models in memory.
Inference quality is higher. Run smaller, faster models locally without timeout pressure.
Cost is predictable. No per-request fees on retrieval.

Where edge computing actually makes sense: Lightweight inference (classification, routing), not RAG.

The lesson: Don't use Cloudflare Workers because they're trendy. Use them because your problem fits their constraints. RAG doesn't. Deploy locally or to a traditional server with a GPU, and you'll ship faster and save money.