Three weeks ago I needed a way to query a large document corpus without sending everything to an LLM every time. The answer was a RAG (Retrieval-Augmented Generation) pipeline — but I wanted to build it inside n8n, not a Python script that I'd have to maintain separately.
Here's the architecture I landed on, and why each decision was made.
The Problem
I had 3,000+ pages of documentation spread across Google Drive. I needed Claude to answer questions about it accurately — not hallucinate, not miss context, not time out from context window limits.
Sending all 3,000 pages to Claude on every query wasn't viable. Cost, latency, and context limits made it impossible.
The solution: store embeddings in a vector database, retrieve only the relevant chunks at query time, then pass those chunks to Claude.
The Stack
- n8n (self-hosted) — orchestration layer
- Google Drive — document source
- OpenAI text-embedding-3-small — embedding model
- Supabase pgvector — vector storage and similarity search
- Claude Sonnet — answer generation
The Ingestion Pipeline
This workflow runs once (or whenever documents are added):
Google Drive Trigger
→ Split Into Chunks (1500 tokens, 200 overlap)
→ OpenAI Embeddings
→ Supabase INSERT (content + embedding + metadata)
The chunking step is a Code node that splits text by paragraph boundaries first, then falls back to character count. This keeps semantic units together better than pure character splitting.
// Chunking logic (simplified)
const chunks = [];
const paragraphs = text.split(/\n\n+/);
let currentChunk = '';
for (const para of paragraphs) {
if ((currentChunk + para).length > 1500) {
if (currentChunk) chunks.push(currentChunk.trim());
currentChunk = para;
} else {
currentChunk += '\n\n' + para;
}
}
if (currentChunk) chunks.push(currentChunk.trim());
The Query Pipeline
This runs on every user question:
Webhook (question input)
→ OpenAI Embeddings (embed the question)
→ Supabase RPC match_documents (vector similarity search)
→ Claude (question + retrieved chunks → answer)
→ Webhook Response
The Supabase similarity search uses a custom function:
CREATE OR REPLACE FUNCTION match_documents(
query_embedding vector(1536),
match_threshold float,
match_count int
)
RETURNS TABLE (content text, similarity float)
AS $$
SELECT content, 1 - (embedding <=> query_embedding) AS similarity
FROM documents
WHERE 1 - (embedding <=> query_embedding) > match_threshold
ORDER BY similarity DESC
LIMIT match_count;
$$ LANGUAGE sql STABLE;
The Results
- Latency: Under 5 seconds end-to-end on a 3,000-page corpus
- Accuracy: Significantly better than full-context approaches — Claude only sees relevant chunks
- Cost: ~$0.002 per query (embedding + Claude Sonnet)
- Maintenance: Zero. The n8n workflow runs on Supabase's free tier.
What I Would Do Differently
Hybrid search: pgvector's cosine similarity alone misses exact keyword matches. Adding a BM25 layer (keyword search) and combining scores improves recall on specific term queries. Supabase supports this natively now.
Reranking: After retrieval, running a reranker model (like Cohere Rerank) before passing to Claude improves answer quality — especially when chunks are topically similar.
Metadata filtering: Store document type, date, and source as metadata columns. Filter before vector search when the user specifies a document type. This cuts retrieval time in half.
The n8n Template
I packaged this as a ready-to-import n8n workflow with:
- Complete ingestion pipeline
- Query pipeline with error handling
- Supabase schema SQL
- Setup guide
Available on Gumroad: search "RAG Pipeline n8n Supabase"
If you want the complete workflow JSON or have questions about the Supabase setup, drop a comment below.
Top comments (0)