Why every RAG project I've built ends up fighting the pipeline — and what I'm doing about it

#ai #opensource #rag #webdev

The pattern that keeps repeating

If you've built a RAG application, this probably sounds familiar:

You pick an embedding model
You set up a vector store
You write chunking logic
You wire everything together
You realize the chunking doesn't work for your use case
You rewrite half the pipeline

The models are the easy part. The pipeline glue is where projects slow down — and where most teams burn weeks they didn't plan for.

A support chatbot needs sentence-level chunks. A legal search tool needs paragraph-level with overlap. An internal knowledge base needs something in between. But every time you change one component, you're rewiring the whole thing.

The actual problem

It's not that building a RAG pipeline is hard. It's that iterating on one is painful.

You pick a chunking strategy, embed a few thousand documents, and your retrieval quality is... okay. Not great. So you want to try a different approach. But that means:

Re-processing all your documents
Re-generating all your embeddings
Hoping the new strategy is actually better
Doing all of this without breaking what's already working

Most teams don't experiment. They ship the first thing that "kind of works" and move on. Retrieval quality suffers, but the cost of iteration is too high.

What I'm building

I started working on klay+ — a composable RAG infrastructure layer where every component is independently swappable.

The core idea: your application code shouldn't change when you change your RAG strategy.

Here's what that looks like in practice:

Ingestion

Feed in PDFs, Markdown, HTML, or plain text. The content gets normalized automatically — no format-specific parsing logic in your app.

Chunking

Choose your strategy per use case:

Recursive — split by structure (headings, paragraphs, sentences)
Sentence-aware — keep semantic units intact
Fixed-size — predictable token counts for context windows
Custom — bring your own logic

The key: switching from recursive to sentence-aware chunking doesn't require touching your application code or your retrieval logic.

Embedding

Plug in the provider that fits your stage:

Hash-based — zero API cost, great for local development
OpenAI / Cohere — production-grade quality
Local models via WebLLM — self-hosted, no data leaves your infra

Swap providers without re-architecting. Your retrieval layer doesn't know or care which embedder generated the vectors.

Retrieval

Query by meaning, not keywords. Results come back ranked by relevance scores. Your application gets a clean interface regardless of what's happening underneath.

The part I'm most excited about: parallel projections

This is the feature that solves the iteration problem. You can generate a new projection — different chunking, different embedding, different strategy — side by side with your production index.

Compare retrieval quality before committing to a migration. No downtime, no risk.

Technical decisions

A few choices worth mentioning:

Self-hostable — your documents don't leave your infrastructure if you don't want them to
No vendor lock-in — every component has multiple provider options
Static configuration — strategies are defined declaratively, not buried in application code

Where it stands

klay+ is in early development. I'm collecting feedback from developers who are building with RAG to understand which pain points matter most.

If you've fought with RAG pipelines before, I'd genuinely love to hear: