The pattern that keeps repeating
If you've built a RAG application, this probably sounds familiar:
- You pick an embedding model
- You set up a vector store
- You write chunking logic
- You wire everything together
- You realize the chunking doesn't work for your use case
- You rewrite half the pipeline
The models are the easy part. The pipeline glue is where projects slow down — and where most teams burn weeks they didn't plan for.
A support chatbot needs sentence-level chunks. A legal search tool needs paragraph-level with overlap. An internal knowledge base needs something in between. But every time you change one component, you're rewiring the whole thing.
The actual problem
It's not that building a RAG pipeline is hard. It's that iterating on one is painful.
You pick a chunking strategy, embed a few thousand documents, and your retrieval quality is... okay. Not great. So you want to try a different approach. But that means:
- Re-processing all your documents
- Re-generating all your embeddings
- Hoping the new strategy is actually better
- Doing all of this without breaking what's already working
Most teams don't experiment. They ship the first thing that "kind of works" and move on. Retrieval quality suffers, but the cost of iteration is too high.
What I'm building
I started working on klay+ — a composable RAG infrastructure layer where every component is independently swappable.
The core idea: your application code shouldn't change when you change your RAG strategy.
Here's what that looks like in practice:
Ingestion
Feed in PDFs, Markdown, HTML, or plain text. The content gets normalized automatically — no format-specific parsing logic in your app.
Chunking
Choose your strategy per use case:
- Recursive — split by structure (headings, paragraphs, sentences)
- Sentence-aware — keep semantic units intact
- Fixed-size — predictable token counts for context windows
- Custom — bring your own logic
The key: switching from recursive to sentence-aware chunking doesn't require touching your application code or your retrieval logic.
Embedding
Plug in the provider that fits your stage:
- Hash-based — zero API cost, great for local development
- OpenAI / Cohere — production-grade quality
- Local models via WebLLM — self-hosted, no data leaves your infra
Swap providers without re-architecting. Your retrieval layer doesn't know or care which embedder generated the vectors.
Retrieval
Query by meaning, not keywords. Results come back ranked by relevance scores. Your application gets a clean interface regardless of what's happening underneath.
The part I'm most excited about: parallel projections
This is the feature that solves the iteration problem. You can generate a new projection — different chunking, different embedding, different strategy — side by side with your production index.
Compare retrieval quality before committing to a migration. No downtime, no risk.
Technical decisions
A few choices worth mentioning:
- Self-hostable — your documents don't leave your infrastructure if you don't want them to
- No vendor lock-in — every component has multiple provider options
- Static configuration — strategies are defined declaratively, not buried in application code
Where it stands
klay+ is in early development. I'm collecting feedback from developers who are building with RAG to understand which pain points matter most.
If you've fought with RAG pipelines before, I'd genuinely love to hear:
- What part of the pipeline costs you the most time?
- How do you handle iteration on retrieval quality?
- What's your current stack and what would you swap if you could?
The landing page is here if you want to follow along: klay-plus-landing.vercel.app
Top comments (0)