Stop Getting 'It Depends' Answers About RAG Architecture

#rag #llm #ai #opensource

Ask five AI engineers which vector database to use for your RAG system. You'll get five different answers, and they'll all start with "it depends."

It depends on your data volume. It depends on your query patterns. It depends on whether you need GDPR compliance. It depends on your team's infra maturity. It depends on your budget. It depends on whether you're doing hybrid search.

The "it depends" answer is technically correct and operationally useless. It turns an architecture decision into an unbounded research project.

I built RAG Readiness to make one specific recommendation per component — and explain why.

→ Full tool page

The Design Principle: Opinions, Not Options

Most RAG tooling and documentation presents you with a comparison table. Pinecone vs. Weaviate vs. Qdrant vs. Chroma. BM25 vs. dense vs. hybrid. ada-002 vs. text-embedding-3-large.

Comparison tables are useful if you already know which dimensions matter for your use case. They're paralyzing if you don't.

RAG Readiness is opinionated by design. You describe your use case, your data, your constraints. The tool returns one choice per component — with full reasoning.

If GDPR applies, managed cloud vector databases are eliminated from consideration before Claude is even called. That's a rule, not an LLM judgment. The recommendation you receive is already constraint-filtered.

Six Modes, One Tool

Architecture Recommendation

The core mode. Answer a structured set of questions about your use case — document types, query patterns, scale, compliance requirements, team capabilities. Get back:

Vector database: one specific choice with rationale
Embedding model: one specific choice
Chunking strategy: one specific approach with parameters
Retrieval method: dense / BM25 / hybrid — one answer
Reranker: whether you need one and which

python main.py audit --interactive
# or from file:
python main.py audit --file examples/usecase_legal_contracts.json --with-cost

Architecture Diagnosis

You already have a RAG system. It's not working. This mode takes your existing architecture and the problems you're seeing, and returns a root-cause analysis per component with severity levels and one specific fix.

Not "improve your chunking" — "switch from fixed 512-token chunks to parent-child hierarchical chunking with 512-token child nodes. Your documents have multi-clause structure that fixed chunks split mid-sentence."

python main.py diagnose --file examples/diagnosis_pinecone_fixed.json

Example output:

overall_severity: critical

chunking_strategy — critical
  "Fixed 512-token chunks split mid-clause in long legal documents"
  Fix: Parent-child hierarchical chunking, 512-token child nodes

retrieval_method — high
  "Dense-only misses exact terms like dollar amounts and clause references"
  Fix: Hybrid BM25 + dense with RRF fusion

quick_fix: Enable 10% token overlap today. Takes 20 minutes, reduces
           the worst failures while you implement the full fix.

Multi-Use-Case Session

Run up to 5 parallel audits in a single request — useful when you're scoping a RAG platform that needs to serve multiple internal teams.

The output includes cross-cutting insights: which components can be shared across use cases, where requirements conflict (the legal team needs GDPR-compliant storage; the sales team wants managed cloud), and which use case to build first for the highest return on the shared infrastructure investment.

Implementation Bundle

Once you have an architecture you trust, generate a complete implementation starter kit:

python main.py bundle <session-id>

Output: a requirements.txt, docker-compose.yml, .env.example, and migration guide tailored to the recommended architecture. If you have an existing stack, you get ordered migration steps with rollback notes.

Cost Estimation

Rule-based monthly cost breakdown per component — no LLM call. Lookup tables for vector DB pricing tiers, embedding API costs, reranker inference, and LLM costs at your estimated query volume.

python main.py cost <session-id>

Returns a line-item breakdown, optimization tips (e.g., "switching to a self-hosted embedding model saves ~$800/month at this query volume"), and a hosting model classification (managed vs. self-hosted trade-off at your scale).

RAGAS Eval Dataset Generation

Generate evaluation questions grounded in your actual use case and query patterns — not generic retrieval questions.

python main.py eval-dataset <session-id> --num-questions 20

Output includes easy/medium/hard distribution, RAGAS metric mapping (which questions test faithfulness vs. answer relevancy vs. context precision), an annotation guide, and a time estimate for human review.

Session Persistence and Refinement

Every audit persists to SQLite. You can refine against new constraints:

python main.py refine <session-id> --feedback "Qdrant was too heavy for our infra team"

The tool re-runs with the feedback as an additional constraint. Refinement history is tracked — you can see how the recommendation evolved across iterations.

A Complete Quickstart

git clone https://github.com/swapnanil/rag-readiness
cd rag-readiness
cp .env.example .env  # add your ANTHROPIC_API_KEY
docker-compose up api

# New architecture audit (interactive)
python main.py audit --interactive

# Diagnose a broken stack
python main.py diagnose --interactive

# Multi-use-case session
python main.py multi-audit examples/multi_usecase_lexvault.json

# List sessions and refine
python main.py sessions
python main.py refine <session-id> --feedback "need self-hosted only"

# Cost breakdown and eval dataset
python main.py cost <session-id>
python main.py eval-dataset <session-id> --num-questions 20

The Pre-Scoring Layer

Before any LLM call, a rule-based pre-scorer computes a complexity score (1–10) from the use case inputs. This has two effects:

It calibrates the LLM prompt — a complexity-1 use case gets a simpler, more direct recommendation; a complexity-9 use case gets a recommendation with more explicit trade-off reasoning.
It runs conflict detection — if your inputs contain contradictory constraints (e.g., "GDPR compliant" + "use Pinecone"), the conflict is flagged before Claude is called, not discovered in the output.

Who This Is For

AI engineers starting a new RAG project who want a structured starting point rather than a blank page
Engineering leads who need to scope a RAG system for a business use case and justify the architecture choices to non-technical stakeholders
Teams with an existing RAG system that isn't performing as expected and need a systematic diagnosis, not a hunch

The tool is open-source, runs locally, and persists everything to SQLite. Your use case details don't leave your environment beyond the single Claude API call per audit.

→ View the full tool page, docs, and GitHub repo