I Built a Production-Style RAG Backend — Focused on What Happens When Things Break
Most RAG tutorials show you the happy path.
Ingest document → generate embeddings → store in vector DB → search → return results.
It works great in demos. But what happens when:
- The worker crashes mid-processing?
- Kafka replays messages and you get duplicates?
- The database goes down during ingestion?
- A malformed document gets stuck in an infinite retry loop? I built SmartSearch to answer those questions — a correctness-first ingestion and retrieval backend designed to handle failures deterministically.
The Problem With Most RAG Systems
Most RAG implementations are optimized for the happy path. They work well when everything goes right, and fail in unpredictable ways when things go wrong.
The result is systems where:
- A worker crash leaves jobs in an unknown state
- Kafka replays create duplicate embeddings
- A bad document retries forever and blocks the queue
- Nobody knows why a document isn't searchable SmartSearch is built to make failures explicit, recoverable, and observable.
Architecture
Client
↓
API Service (Spring Boot)
↓
Kafka (async decoupling + replay)
↓
Worker (consumes, embeds, writes)
↓
Postgres + pgvector (embeddings + similarity search)
↓
Prometheus + Grafana (observability)
The key design decision: decouple ingestion from processing via Kafka. This gives you replay, retry, and resilience — at the cost of eventual consistency.
The Job Lifecycle State Machine
Every ingestion request has an explicit state:
PENDING → PROCESSING → READY
→ FAILED
Why this matters:
- No hidden progress — you always know exactly where a job is
- Failures are visible — FAILED jobs appear in the system pressure dashboard
- Recovery is deterministic — on restart, PROCESSING jobs are retried The lifecycle invariant: state transitions are monotonic. A job never goes backwards from PROCESSING to PENDING. Once FAILED, it stays FAILED unless explicitly retried.
Idempotent Ingestion
Kafka guarantees at-least-once delivery. This means the same message can arrive multiple times — on retry, on replay, or after a broker restart.
SmartSearch handles this via unique constraints:
UNIQUE(doc_id, chunk_id)
If a chunk already exists, the write is a no-op. This means:
- Reprocessing the same message is always safe
- No duplicate embeddings, ever
- Workers can crash and restart without corrupting state This is the idempotency invariant: reprocessing the same request does not change the final database state.
Failure Handling + DLQ
Workers retry failed jobs with bounded attempts. After exhausting retries:
- Job is marked
FAILED - Message is sent to a Dead Letter Queue (DLQ)
- The job stops blocking other work This prevents poison messages from retrying forever and starving the queue.
The failure isolation invariant: a FAILED job does not corrupt other documents.
Observability
The system exposes a /api/system/pressure endpoint showing live counts:
{
"pending": 12,
"processing": 3,
"ready": 847,
"failed": 2
}
Prometheus metrics via Spring Boot Actuator:
- HTTP request rate and latency
- Ingestion pipeline metrics (received, succeeded, failed, retries, DLQ)
- Processing age — how long jobs wait before being processed
- Database connection pool metrics Processing age is the metric most people overlook. Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age is an early warning signal before latency spikes become visible.
Failure Matrix
| Failure Scenario | Expected Behavior |
|---|---|
| Worker crash mid-processing | Job retried, no duplicate chunks |
| Worker crash after DB write | Reprocessing occurs, idempotency holds |
| Kafka broker restart | Processing resumes, no message loss |
| Postgres outage | Worker retries, job eventually READY or FAILED |
| Poison message | Retries exhausted → FAILED + DLQ |
| Duplicate request | No duplicate embeddings created |
All five scenarios were tested and verified to behave as specified.
What I Learned
At-least-once + idempotency is the right default. Exactly-once semantics in Kafka are possible but operationally complex. At-least-once delivery with idempotent writes gives you the same correctness guarantees with far less complexity.
The visibility invariant is underrated. A document should be searchable if and only if its state is READY. This simple rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.
Processing age is the most important metric nobody talks about. Every pipeline should expose how long work sits before being processed. It's the earliest signal of a system falling behind.
Kafka adds complexity but the tradeoffs are worth it. You get replay, retry, and resilience. The operational overhead is real, but for any system where correctness under failure matters, it's the right call.
Try It Yourself
git clone https://github.com/NasitSony/SmartSearch.git
cd SmartSearch
docker compose up -d
# API available at http://localhost:8080
# Grafana at http://localhost:3000
# Prometheus at http://localhost:9090
# Ingest a document
curl -X POST http://localhost:8080/api/documents \
-H "Content-Type: application/json" \
-d '{"content": "your document text here"}'
# Search
curl "http://localhost:8080/api/search?q=your+query"
# Check system pressure
curl http://localhost:8080/api/system/pressure
GitHub: https://github.com/NasitSony/SmartSearch
SmartSearch is the data pipeline layer of a larger AI infrastructure stack I've been building. The full stack story is covered in my article: I Built a Complete AI Infrastructure Stack from Scratch.
If you found this useful, a ⭐ on GitHub goes a long way!
Top comments (0)