Nasit Sony

Posted on May 29

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

#distributedsystems #mlop #java #opensource

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

Most RAG tutorials show you the happy path.

Ingest document → generate embeddings → store in vector DB → search → return results.

It works great in demos. But what happens when:

The worker crashes mid-processing?
Kafka replays messages and you get duplicates?
The database goes down during ingestion?
A malformed document gets stuck in an infinite retry loop? I built SmartSearch to answer those questions — a correctness-first ingestion and retrieval backend designed to handle failures deterministically.

The Problem With Most RAG Systems

Most RAG implementations are optimized for the happy path. They work well when everything goes right, and fail in unpredictable ways when things go wrong.

The result is systems where:

A worker crash leaves jobs in an unknown state
Kafka replays create duplicate embeddings
A bad document retries forever and blocks the queue
Nobody knows why a document isn't searchable SmartSearch is built to make failures explicit, recoverable, and observable.

Architecture

Client
  ↓
API Service (Spring Boot)
  ↓
Kafka (async decoupling + replay)
  ↓
Worker (consumes, embeds, writes)
  ↓
Postgres + pgvector (embeddings + similarity search)
  ↓
Prometheus + Grafana (observability)

The key design decision: decouple ingestion from processing via Kafka. This gives you replay, retry, and resilience — at the cost of eventual consistency.

The Job Lifecycle State Machine

Every ingestion request has an explicit state:

PENDING → PROCESSING → READY
                     → FAILED

Why this matters:

No hidden progress — you always know exactly where a job is
Failures are visible — FAILED jobs appear in the system pressure dashboard
Recovery is deterministic — on restart, PROCESSING jobs are retried The lifecycle invariant: state transitions are monotonic. A job never goes backwards from PROCESSING to PENDING. Once FAILED, it stays FAILED unless explicitly retried.

Idempotent Ingestion

Kafka guarantees at-least-once delivery. This means the same message can arrive multiple times — on retry, on replay, or after a broker restart.

SmartSearch handles this via unique constraints:

UNIQUE(doc_id, chunk_id)

If a chunk already exists, the write is a no-op. This means:

Reprocessing the same message is always safe
No duplicate embeddings, ever
Workers can crash and restart without corrupting state This is the idempotency invariant: reprocessing the same request does not change the final database state.

Failure Handling + DLQ

Workers retry failed jobs with bounded attempts. After exhausting retries:

Job is marked FAILED
Message is sent to a Dead Letter Queue (DLQ)
The job stops blocking other work This prevents poison messages from retrying forever and starving the queue.

The failure isolation invariant: a FAILED job does not corrupt other documents.

Observability

The system exposes a /api/system/pressure endpoint showing live counts:

{
  "pending": 12,
  "processing": 3,
  "ready": 847,
  "failed": 2
}

Prometheus metrics via Spring Boot Actuator:

HTTP request rate and latency
Ingestion pipeline metrics (received, succeeded, failed, retries, DLQ)
Processing age — how long jobs wait before being processed
Database connection pool metrics Processing age is the metric most people overlook. Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age is an early warning signal before latency spikes become visible.

Failure Matrix

Failure Scenario	Expected Behavior
Worker crash mid-processing	Job retried, no duplicate chunks
Worker crash after DB write	Reprocessing occurs, idempotency holds
Kafka broker restart	Processing resumes, no message loss
Postgres outage	Worker retries, job eventually READY or FAILED
Poison message	Retries exhausted → FAILED + DLQ
Duplicate request	No duplicate embeddings created

All five scenarios were tested and verified to behave as specified.

What I Learned

At-least-once + idempotency is the right default. Exactly-once semantics in Kafka are possible but operationally complex. At-least-once delivery with idempotent writes gives you the same correctness guarantees with far less complexity.

The visibility invariant is underrated. A document should be searchable if and only if its state is READY. This simple rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.

Processing age is the most important metric nobody talks about. Every pipeline should expose how long work sits before being processed. It's the earliest signal of a system falling behind.

Kafka adds complexity but the tradeoffs are worth it. You get replay, retry, and resilience. The operational overhead is real, but for any system where correctness under failure matters, it's the right call.

Try It Yourself

git clone https://github.com/NasitSony/SmartSearch.git
cd SmartSearch
docker compose up -d

# API available at http://localhost:8080
# Grafana at http://localhost:3000
# Prometheus at http://localhost:9090

# Ingest a document
curl -X POST http://localhost:8080/api/documents \
  -H "Content-Type: application/json" \
  -d '{"content": "your document text here"}'

# Search
curl "http://localhost:8080/api/search?q=your+query"

# Check system pressure
curl http://localhost:8080/api/system/pressure

GitHub: https://github.com/NasitSony/SmartSearch

SmartSearch is the data pipeline layer of a larger AI infrastructure stack I've been building. The full stack story is covered in my article: I Built a Complete AI Infrastructure Stack from Scratch.

If you found this useful, a ⭐ on GitHub goes a long way!

DEV Community

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

The Problem With Most RAG Systems

Architecture

The Job Lifecycle State Machine

Idempotent Ingestion

Failure Handling + DLQ

Observability

Failure Matrix

What I Learned

Try It Yourself

Top comments (0)