DEV Community

Nasit Sony
Nasit Sony

Posted on

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

Most RAG tutorials show you the happy path.

Ingest document → generate embeddings → store in vector DB → search → return results.

It works great in demos. But what happens when:

  • The worker crashes mid-processing?
  • Kafka replays messages and you get duplicates?
  • The database goes down during ingestion?
  • A malformed document gets stuck in an infinite retry loop? I built SmartSearch to answer those questions — a correctness-first ingestion and retrieval backend designed to handle failures deterministically.

The Problem With Most RAG Systems

Most RAG implementations are optimized for the happy path. They work well when everything goes right, and fail in unpredictable ways when things go wrong.

The result is systems where:

  • A worker crash leaves jobs in an unknown state
  • Kafka replays create duplicate embeddings
  • A bad document retries forever and blocks the queue
  • Nobody knows why a document isn't searchable SmartSearch is built to make failures explicit, recoverable, and observable.

Architecture

Client
  ↓
API Service (Spring Boot)
  ↓
Kafka (async decoupling + replay)
  ↓
Worker (consumes, embeds, writes)
  ↓
Postgres + pgvector (embeddings + similarity search)
  ↓
Prometheus + Grafana (observability)
Enter fullscreen mode Exit fullscreen mode

The key design decision: decouple ingestion from processing via Kafka. This gives you replay, retry, and resilience — at the cost of eventual consistency.


The Job Lifecycle State Machine

Every ingestion request has an explicit state:

PENDING → PROCESSING → READY
                     → FAILED
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • No hidden progress — you always know exactly where a job is
  • Failures are visible — FAILED jobs appear in the system pressure dashboard
  • Recovery is deterministic — on restart, PROCESSING jobs are retried The lifecycle invariant: state transitions are monotonic. A job never goes backwards from PROCESSING to PENDING. Once FAILED, it stays FAILED unless explicitly retried.

Idempotent Ingestion

Kafka guarantees at-least-once delivery. This means the same message can arrive multiple times — on retry, on replay, or after a broker restart.

SmartSearch handles this via unique constraints:

UNIQUE(doc_id, chunk_id)
Enter fullscreen mode Exit fullscreen mode

If a chunk already exists, the write is a no-op. This means:

  • Reprocessing the same message is always safe
  • No duplicate embeddings, ever
  • Workers can crash and restart without corrupting state This is the idempotency invariant: reprocessing the same request does not change the final database state.

Failure Handling + DLQ

Workers retry failed jobs with bounded attempts. After exhausting retries:

  1. Job is marked FAILED
  2. Message is sent to a Dead Letter Queue (DLQ)
  3. The job stops blocking other work This prevents poison messages from retrying forever and starving the queue.

The failure isolation invariant: a FAILED job does not corrupt other documents.


Observability

The system exposes a /api/system/pressure endpoint showing live counts:

{
  "pending": 12,
  "processing": 3,
  "ready": 847,
  "failed": 2
}
Enter fullscreen mode Exit fullscreen mode

Prometheus metrics via Spring Boot Actuator:

  • HTTP request rate and latency
  • Ingestion pipeline metrics (received, succeeded, failed, retries, DLQ)
  • Processing age — how long jobs wait before being processed
  • Database connection pool metrics Processing age is the metric most people overlook. Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age is an early warning signal before latency spikes become visible.

Failure Matrix

Failure Scenario Expected Behavior
Worker crash mid-processing Job retried, no duplicate chunks
Worker crash after DB write Reprocessing occurs, idempotency holds
Kafka broker restart Processing resumes, no message loss
Postgres outage Worker retries, job eventually READY or FAILED
Poison message Retries exhausted → FAILED + DLQ
Duplicate request No duplicate embeddings created

All five scenarios were tested and verified to behave as specified.


What I Learned

At-least-once + idempotency is the right default. Exactly-once semantics in Kafka are possible but operationally complex. At-least-once delivery with idempotent writes gives you the same correctness guarantees with far less complexity.

The visibility invariant is underrated. A document should be searchable if and only if its state is READY. This simple rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.

Processing age is the most important metric nobody talks about. Every pipeline should expose how long work sits before being processed. It's the earliest signal of a system falling behind.

Kafka adds complexity but the tradeoffs are worth it. You get replay, retry, and resilience. The operational overhead is real, but for any system where correctness under failure matters, it's the right call.


Try It Yourself

git clone https://github.com/NasitSony/SmartSearch.git
cd SmartSearch
docker compose up -d

# API available at http://localhost:8080
# Grafana at http://localhost:3000
# Prometheus at http://localhost:9090
Enter fullscreen mode Exit fullscreen mode
# Ingest a document
curl -X POST http://localhost:8080/api/documents \
  -H "Content-Type: application/json" \
  -d '{"content": "your document text here"}'

# Search
curl "http://localhost:8080/api/search?q=your+query"

# Check system pressure
curl http://localhost:8080/api/system/pressure
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/NasitSony/SmartSearch


SmartSearch is the data pipeline layer of a larger AI infrastructure stack I've been building. The full stack story is covered in my article: I Built a Complete AI Infrastructure Stack from Scratch.

If you found this useful, a ⭐ on GitHub goes a long way!

Top comments (0)