Cayman Roden

Posted on Mar 23

Building a Production RAG Pipeline That Actually Works: Lessons from DocExtract

#ai #machinelearning #devops #python

The Architecture (and Why It's 3 Services, Not 1)

DocExtract is split into three services: an API, a worker, and a frontend.

User uploads PDF
    → API validates and enqueues job (ARQ/Redis)
    → Worker picks up job asynchronously
        → chunk + embed → pgvector store
        → BM25 index built in memory on retrieval
    → API streams SSE progress to frontend
    → User queries with natural language
        → hybrid retrieval → Claude generates answer with citations

Why not one FastAPI service? Because document processing is slow (2-8 seconds per page), and you don't want your API workers blocked. The ARQ queue decouples upload from processing, which lets you scale workers independently and gives you a natural retry boundary.

The async split also means you can add real-time progress streaming (SSE) to the frontend without any threading complexity - the worker updates job state in Redis, the API polls it, and the frontend gets a 12-step progress bar that actually reflects what's happening.

The full system has 1,060 tests with 90%+ coverage across the extraction pipeline, retrieval paths, agent evaluator, guardrails, and infrastructure layer.

Why Pure Vector Search Fails (and What to Do About It)

Vector search is great at semantic similarity. "mortgage prequalification" and "home loan eligibility" have similar embeddings. But "Section 3.2(b)" - the exact contract clause a user is looking for - doesn't.

BM25 catches what embeddings miss. Exact product codes, invoice numbers, legal citations, acronyms - these score high on BM25 and often near-zero on cosine similarity.

The solution is Reciprocal Rank Fusion (RRF). You run both retrievers, rank the results independently, then combine the rank positions:

def rrf_score(vector_rank: int, rid: str) -> float:
    bm25_rank = bm25_ranks.get(rid, len(record_ids))
    return 1 / (60 + bm25_rank) + 1 / (60 + vector_rank)

The constant 60 is the RRF smoothing factor. It prevents very high scores from dominating when one system has zero results. A document ranked #1 by vector and #3 by BM25 scores higher than one ranked #1 by vector alone.

The API exposes this as a mode parameter:

GET /api/v1/records/search?q=Section+3.2&mode=hybrid

mode=vector (default), mode=bm25, or mode=hybrid. The BM25 index is built in memory at query time from the retrieved vector candidates. No separate BM25 service, no sync complexity. It adds maybe 20ms.

Result: 92.6% accuracy on a 16-fixture golden evaluation suite. Pure vector was at 86.9% on the same fixtures.

From Static RAG to Agentic RAG

Static RAG forces you to pick one retrieval mode per deployment. The real problem is that retrieval quality varies by query type, and you don't know at deploy time what queries you'll get.

The solution is a ReAct (Reasoning + Acting) agent that picks the right retrieval approach per-query. Each query goes through Think→Act→Observe cycles, choosing from five tools: search_vectors, search_bm25, search_hybrid, lookup_metadata, and rerank_results.

The agent is confidence-gated at 0.8 - if it reaches that threshold, it stops iterating. Max 3 iterations caps cost.

while iteration < max_iterations:
    thought = await agent.think(query, context)
    tool_name, tool_args = await agent.act(thought)
    observation = await tools[tool_name](**tool_args)
    context.append(observation)
    if agent.confidence >= confidence_threshold:
        break
    iteration += 1

The key insight: the agent does what a senior engineer does mentally. For "Section 3.2(b)" use BM25 - it's an exact citation. For "documents about loan eligibility" use vector - it's a concept. For ambiguous queries, use hybrid. The difference is the agent makes this decision per-query, not per-deployment.

At 2-3x the latency of a single retrieval call, this isn't free. But if your users ask both structured queries (exact IDs, clause references) and semantic queries (concepts, summaries), a per-query agent consistently outperforms any static retrieval mode.

The Golden Eval CI Gate

This is the piece most RAG pipelines skip, and it's the most important one.

Without a regression gate, you can accidentally degrade retrieval quality during a refactor and not notice until a user complains. With a gate, a PR that drops accuracy by more than 2% gets blocked automatically.

The setup:

16 document fixtures (contracts, invoices, reports) with expected extraction output in a JSON file
A pytest test that runs the full pipeline end-to-end against those fixtures
A pass threshold of 92.6% (the current baseline); anything below 90.6% blocks the merge

def test_golden_eval_accuracy():
    results = run_eval_suite(fixtures=GOLDEN_FIXTURES)
    accuracy = results["accuracy"]
    assert accuracy >= ACCURACY_THRESHOLD, (
        f"Golden eval failed: {accuracy:.1%} < {ACCURACY_THRESHOLD:.1%} "
        f"({results['passed']}/{results['total']} passed)"
    )

The fixtures are real documents with personally identifying information removed. The expected outputs cover edge cases: multi-column tables, handwritten fields (which fail gracefully), and mixed-language documents.

This runs in CI on every PR. It catches prompt regressions, embedding model changes, and chunking strategy changes before they ship.

Evaluating the Agent, Not Just the Output

The golden eval gate measures extraction accuracy. For agentic retrieval, you also need to measure whether the agent's behavior was sensible - did it pick the right tools? Did it iterate efficiently?

RAGAS pipeline with three weighted metrics:

context_recall - weight 0.35
faithfulness - weight 0.40
answer_relevancy - weight 0.25

Faithfulness carries the highest weight because hallucination is the worst failure mode for a document extraction API. A retrieved context that gets misrepresented in the answer is more dangerous than a missed chunk.

LLM-as-judge scores outputs against structured rubrics with few-shot examples. It extracts the evidence for each scoring decision - you get an auditable trace, not just a number.

Agent evaluation adds three dimensions:

Tool selection quality: Jaccard similarity against expected tool sequences when ground truth is known, redundancy penalty otherwise.
Iteration efficiency: linear decay from 1.0 at 1 iteration to 0.5 at the max iteration count.
Confidence calibration: trajectory trend and word-overlap with ground truth across iterations.

Both RAGAS and agent evaluation are feature-flagged (RAGAS_ENABLED, LLM_JUDGE_ENABLED) to avoid CI cost. The golden eval is the mandatory gate. These are the diagnostic layer - run them in staging, not on every PR.

Circuit Breakers for LLM Calls

LLM APIs fail. Rate limits, transient 5xx errors, model deprecations - your pipeline will experience all of them. A circuit breaker turns cascading failures into graceful degradation.

class CircuitState(Enum):
    CLOSED = "closed"       # Healthy - calls pass through
    OPEN = "open"           # Failing - calls rejected immediately
    HALF_OPEN = "half_open" # Recovering - one probe call allowed

class AsyncCircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 60.0,
        half_open_max_calls: int = 1,
    ) -> None:
        ...

The state machine: after 5 consecutive failures, the circuit opens. All calls fail immediately (no network round-trip). After 60 seconds, one probe call is allowed. If it succeeds, back to CLOSED. If it fails, back to OPEN.

The non-obvious design choice is inverting the fallback chain by task type:

Extraction (quality matters): Sonnet → Haiku fallback. Sonnet is more accurate; fall back to Haiku only under failure.
Classification (cost matters): Haiku → Sonnet fallback. Haiku is cheaper and fast enough; escalate to Sonnet only under failure.

This also means the fallback actually reduces costs in the classification path - a side benefit of designing for failure correctly.

One more thing: distinguish transient errors from permanent ones. A 429 (rate limit) or 503 (overloaded) should trigger the circuit. A 400 (bad request) is your bug and should never trigger a fallback.

def _is_transient(exc: Exception) -> bool:
    """Only trigger circuit on errors that might resolve themselves."""
    if isinstance(exc, anthropic.RateLimitError):
        return True
    if isinstance(exc, anthropic.APIStatusError) and exc.status_code >= 500:
        return True
    return False

Observability That Actually Tells You Something

Three layers:

1. OpenTelemetry + Prometheus

Every LLM call emits four metrics:

llm_call_duration_ms - histogram, tagged by model and operation
llm_calls_total - counter, tagged by status (success/failure)
llm_tokens_total - counter, split by input/output
circuit_breaker_state - gauge (0=CLOSED, 1=HALF_OPEN, 2=OPEN)

The circuit breaker gauge is the critical one. If it flips to 2 at 2am, you want to know about it before your users do.

2. Grafana Dashboard

Pre-built dashboard with:

LLM latency p50/p95/p99 by model (spot the Sonnet vs Haiku difference immediately)
Calls/sec by status (surface rate limit bursts)
Circuit breaker state gauge (red when open, green when closed)
Token consumption rate over time (cost forecasting)

The whole observability stack runs locally with one command:

docker compose -f docker-compose.yml -f docker-compose.observability.yml up

That brings up Jaeger (distributed traces), Prometheus (metrics), and Grafana (pre-configured dashboard at localhost:3000).

3. LangSmith Tracing

For the retrieval path specifically, LangSmith gives you per-query traces: what was retrieved, what the final prompt looked like, what tokens were consumed. When the golden eval catches a regression, LangSmith shows you which document type is failing and why.

What I'd do differently: Ship observability on day one, not as an afterthought. When something breaks in production and you have no metrics, you're debugging blind.

Cost as a First-Class Metric

Token costs compound fast with agentic workloads. Each ReAct iteration is an LLM call. 3 iterations times 16 parallel workers equals 48 LLM calls per batch. At scale, that adds up in ways that surprise you if you're not tracking it.

CostTracker computes USD cost per request using Decimal arithmetic against a model pricing table. This matters: float arithmetic accumulates rounding errors across thousands of requests. Decimal doesn't.

Model A/B testing: ModelABTest uses SHA-256 hashing of (user_id, experiment_id) for deterministic variant assignment. The same user always gets the same model - no session contamination from random assignment. Statistical significance is checked via two-sample z-test at n≥30 before drawing conclusions.

Prompt versioning: prompts are stored as semver files (prompts/{category}/vX.Y.Z.txt), with the active version env-configurable. PromptRegressionTester runs the golden eval suite against two prompt versions and flags any regression above 2%. A prompt change that improves accuracy but increases cost gets surfaced as a tradeoff - not automatically accepted.

One concrete number: switching classification from Sonnet to Haiku (when Sonnet circuit-opens) saves approximately $0.003 per document. At 10,000 documents per month, that's $30/month from one inverted fallback chain. Small per-call, meaningful at volume.

The Kubernetes Deploy (And Why It Matters for the Portfolio)

DocExtract now deploys to Kubernetes via 11 Kustomize manifests: namespace, deployments for all three services, services, ingress (nginx, SSE buffering disabled), HPA (API scales 2-8 replicas at 70% CPU, worker scales 2-6), configmap, and secrets template.

# Base deploy
kubectl apply -k deploy/k8s/

# Production overlay (higher replicas, resource limits)
K8S_ENV=production make k8s-apply

The production overlay overrides replica counts and resource requests without duplicating manifests. That's the Kustomize pattern.

The AWS Terraform provisions RDS PostgreSQL 16 and ElastiCache Redis 7 (managed, not containers on EC2). The alembic migrations run automatically on boot via a retry loop in user_data.sh - the worker waits up to 2 minutes for RDS to accept connections before starting.

GHCR CI publishes three Docker images (api, worker, frontend) tagged with latest and ${{ github.sha }} on every merge to main.

7 Lessons

Hybrid search from the start. Adding BM25 to a pure vector system after the fact is straightforward, but designing your retrieval interface to support both modes from day one (via ?mode=hybrid) means you never break existing callers.
Golden eval before launch, not after. Build your evaluation suite from real documents during development, not post-launch when you're debugging complaints. The cost is low; the signal is high.
Circuit breakers are cheaper than incident response. Shipping a circuit breaker takes a day. An LLM API outage that cascades into your whole pipeline taking down a client takes much longer to recover from - and costs trust.
Observability belongs in the infrastructure layer, not the application layer. The circuit breaker state is a Prometheus gauge emitted from emit_circuit_breaker_state(). The LLM call duration is emitted from a trace_llm_call() context manager. Neither the API routes nor the extraction logic know about metrics - they just call the tracer. That separation means you can add new metrics without touching business logic.
Async workers are the right abstraction for long-running AI tasks. Don't block API workers with 8-second document processing. The ARQ queue gives you retries, concurrency control, and a clean separation between "job accepted" and "job complete."
Agentic retrieval outperforms static when query types vary. If your users ask structured queries (exact IDs, clause references) and semantic queries (concepts, summaries) in the same system, a per-query retrieval agent consistently outperforms any single retrieval mode - at the cost of 2-3x latency.
Track cost per LLM call from day one. Once you add agentic workflows with multiple iterations, cost compounds fast. A CostTracker built at day 1 costs a few hours; retrofitting it after the fact requires touching every LLM call site.

DEV Community