Building a Production RAG Pipeline That Actually Survives Monday Morning

#ai #machinelearning #devops #python

I spent three months building a document extraction API. The first version worked great in demos. It also silently hallucinated invoice totals, crashed when Claude hit rate limits, and had no way to tell me extraction quality was degrading until a customer filed a support ticket.

This is the story of three patterns that turned it into something I'd actually deploy: circuit breaker model fallback, a golden eval CI gate, and two-pass extraction with automatic correction.

The problem: documents are messy

Every company that processes documents at scale hits the same wall. PDFs arrive in different layouts. Scanned images have OCR artifacts. Emails have attachments nested inside attachments. Template-based extraction tools break the moment a vendor changes their invoice format.

I needed an API that could accept any document, figure out what it was, and extract the right fields without being pre-configured for each layout.

Architecture: three services, seven steps

Client -> FastAPI REST API -> Redis/ARQ Queue -> Worker Pipeline:
  1. MIME detection + routing
  2. Text extraction (PDF/image/email)
  3. Document classification (Haiku)
  4. Two-pass Claude extraction (Sonnet)
  5. Business rule validation
  6. pgvector HNSW embedding (768-dim)
  7. HMAC-signed webhook delivery

The API accepts uploads, deduplicates via SHA-256 hash, and queues a job. The ARQ worker runs the seven-step pipeline asynchronously. Clients get real-time progress via Server-Sent Events.

Three decisions shaped everything that followed.

Decision 1: Two-pass extraction catches silent failures

The single biggest failure mode in document extraction is silent bad data. The model returns a plausible-looking JSON response, but the invoice total is wrong or the vendor name is truncated. Nobody notices until downstream accounting breaks.

Two-pass extraction fixes this. Pass 1 calls Claude Sonnet with a structured JSON prompt and asks for a _confidence field. If confidence drops below 0.80, Pass 2 fires a second call using Claude's tool_use API. The model returns corrections as a structured apply_corrections tool call, which gets merged into the original extraction.

This catches roughly 15-20% of extractions that would otherwise produce bad data. The remaining 80-85% never pay for a second API call.

The per-document-type confidence thresholds are configurable: identity documents default to 0.90 (high stakes), receipts to 0.75 (more noise tolerance).

Decision 2: Circuit breakers prevent cascading failures

The first time Claude's API hit a rate limit during a batch job, my worker crashed, the queue backed up, and retries made the rate limiting worse. Classic cascading failure.

The fix was per-model circuit breakers with a fallback chain. Each model (Sonnet, Haiku) gets its own state machine: CLOSED (healthy), OPEN (failing, route to fallback), HALF_OPEN (probe recovery).

When Sonnet trips after 5 consecutive failures, extraction automatically routes to Haiku. Accuracy drops roughly 14%, but the system stays up. After 60 seconds, the breaker enters HALF_OPEN and probes Sonnet with a single call. If it succeeds, traffic restores.

The fallback chains are intentionally inverted by role:

Extraction: Sonnet (primary) -> Haiku (fallback). Quality matters most.
Classification: Haiku (primary) -> Sonnet (fallback). Classification is simpler; Haiku-first saves cost without quality loss.

The circuit breaker actually reduces cost during outages by failing fast instead of burning through retry budgets.

Decision 3: The eval gate makes quality a CI signal

This was the one that changed how I think about AI systems.

I built a golden eval suite: 24 test fixtures across 6 document types (invoices, receipts, purchase orders, bank statements, medical records, identity documents). Each fixture has ground truth expected output and a recorded model response so the eval runs without API calls.

The CI gate loads the golden fixtures, scores them against ground truth, and compares to a committed baseline (currently 94.6%). If the score drops more than 2%, the build fails.

Eval Regression Gate -- PASS

| Metric          | Value  |
|-----------------|--------|
| Overall Score   | 0.9462 |
| Baseline        | 0.9462 |
| Tolerance       | +/-0.02|
| Cases           | 24     |
| Brier Score     | 0.0000 |

This means extraction quality is a first-class CI signal. The same way you wouldn't merge code that drops test coverage below 80%, you can't merge a prompt change that drops extraction accuracy below 92%.

The eval includes 8 adversarial fixtures designed to break things: corrupted PDFs with null bytes, blank multi-page documents, scanned tables with OCR character substitution (0/O, l/1), duplicate pages, mixed Spanish/English invoices, and redacted bank statements.

Scoring uses weighted field-level accuracy: critical fields (invoice number, total amount) are weighted 2x. Lists use best-pair alignment. A Brier score measures calibration -- whether 80% confidence actually means 80% accuracy.

What I measured

Metric	Value
Extraction accuracy	94.6% (24 golden fixtures, 6 doc types)
Tests	1,135 passing in ~7 seconds
Extraction latency (p50)	2.1s
Extraction latency (p95)	6.8s
Cost per extraction (Sonnet)	~$0.01
Cost per extraction (Haiku)	~$0.001
Circuit breaker recovery	<60s

What I'd still change

Field-level confidence: Current confidence is document-level. Field-level scores (total: 0.97, address: 0.61) would let reviewers focus on specific uncertain fields instead of re-reviewing entire documents.

Multilingual prompts: Non-English documents extract with degraded accuracy because prompts are English-only. A language-detect layer would extend coverage without model changes.

The takeaway

The circuit breaker + eval gate combination is the piece I'd carry into any future AI pipeline. Circuit breakers give you availability. The eval gate gives you measurable, CI-enforced quality. Two-pass extraction gives you a way to catch your own mistakes before they reach users.

None of this is complicated individually. The compound effect of all three is what turns a prototype into something you'd trust with real invoices on a Monday morning.

Stack: FastAPI, ARQ, PostgreSQL + pgvector, Redis, Claude Sonnet/Haiku, Gemini Embeddings, OpenTelemetry, Prometheus, Streamlit

Code: github.com/ChunkyTortoise/docextract