DEV Community: Cayman Roden

DTI Beats FICO for Prime Borrowers: What SHAP Values Reveal About Credit Risk

Cayman Roden — Fri, 03 Apr 2026 10:58:13 +0000

If you build credit models, you probably treat FICO as your primary signal. You are not wrong, exactly, but you are almost certainly missing the highest-value improvement available to you. For your best borrowers, the ones above 720, FICO is already priced in. The risk that matters in that segment is somewhere else entirely.

That somewhere else is debt-to-income ratio. And the way to see it is SHAP.

The Setup

The analysis runs on 50,000 synthetic consumer installment loans, calibrated to Lending Club historical distributions with a fixed seed (42) and a roughly 15% default rate. That calibration matters: the findings hold up against real-world portfolio shapes, not toy data.

Three models were compared: Logistic Regression, Random Forest, and Gradient Boosting. The comparison is deliberate. Any single model can produce feature importance numbers that look plausible but are artifacts of that model's structure. Running three models side by side, and then applying SHAP values across all three, lets you distinguish genuine signal from modeling quirks.

SHAP (SHapley Additive exPlanations) is the right tool here, not standard feature importance. Feature importance tells you which features the model uses most. SHAP tells you how each feature pushes each individual prediction higher or lower, with a sign. You can segment the SHAP values by any borrower characteristic, which is the only way to surface the finding described below.

The Finding

Across the full lending book, FICO dominates. It has the highest mean absolute SHAP value. This is expected. FICO is a compressed summary of payment history, credit utilization, length of history, and several other factors. Of course it predicts default.

But segment the portfolio to FICO 720 and above, and the picture changes. In that segment, DTI ratio becomes the dominant predictor, explaining roughly 38% of default variance. FICO drops.

The SHAP beeswarm plot makes this concrete. For the full population, FICO values fan out widely on both sides of zero, meaning high and low FICO scores are both doing significant explanatory work. In the 720+ segment, those FICO dots compress toward zero. The DTI dots spread out instead.

Why does this happen? Prime borrowers have already passed a FICO floor. The lender screened on FICO, so FICO variance in the approved pool is low. When variance in a feature is low, that feature cannot explain much of the outcome variance. What is left? Income and debt load. A borrower with a 740 FICO and a 44% DTI is meaningfully different from a borrower with a 740 FICO and a 22% DTI, but FICO cannot see that distinction. DTI can.

The practical implication is that lenders who screen only on FICO are systematically underestimating the tail risk sitting above their prime cutoff.

The Dollar Math

For a $100M consumer installment portfolio, the numbers are specific enough to put in a business case.

Tightening DTI thresholds in the 720-760 FICO band from 45% to 38% reduces annual expected losses by an estimated $600K-$900K. The assumptions: 15% portfolio default rate, 40% loss given default, and approximately 38% of defaults in this band being DTI-driven rather than FICO-driven.

The action does not require changing the FICO cutoff, and it does not decline additional applications. It re-weights the approval decision within the existing prime segment. This is a policy parameter change, not a model change. It can go into effect without a model validation cycle.

That combination -- six-figure loss reduction with no new model, no approval volume impact, and no model governance overhead -- is rare. Most credit improvement levers require tradeoffs.

The Regulatory Angle

There is a fair lending dimension worth paying attention to.

FICO score can act as a proxy for protected class characteristics. This is well documented. ECOA (Reg B) and the Fair Housing Act require that adverse action be based on neutral, income-related factors wherever possible. DTI is precisely that: it measures a borrower's actual debt load relative to income. It is not a protected characteristic, and it does not have the proxy concerns that FICO carries.

A DTI-first screening approach in the prime band is also more defensible under a fair lending examination. If a regulator asks why you denied a 725 FICO borrower, "DTI of 46% exceeds our prime-band threshold of 38%" is a cleaner answer than any explanation that depends on the FICO composite. Examiners know what goes into FICO.

This is not a hypothetical regulatory posture. The CFPB has consistently indicated in supervisory guidance that income-based factors are preferred when they are genuinely predictive. The SHAP analysis confirms that DTI is genuinely predictive in this segment.

How to Reproduce It

The full analysis runs on the Credit Risk Explorer page of the live dashboard at finance-analytics-portfolio.streamlit.app. The source is at github.com/ChunkyTortoise/finance-analytics-portfolio.

To run SHAP on the credit model locally:

import shap
from analysis.credit_models import train_models, CREDIT_FEATURES

models, X_test, y_test = train_models(df)
rf = models["random_forest"]

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

# Segment to prime borrowers only
prime_mask = X_test["fico_score"] >= 720
shap.summary_plot(shap_values[1][prime_mask], X_test[prime_mask])

The beeswarm plot that results from the segmented call is where the DTI flip becomes visible.

The Takeaway

Every analyst building a credit model should run SHAP on their high-FICO segment separately, not just on the full portfolio. The full-population feature importance is almost always dominated by FICO, which makes it easy to conclude that FICO is the only signal worth tracking. That conclusion is wrong for your prime borrowers.

DTI is the actionable variable in the segment where it matters most. The improvement is a policy change, not a modeling exercise. And it happens to be the more defensible choice under fair lending scrutiny.

The finding generalizes: any model trained on a population that has already been filtered on a key feature will systematically underweight the variables that matter within that filtered population. SHAP, segmented by the filter criterion, is the fastest way to find those blind spots.

Cayman Roden is a data analyst specializing in financial services analytics. Full analysis at Finance Analytics Portfolio | GitHub

Why Your AI Portfolio Needs Observability, Not More Repos

Cayman Roden — Wed, 25 Mar 2026 07:42:00 +0000

I have 15 GitHub repos and 10,800+ automated tests. My freelance rate was stuck at $40-55/hr. After running a multi-model research pipeline (Perplexity, Gemini, Grok, ChatGPT) to figure out what to build next, every model converged on the same answer:

Stop building. Start observing.

The production deployment gap is the #1 constraint on AI engineer hiring signal and rates in 2026. Only 11% of enterprises have deployed AI agents to production despite 66% experimenting. The engineers who can prove their systems work at runtime (not just at test time) are 10x rarer than the ones who can build the prototype.

Here is what I changed and what I learned.

The Gap: 1,109 Tests, Zero Traces

My RAG pipeline (DocExtract) had everything a hiring manager might want to see in a repo:

Agentic RAG with ReAct reasoning loop
Circuit breaker model fallback (Sonnet to Haiku)
HITL correction workflow with audit trail
RAGAS evaluation + LLM-as-judge CI gate
Kubernetes manifests, Terraform IaC, Docker multi-stage builds
1,109 tests at 90%+ coverage

What it did NOT have: any evidence it had ever processed a real request. No traces. No latency dashboards. No cost-per-request visibility. No runtime quality monitoring.

A hiring manager clicking the repo would see impressive architecture docs and green CI badges. But they would have no way to verify the system actually works under real conditions.

The Fix: Sync Sidecar Pattern

The key constraint: observability must never slow down the request path.

The pattern is simple. FastAPI BackgroundTasks handle all trace submission after the response is sent:

@app.post("/extract")
async def extract(req: Request, background_tasks: BackgroundTasks):
    trace = langfuse_trace("extraction", session_id=req.state.request_id)
    response = await run_extraction(req, trace)
    background_tasks.add_task(langfuse_flush)
    return response

The user gets their extraction result immediately. Langfuse receives the full trace (model, tokens, latency, confidence) in the background. This adds ~0ms to the request path.

Tiered Evaluation: The $0 CI Gate

Running LLM-as-a-judge on every PR is financially unviable for a solo engineer. At $0.003+ per metric per test case, a 50-case golden set costs $50-100/month in CI alone.

The solution is tiered evaluation:

Tier 1 (every PR, $0): Deterministic checks only

Schema conformance (Pydantic validation)
Confidence scores in 0.0-1.0 range
Field completeness (no empty extractions)
Citation grounding (extracted values appear in source text)
Baseline accuracy above 90%

Tier 2 (nightly cron, API cost): LLM-as-a-judge via DeepEval

Contextual precision
Faithfulness (hallucination detection)
Answer relevancy

This gives fast, reliable feedback on every change while reserving expensive quality checks for nightly validation.

PII Sanitization: Non-Negotiable Before Tracing

You cannot send production user data to external tracing services in plain text. Before any trace leaves the application:

# Regex-based PII masking (SSN, credit card, phone, email)
def sanitize_for_trace(data):
    if isinstance(data, str):
        for token, pattern in _PATTERNS:
            data = pattern.sub(token, data)
    elif isinstance(data, dict):
        return {k: sanitize_for_trace(v) for k, v in data.items()}
    return data

Zero dependencies. Deterministic. Runs before every Langfuse trace submission.

The HITL Data Advantage

Human corrections are not just UX. They are data assets.

DocExtract's review queue captures structured corrections: original extraction, corrected fields, error type, reviewer ID. This creates organic training data for future fine-tuning without the typical $2K-8K dataset curation cost.

When correction volume reaches critical mass, this feeds directly into a QLoRA fine-tuning pipeline (DPO pairs already exported in JSONL format).

What I Would Do Differently

Add Langfuse from day one. Retrofitting observability onto 40+ endpoints is more work than building it in. The Sync Sidecar pattern adds about 10 lines per endpoint.
Start with cloud-managed everything. Self-hosting Langfuse requires ClickHouse + Redis + S3. The cloud free tier (1M spans/month) is the right call for a solo engineer.
Tier the eval strategy earlier. Running DeepEval on every PR sounded good in theory. The tiered approach (deterministic CI + nightly LLM-judge) is the sustainable pattern.

The Real Lesson

The multi-model research consensus was clear: 15 repos with 10K+ tests is more code than most engineers at double my rate. The gap is not engineering skill. It is observable production signal.

Monitoring and observability represent 70% of production AI work that nobody puts in their portfolio. Adding Langfuse tracing, tiered DeepEval CI gates, PII sanitization, and cost tracking transforms a demo project into a production system.

The code changes took 2 days. The positioning shift is worth 10x that in hiring signal.

Stack: FastAPI, PostgreSQL + pgvector, Redis, Claude API, Langfuse, DeepEval, ARQ, Docker, Kubernetes, GitHub Actions

Repo: github.com/ChunkyTortoise/docextract

Building a Production RAG Pipeline That Actually Survives Monday Morning

Cayman Roden — Wed, 25 Mar 2026 06:32:50 +0000

I spent three months building a document extraction API. The first version worked great in demos. It also silently hallucinated invoice totals, crashed when Claude hit rate limits, and had no way to tell me extraction quality was degrading until a customer filed a support ticket.

This is the story of three patterns that turned it into something I'd actually deploy: circuit breaker model fallback, a golden eval CI gate, and two-pass extraction with automatic correction.

The problem: documents are messy

Every company that processes documents at scale hits the same wall. PDFs arrive in different layouts. Scanned images have OCR artifacts. Emails have attachments nested inside attachments. Template-based extraction tools break the moment a vendor changes their invoice format.

I needed an API that could accept any document, figure out what it was, and extract the right fields without being pre-configured for each layout.

Architecture: three services, seven steps

Client -> FastAPI REST API -> Redis/ARQ Queue -> Worker Pipeline:
  1. MIME detection + routing
  2. Text extraction (PDF/image/email)
  3. Document classification (Haiku)
  4. Two-pass Claude extraction (Sonnet)
  5. Business rule validation
  6. pgvector HNSW embedding (768-dim)
  7. HMAC-signed webhook delivery

The API accepts uploads, deduplicates via SHA-256 hash, and queues a job. The ARQ worker runs the seven-step pipeline asynchronously. Clients get real-time progress via Server-Sent Events.

Three decisions shaped everything that followed.

Decision 1: Two-pass extraction catches silent failures

The single biggest failure mode in document extraction is silent bad data. The model returns a plausible-looking JSON response, but the invoice total is wrong or the vendor name is truncated. Nobody notices until downstream accounting breaks.

Two-pass extraction fixes this. Pass 1 calls Claude Sonnet with a structured JSON prompt and asks for a _confidence field. If confidence drops below 0.80, Pass 2 fires a second call using Claude's tool_use API. The model returns corrections as a structured apply_corrections tool call, which gets merged into the original extraction.

This catches roughly 15-20% of extractions that would otherwise produce bad data. The remaining 80-85% never pay for a second API call.

The per-document-type confidence thresholds are configurable: identity documents default to 0.90 (high stakes), receipts to 0.75 (more noise tolerance).

Decision 2: Circuit breakers prevent cascading failures

The first time Claude's API hit a rate limit during a batch job, my worker crashed, the queue backed up, and retries made the rate limiting worse. Classic cascading failure.

The fix was per-model circuit breakers with a fallback chain. Each model (Sonnet, Haiku) gets its own state machine: CLOSED (healthy), OPEN (failing, route to fallback), HALF_OPEN (probe recovery).

When Sonnet trips after 5 consecutive failures, extraction automatically routes to Haiku. Accuracy drops roughly 14%, but the system stays up. After 60 seconds, the breaker enters HALF_OPEN and probes Sonnet with a single call. If it succeeds, traffic restores.

The fallback chains are intentionally inverted by role:

Extraction: Sonnet (primary) -> Haiku (fallback). Quality matters most.
Classification: Haiku (primary) -> Sonnet (fallback). Classification is simpler; Haiku-first saves cost without quality loss.

The circuit breaker actually reduces cost during outages by failing fast instead of burning through retry budgets.

Decision 3: The eval gate makes quality a CI signal

This was the one that changed how I think about AI systems.

I built a golden eval suite: 24 test fixtures across 6 document types (invoices, receipts, purchase orders, bank statements, medical records, identity documents). Each fixture has ground truth expected output and a recorded model response so the eval runs without API calls.

The CI gate loads the golden fixtures, scores them against ground truth, and compares to a committed baseline (currently 94.6%). If the score drops more than 2%, the build fails.

Eval Regression Gate -- PASS

| Metric          | Value  |
|-----------------|--------|
| Overall Score   | 0.9462 |
| Baseline        | 0.9462 |
| Tolerance       | +/-0.02|
| Cases           | 24     |
| Brier Score     | 0.0000 |

This means extraction quality is a first-class CI signal. The same way you wouldn't merge code that drops test coverage below 80%, you can't merge a prompt change that drops extraction accuracy below 92%.

The eval includes 8 adversarial fixtures designed to break things: corrupted PDFs with null bytes, blank multi-page documents, scanned tables with OCR character substitution (0/O, l/1), duplicate pages, mixed Spanish/English invoices, and redacted bank statements.

Scoring uses weighted field-level accuracy: critical fields (invoice number, total amount) are weighted 2x. Lists use best-pair alignment. A Brier score measures calibration -- whether 80% confidence actually means 80% accuracy.

What I measured

Metric	Value
Extraction accuracy	94.6% (24 golden fixtures, 6 doc types)
Tests	1,135 passing in ~7 seconds
Extraction latency (p50)	2.1s
Extraction latency (p95)	6.8s
Cost per extraction (Sonnet)	~$0.01
Cost per extraction (Haiku)	~$0.001
Circuit breaker recovery	<60s

What I'd still change

Field-level confidence: Current confidence is document-level. Field-level scores (total: 0.97, address: 0.61) would let reviewers focus on specific uncertain fields instead of re-reviewing entire documents.

Multilingual prompts: Non-English documents extract with degraded accuracy because prompts are English-only. A language-detect layer would extend coverage without model changes.

The takeaway

The circuit breaker + eval gate combination is the piece I'd carry into any future AI pipeline. Circuit breakers give you availability. The eval gate gives you measurable, CI-enforced quality. Two-pass extraction gives you a way to catch your own mistakes before they reach users.

None of this is complicated individually. The compound effect of all three is what turns a prototype into something you'd trust with real invoices on a Monday morning.

Stack: FastAPI, ARQ, PostgreSQL + pgvector, Redis, Claude Sonnet/Haiku, Gemini Embeddings, OpenTelemetry, Prometheus, Streamlit

Code: github.com/ChunkyTortoise/docextract

Building a Production RAG Pipeline That Actually Works: Lessons from DocExtract

Cayman Roden — Mon, 23 Mar 2026 04:01:47 +0000

The Architecture (and Why It's 3 Services, Not 1)

DocExtract is split into three services: an API, a worker, and a frontend.

User uploads PDF
    → API validates and enqueues job (ARQ/Redis)
    → Worker picks up job asynchronously
        → chunk + embed → pgvector store
        → BM25 index built in memory on retrieval
    → API streams SSE progress to frontend
    → User queries with natural language
        → hybrid retrieval → Claude generates answer with citations

Why not one FastAPI service? Because document processing is slow (2-8 seconds per page), and you don't want your API workers blocked. The ARQ queue decouples upload from processing, which lets you scale workers independently and gives you a natural retry boundary.

The async split also means you can add real-time progress streaming (SSE) to the frontend without any threading complexity - the worker updates job state in Redis, the API polls it, and the frontend gets a 12-step progress bar that actually reflects what's happening.

The full system has 1,060 tests with 90%+ coverage across the extraction pipeline, retrieval paths, agent evaluator, guardrails, and infrastructure layer.

Why Pure Vector Search Fails (and What to Do About It)

Vector search is great at semantic similarity. "mortgage prequalification" and "home loan eligibility" have similar embeddings. But "Section 3.2(b)" - the exact contract clause a user is looking for - doesn't.

BM25 catches what embeddings miss. Exact product codes, invoice numbers, legal citations, acronyms - these score high on BM25 and often near-zero on cosine similarity.

The solution is Reciprocal Rank Fusion (RRF). You run both retrievers, rank the results independently, then combine the rank positions:

def rrf_score(vector_rank: int, rid: str) -> float:
    bm25_rank = bm25_ranks.get(rid, len(record_ids))
    return 1 / (60 + bm25_rank) + 1 / (60 + vector_rank)

The constant 60 is the RRF smoothing factor. It prevents very high scores from dominating when one system has zero results. A document ranked #1 by vector and #3 by BM25 scores higher than one ranked #1 by vector alone.

The API exposes this as a mode parameter:

GET /api/v1/records/search?q=Section+3.2&mode=hybrid

mode=vector (default), mode=bm25, or mode=hybrid. The BM25 index is built in memory at query time from the retrieved vector candidates. No separate BM25 service, no sync complexity. It adds maybe 20ms.

Result: 92.6% accuracy on a 16-fixture golden evaluation suite. Pure vector was at 86.9% on the same fixtures.

From Static RAG to Agentic RAG

Static RAG forces you to pick one retrieval mode per deployment. The real problem is that retrieval quality varies by query type, and you don't know at deploy time what queries you'll get.

The solution is a ReAct (Reasoning + Acting) agent that picks the right retrieval approach per-query. Each query goes through Think→Act→Observe cycles, choosing from five tools: search_vectors, search_bm25, search_hybrid, lookup_metadata, and rerank_results.

The agent is confidence-gated at 0.8 - if it reaches that threshold, it stops iterating. Max 3 iterations caps cost.

while iteration < max_iterations:
    thought = await agent.think(query, context)
    tool_name, tool_args = await agent.act(thought)
    observation = await tools[tool_name](**tool_args)
    context.append(observation)
    if agent.confidence >= confidence_threshold:
        break
    iteration += 1

The key insight: the agent does what a senior engineer does mentally. For "Section 3.2(b)" use BM25 - it's an exact citation. For "documents about loan eligibility" use vector - it's a concept. For ambiguous queries, use hybrid. The difference is the agent makes this decision per-query, not per-deployment.

At 2-3x the latency of a single retrieval call, this isn't free. But if your users ask both structured queries (exact IDs, clause references) and semantic queries (concepts, summaries), a per-query agent consistently outperforms any static retrieval mode.

The Golden Eval CI Gate

This is the piece most RAG pipelines skip, and it's the most important one.

Without a regression gate, you can accidentally degrade retrieval quality during a refactor and not notice until a user complains. With a gate, a PR that drops accuracy by more than 2% gets blocked automatically.

The setup:

16 document fixtures (contracts, invoices, reports) with expected extraction output in a JSON file
A pytest test that runs the full pipeline end-to-end against those fixtures
A pass threshold of 92.6% (the current baseline); anything below 90.6% blocks the merge

def test_golden_eval_accuracy():
    results = run_eval_suite(fixtures=GOLDEN_FIXTURES)
    accuracy = results["accuracy"]
    assert accuracy >= ACCURACY_THRESHOLD, (
        f"Golden eval failed: {accuracy:.1%} < {ACCURACY_THRESHOLD:.1%} "
        f"({results['passed']}/{results['total']} passed)"
    )

The fixtures are real documents with personally identifying information removed. The expected outputs cover edge cases: multi-column tables, handwritten fields (which fail gracefully), and mixed-language documents.

This runs in CI on every PR. It catches prompt regressions, embedding model changes, and chunking strategy changes before they ship.

Evaluating the Agent, Not Just the Output

The golden eval gate measures extraction accuracy. For agentic retrieval, you also need to measure whether the agent's behavior was sensible - did it pick the right tools? Did it iterate efficiently?

RAGAS pipeline with three weighted metrics:

context_recall - weight 0.35
faithfulness - weight 0.40
answer_relevancy - weight 0.25

Faithfulness carries the highest weight because hallucination is the worst failure mode for a document extraction API. A retrieved context that gets misrepresented in the answer is more dangerous than a missed chunk.

LLM-as-judge scores outputs against structured rubrics with few-shot examples. It extracts the evidence for each scoring decision - you get an auditable trace, not just a number.

Agent evaluation adds three dimensions:

Tool selection quality: Jaccard similarity against expected tool sequences when ground truth is known, redundancy penalty otherwise.
Iteration efficiency: linear decay from 1.0 at 1 iteration to 0.5 at the max iteration count.
Confidence calibration: trajectory trend and word-overlap with ground truth across iterations.

Both RAGAS and agent evaluation are feature-flagged (RAGAS_ENABLED, LLM_JUDGE_ENABLED) to avoid CI cost. The golden eval is the mandatory gate. These are the diagnostic layer - run them in staging, not on every PR.

Circuit Breakers for LLM Calls

LLM APIs fail. Rate limits, transient 5xx errors, model deprecations - your pipeline will experience all of them. A circuit breaker turns cascading failures into graceful degradation.

class CircuitState(Enum):
    CLOSED = "closed"       # Healthy - calls pass through
    OPEN = "open"           # Failing - calls rejected immediately
    HALF_OPEN = "half_open" # Recovering - one probe call allowed

class AsyncCircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 60.0,
        half_open_max_calls: int = 1,
    ) -> None:
        ...

The state machine: after 5 consecutive failures, the circuit opens. All calls fail immediately (no network round-trip). After 60 seconds, one probe call is allowed. If it succeeds, back to CLOSED. If it fails, back to OPEN.

The non-obvious design choice is inverting the fallback chain by task type:

Extraction (quality matters): Sonnet → Haiku fallback. Sonnet is more accurate; fall back to Haiku only under failure.
Classification (cost matters): Haiku → Sonnet fallback. Haiku is cheaper and fast enough; escalate to Sonnet only under failure.

This also means the fallback actually reduces costs in the classification path - a side benefit of designing for failure correctly.

One more thing: distinguish transient errors from permanent ones. A 429 (rate limit) or 503 (overloaded) should trigger the circuit. A 400 (bad request) is your bug and should never trigger a fallback.

def _is_transient(exc: Exception) -> bool:
    """Only trigger circuit on errors that might resolve themselves."""
    if isinstance(exc, anthropic.RateLimitError):
        return True
    if isinstance(exc, anthropic.APIStatusError) and exc.status_code >= 500:
        return True
    return False

Observability That Actually Tells You Something

Three layers:

1. OpenTelemetry + Prometheus

Every LLM call emits four metrics:

llm_call_duration_ms - histogram, tagged by model and operation
llm_calls_total - counter, tagged by status (success/failure)
llm_tokens_total - counter, split by input/output
circuit_breaker_state - gauge (0=CLOSED, 1=HALF_OPEN, 2=OPEN)

The circuit breaker gauge is the critical one. If it flips to 2 at 2am, you want to know about it before your users do.

2. Grafana Dashboard

Pre-built dashboard with:

LLM latency p50/p95/p99 by model (spot the Sonnet vs Haiku difference immediately)
Calls/sec by status (surface rate limit bursts)
Circuit breaker state gauge (red when open, green when closed)
Token consumption rate over time (cost forecasting)

The whole observability stack runs locally with one command:

docker compose -f docker-compose.yml -f docker-compose.observability.yml up

That brings up Jaeger (distributed traces), Prometheus (metrics), and Grafana (pre-configured dashboard at localhost:3000).

3. LangSmith Tracing

For the retrieval path specifically, LangSmith gives you per-query traces: what was retrieved, what the final prompt looked like, what tokens were consumed. When the golden eval catches a regression, LangSmith shows you which document type is failing and why.

What I'd do differently: Ship observability on day one, not as an afterthought. When something breaks in production and you have no metrics, you're debugging blind.

Cost as a First-Class Metric

Token costs compound fast with agentic workloads. Each ReAct iteration is an LLM call. 3 iterations times 16 parallel workers equals 48 LLM calls per batch. At scale, that adds up in ways that surprise you if you're not tracking it.

CostTracker computes USD cost per request using Decimal arithmetic against a model pricing table. This matters: float arithmetic accumulates rounding errors across thousands of requests. Decimal doesn't.

Model A/B testing: ModelABTest uses SHA-256 hashing of (user_id, experiment_id) for deterministic variant assignment. The same user always gets the same model - no session contamination from random assignment. Statistical significance is checked via two-sample z-test at n≥30 before drawing conclusions.

Prompt versioning: prompts are stored as semver files (prompts/{category}/vX.Y.Z.txt), with the active version env-configurable. PromptRegressionTester runs the golden eval suite against two prompt versions and flags any regression above 2%. A prompt change that improves accuracy but increases cost gets surfaced as a tradeoff - not automatically accepted.

One concrete number: switching classification from Sonnet to Haiku (when Sonnet circuit-opens) saves approximately $0.003 per document. At 10,000 documents per month, that's $30/month from one inverted fallback chain. Small per-call, meaningful at volume.

The Kubernetes Deploy (And Why It Matters for the Portfolio)

DocExtract now deploys to Kubernetes via 11 Kustomize manifests: namespace, deployments for all three services, services, ingress (nginx, SSE buffering disabled), HPA (API scales 2-8 replicas at 70% CPU, worker scales 2-6), configmap, and secrets template.

# Base deploy
kubectl apply -k deploy/k8s/

# Production overlay (higher replicas, resource limits)
K8S_ENV=production make k8s-apply

The production overlay overrides replica counts and resource requests without duplicating manifests. That's the Kustomize pattern.

The AWS Terraform provisions RDS PostgreSQL 16 and ElastiCache Redis 7 (managed, not containers on EC2). The alembic migrations run automatically on boot via a retry loop in user_data.sh - the worker waits up to 2 minutes for RDS to accept connections before starting.

GHCR CI publishes three Docker images (api, worker, frontend) tagged with latest and ${{ github.sha }} on every merge to main.

7 Lessons

Hybrid search from the start. Adding BM25 to a pure vector system after the fact is straightforward, but designing your retrieval interface to support both modes from day one (via ?mode=hybrid) means you never break existing callers.
Golden eval before launch, not after. Build your evaluation suite from real documents during development, not post-launch when you're debugging complaints. The cost is low; the signal is high.
Circuit breakers are cheaper than incident response. Shipping a circuit breaker takes a day. An LLM API outage that cascades into your whole pipeline taking down a client takes much longer to recover from - and costs trust.
Observability belongs in the infrastructure layer, not the application layer. The circuit breaker state is a Prometheus gauge emitted from emit_circuit_breaker_state(). The LLM call duration is emitted from a trace_llm_call() context manager. Neither the API routes nor the extraction logic know about metrics - they just call the tracer. That separation means you can add new metrics without touching business logic.
Async workers are the right abstraction for long-running AI tasks. Don't block API workers with 8-second document processing. The ARQ queue gives you retries, concurrency control, and a clean separation between "job accepted" and "job complete."
Agentic retrieval outperforms static when query types vary. If your users ask structured queries (exact IDs, clause references) and semantic queries (concepts, summaries) in the same system, a per-query retrieval agent consistently outperforms any single retrieval mode - at the cost of 2-3x latency.
Track cost per LLM call from day one. Once you add agentic workflows with multiple iterations, cost compounds fast. A CostTracker built at day 1 costs a few hours; retrofitting it after the fact requires touching every LLM call site.

Available for Hire: AI Engineer with 11 Production Repos and 8,500+ Tests

Cayman Roden — Tue, 10 Feb 2026 05:34:10 +0000

About Me

I'm Cayman Roden, an AI engineer who builds production systems — not demos. Over the past year, I've shipped 11 repositories covering RAG pipelines, multi-agent orchestration, BI dashboards, chatbot platforms, and web scraping infrastructure. Every repo has CI, typed code, and comprehensive test coverage (8,500+ tests total, all green).

My stack: Python, FastAPI, LangChain, Streamlit, PostgreSQL, Redis, Docker.

What I Build

Here are the numbers from production:

89% LLM cost reduction through L1/L2/L3 intelligent caching
88% cache hit rate across retrieval and orchestration layers
<200ms orchestration overhead for multi-model AI routing
4.3M tool dispatches/sec throughput in agent systems

I specialize in the gap between "it works in a notebook" and "it runs in production." That means proper error handling, caching, monitoring, rate limiting, and deployment infrastructure.

Services & Pricing

Fiverr Gigs (project-based):

Service	Price Range	Turnaround
CSV/Excel to Interactive Dashboard	$50-$200	2-5 days
RAG Document Q&A System	$100-$500	3-7 days
AI Chatbot (Lead Qual, Support, Internal)	$200-$500	5-10 days

Gumroad Products (self-serve):

Product	Price	What You Get
DocQA Engine	$49	Production RAG with hybrid search and caching
AgentForge	$39	Multi-agent orchestration framework
Scrape-and-Serve	$29	Web scraping pipeline with API output
Insight Engine	$39	BI toolkit for CSV/Excel data

How to Work With Me

Browse my portfolio: chunkytortoise.github.io
See the code: github.com/ChunkyTortoise
Hire me on Fiverr: fiverr.com/caymanroden
Buy a tool on Gumroad: caymanroden.gumroad.com
Connect on LinkedIn: linkedin.com/in/caymanroden

I'm US-based (Pacific time), respond within a few hours, and can start most projects within 24-48 hours.

If you have a Python or AI project that needs to actually work in production, let's talk.