DEV Community: Nitin Srivastava

Production Reranker Layer for RAG in Python: Cross-Encoder, Cohere Fallback, and Reciprocal Rank Fusion (Runnable Code)

Nitin Srivastava — Tue, 12 May 2026 10:10:05 +0000

I shipped my fifth RAG pipeline to production in February. Top-10 recall@10 was 0.94. The team ran a demo, executive nodded, we declared victory. Two weeks later customer complaints started landing. The model was citing stale 2023 policy docs, ignoring the 2026 rewrite that ranked 4th. Somewhere between rank 4 and rank 1, the answer everyone needed was getting buried.

That is the thing nobody warns you about with RAG. Your retriever can be statistically excellent at top-10 and still hand the LLM the wrong top-3. The model only reads what is in the prompt. If the right chunk is at position 7, it might as well be at position 700.

The fix is a reranker layer. A second, smaller model whose only job is to re-score the top-K candidates with a query-aware comparison the first-stage retriever could not afford. Done right, it is the cheapest precision win in the entire RAG stack: 40-60% improvement on precision@3 for under 200ms of added latency.

Done wrong, it is a single point of failure that 504s your endpoint when Hugging Face has a bad day, or runs up a Cohere bill nobody approved.

Here is the production reranker layer I run today. Two models (local cross-encoder + Cohere managed), reciprocal rank fusion to combine signals, latency and cost budgets, graceful degradation when something is down, and an evaluation harness so you can actually measure whether reranking helps on your data.

Every code block runs.

The shape of the production reranker layer

The naive blog-post version is one box: "first-stage retriever -> reranker -> LLM." That is enough for a demo. In production, every box has at least two failure modes that quietly destroy answer quality.

The minimum viable production reranker has six pieces:

A first-stage retriever that returns 50-100 candidates, not 10. Recall is cheap here, precision is not.
A primary reranker — local cross-encoder for cost, latency, and offline survivability.
A fallback reranker — managed API (Cohere) for when the local model is degraded or absent.
A score fusion strategy — reciprocal rank fusion when you have multiple candidate sources or multiple rerankers.
A latency/cost budget that bounds the second stage and degrades gracefully.
An evaluation harness with golden queries and answer-relevance labels so you can prove reranker value on your domain.

Here is how each piece looks when it is actually wired up.

Step 1: Generate fat candidate sets

The most common reranker mistake is calling the reranker on top-10. You give the second stage almost no signal to work with. Top-K for the first stage should be 50-100 documents. The cross-encoder will reduce this back down.

# retrieval.py
from dataclasses import dataclass
from typing import List

@dataclass
class Candidate:
    doc_id: str
    text: str
    source: str
    metadata: dict
    first_stage_score: float
    first_stage_rank: int


def retrieve_candidates(query: str, top_k: int = 80) -> List[Candidate]:
    """First-stage retrieval. Replace internals with your vector store + BM25."""
    vector_hits = _vector_search(query, top_k=top_k)
    bm25_hits = _bm25_search(query, top_k=top_k)

    seen, merged = set(), []
    for rank, hit in enumerate(vector_hits + bm25_hits):
        if hit["doc_id"] in seen:
            continue
        seen.add(hit["doc_id"])
        merged.append(Candidate(
            doc_id=hit["doc_id"],
            text=hit["text"],
            source=hit["source"],
            metadata=hit.get("metadata", {}),
            first_stage_score=hit["score"],
            first_stage_rank=rank,
        ))
    return merged[:top_k]


def _vector_search(query: str, top_k: int) -> list:
    # placeholder — wire to ChromaDB / Qdrant / pgvector
    return []


def _bm25_search(query: str, top_k: int) -> list:
    # placeholder — wire to OpenSearch / rank_bm25
    return []

The point of the dataclass is that downstream code never has to peek into the retriever. Every reranker, every fusion function, every monitor reads the same shape.

Step 2: Local cross-encoder reranker (BGE)

BAAI's BGE-reranker-v2-m3 is the best free cross-encoder I have used in 2026. Roughly 568M parameters, multilingual, runs on CPU at ~80ms per query for 50 candidates if you batch right.

# rerankers/bge.py
from typing import List, Tuple
import torch
from sentence_transformers import CrossEncoder

_BGE_MODEL = None
_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"


def _load():
    global _BGE_MODEL
    if _BGE_MODEL is None:
        _BGE_MODEL = CrossEncoder(
            "BAAI/bge-reranker-v2-m3",
            device=_DEVICE,
            max_length=512,
        )
    return _BGE_MODEL


def bge_rerank(query: str, candidates: list, top_n: int = 10,
               batch_size: int = 32) -> List[Tuple[int, float]]:
    """Returns (candidate_index, score) sorted by score desc, top_n results."""
    if not candidates:
        return []
    model = _load()
    pairs = [(query, c.text) for c in candidates]
    scores = model.predict(
        pairs,
        batch_size=batch_size,
        show_progress_bar=False,
        convert_to_numpy=True,
    )
    indexed = sorted(enumerate(scores.tolist()), key=lambda x: x[1], reverse=True)
    return indexed[:top_n]

Three things people get wrong here:

Module-level model — load once per process. Reloading on every request adds 3-8 seconds of cold start.
Batch size 32 — the sweet spot on CPU. Going higher does not help, going lower wastes throughput.
max_length=512 — chunks longer than 512 tokens get silently truncated. If your chunks are 1024+ tokens, either re-chunk for reranking or use a long-context reranker like Jina ColBERT-v2.

Step 3: Managed reranker fallback (Cohere)

When the local model is absent (small footprint deployment), unavailable (GPU host down), or just too slow under load, you want a managed API to take over. Cohere Rerank is the lowest-friction option in 2026 — single call, no infra, around $1 per 1k searches.

# rerankers/cohere.py
import os
from typing import List, Tuple
import cohere

_CLIENT = None


def _client():
    global _CLIENT
    if _CLIENT is None:
        _CLIENT = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])
    return _CLIENT


def cohere_rerank(query: str, candidates: list, top_n: int = 10,
                  model: str = "rerank-v3.5", timeout_s: float = 1.5
                  ) -> List[Tuple[int, float]]:
    if not candidates:
        return []
    docs = [c.text for c in candidates]
    resp = _client().rerank(
        model=model,
        query=query,
        documents=docs,
        top_n=top_n,
        timeout=timeout_s,
    )
    return [(r.index, float(r.relevance_score)) for r in resp.results]

Note the explicit timeout. If you do not set one, the SDK default is 60s — that is long enough for an upstream incident to take your endpoint down with it. 1.5s is enough for Cohere's p99 plus network and gives the orchestrator room to fall back.

Step 4: Score fusion with reciprocal rank fusion

RRF is the right default when you are combining results from different scorers — say, your local cross-encoder and Cohere — or different first-stage retrievers (vector + BM25 + a domain-specific keyword search).

The math is embarrassingly simple. For each ranked list, every document gets a score 1 / (k + rank) where k is a smoothing constant (60 is the published default). Sum those scores across all lists. Sort.

# fusion.py
from collections import defaultdict
from typing import List, Tuple


def reciprocal_rank_fusion(
    ranked_lists: List[List[Tuple[int, float]]],
    k: int = 60,
    weights: List[float] = None,
) -> List[Tuple[int, float]]:
    """
    Each ranked_list is [(candidate_index, score), ...] sorted score desc.
    Returns the fused ranking [(candidate_index, fused_score), ...].
    """
    if weights is None:
        weights = [1.0] * len(ranked_lists)
    if len(weights) != len(ranked_lists):
        raise ValueError("weights length must match ranked_lists length")

    fused = defaultdict(float)
    for rlist, w in zip(ranked_lists, weights):
        for rank, (idx, _score) in enumerate(rlist, start=1):
            fused[idx] += w * (1.0 / (k + rank))
    return sorted(fused.items(), key=lambda x: x[1], reverse=True)

Three things to know about RRF in production:

Ignores raw scores. That is the point. Cohere returns 0-1 calibrated, BGE returns logits, BM25 returns BM25 scores. They are not comparable. Rank is comparable.
k=60 is rarely worth tuning. I have run sweeps from k=10 to k=200 across four production deployments. The win over default is under 1% NDCG@10 in every case.
Weights matter when one source is materially stronger. If your golden-set evaluation shows the cross-encoder beats Cohere on your data by 8% NDCG, weight it 1.5 vs 1.0. Do not do this without the eval — the intuition is wrong roughly half the time.

When you have only one reranker, you do not need fusion. When you have two — or two rerankers plus the original first-stage — RRF is the lowest-risk way to combine them.

Step 5: The production wrapper

This is the piece that ties everything together. It enforces a latency budget, picks the active strategy, falls back when the primary is down, tracks cost, and logs the metadata an oncall engineer will need at 3am.

# rerank_service.py
import logging
import time
from dataclasses import dataclass, field
from typing import List, Optional
from rerankers.bge import bge_rerank
from rerankers.cohere import cohere_rerank
from fusion import reciprocal_rank_fusion

log = logging.getLogger("rerank")

_COHERE_COST_PER_SEARCH = 0.001  # $1 / 1k searches as of 2026-05


@dataclass
class RerankResult:
    candidates: list
    strategy: str
    duration_ms: float
    primary_failed: bool = False
    cost_usd: float = 0.0
    debug: dict = field(default_factory=dict)


def rerank(query: str,
           candidates: list,
           top_n: int = 10,
           latency_budget_ms: int = 600,
           daily_cohere_budget_usd: float = 5.0,
           cohere_spent_today_usd: float = 0.0,
           strategy: str = "fusion") -> RerankResult:
    """
    strategy:
      - "local"   -> BGE only
      - "cohere"  -> Cohere only
      - "fusion"  -> BGE + Cohere via RRF (with fallback)
    """
    start = time.monotonic()
    deadline = start + latency_budget_ms / 1000.0

    primary_failed = False
    cost_usd = 0.0
    bge_ranked, cohere_ranked = None, None

    if strategy in ("local", "fusion"):
        try:
            bge_ranked = bge_rerank(query, candidates, top_n=top_n)
        except Exception as e:
            log.warning("bge rerank failed: %s", e)
            primary_failed = True

    cohere_allowed = (cohere_spent_today_usd + _COHERE_COST_PER_SEARCH
                      <= daily_cohere_budget_usd)
    time_left_ms = (deadline - time.monotonic()) * 1000

    if strategy in ("cohere", "fusion") and cohere_allowed and time_left_ms > 200:
        try:
            cohere_ranked = cohere_rerank(query, candidates, top_n=top_n,
                                          timeout_s=min(1.5, time_left_ms / 1000))
            cost_usd += _COHERE_COST_PER_SEARCH
        except Exception as e:
            log.warning("cohere rerank failed: %s", e)
            if strategy == "cohere":
                primary_failed = True

    sources = [r for r in (bge_ranked, cohere_ranked) if r]
    if not sources:
        log.error("all rerankers failed, returning first-stage order")
        ranked = [(i, 1.0 / (i + 1)) for i in range(len(candidates))][:top_n]
        used = "first_stage_fallback"
    elif len(sources) == 1:
        ranked = sources[0]
        used = "bge" if sources[0] is bge_ranked else "cohere"
    else:
        ranked = reciprocal_rank_fusion(sources)[:top_n]
        used = "rrf_bge_cohere"

    out = [candidates[idx] for idx, _ in ranked]
    duration_ms = (time.monotonic() - start) * 1000

    return RerankResult(
        candidates=out,
        strategy=used,
        duration_ms=duration_ms,
        primary_failed=primary_failed,
        cost_usd=cost_usd,
        debug={"input_count": len(candidates),
               "output_count": len(out),
               "deadline_exceeded": duration_ms > latency_budget_ms},
    )

Five details that matter:

The latency budget is a deadline, not a target. Past it, the function returns whatever it has. An LLM call after the reranker is far more expensive than a slightly worse top-3.
The cost budget is checked before the call. A reranker that quietly burns past your daily Cohere budget is worse than no reranker.
Failure is observable. primary_failed=True should fire an alert, not just a log line. You want to know within minutes when the local model goes degraded.
First-stage fallback exists. If both rerankers fail, return the first-stage order. The pipeline must not 500 because reranking is broken.
RerankResult is a dataclass, not a dict. Saves you from typos in metric names six months later.

Step 6: An evaluation harness that proves reranking helps

Most production reranker deployments I audit have no evaluation. They were added because a tutorial said to. Without an eval set, you cannot tell whether the reranker is helping, hurting, or breaking even on your queries — and on roughly 30% of domains I have measured, BGE actually loses to a well-tuned BM25+vector hybrid.

You need a golden set: 50-200 (query, relevant_doc_ids) pairs labeled by a human. Then a metric that captures top-K precision. NDCG@10 is the standard.

# eval_reranker.py
import math
from typing import Callable, List, Set


def ndcg_at_k(retrieved_ids: List[str], relevant_ids: Set[str], k: int = 10) -> float:
    dcg = 0.0
    for i, doc_id in enumerate(retrieved_ids[:k]):
        if doc_id in relevant_ids:
            dcg += 1.0 / math.log2(i + 2)
    ideal_hits = min(len(relevant_ids), k)
    idcg = sum(1.0 / math.log2(i + 2) for i in range(ideal_hits))
    return dcg / idcg if idcg > 0 else 0.0


def evaluate(golden: List[dict],
             retrieve_fn: Callable[[str], list],
             rerank_fn: Callable[[str, list], list],
             k: int = 10) -> dict:
    baseline_scores, reranked_scores = [], []
    for row in golden:
        query, relevant = row["query"], set(row["relevant_doc_ids"])
        candidates = retrieve_fn(query)
        baseline_ids = [c.doc_id for c in candidates[:k]]
        reranked = rerank_fn(query, candidates)[:k]
        reranked_ids = [c.doc_id for c in reranked]
        baseline_scores.append(ndcg_at_k(baseline_ids, relevant, k))
        reranked_scores.append(ndcg_at_k(reranked_ids, relevant, k))
    n = len(golden)
    baseline_avg = sum(baseline_scores) / n
    reranked_avg = sum(reranked_scores) / n
    return {
        "n": n,
        "baseline_ndcg": baseline_avg,
        "reranked_ndcg": reranked_avg,
        "lift": reranked_avg - baseline_avg,
        "lift_pct": (reranked_avg - baseline_avg) / max(baseline_avg, 1e-9) * 100,
    }

Wire it up:

# eval_run.py
import json
from retrieval import retrieve_candidates
from rerank_service import rerank
from eval_reranker import evaluate

with open("golden_set.json") as f:
    golden = json.load(f)


def reranker(query, candidates):
    return rerank(query, candidates, top_n=10).candidates


report = evaluate(golden, retrieve_candidates, reranker, k=10)
print(json.dumps(report, indent=2))

You should see a lift in the 10-30% range on most domains. If you see a regression, the reranker is the wrong fit or the chunks are too long for the model. Either way, the eval told you before the customer did.

Wire-up checklist

Before any of this code touches a real customer:

First-stage retrieval returns 50-100 candidates, not 10.
Cross-encoder model loaded once at process start, not per request.
Cohere call has an explicit timeout under 2s.
Latency budget is a deadline, with first-stage fallback if it is exceeded.
Daily cost budget is checked before each Cohere call.
Cross-encoder failure fires an alert, not just a log line.
Evaluation harness has at least 50 labeled queries before you ship.
NDCG@10 lift is measured monthly, not just at launch — embedding drift is real.

What this fixes

The article that opened this post — the customer answering with stale 2023 policy because rank-7 was the right answer — was fixed by adding exactly this layer. NDCG@10 on the eval set went from 0.71 (vector + BM25 hybrid) to 0.88 (hybrid + BGE + Cohere fused via RRF). p95 query latency went up by 240ms. Customer escalations dropped to zero in the next sprint.

Reranking is one of the few RAG improvements that is cheap to add, easy to measure, and almost always positive on domain-specific data. The thing that breaks people is treating it like a single Cohere call instead of a layer with fallback, budgets, and evidence.

The pieces above are what survives an actual production incident. They are also the pieces nobody puts in their hello-world tutorial. Build the layer once, evaluate it monthly, and you can stop wondering whether your top-3 chunks are the right ones.

If you are interested in the whole RAG stack, my earlier piece on building a production-ready RAG pipeline with Python and ChromaDB covers chunking, ingestion idempotency, and hybrid retrieval — the pieces that produce the candidate set this reranker layer consumes. The LLM evaluation harness piece shows how to put NDCG@10 regressions behind a CI gate so a quiet retrieval drift cannot ship.

Bulletproofing LLM Structured Output in Python: Healing Retries, Cost Caps, and Drift Detection (Runnable Code)

Nitin Srivastava — Sun, 10 May 2026 03:45:09 +0000

I shipped a structured-output endpoint to production in March. The schema was clean, JSON mode was on, the model was GPT-4.1, the eval suite was green. Three weeks in, the on-call channel lit up because a downstream billing job had silently skipped 4,200 records over a weekend.

The output was valid JSON. It just wasn't the JSON we asked for.

That was my last "JSON mode is good enough" deployment. Since then I've shipped four more LLM structured-output systems and the failures keep coming from the same places — and JSON mode catches roughly two of them. This post is the toolkit I wish I had on day one, with runnable Python you can drop into a FastAPI service this afternoon.

The six failure modes JSON mode does not save you from

Two months of incident logs across two enterprise deployments, sorted by frequency:

Silent truncation. max_tokens runs out mid-object. You get parseable JSON for the first 80% of an array, the last item is gone.
Hallucinated keys. Model returns customer_id when the schema says client_id. JSON mode does not check field names against your schema unless you use strict structured output, and even then nested types slip through.
Type coercion. "price": "1,499.00" instead of 1499.00. JSON parser is happy. Your billing job is not.
Semantic drift. Schema-valid output where the values are wrong — wrong customer, wrong amount, wrong country code.
Refusals returning JSON. Safety filter triggers, model returns {"refusal": "I can't help with that"}. Your code parses it as a normal response.
Schema-version desync. You ship a new field, an in-flight worker is still on the old schema, batch fails for two hours until someone notices.

JSON mode catches #1 and #3 sometimes. The other four need real validation, healing, and observability layered on top.

The toolkit

We're building this around four pieces:

A strict validator that runs after JSON mode (catches what JSON mode misses).
A healing retry loop that feeds the validation error back to the model — not a blind retry.
A cost-bounded fallback chain so a bad prompt cannot burn through $400 in tokens.
A drift detector that tracks parse compliance and field-distribution shifts over time.

Full file structure:

llm_structured/
├── schemas.py        # Pydantic models with versioning
├── validator.py      # Strict validation beyond JSON mode
├── healer.py         # Healing retry loop
├── budget.py         # Per-request and global cost caps
├── chain.py          # Multi-provider fallback with circuit breaker
├── observability.py  # Metrics + drift detection
└── service.py        # FastAPI endpoint that ties it all together

Install dependencies:

pip install pydantic==2.7.4 openai==1.30.0 anthropic==0.30.0 \
    tenacity==8.3.0 prometheus-client==0.20.0 fastapi==0.111.0 \
    uvicorn==0.30.1 httpx==0.27.0

1. Schemas with versioning baked in

Schema versioning sounds boring until you've had two services on different versions for ninety minutes during a deploy.

# schemas.py
from pydantic import BaseModel, Field, field_validator
from typing import Literal
from decimal import Decimal


class InvoiceLineV2(BaseModel):
    schema_version: Literal["2.0"] = "2.0"
    client_id: str = Field(min_length=3, max_length=64)
    amount: Decimal = Field(gt=0, decimal_places=2)
    currency: Literal["USD", "EUR", "GBP", "INR"]
    invoice_date: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")
    line_items: list[str] = Field(min_length=1, max_length=50)
    confidence: float = Field(ge=0.0, le=1.0)

    @field_validator("amount", mode="before")
    @classmethod
    def coerce_amount(cls, v):
        if isinstance(v, str):
            cleaned = v.replace(",", "").replace("$", "").strip()
            return Decimal(cleaned)
        return v


def schema_for_prompt(model: type[BaseModel]) -> dict:
    """Return a JSON-schema dict suitable for OpenAI response_format."""
    return {
        "type": "json_schema",
        "json_schema": {
            "name": model.__name__,
            "schema": model.model_json_schema(),
            "strict": True,
        },
    }

The schema_version field is the key. Every output carries the version that produced it; downstream consumers fail loudly when they see a version they don't understand instead of silently mis-mapping fields.

2. Validation that goes beyond JSON mode

JSON mode + strict: true will catch type errors and missing required fields. It will not catch refusals, won't catch semantic anchors, and won't tell you about partial truncation. So we run a second-pass validator.

# validator.py
from pydantic import BaseModel, ValidationError
import json
import re

REFUSAL_PATTERNS = [
    r"i can'?t help",
    r"i'?m not able to",
    r"as an ai",
    r"i'?m unable to provide",
]


class ValidationResult:
    def __init__(self, ok: bool, value=None, errors=None, raw=None):
        self.ok = ok
        self.value = value
        self.errors = errors or []
        self.raw = raw


def validate(raw: str, model: type[BaseModel]) -> ValidationResult:
    if not raw or not raw.strip():
        return ValidationResult(False, errors=["empty_response"], raw=raw)

    lower = raw.lower()
    for pat in REFUSAL_PATTERNS:
        if re.search(pat, lower):
            return ValidationResult(False, errors=["refusal_detected"], raw=raw)

    try:
        parsed = json.loads(raw)
    except json.JSONDecodeError as e:
        return ValidationResult(False, errors=[f"json_decode: {e}"], raw=raw)

    try:
        instance = model.model_validate(parsed)
    except ValidationError as e:
        return ValidationResult(False, errors=_format_errors(e), raw=raw)

    return ValidationResult(True, value=instance, raw=raw)


def _format_errors(e: ValidationError) -> list[str]:
    out = []
    for err in e.errors():
        loc = ".".join(str(p) for p in err["loc"])
        out.append(f"{loc}: {err['msg']}")
    return out

The _format_errors step matters. If you feed ValidationError.json() back to the model verbatim, you get a wall of stack-trace-looking text the model wastes tokens trying to parse. Plain English errors heal in one round most of the time.

3. Healing retries — not blind retries

A blind retry on the same prompt with temperature=0 gives you the same broken output. The fix is to tell the model what was wrong and ask it to repair that specific output.

# healer.py
from .validator import validate, ValidationResult
from pydantic import BaseModel
from openai import AsyncOpenAI

REPAIR_PROMPT = """The previous response failed validation.

Original schema requirements:
{schema}

Your previous output:
{previous}

Validation errors:
{errors}

Return ONLY corrected JSON matching the schema. Do not explain.
"""


async def heal(
    client: AsyncOpenAI,
    model_name: str,
    user_prompt: str,
    response_model: type[BaseModel],
    max_attempts: int = 3,
) -> ValidationResult:
    history = [{"role": "user", "content": user_prompt}]
    last_raw = ""

    for attempt in range(max_attempts):
        resp = await client.chat.completions.create(
            model=model_name,
            messages=history,
            response_format={"type": "json_object"},
            temperature=0.0,
        )
        last_raw = resp.choices[0].message.content or ""

        result = validate(last_raw, response_model)
        if result.ok:
            return result

        history.append({"role": "assistant", "content": last_raw})
        history.append({
            "role": "user",
            "content": REPAIR_PROMPT.format(
                schema=response_model.model_json_schema(),
                previous=last_raw,
                errors="\n".join(result.errors),
            ),
        })

    return ValidationResult(False, errors=["max_heal_attempts"], raw=last_raw)

Three attempts is the cap I land on most of the time. In incident data from one client, attempt 1 succeeds 87.4% of the time, attempt 2 takes another 9.1%, attempt 3 captures 2.8%, and the remaining 0.7% is genuinely broken (model can't comply, downstream needs a human). Anything past three is just burning tokens.

4. Cost budget that actually caps spend

The 0.7% that fail-loud are also the prompts that recursively spiral. So we cap.

# budget.py
import time
from dataclasses import dataclass
from contextvars import ContextVar


@dataclass
class CostState:
    spent_usd: float = 0.0
    started_at: float = 0.0
    request_cap_usd: float = 0.10
    global_cap_usd_per_min: float = 5.0
    global_window_start: float = 0.0
    global_spent_in_window: float = 0.0


_state: ContextVar[CostState] = ContextVar("_cost_state")


PRICING = {
    "gpt-4.1": (0.00250, 0.01000),
    "gpt-4.1-mini": (0.00015, 0.00060),
    "claude-sonnet-4.5": (0.00300, 0.01500),
}


def estimate(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    in_price, out_price = PRICING.get(model, (0.0, 0.0))
    return (prompt_tokens / 1000) * in_price + (completion_tokens / 1000) * out_price


def charge(model: str, prompt_tokens: int, completion_tokens: int) -> None:
    state = _state.get()
    cost = estimate(model, prompt_tokens, completion_tokens)
    state.spent_usd += cost

    now = time.time()
    if now - state.global_window_start > 60:
        state.global_window_start = now
        state.global_spent_in_window = 0.0
    state.global_spent_in_window += cost

    if state.spent_usd > state.request_cap_usd:
        raise BudgetExceeded(f"per-request cap hit: {state.spent_usd:.4f}")
    if state.global_spent_in_window > state.global_cap_usd_per_min:
        raise BudgetExceeded(
            f"global rate cap hit: {state.global_spent_in_window:.4f}/min"
        )


class BudgetExceeded(Exception):
    pass


def with_budget(request_cap_usd: float = 0.10) -> CostState:
    state = CostState(request_cap_usd=request_cap_usd, started_at=time.time())
    _state.set(state)
    return state

The per-request cap is what saves you from one runaway prompt. The per-minute global cap is what saves you from a bug — like the time I deployed a regex that turned every retrieved doc into a 200KB context, and we caught it because the global cap kicked in at minute three instead of the next billing cycle.

5. Multi-provider fallback with a circuit breaker

If OpenAI returns a 5xx burst, retrying OpenAI is wasted seconds. Fall over to Anthropic, but don't fall back forever — open the circuit, let one probe through every 30 seconds, recover when it succeeds.

# chain.py
import time
from dataclasses import dataclass
from openai import AsyncOpenAI
from anthropic import AsyncAnthropic
from pydantic import BaseModel
from .healer import heal
from .validator import ValidationResult


@dataclass
class Breaker:
    failures: int = 0
    open_until: float = 0.0
    threshold: int = 3
    cooldown: float = 30.0


class FallbackChain:
    def __init__(self, openai_client: AsyncOpenAI, anthropic_client: AsyncAnthropic):
        self.openai = openai_client
        self.anthropic = anthropic_client
        self.breakers = {"openai": Breaker(), "anthropic": Breaker()}

    def _can_call(self, name: str) -> bool:
        return time.time() >= self.breakers[name].open_until

    def _record(self, name: str, ok: bool) -> None:
        b = self.breakers[name]
        if ok:
            b.failures = 0
            b.open_until = 0.0
        else:
            b.failures += 1
            if b.failures >= b.threshold:
                b.open_until = time.time() + b.cooldown

    async def run(
        self, user_prompt: str, model: type[BaseModel]
    ) -> ValidationResult:
        if self._can_call("openai"):
            try:
                result = await heal(self.openai, "gpt-4.1-mini", user_prompt, model)
                self._record("openai", result.ok)
                if result.ok:
                    return result
            except Exception:
                self._record("openai", False)

        if self._can_call("anthropic"):
            try:
                result = await self._call_anthropic(user_prompt, model)
                self._record("anthropic", result.ok)
                return result
            except Exception:
                self._record("anthropic", False)

        return ValidationResult(False, errors=["all_providers_unavailable"])

    async def _call_anthropic(self, prompt: str, model: type[BaseModel]):
        # Anthropic uses tool_use to force structured output
        from .validator import validate
        resp = await self.anthropic.messages.create(
            model="claude-sonnet-4.5",
            max_tokens=2000,
            tools=[{
                "name": model.__name__,
                "description": f"Return a {model.__name__}",
                "input_schema": model.model_json_schema(),
            }],
            tool_choice={"type": "tool", "name": model.__name__},
            messages=[{"role": "user", "content": prompt}],
        )
        import json
        for block in resp.content:
            if block.type == "tool_use":
                return validate(json.dumps(block.input), model)
        return ValidationResult(False, errors=["no_tool_use_block"])

6. Observability — drift detection beyond a parse rate

Most teams stop at parse_compliance_rate and call it observability. That tells you nothing on the day a model upgrade silently shifts your confidence field from a 0.85 mean to 0.62.

# observability.py
from prometheus_client import Counter, Histogram, Gauge
from collections import deque
import statistics

PARSE_OK = Counter("llm_parse_ok_total", "Successful parses", ["model", "schema"])
PARSE_FAIL = Counter(
    "llm_parse_fail_total", "Failed parses", ["model", "schema", "reason"]
)
HEAL_ATTEMPTS = Histogram(
    "llm_heal_attempts", "Attempts to validation success", ["model", "schema"]
)
COST_USD = Counter("llm_cost_usd_total", "Cost in USD", ["model"])

_field_windows: dict[str, deque] = {}
DRIFT_GAUGE = Gauge("llm_field_drift_zscore", "Z-score of field mean", ["field"])


def track_field(field_name: str, value: float, window: int = 1000) -> None:
    if field_name not in _field_windows:
        _field_windows[field_name] = deque(maxlen=window)
    q = _field_windows[field_name]
    q.append(value)
    if len(q) >= 50:
        old = list(q)[: len(q) // 2]
        new = list(q)[len(q) // 2 :]
        if statistics.stdev(old) > 0:
            z = (statistics.mean(new) - statistics.mean(old)) / statistics.stdev(old)
            DRIFT_GAUGE.labels(field=field_name).set(z)

The track_field helper is what catches the silent model-upgrade regression. Wire it to alert when |z| > 2.5 on any field for ten minutes — same pattern I broke down for the LLM evaluation harness in pytest, except that one runs at CI time and this one runs in production.

7. The endpoint that ties it together

# service.py
from fastapi import FastAPI, HTTPException
from openai import AsyncOpenAI
from anthropic import AsyncAnthropic
from .schemas import InvoiceLineV2
from .chain import FallbackChain
from .budget import with_budget, BudgetExceeded
from .observability import PARSE_OK, PARSE_FAIL, HEAL_ATTEMPTS, track_field

app = FastAPI()
chain = FallbackChain(AsyncOpenAI(), AsyncAnthropic())


@app.post("/extract/invoice")
async def extract_invoice(payload: dict):
    text = payload.get("text", "")
    with_budget(request_cap_usd=0.05)

    try:
        result = await chain.run(text, InvoiceLineV2)
    except BudgetExceeded as e:
        raise HTTPException(429, f"cost cap: {e}")

    label = {"model": "gpt-4.1-mini", "schema": "InvoiceLineV2"}
    if not result.ok:
        for reason in result.errors:
            PARSE_FAIL.labels(**label, reason=reason[:32]).inc()
        raise HTTPException(422, {"errors": result.errors, "raw": result.raw})

    PARSE_OK.labels(**label).inc()
    track_field("invoice.confidence", float(result.value.confidence))
    return result.value.model_dump()

Chaos-test before you ship

The whole point of the toolkit is that the bad days behave. Test the bad days on purpose.

# tests/test_chaos.py
import pytest
from unittest.mock import AsyncMock, patch
from llm_structured.service import app
from fastapi.testclient import TestClient

client = TestClient(app)


@pytest.mark.parametrize("bad_response", [
    "I can't help with that request.",
    '{"client_id": "abc", "amount": "not_a_number"}',
    '{"customer_id": "abc", "amount": 100}',
    '{"client_id": "abc", "amount": 100, "currency": "ZZZ"}',
])
def test_chaos_responses(bad_response):
    with patch("openai.AsyncOpenAI") as mock:
        mock.return_value.chat.completions.create = AsyncMock(
            return_value=type("R", (), {
                "choices": [type("C", (), {
                    "message": type("M", (), {"content": bad_response})()
                })()]
            })()
        )
        r = client.post("/extract/invoice", json={"text": "..."})
        assert r.status_code in (422, 429)

The point of the parametrize block isn't coverage. It's to make sure none of these failure modes can crash the service or silently succeed. A green test on this file is the closest thing to a guarantee you get.

What to actually do today

If you have a structured-output endpoint in production right now, do this Monday morning, in this order:

Add the validator from section 2 after JSON mode. You will catch hallucinated keys you didn't know you had.
Wire the per-request cost cap. It's twelve lines of code and it will save you the day a bad prompt loops.
Add track_field to one numerical field with a known distribution. That's your drift canary.

Steps 4 through 6 are the ones that take a sprint. These three take an afternoon, and they cover the failure modes that have hit me on every production rollout.

I'm working on the next post in this cluster — a teardown of the failure-injection harness we use to test these endpoints under realistic chaos before they ship. If that's interesting, follow along.

For teams that want this implemented end-to-end, our work on LLM integration and custom AI agents at Velocity Software Solutions covers exactly this kind of production-hardening; we ship Python services like this one for clients regularly through our Python development practice.

Related production-grade pieces I've written:

Building a Production LLM Evaluation Harness in Pytest — the CI-side companion to drift detection
Production-Grade LLM Streaming in FastAPI — backpressure and cancellation patterns
Building a Production MCP Server in Python — per-tool permissions and audit logs

External references worth bookmarking:

Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runnable Python)

Nitin Srivastava — Thu, 07 May 2026 17:49:07 +0000

I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer.

The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a single golden answer, on a model whose temperature happened to land in our favor that day. We had built a coin flip and called it a test.

This article is the harness I wish I'd had on day one. Not another wrapper around DeepEval or RAGAS — a thin layer on top of pytest that solves the five things every production LLM evaluation harness needs and most tutorials skip:

Flake-aware tests. LLMs are stochastic. Single-shot assertions are noise.
Cost-bounded tests. A single misbehaving prompt should not burn $40 on one CI run.
Golden set with versioning. When a result changes, you need to know if the answer drifted or the model did.
Regression-only CI gating. Block PRs on degradation vs. baseline, not on absolute floors that bit-rot.
Multi-metric scoring. Semantic similarity AND structured assertion AND token cost. Any one of these alone lies.

All code below is complete and runnable. No # rest of code. Drop it in tests/llm/ and you have something that actually works.

Why Single-Shot Assertions Lie

Here's the test that failed me. Looks fine, right?

def test_classifier_returns_billing():
    result = call_llm("My credit card was charged twice")
    assert result["category"] == "billing"

I shipped this. It passed locally. It passed in CI for two weeks. Then production traffic exposed that the model returns "billing" 73% of the time on this exact input — the other 27% it returns "payment", "transaction", or once memorably "financial dispute".

A test that passes 73% of the time will eventually pass on the run that matters and fail on every customer hour after. The fix is not "add a temperature=0" — that hides the variance, it doesn't measure it. The fix is to assert on the distribution.

@flake_aware(runs=10, pass_threshold=0.8)
def test_classifier_returns_billing():
    result = call_llm("My credit card was charged twice")
    return result["category"] == "billing"

Run it 10 times, fail the test if fewer than 80% return the right category. Now your green test means something.

The Harness — One File, Drop In

Here's the full plugin. Save as tests/llm/conftest.py:

# tests/llm/conftest.py
import os, json, time, hashlib, statistics
from dataclasses import dataclass, asdict, field
from pathlib import Path
from typing import Callable, Any
import pytest

BASELINE_PATH = Path("tests/llm/_baseline.json")
RESULTS_PATH = Path("tests/llm/_results.json")
COST_CAP_USD = float(os.getenv("LLM_TEST_COST_CAP_USD", "5.0"))


@dataclass
class EvalResult:
    name: str
    pass_rate: float
    runs: int
    total_cost_usd: float
    p50_latency_ms: float
    p95_latency_ms: float
    notes: dict = field(default_factory=dict)


@dataclass
class _RunStats:
    successes: int = 0
    runs: int = 0
    cost_usd: float = 0.0
    latencies_ms: list = field(default_factory=list)
    notes: dict = field(default_factory=dict)


_run_total_cost = {"value": 0.0}
_collected: dict[str, EvalResult] = {}


def flake_aware(runs: int = 5, pass_threshold: float = 0.8, max_cost_usd: float = 0.50):
    """
    Decorator: run the wrapped fn `runs` times. Test passes if pass-rate >= threshold
    and total cost is under max_cost_usd. The wrapped fn must return a bool or a
    (bool, cost_usd, latency_ms) tuple.
    """
    def decorator(fn: Callable) -> Callable:
        def wrapper(*args, **kwargs):
            stats = _RunStats()
            for _ in range(runs):
                start = time.monotonic()
                outcome = fn(*args, **kwargs)
                latency_ms = (time.monotonic() - start) * 1000

                if isinstance(outcome, tuple):
                    ok, cost, reported_latency = outcome
                    latency_ms = reported_latency or latency_ms
                else:
                    ok, cost = bool(outcome), 0.0

                stats.runs += 1
                stats.successes += 1 if ok else 0
                stats.cost_usd += cost
                stats.latencies_ms.append(latency_ms)

                _run_total_cost["value"] += cost
                if _run_total_cost["value"] > COST_CAP_USD:
                    pytest.fail(
                        f"Global cost cap of ${COST_CAP_USD} exceeded. "
                        f"Aborting suite. Spent: ${_run_total_cost['value']:.2f}"
                    )

            pass_rate = stats.successes / stats.runs
            result = EvalResult(
                name=fn.__name__,
                pass_rate=pass_rate,
                runs=stats.runs,
                total_cost_usd=round(stats.cost_usd, 4),
                p50_latency_ms=round(statistics.median(stats.latencies_ms), 1),
                p95_latency_ms=round(_p95(stats.latencies_ms), 1),
                notes=stats.notes,
            )
            _collected[fn.__name__] = result

            if stats.cost_usd > max_cost_usd:
                pytest.fail(
                    f"{fn.__name__} cost ${stats.cost_usd:.3f} > "
                    f"per-test cap ${max_cost_usd}"
                )
            assert pass_rate >= pass_threshold, (
                f"{fn.__name__}: pass rate {pass_rate:.0%} below threshold {pass_threshold:.0%} "
                f"({stats.successes}/{stats.runs}) — cost ${stats.cost_usd:.3f}"
            )
        wrapper.__name__ = fn.__name__
        return wrapper
    return decorator


def _p95(values: list[float]) -> float:
    if not values:
        return 0.0
    s = sorted(values)
    k = max(0, int(round(0.95 * (len(s) - 1))))
    return s[k]


@pytest.fixture(scope="session", autouse=True)
def _persist_results(request):
    yield
    RESULTS_PATH.parent.mkdir(parents=True, exist_ok=True)
    payload = {name: asdict(r) for name, r in _collected.items()}
    RESULTS_PATH.write_text(json.dumps(payload, indent=2, sort_keys=True))


def pytest_terminal_summary(terminalreporter, exitstatus, config):
    if not _collected:
        return
    tr = terminalreporter
    tr.write_sep("=", "LLM EVAL SUMMARY")
    for r in sorted(_collected.values(), key=lambda x: x.name):
        tr.write_line(
            f"{r.name:40s}  pass={r.pass_rate:>5.0%}  runs={r.runs:>3}  "
            f"cost=${r.total_cost_usd:>5.3f}  p95={r.p95_latency_ms:>6.0f}ms"
        )
    tr.write_line(f"TOTAL COST: ${_run_total_cost['value']:.3f} (cap ${COST_CAP_USD})")

Three things to notice. The decorator returns a no-arg test for pytest. Cost is tracked both per-test (so a single bad prompt can't blow the budget) and globally (so the suite can't either). And every run dumps a structured _results.json — that file is what the regression check feeds on.

Cost Bounding — Why It Has to Be Two Layers

The first time I ran an LLM eval suite without a cap, a misbehaving prompt template hit the model with a 12K-token context on every retry. 200 tests × 5 runs × $0.04 per call = $40 in one CI run. The PR was a one-line typo fix.

You need both caps:

Per-test cap (the max_cost_usd=0.50 on the decorator): catches a single regression that explodes context.
Global cap (COST_CAP_USD env var, default $5): catches an accidental loop, a misconfigured runner, or someone leaving runs=200 on by mistake.

The global cap aborts the suite mid-run. The per-test cap fails just that test. That separation matters in practice — you want one runaway test to fail loudly without killing 40 other tests that would have caught real regressions.

Multi-Metric Scoring — The Single Number Trap

Most evaluation tutorials show you cosine similarity to a golden answer. Cosine similarity alone is a trap. A model can return "the customer should be issued a refund per policy 4.2" while the golden says "issue refund" — semantically aligned, but the structured field your downstream code parses now has a policy reference that breaks the regex.

Score on three axes, every time:

# tests/llm/test_classifier.py
from conftest import flake_aware
from your_app.llm import call_llm  # your wrapper that returns (response_dict, cost_usd, latency_ms)

GOLDEN_BILLING = {
    "input": "My credit card was charged twice for the same order",
    "expected_category": "billing",
    "expected_keywords": ["refund", "duplicate", "charge"],
}


@flake_aware(runs=10, pass_threshold=0.8, max_cost_usd=0.20)
def test_classifier_billing_intent():
    result, cost, latency_ms = call_llm(GOLDEN_BILLING["input"])

    # Metric 1: structured field equality
    structural = result["category"] == GOLDEN_BILLING["expected_category"]

    # Metric 2: keyword recall in the rationale (catches semantic drift)
    rationale = result.get("rationale", "").lower()
    keyword_hits = sum(1 for k in GOLDEN_BILLING["expected_keywords"] if k in rationale)
    semantic = keyword_hits >= 2  # at least 2 of 3 keywords

    # Metric 3: cost guardrail
    affordable = cost < 0.005  # half a cent per call ceiling

    return (structural and semantic and affordable), cost, latency_ms

When this test fails, the failure message tells you which axis broke — structural, semantic, or cost. That distinction is the difference between "the model is broken" and "the prompt template now ships 4x more tokens." Both are bugs. They have different fixes.

Golden Set Drift — The Hidden Failure Mode

Here's the failure mode most teams don't see coming. You have a golden set of 50 test cases. You write them in January. By June, your product has shifted — the support team now classifies "shipping damage" as a logistics issue, not customer service. Your tests still pass. The model still returns "customer_service". Reality has moved.

The fix is to version the golden set and audit drift on a schedule:

# tests/llm/golden_set.py
import json, hashlib
from pathlib import Path

GOLDEN_PATH = Path("tests/llm/golden_set.json")


def load_golden() -> list[dict]:
    data = json.loads(GOLDEN_PATH.read_text())
    return data["cases"]


def golden_fingerprint() -> str:
    """Stable hash of the golden set. Bump in commit messages when intentional."""
    raw = GOLDEN_PATH.read_text()
    return hashlib.sha256(raw.encode()).hexdigest()[:12]


def assert_drift_log_current():
    """Run as a separate test. Fails if golden set hash changed without an entry
    in the drift log — forces a human to acknowledge the change."""
    log = Path("tests/llm/golden_drift.log").read_text().splitlines()
    current = golden_fingerprint()
    assert any(line.startswith(current) for line in log), (
        f"Golden set hash {current} not in drift log. "
        f"Add an entry: '{current}  YYYY-MM-DD  reason'"
    )

This is annoying. It's supposed to be annoying. The whole point is that changing the golden set should require a deliberate, logged human action — because every time the golden moves, your "regression" baseline moves with it, and you've lost the ability to spot real regressions.

Regression-Only CI Gating

The last piece is the one that turns this from "tests" into "guardrails." Most LLM eval suites fail on absolute thresholds — pass_rate >= 0.85. That works on day one. By day 90 your prompt has improved, every test passes at 0.97, and a regression to 0.88 is still "above the floor" so CI stays green.

What you actually want is: fail the PR if any test drops more than X% vs. the last green main build.

# tests/llm/test_regression_gate.py
import json, os
from pathlib import Path
import pytest

BASELINE = Path("tests/llm/_baseline.json")
CURRENT = Path("tests/llm/_results.json")
REGRESSION_TOLERANCE = float(os.getenv("LLM_REGRESSION_TOLERANCE", "0.05"))


def test_no_regression_vs_baseline():
    if not BASELINE.exists():
        pytest.skip("No baseline yet — first run will create one")

    baseline = json.loads(BASELINE.read_text())
    current = json.loads(CURRENT.read_text())

    regressions = []
    for name, baseline_result in baseline.items():
        if name not in current:
            continue  # test deleted, ignore
        delta = current[name]["pass_rate"] - baseline_result["pass_rate"]
        if delta < -REGRESSION_TOLERANCE:
            regressions.append(
                f"{name}: {baseline_result['pass_rate']:.0%} -> "
                f"{current[name]['pass_rate']:.0%} (Δ {delta:+.0%})"
            )

    assert not regressions, "Regressions detected:\n  " + "\n  ".join(regressions)

Wire this into CI: on a successful merge to main, copy _results.json to _baseline.json and commit it. Now every PR is gated on not making things worse, which is the only gate that scales.

CI Wiring — The One-Command Setup

# .github/workflows/llm-eval.yml
name: LLM Evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install pytest openai
      - name: Run eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LLM_TEST_COST_CAP_USD: "3.00"
          LLM_REGRESSION_TOLERANCE: "0.05"
        run: pytest tests/llm/ -v
      - uses: actions/upload-artifact@v4
        with: { name: llm-results, path: tests/llm/_results.json }

That's it. Total cost per PR is capped. Regressions block. Golden drift forces a human ack. And the artifact gives you a paper trail for which prompt change moved which metric.

What I'd Tell Past Me

If I could go back to my first LLM project, I would skip most of the things I optimized for. I would not pick a fancy eval library. I would not build a custom dashboard. I would do exactly this — five primitives, in pytest, with a CI workflow that costs less than a cup of coffee per PR.

The teams I see succeed with LLM features in production are not the ones with the deepest evaluation theory. They're the ones whose tests catch a regression in the time it takes to read a PR description. That's the bar. Everything else is decoration.

(The honest truth: I've watched two teams burn six weeks each on building DSL-based eval frameworks before shipping anything. Both eventually replaced them with pytest plus 200 lines of helper code. So just start there.)

If you're building agents and want the related production pieces — the LLM tool calling patterns and the semantic caching layer plus the production MCP server walkthrough — I wrote those up too. They share the same harness for regression testing. You can also dig into how our team at Velocity Software Solutions wires this kind of harness into LLM integration projects and custom AI agents if you want the application-level view.

For deeper reading: Anthropic's evaluation best practices, the pytest official docs on parametrize and fixtures, and the DeepEval source for prior-art ideas you can borrow without taking on the dependency.

Build the harness this week. Pick five tests that matter. Run them on every PR. The first time it catches a silent regression, you'll wonder how you ever shipped without it.

How We Cut API Response Time from 2.3s to 180ms Using Redis + Smart Caching

Nitin Srivastava — Thu, 23 Apr 2026 09:14:31 +0000

p95 latency dropped from 2.3 seconds to 180 milliseconds. Same hardware, same database, same traffic. The only thing that changed was how we cached — and I don't mean slapping @lru_cache on a function.

I'm writing this because every Redis caching tutorial I read before this project showed me the same 15-line example: redis.get(key) or fetch_from_db(). That code works in a notebook. It will absolutely wreck you in production the first time real traffic hits it.

This is the layered strategy that actually survived. FastAPI + Python on the server, Redis 7 for caching, Postgres behind it. Everything below is from a real project we shipped for a B2B client earlier this year — roughly 800 requests per minute on the hot endpoints, with read-heavy traffic around product catalog and pricing.

The endpoint that was killing us

The problematic endpoint returned a pricing quote for a product variant, filtered by region, customer tier, and active promotions. Three joins, a couple of window functions for volume-based discounts, and a call to a legacy PHP service for tax lookup. Most requests landed somewhere between 1.8 and 2.6 seconds. p95 sat at 2.3s. We had complaints.

Here's the naive caching attempt we rolled out first. Guess how long it lasted in production.

# app/pricing.py
import json
import redis
from fastapi import APIRouter

r = redis.Redis(host="localhost", port=6379, decode_responses=True)
router = APIRouter()

@router.get("/quote/{product_id}")
def get_quote(product_id: int, region: str, tier: str):
    key = f"quote:{product_id}:{region}:{tier}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)

    quote = compute_quote(product_id, region, tier)  # slow path
    r.setex(key, 300, json.dumps(quote))  # 5 min TTL
    return quote

Three problems showed up within the first hour of production traffic. I want to walk through each, because these are the things the tutorials don't tell you.

Gotcha 1: The thundering herd

The first alert fired at 2:14 AM. Our compute_quote function was being called 400+ times per second for the same key when the cache expired. Postgres spiked, the legacy tax service timed out, everything melted.

This is called cache stampede (or thundering herd). When a popular key expires, every in-flight request misses the cache simultaneously and hammers the origin. The fix is request coalescing — only one request actually does the work, the rest wait for it.

# app/cache.py
import json
import time
import redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def cached_with_lock(key: str, ttl: int, loader):
    """
    Get value for `key` from Redis. If missing, acquire a short lock,
    call `loader()` to compute, write the value, release the lock.
    Other waiters poll until the value appears.
    """
    cached = r.get(key)
    if cached is not None:
        return json.loads(cached)

    lock_key = f"lock:{key}"
    acquired = r.set(lock_key, "1", nx=True, ex=10)

    if acquired:
        try:
            value = loader()
            r.setex(key, ttl, json.dumps(value))
            return value
        finally:
            r.delete(lock_key)

    # Someone else is computing. Poll briefly.
    for _ in range(50):
        time.sleep(0.05)
        cached = r.get(key)
        if cached is not None:
            return json.loads(cached)

    # Lock holder died or is too slow. Fall through and compute ourselves.
    return loader()

Two details matter. First, the lock TTL (10 seconds) must be longer than your worst-case loader time. If loader takes 15 seconds, the lock expires, and now you have two processes computing. Second, the polling fallback at the end prevents permanent deadlock if the lock holder crashes mid-compute.

You can get fancier with pub/sub notifications instead of polling, but the 50ms poll interval is cheap enough that it rarely matters. We tried the pub/sub version. It was 40 lines more code and saved us maybe 15ms on cache misses. Not worth it.

Gotcha 2: Cache key design is 80% of the game

Our first cache key was quote:{product_id}:{region}:{tier}. Looked reasonable. It was wrong in at least three ways.

First, active promotions affected the price but weren't in the key. So a customer would fetch a quote, the promo would end, and they'd keep getting the stale promoted price for 5 minutes. Support tickets rolled in.

Second, we had currency as an optional query param that defaulted to USD. When a user explicitly passed currency=EUR, we'd cache it under the EUR key. When they then hit the endpoint without the param (defaulting to USD), we'd correctly hit a different key. Fine so far. But internal services that always sent currency=USD and external clients that omitted it were filling the cache with duplicates for the same logical value.

Third, we were caching per-user-tier, but three tiers (silver, gold, platinum) had identical pricing for 70% of products. We were storing three copies of the same thing.

Here's the redesigned key structure:

# app/keys.py
import hashlib
import json

def quote_cache_key(product_id: int, region: str, tier: str,
                    currency: str, active_promo_ids: list[int]) -> str:
    """
    Stable cache key that accounts for every input that can change the output.
    """
    normalized = {
        "product_id": product_id,
        "region": region.upper(),
        "tier": tier.lower(),
        "currency": (currency or "USD").upper(),
        "promos": sorted(active_promo_ids or []),
    }
    payload = json.dumps(normalized, separators=(",", ":"), sort_keys=True)
    digest = hashlib.sha1(payload.encode()).hexdigest()[:16]
    return f"quote:v2:{digest}"

Three changes from the original. We normalize inputs before keying (uppercase region, lowercase tier, default currency). We include every input that can change the output, including promotion IDs. And we hash the whole thing with a version prefix so we can invalidate the entire cache namespace by bumping v2 to v3.

That version prefix has saved us twice since then. When we changed the pricing rules in March, we shipped with v3 and every old cached quote became instantly irrelevant — no flush, no downtime.

Gotcha 3: Writes should invalidate, not wait for TTL

TTL-based invalidation is fine for data that's allowed to be slightly stale. It's terrible for anything a user just changed themselves.

The pattern I see everywhere is: cache with TTL, shrug when users report they're seeing old data right after updating a record. "Wait 5 minutes and refresh." That's not acceptable in 2026.

We use a write-through-invalidate pattern. When a write happens, we delete related cache keys explicitly. The hard part is knowing which keys are related — this is where the key design from the previous section pays off.

# app/invalidation.py
import redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def invalidate_product(product_id: int):
    """
    A product's price data changed. Invalidate every cached quote for it.
    Uses SCAN instead of KEYS to avoid blocking Redis on large datasets.
    """
    pattern = f"quote:v2:*"
    cursor = 0
    deleted = 0

    while True:
        cursor, batch = r.scan(cursor=cursor, match=pattern, count=500)
        for key in batch:
            meta_key = f"{key}:meta"
            stored_product_id = r.hget(meta_key, "product_id")
            if stored_product_id and int(stored_product_id) == product_id:
                r.delete(key, meta_key)
                deleted += 1
        if cursor == 0:
            break

    return deleted

Two things to notice. We use SCAN instead of KEYS. KEYS blocks the Redis event loop on large datasets — we learned this when a KEYS quote:* took 800ms during a traffic peak and we queued up a couple thousand waiting commands behind it. Don't use KEYS in production. Ever.

Second, we store a lightweight metadata hash alongside each cached value ({key}:meta) so we can look up what's inside the opaque hashed key. This costs us a tiny bit of memory but makes targeted invalidation possible without re-deriving keys from the invalidation side.

Bringing it together

Here's the actual handler that ships in production, roughly.

# app/pricing.py
from fastapi import APIRouter, Query
from app.cache import cached_with_lock
from app.keys import quote_cache_key
from app.quotes import compute_quote, get_active_promo_ids

router = APIRouter()
QUOTE_TTL_SECONDS = 300  # 5 minutes

@router.get("/quote/{product_id}")
def get_quote(
    product_id: int,
    region: str,
    tier: str,
    currency: str = Query("USD"),
):
    promo_ids = get_active_promo_ids(product_id, region)
    key = quote_cache_key(product_id, region, tier, currency, promo_ids)

    def loader():
        return compute_quote(product_id, region, tier, currency, promo_ids)

    return cached_with_lock(key, QUOTE_TTL_SECONDS, loader)

Six lines of logic on top of the shared cached_with_lock helper. Most of the complexity lives in two places: the key function and the loader. The route itself stays boring, which is what you want. Python development shines when the code reads this plainly — if you're building this kind of API layer and want a second pair of eyes, our team offers Python development services and we've walked through this exact pattern with a few clients.

The numbers, measured

I hate when people claim a perf win without showing the measurement method. Here's how we actually verified it.

# scripts/benchmark.py
import asyncio
import statistics
import time
import httpx

async def hit(client, url):
    start = time.perf_counter()
    r = await client.get(url)
    r.raise_for_status()
    return time.perf_counter() - start

async def main():
    url = "http://api.local/quote/1234?region=EU&tier=gold"
    async with httpx.AsyncClient(timeout=10) as client:
        latencies = await asyncio.gather(*[hit(client, url) for _ in range(500)])

    latencies.sort()
    print(f"count     : {len(latencies)}")
    print(f"p50 (ms)  : {statistics.median(latencies) * 1000:.0f}")
    print(f"p95 (ms)  : {latencies[int(0.95 * len(latencies))] * 1000:.0f}")
    print(f"p99 (ms)  : {latencies[int(0.99 * len(latencies))] * 1000:.0f}")

if __name__ == "__main__":
    asyncio.run(main())

Results over 500 requests against the same endpoint, warmed cache:

Metric	Before	After
p50	1.9s	110ms
p95	2.3s	180ms
p99	3.1s	240ms

The p99 number is the one I care about most. p50 and p95 dropping is expected when cache hit rate goes up. But p99 represents the tail — the cache misses, the unlucky timing windows, the lock-wait fallbacks. If your p99 doesn't drop meaningfully, you've accidentally traded one set of slow requests for another.

What not to cache

Two things we deliberately left uncached, because caching them was making things worse.

User-specific data with low reuse — a user's cart, their last order, their notification state. The cache hit rate was under 4% and Redis memory usage ballooned. We moved these to an in-process LRU cache with a 60-second TTL and dropped Redis involvement entirely.

Anything with a write-read latency requirement under 500ms. If a user writes, then immediately reads, they expect to see their write. Cache-aside patterns will serve them stale data from the cache for a few hundred milliseconds until the invalidation propagates. For those endpoints, we skip the cache entirely. The DB is fast enough, and correctness matters more.

One concrete thing to do today

Go look at your slowest endpoint. Pull its last week of logs, find the keys (query params, path params, user ID) that actually determine the response. Now ask: if I cached this, what's my invalidation story? If the answer is "5-minute TTL, users will deal with it," that's fine for read-mostly data. If the answer is "I don't know," you have work to do — and that work starts with the key design, not the Redis client library.

Caching is like plumbing. Nobody notices it when it works, everyone's angry when it leaks, and the leaks usually trace back to decisions you made on day one. Spend the extra hour on the key design. You'll thank yourself the first time a customer asks why they're seeing last week's price.

If you're building this into a larger system and the architecture choices start to feel heavier — where to put the cache, how to handle multi-region, when to go from Redis to a proper CDN — that's the point where we usually get called in. We've helped a few clients through exactly this scaling inflection point as part of our custom software development work, and the answer is almost never "just add more Redis." It's usually "simplify the thing you're trying to cache."

Semantic Chunking with Overlap and Section-Awareness: The RAG Tutorial Nobody Wrote

Nitin Srivastava — Mon, 20 Apr 2026 12:40:52 +0000

I wasted three weeks debugging a RAG system before I realized the LLM wasn't the problem. The embeddings weren't the problem. The vector database wasn't the problem.

The chunks were garbage.

We were splitting 340,000 legal documents into 512-token fixed-size chunks. Definitions got separated from the clauses that referenced them. Tables split mid-row. Section headers landed at the end of one chunk with their content starting the next. Retrieval accuracy sat at 61%.

I switched to semantic chunking with overlap and section-awareness. Same model, same documents, same everything else. Accuracy jumped to 89%.

Here's the exact code that made it work.

Why Fixed-Size Chunking Fails

The default advice is simple: split your documents into N-token chunks. Maybe add some overlap. Done.

It works on clean blog posts and well-formatted docs. It falls apart on anything real-world — contracts with nested subclauses, technical manuals with tables, wikis written by 12 different people over 3 years.

The problem is that meaning doesn't respect token boundaries. A 512-token window might cut a paragraph in half, split a code block from its explanation, or strand a section header without its content. It's like slicing a cookbook by page count instead of by recipe — you end up with the ingredient list in one chunk and the instructions in another. Good luck making dinner.

So why does everyone still do it? Because it's easy. But "easy to implement" and "works in production" are very different things.

What We're Building

A Python chunker that:

Detects section boundaries from document structure (headings, horizontal rules, major topic shifts)
Splits within sections using semantic similarity — finding natural breakpoints where the topic shifts
Adds configurable overlap so no information falls into gaps between chunks
Preserves metadata — each chunk knows which section it belongs to

No LangChain, no frameworks. Just Python, a sentence transformer, and numpy. You can read every line and understand exactly what it does.

The Full Implementation

Dependencies

pip install sentence-transformers numpy

That's it. Two packages.

The Chunker

# semantic_chunker.py
import re
from dataclasses import dataclass, field
from sentence_transformers import SentenceTransformer
import numpy as np


@dataclass
class Chunk:
    text: str
    section: str
    index: int
    token_estimate: int
    metadata: dict = field(default_factory=dict)


class SemanticChunker:
    def __init__(
        self,
        model_name: str = "all-MiniLM-L6-v2",
        max_chunk_tokens: int = 512,
        min_chunk_tokens: int = 50,
        overlap_tokens: int = 64,
        similarity_threshold: float = 0.45,
    ):
        self.model = SentenceTransformer(model_name)
        self.max_chunk_tokens = max_chunk_tokens
        self.min_chunk_tokens = min_chunk_tokens
        self.overlap_tokens = overlap_tokens
        self.similarity_threshold = similarity_threshold

    def _estimate_tokens(self, text: str) -> int:
        return len(text.split()) * 4 // 3  # rough estimate: 1 word ~ 1.33 tokens

    def _split_into_sections(self, text: str) -> list[tuple[str, str]]:
        """Split document into (heading, body) tuples based on structure."""
        # Match markdown headings, HTML headings, or ALL-CAPS lines
        section_pattern = re.compile(
            r"(?:^|\n)"
            r"(?:"
            r"(#{1,4})\s+(.+)"       # markdown headings
            r"|<h([1-4])[^>]*>(.+?)</h\3>"  # html headings
            r"|([A-Z][A-Z\s]{4,})\n"  # ALL-CAPS lines (5+ chars)
            r")"
        )

        sections = []
        last_end = 0
        last_heading = "Introduction"

        for match in section_pattern.finditer(text):
            # Grab content between previous heading and this one
            body = text[last_end:match.start()].strip()
            if body:
                sections.append((last_heading, body))

            # Determine the heading text
            if match.group(2):
                last_heading = match.group(2).strip()
            elif match.group(4):
                last_heading = match.group(4).strip()
            elif match.group(5):
                last_heading = match.group(5).strip().title()

            last_end = match.end()

        # Don't forget the final section
        remaining = text[last_end:].strip()
        if remaining:
            sections.append((last_heading, remaining))

        # If no headings were found, treat entire doc as one section
        if not sections:
            sections = [("Document", text.strip())]

        return sections

    def _split_into_sentences(self, text: str) -> list[str]:
        """Split text into sentences, preserving code blocks and lists."""
        # Protect code blocks from sentence splitting
        code_blocks = {}
        code_pattern = re.compile(r"```

[\s\S]*?

```", re.MULTILINE)
        for i, match in enumerate(code_pattern.finditer(text)):
            placeholder = f"__CODE_BLOCK_{i}__"
            code_blocks[placeholder] = match.group()
        protected = code_pattern.sub(
            lambda m: f"__CODE_BLOCK_{list(code_blocks.values()).index(m.group())}__",
            text,
        )

        # Split on sentence boundaries
        raw = re.split(r"(?<=[.!?])\s+(?=[A-Z])", protected)

        # Restore code blocks
        sentences = []
        for s in raw:
            for placeholder, code in code_blocks.items():
                s = s.replace(placeholder, code)
            s = s.strip()
            if s:
                sentences.append(s)

        return sentences

    def _find_semantic_breakpoints(self, sentences: list[str]) -> list[int]:
        """Find indices where topic shifts occur using embedding similarity."""
        if len(sentences) < 3:
            return []

        embeddings = self.model.encode(sentences, show_progress_bar=False)
        breakpoints = []

        for i in range(1, len(embeddings)):
            sim = np.dot(embeddings[i - 1], embeddings[i]) / (
                np.linalg.norm(embeddings[i - 1]) * np.linalg.norm(embeddings[i])
            )
            if sim < self.similarity_threshold:
                breakpoints.append(i)

        return breakpoints

    def _merge_small_groups(
        self, groups: list[list[str]]
    ) -> list[list[str]]:
        """Merge consecutive groups that are below min_chunk_tokens."""
        merged = []
        buffer = []

        for group in groups:
            buffer.extend(group)
            if self._estimate_tokens(" ".join(buffer)) >= self.min_chunk_tokens:
                merged.append(buffer)
                buffer = []

        # Attach leftover to the last group
        if buffer:
            if merged:
                merged[-1].extend(buffer)
            else:
                merged.append(buffer)

        return merged

    def _split_oversized_group(self, sentences: list[str]) -> list[list[str]]:
        """Split a group that exceeds max_chunk_tokens."""
        result = []
        current = []
        current_tokens = 0

        for sentence in sentences:
            stokens = self._estimate_tokens(sentence)
            if current_tokens + stokens > self.max_chunk_tokens and current:
                result.append(current)
                current = []
                current_tokens = 0
            current.append(sentence)
            current_tokens += stokens

        if current:
            result.append(current)

        return result

    def _add_overlap(self, groups: list[list[str]]) -> list[str]:
        """Convert sentence groups into text chunks with overlap."""
        chunks = []

        for i, group in enumerate(groups):
            parts = list(group)

            # Prepend overlap from previous group
            if i > 0 and self.overlap_tokens > 0:
                prev_sentences = groups[i - 1]
                overlap_text = []
                token_count = 0
                for s in reversed(prev_sentences):
                    stokens = self._estimate_tokens(s)
                    if token_count + stokens > self.overlap_tokens:
                        break
                    overlap_text.insert(0, s)
                    token_count += stokens
                if overlap_text:
                    parts = overlap_text + parts

            chunks.append(" ".join(parts))

        return chunks

    def chunk(self, text: str, source: str = "") -> list[Chunk]:
        """Main entry point. Returns a list of Chunk objects."""
        sections = self._split_into_sections(text)
        all_chunks = []
        idx = 0

        for heading, body in sections:
            sentences = self._split_into_sentences(body)
            if not sentences:
                continue

            # Find semantic breakpoints
            breakpoints = self._find_semantic_breakpoints(sentences)

            # Group sentences by breakpoints
            groups = []
            prev = 0
            for bp in breakpoints:
                groups.append(sentences[prev:bp])
                prev = bp
            groups.append(sentences[prev:])

            # Merge groups that are too small
            groups = self._merge_small_groups(groups)

            # Split groups that are too large
            final_groups = []
            for g in groups:
                if self._estimate_tokens(" ".join(g)) > self.max_chunk_tokens:
                    final_groups.extend(self._split_oversized_group(g))
                else:
                    final_groups.append(g)

            # Add overlap and build Chunk objects
            chunk_texts = self._add_overlap(final_groups)

            for chunk_text in chunk_texts:
                all_chunks.append(
                    Chunk(
                        text=chunk_text,
                        section=heading,
                        index=idx,
                        token_estimate=self._estimate_tokens(chunk_text),
                        metadata={"source": source, "section": heading},
                    )
                )
                idx += 1

        return all_chunks

Using It

# example_usage.py
from semantic_chunker import SemanticChunker

chunker = SemanticChunker(
    max_chunk_tokens=512,
    min_chunk_tokens=50,
    overlap_tokens=64,
    similarity_threshold=0.45,
)

document = """
# Introduction to Vector Databases

Vector databases store high-dimensional embeddings and enable similarity search.
They are the backbone of modern RAG systems. Unlike traditional databases that
match on exact values, vector DBs find the closest neighbors in embedding space.

# How Indexing Works

Most vector databases use approximate nearest neighbor (ANN) algorithms.
HNSW (Hierarchical Navigable Small World) is the most popular choice in 2026.
It builds a multi-layer graph where each node connects to its nearest neighbors.
Query time is logarithmic, which matters when you have millions of vectors.

The trade-off is memory. HNSW indexes can consume 2-4x the size of the raw
vectors. For a collection of 10 million 768-dimensional float32 vectors,
that is roughly 30 GB of raw data and 60-120 GB with the index.

# Choosing the Right Database

Pinecone offers a managed experience with minimal ops overhead.
Weaviate and Qdrant give you more control but require self-hosting.
pgvector is worth considering if your team already runs PostgreSQL
and your dataset is under 5 million vectors.

For most production RAG systems, we recommend starting with a managed
service and migrating to self-hosted once you understand your access patterns.
"""

chunks = chunker.chunk(document, source="vector-db-guide.md")

for chunk in chunks:
    print(f"\n--- Chunk {chunk.index} [{chunk.section}] ({chunk.token_estimate} tokens) ---")
    print(chunk.text[:200] + "..." if len(chunk.text) > 200 else chunk.text)

Running this produces chunks that respect section boundaries, split at semantic shifts within sections, and carry overlap from the previous chunk so no information gets lost at boundaries.

The Three Knobs That Matter

I spent two days tuning these parameters across 4 different document types. Here's what I landed on:

similarity_threshold (0.3–0.6): This controls how sensitive the chunker is to topic shifts. Lower values mean fewer breaks (bigger chunks). Higher values mean more breaks (smaller chunks). I use 0.45 for general business docs, 0.35 for legal contracts (they stay on-topic longer), and 0.55 for knowledge bases with many small topics.

overlap_tokens (32–128): The overlap prevents information from falling into cracks between chunks. 64 tokens is the sweet spot for most content. Go higher (96-128) for documents where a sentence at the end of one section sets up the next. Don't go below 32 — at that point, the overlap is too small to provide context.

max_chunk_tokens (256–1024): Smaller chunks (256) give better precision in retrieval but require more chunks in the context window. Larger chunks (512-1024) carry more context per retrieval hit but risk diluting relevance. I default to 512 and only go smaller when precision is more important than context.

Quick Benchmark: Fixed vs Semantic

I ran both strategies against a set of 500 queries on a 12,000-document corpus of technical documentation. Retrieval was top-5 with cosine similarity, embeddings from all-MiniLM-L6-v2:

# benchmark.py
from semantic_chunker import SemanticChunker
import time

def fixed_chunk(text: str, size: int = 512, overlap: int = 64) -> list[str]:
    """Baseline fixed-size chunker for comparison."""
    words = text.split()
    chunks = []
    # Convert token targets to approximate word counts
    step = size * 3 // 4  # ~tokens to words
    olap = overlap * 3 // 4
    i = 0
    while i < len(words):
        end = min(i + step, len(words))
        chunks.append(" ".join(words[i:end]))
        i += step - olap
    return chunks


# Example comparison on a single document
sample_doc = open("sample_technical_doc.md").read()

start = time.perf_counter()
fixed = fixed_chunk(sample_doc)
fixed_time = time.perf_counter() - start

chunker = SemanticChunker()
start = time.perf_counter()
semantic = chunker.chunk(sample_doc)
semantic_time = time.perf_counter() - start

print(f"Fixed:    {len(fixed)} chunks in {fixed_time:.3f}s")
print(f"Semantic: {len(semantic)} chunks in {semantic_time:.3f}s")
print(f"Overhead: {semantic_time / fixed_time:.1f}x slower")

Results from my runs:

Metric	Fixed-512	Semantic
Retrieval precision@5	0.71	0.86
Avg chunk size (tokens)	512	387
Chunks per document	14.2	18.6
Indexing time (12k docs)	8 min	23 min

Semantic chunking is roughly 3x slower to index. But you index once and query thousands of times. The 15-point precision gain pays for itself on the first real user query.

The Gotcha: Code Blocks

One thing that tripped me up for longer than I'd like to admit — code blocks. If you're chunking technical docs, your sentence splitter will happily tear a Python function in half at the first period it finds inside a docstring.

The chunker above handles this by detecting


 fenced blocks and protecting them from sentence splitting. But watch out for inline code with periods (like `numpy.array` or `os.path.join`). Those can still cause false sentence breaks if your splitter is too aggressive.

I considered using a proper NLP sentence tokenizer (spaCy or NLTK), but they add heavy dependencies and still struggle with code-heavy text. The regex approach in the chunker above isn't perfect, but it covers 95% of cases without adding 200 MB of model downloads.

## Where This Fits in the Pipeline

This chunker is one piece of a production RAG system. I wrote about [the 5 failure patterns that kill RAG deployments](https://www.velsof.com/blog/why-your-rag-system-works-in-demo-but-fails-in-production) — chunking is failure pattern #1, but it's not the only one.

The full pipeline looks like this:

1. **Ingest** → parse documents (PDF, HTML, Markdown)
2. **Chunk** → this semantic chunker
3. **Embed** → sentence transformer or OpenAI embeddings
4. **Index** → vector DB (Qdrant, Pinecone, pgvector)
5. **Retrieve** → hybrid search (vector + BM25)
6. **Rerank** → cross-encoder to filter top results
7. **Generate** → LLM with the reranked context

If you need help building out steps 5-7 or integrating this into an existing [RAG solution](https://www.velsof.com/rag-solutions), that's exactly what my team at [Velocity Software Solutions](https://www.velsof.com/llm-integration) does day-to-day.

## Try It Yourself

Grab the code, point it at your own documents, and compare retrieval precision against fixed-size chunks. I'd bet the difference surprises you — it surprised me, and I was the one who wrote it.

The code is intentionally framework-free. No LangChain, no LlamaIndex. If you want to plug it into either of those later, wrap the `chunk()` method in their document transformer interface. But start without the framework. Understand what every line does. Then decide if you need the abstraction.