Building Production-Ready AI Pipelines: Lessons from Running 10K+ Generations

#ai #llm #production #python

It was a Tuesday morning when I opened our Datadog dashboard and saw 847 silent failures from the previous night's batch job. No alerts. No exceptions in our logs. Just a queue that had quietly eaten thousands of tokens and returned nothing useful. Our pipeline had been "succeeding" in the sense that it wasn't throwing errors — it was just producing garbage and writing it to the database like everything was fine.

That was month two of running LLM-powered features in production. I thought I had it figured out by then. I did not.

Over the past eight months, on a three-person team, I've pushed somewhere north of 10,000 generations through production pipelines — across Claude 3.5, GPT-4o, and a brief, regrettable experiment with a self-hosted Mistral instance that I will get to. Here's what I actually learned, as opposed to what the documentation implied I would need to care about.

Retry Logic Is a Trap If You Do It Wrong

Every guide tells you to implement retries. What they don't tell you is that naive exponential backoff will bankrupt you during a rate limit storm, and that retrying on the wrong error codes will just make your problems worse, faster.

My first implementation looked roughly like this:

import time
import random

def call_llm_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)

Looks fine, right? The problem is that broad except Exception. I was retrying on context length errors (HTTP 400) — deterministic failures where no amount of waiting fixes a prompt that's 2,000 tokens over the limit. I was also retrying on content policy rejections. And on malformed JSON responses from my own parsing layer, which weren't even API errors.

After a particularly bad Friday afternoon deploy where this pattern caused a cascade of 400 errors that chewed through our rate limit budget retrying requests that were never going to succeed, I got specific:

import anthropic
import time
import random
import logging

RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 529}

def call_llm_with_retry(prompt: str, max_retries: int = 4) -> str:
    last_exception = None

    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text

        except anthropic.RateLimitError as e:
            # 429 — back off hard, respect Retry-After header if present
            retry_after = getattr(e, 'retry_after', None)
            wait = retry_after if retry_after else (2 ** attempt) * 2 + random.uniform(0, 2)
            logging.warning(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
            last_exception = e

        except anthropic.APIStatusError as e:
            if e.status_code not in RETRYABLE_STATUS_CODES:
                # 400, 401, 403, 404 — these won't get better with retries
                logging.error(f"Non-retryable API error {e.status_code}: {e.message}")
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
            last_exception = e

        except anthropic.APIConnectionError as e:
            # Network issues — retry with backoff
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
            last_exception = e

    raise last_exception

The separation between retryable and non-retryable errors cut our wasted API spend by about 30% in the first week. Not because we were hitting tons of 400s — we weren't — but because when we did, they were expensive ones (long prompts) and we were burning budget retrying them six times.

Practical takeaway: type your exceptions. If you're using Anthropic's SDK, anthropic.RateLimitError, anthropic.APIStatusError, and anthropic.APIConnectionError are distinct and should be handled differently. Same pattern applies to OpenAI's SDK.

The Cost Math Will Surprise You

I thought I had a handle on costs. Did input/output token estimates, built a little calculator, felt confident. Then I saw the actual bill.

The issue wasn't the per-token cost. It was everything I hadn't accounted for: tokens burned on retries, on failed generations, on the system prompt I was including on every single request even when most of those requests didn't need the full context. My system prompt was 847 tokens. Across 10,000 requests, that's 8.47 million tokens of input just for boilerplate.

So I started being deliberate about prompt architecture. Short context = shorter system prompt. A simple classification task doesn't need the five-paragraph system prompt I wrote for open-ended generation. I built a prompt registry — nothing fancy, just a dict of prompt templates keyed by task type — and matched prompt complexity to task complexity.

One thing I noticed that genuinely surprised me: batch processing doesn't just save money on some APIs, it changes your failure mode profile entirely. With synchronous requests, latency spikes cause timeouts and downstream failures. With batch, the failure shows up hours later when you check results. Both are annoying; they're annoying in different ways, on different schedules. For our async summarization jobs, batch made sense. For anything user-facing, obviously not.

Also: watch your output token limits. I was setting max_tokens=4096 on everything out of habit. The model doesn't charge you for tokens it doesn't use, but it holds a connection open while generating. For tasks that reliably produce short outputs, tighter limits improve throughput and catch runaway generations early.

Observability: What I Actually Watch

Before shipping, I imagined needing detailed traces of every reasoning step. What I actually monitor day-to-day is much simpler and more boring.

The signals that matter in my setup:

Generation latency (p50, p95, p99) — p95 being more than 3x p50 usually means something weird is happening upstream, or my prompts have gotten inconsistent
Token count per request — sudden spikes here mean prompt injection or a bug in my context-building logic
Stop reason distribution — if stop_reason: "max_tokens" climbs above ~5%, something is wrong with my output length assumptions
Error rate by error type — separated by retryable vs. non-retryable, as above

Maybe at scale I'd need semantic similarity metrics, hallucination detection, all of that. But for 10K generations a month, the operational signals told me more about what was actually broken than anything in the "LLM observability" category.

The one exception: I log a random 1% sample of full prompt+response pairs to a separate store for offline review. This has caught more real bugs than any automated metric — things like my context truncation cutting off mid-sentence; a template interpolation bug that put {customer_name} literally in the prompt for about 200 requests before I noticed; and a system prompt that was accidentally instructing the model to respond in Spanish for reasons I still haven't fully traced. (I pushed a fix before finding the root cause. Bad habit. But it's fixed.)

Right, so — the self-hosted Mistral experiment. I spent three weeks running a quantized Mistral 7B instance on a rented A100, convinced I'd save money and gain latency control. The latency was fine. The output quality on my specific tasks (structured extraction from messy text) was noticeably worse, and the operational overhead of managing the inference server ate most of the cost savings. Your mileage may vary if you have a team with MLOps experience; I don't, really. We're primarily a web backend shop and it showed. Took it down in week four.

Structured Output Is More Fragile Than the Demos Suggest

Getting JSON out of a language model reliably was the part I underestimated most. The demos always work. Production does not always work.

Even with JSON mode or tool use, you need to handle partial outputs, schema mismatches, and the question of what to do when validation fails. I went through three iterations:

Prompt-only JSON extraction — about 92% success rate, which sounds okay until you realize 8% silent failures is catastrophic at scale
JSON mode with response_format: {type: "json_object"} — better, but this only enforces valid JSON, not your schema
Tool use / function calling with strict schema — this is where I landed, and it's genuinely better, though you pay for it in prompt complexity

Even with strict tool use, I see maybe 1-2% of responses where the model technically calls the tool but fills optional fields with placeholder values ("N/A", "unknown", empty strings) instead of omitting them. I validate against a Pydantic model post-extraction and route those to a dead letter queue for human review rather than silently accepting them.

The dead letter queue was one of the better decisions I made. It gave me a place for "I'm not sure what to do with this" responses that wasn't "crash" or "silently corrupt the database." About 200 of those 847 initial failures would have been catchable with this pattern.

What I'd Actually Recommend

If you're starting from scratch, my honest suggestion is to just use the managed APIs. I know the "self-host for control" argument is appealing — I made it to myself for three weeks before the Mistral experiment cured me of it. Unless you have a hard data residency requirement or you're processing volumes where the math definitively works out, the operational cost is real and it doesn't show up in the GPU rental price.

Error handling before observability. Seriously, in that order. One well-typed exception handler is worth more than a week of dashboards. Know which errors are retryable, retry those, raise the rest immediately. You cannot dashboard your way out of code that silently fails.

A dead letter queue — build it on day one, not when you're scrambling to understand your failure modes at day fifty. Every generation pipeline has some percentage of responses that don't fit the happy path. "Fail loudly and queue for human review" is so much better than "silently accept garbage and discover the problem during a production incident."

Log a random sample of prompts and responses — 1% to a separate store, or 0.1% at higher volumes. Not all of them; storage costs add up fast. But that sample will surface things no metric catches. The {customer_name} bug I mentioned? Found via sampling, not alerting.

And keep your prompts in version control. Treat prompt changes like code changes. I have a prompts/ directory, everything is versioned, and significant changes go through the same review process as code. I still see teams treating prompts as configuration rather than code, and they discover why that's a mistake when something breaks and they can't tell what changed.

The thing I keep coming back to: the hard part of AI pipelines isn't the AI. It's the same distributed systems problems — queueing, retries, schema validation, observability — just with a new failure mode where the output looks plausible even when it's completely wrong. That last part is what makes it genuinely harder than it sounds. A network timeout is obvious. A response that passes JSON validation but returns the wrong answer is not. Once I started treating it as a distributed systems problem with an extra validation layer, things got clearer. Not easy. Just clearer.