DEV Community

Puneet Gupta
Puneet Gupta

Posted on • Originally published at pg-blogs.netlify.app

Building Reliable LLM Applications in Python

Introduction

Calling an LLM API is easy. Building an application on top of one that is reliable — that fails predictably, doesn't hallucinate its way into wrong answers, and doesn't surprise you with a bill — is a real engineering discipline.

The core mindset shift: treat model output as a hypothesis to verify, not a fact to trust. This post covers the practices that make Python LLM applications production-grade, using Anthropic's Claude and the official anthropic SDK.


Pick the Right Model for the Task

Model choice is a decision, not a default. Match the tier to the difficulty of the task:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from the environment

# Hard reasoning / agentic work → the strongest model
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    messages=[{"role": "user", "content": "..."}],
)

# High-volume, cost-sensitive classification → a cheaper capable model
cheap = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=256,
    messages=[{"role": "user", "content": "Classify sentiment: ..."}],
)
Enter fullscreen mode Exit fullscreen mode

Never run an expensive model where a cheap one suffices, and never under-provision where quality matters. Cost and latency are features — track them.


Get Structured Output — Don't Parse Prose

The single biggest source of fragility in LLM apps is scraping structured data out of free-form text. Prefer typed outputs validated against a schema. With the Anthropic SDK, messages.parse() validates the response against a Pydantic model for you:

from pydantic import BaseModel

class Invoice(BaseModel):
    vendor: str
    total: float
    due_date: str

response = client.messages.parse(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"Extract invoice fields:\n{raw_text}"}],
    output_format=Invoice,
)

invoice = response.parsed_output   # a validated Invoice instance
print(invoice.total)               # a float, guaranteed — no regex, no json.loads
Enter fullscreen mode Exit fullscreen mode

The validation happens against a schema the model is constrained to, so you get a typed object instead of a string you have to hope parses. Structured output turns "the model usually returns JSON" into "the model returns this shape."


Ground the Model — Don't Let It Hallucinate

An LLM will confidently invent facts. For anything that must be correct, give the model the source material and instruct it to answer only from that material — retrieval-augmented generation (RAG) in its simplest form:

prompt = f"""Answer the question using ONLY the context below.
If the answer is not in the context, say "I don't know."

<context>
{retrieved_documents}
</context>

Question: {user_question}"""
Enter fullscreen mode Exit fullscreen mode

Two things make this reliable: the explicit "only from context" instruction, and an explicit escape hatch ("say I don't know") so the model isn't pressured to fabricate. Then cite — have the model point at which passage it used, so a human can verify.


Handle the Unhappy Path

Networks fail and rate limits happen. The Anthropic SDK already retries transient errors (429, 5xx, connection errors) with exponential backoff — configure it rather than reinventing it:

client = anthropic.Anthropic(max_retries=4, timeout=30.0)
Enter fullscreen mode Exit fullscreen mode

Catch the specific exceptions and branch on retryable vs. terminal:

try:
    response = client.messages.create(...)
except anthropic.RateLimitError as exc:
    retry_after = int(exc.response.headers.get("retry-after", "60"))
    # back off and retry
except anthropic.BadRequestError:
    # a 400 is a bug in our request — do NOT retry, fix the payload
    raise
Enter fullscreen mode Exit fullscreen mode

For any operation with side effects (charging a card, sending an email based on a model decision), make it idempotent — the model, or a retry, may trigger the same action twice.


Put Control Flow in Code, Judgment in the Model

Use the model for judgment; use code for bookkeeping. Loops, branching, and fan-out belong in deterministic Python — not in a prompt asking the model to "keep going until done." For agentic tasks with tools, drive the loop yourself so you can intercept, validate, and log each tool call:

messages = [{"role": "user", "content": user_input}]
while True:
    response = client.messages.create(
        model="claude-opus-4-8", max_tokens=4096,
        tools=tools, messages=messages,
    )
    if response.stop_reason == "end_turn":
        break
    messages.append({"role": "assistant", "content": response.content})

    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = execute_tool(block.name, block.input)   # YOUR validated code
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result,
            })
    messages.append({"role": "user", "content": tool_results})
Enter fullscreen mode Exit fullscreen mode

The model decides what to do; your code decides whether it's allowed and records what happened.


Evaluate Output Like Any Other Untrusted Input

You wouldn't ship a function without tests. Don't ship a prompt without evals. Build a small dataset of representative inputs with known-good outputs, and score the model against it whenever you change a prompt or model:

def evaluate(cases: list[dict]) -> float:
    passed = 0
    for case in cases:
        out = extract_invoice(case["input"])
        if out.total == case["expected_total"]:
            passed += 1
    return passed / len(cases)
Enter fullscreen mode Exit fullscreen mode

Evals catch the regression where a prompt tweak that helped one case quietly broke ten others — the LLM equivalent of a failing unit test.


Cache Repeated Context to Cut Cost and Latency

If many requests share a large fixed prefix — a system prompt, a big document, few-shot examples — prompt caching serves that prefix at a fraction of the price and latency:

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": large_shared_context,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": question}],
)
# Verify it worked:
print(response.usage.cache_read_input_tokens)  # >0 means the cache was hit
Enter fullscreen mode Exit fullscreen mode

Caching is a prefix match: keep the stable content first and put anything that varies per request (timestamps, the user's question) after it. If cache_read_input_tokens stays zero across repeated calls, something volatile is invalidating the prefix.


Practical Checklist

Practice Why it matters
Match model tier to task difficulty Don't overpay or under-provision
Use structured outputs, not prose parsing Typed data, no brittle regex
Ground answers in provided context + cite Curbs hallucination
Configure SDK retries; branch on error type Survive transient failures, fail fast on bugs
Make side-effecting actions idempotent Retries and re-decisions are safe
Control flow in code, judgment in the model Deterministic, debuggable
Keep an eval set; score on every change Catch prompt/model regressions
Cache large shared prefixes Lower cost and latency
Never send secrets/PII you don't need to Anything sent externally may be retained

Final Thoughts

Reliable LLM applications aren't built by finding the perfect prompt. They're built with the same engineering discipline as any other system: strong typing at the boundary, verification of untrusted output, deterministic control flow, graceful failure handling, and measurable evaluation.

The model provides judgment. Everything around it — the structure, the checks, the guardrails — is your job. Get that right, and the LLM becomes a dependable component instead of a source of surprises.

Top comments (0)