DEV Community: Wade Allen

Catch LLM hallucinations with multi-model consensus

Wade Allen — Mon, 22 Jun 2026 15:30:02 +0000

A single model gives you a single point of failure: when it's confidently wrong, you get no signal that it's wrong. A cheap, surprisingly effective guard is to ask the same question to a few independent models and use their agreement as a confidence signal.

The idea: fan the question out concurrently, then rank the answers by how much they agree with each other. When the models converge, you can trust the answer. When they diverge, you flag it for review instead of shipping a guess.

import asyncio
from difflib import SequenceMatcher
from pydantic_ai import Agent

MODELS = ['anthropic:claude-sonnet-4-6', 'openai:gpt-4.1']

async def answer(model: str, q: str) -> str:
    return (await Agent(model).run(q)).output

async def ask(q: str):
    answers = await asyncio.gather(*(answer(m, q) for m in MODELS))
    # agreement = mean pairwise similarity of each answer to the others
    def agree(a):
        others = [b for b in answers if b is not a]
        return sum(SequenceMatcher(None, a, b).ratio() for b in others) / len(others)
    best = max(answers, key=agree)
    return {'answer': best, 'agreement': round(agree(best), 2)}

If agreement is high, the models independently reached the same place — a strong signal. If it's low, you've caught a disagreement before it reached a user, and you can route to a human, ask a tie-breaker model, or return "uncertain".

You can make the similarity check smarter (embeddings instead of difflib), add a local model as a third vote, or weight by each model's self-reported confidence. But even this 20-line version turns "hope the model is right" into a measurable number you can gate on. Agreement isn't proof of truth — but disagreement is a reliable smoke alarm, and that's most of the value.

I package patterns like this as small open-source pydantic-ai + FastAPI templates — the repos are on GitHub, and complete, ready-to-run versions are on Gumroad. Feedback and issues welcome.

Route Every Prompt to the Cheapest Model: Building a Multi-LLM Cost Optimizer with Pydantic AI

Wade Allen — Mon, 22 Jun 2026 13:01:24 +0000

The Problem: Every Prompt Costs Money, But Not Every Prompt Needs GPT-4

You're running an AI system in production. Some requests need Claude's reasoning depth. Others are simple classification tasks that Groq can handle in milliseconds for a fraction of the cost. The trap most teams fall into: they pick one model and stick with it.

Here's the math that breaks you. Your engineering team sends a 50-token prompt asking "Is this email spam?" to GPT-4o. Cost: ~$0.015. Groq does the same thing for ~$0.0001. Scale that to 100k prompts per day across your product, and you're hemorrhaging $1,400 per month on decisions that don't need advanced reasoning.

The other trap: building classification logic by hand. You write heuristics to detect "simple" vs. "complex" prompts. You maintain these rules. Six months later, edge cases pile up. You're debugging prompts that should have routed to Claude but hit GPT-4o instead. Now you're paying for wrong answers.

The real problem isn't picking models. It's routing each individual prompt to the minimum sufficient model without building fragile classification logic. You need infrastructure that learns what "complex" means from actual patterns in your data, then makes routing decisions automatically.

The Approach: Structured Classification + Intelligent Routing

The solution uses three layers working together:

Layer 1: Complexity Classification - Instead of writing rules, you use a lightweight model (Claude Haiku) to analyze each incoming prompt and assign a complexity score. Haiku is cheap enough that the classification cost is negligible compared to the routing savings. It outputs structured JSON describing the prompt's requirements: reasoning depth needed, external knowledge, multi-step logic, etc.

Layer 2: Model Selection - Based on the classification output, you select the cheapest model that meets the requirements. This is a simple lookup table initially, but it becomes data-driven. You track which models succeed at which complexity levels, then optimize the routing rules over time.

Layer 3: Execution + Tracking - You execute the prompt on the selected model, track the cost, track the latency, and feed that back into your cost database. This becomes your ground truth for future optimizations.

Why pydantic-ai over alternatives? Langchain's flexible but opaque. Building with plain requests/openai libraries means you're implementing routing logic yourself. Pydantic-ai's structured outputs are type-safe, they integrate cleanly with Pydantic models for validation, and the agent-based pattern maps perfectly to this workflow: an agent that evaluates complexity, an agent that routes, an agent that executes.

The key design decision: make the complexity classification itself cheap and fast. If it costs more to classify a prompt than to just run it on the cheapest model, the system fails. By using Haiku for classification and caching classification patterns, you ensure that the overhead is minimal.

The Central Code Pattern: Classification-Driven Routing

Here's the pattern that makes this work:

from pydantic import BaseModel
from pydantic_ai import Agent
from enum import Enum
import litellm

class ComplexityLevel(str, Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"

class PromptAnalysis(BaseModel):
    complexity: ComplexityLevel
    reasoning_required: bool
    external_knowledge_needed: bool
    estimated_tokens: int
    reasoning: str

# Agent 1: Classify complexity
classifier_agent = Agent(
    'claude-3-5-haiku-20241022',
    result_type=PromptAnalysis,
)

@classifier_agent.system_prompt
def classifier_system():
    return """You are a prompt complexity analyzer. Classify the incoming prompt.
    Return structured JSON with complexity level and reasoning."""

# Routing map: complexity -> list of models (ordered by cost)
MODEL_ROUTING = {
    ComplexityLevel.SIMPLE: ['groq/llama-3.1-8b', 'gpt-4o-mini', 'claude-3-5-haiku-20241022'],
    ComplexityLevel.MODERATE: ['gpt-4o-mini', 'claude-3-5-sonnet-20241022'],
    ComplexityLevel.COMPLEX: ['claude-3-5-sonnet-20241022', 'gpt-4-turbo'],
}

async def classify_and_route(user_prompt: str) -> tuple[PromptAnalysis, str]:
    """Classify prompt, select model, return both."""

    # Step 1: Classify
    analysis = await classifier_agent.run(user_prompt)

    # Step 2: Select model from routing map
    candidate_models = MODEL_ROUTING[analysis.data.complexity]
    selected_model = candidate_models[0]  # Cheapest first

    # Step 3: Log for cost tracking
    log_routing_decision(
        prompt=user_prompt,
        complexity=analysis.data.complexity,
        selected_model=selected_model,
        estimated_cost=estimate_cost(selected_model, analysis.data.estimated_tokens)
    )

    return analysis.data, selected_model

async def execute_with_routing(user_prompt: str) -> dict:
    """Full pipeline: classify, route, execute."""

    analysis, model = await classify_and_route(user_prompt)

    # Execute on selected model using litellm for unified interface
    response = litellm.completion(
        model=model,
        messages=[{"role": "user", "content": user_prompt}],
    )

    # Track actual cost
    log_execution(
        model=model,
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        actual_cost=calculate_actual_cost(model, response.usage)
    )

    return {
        "analysis": analysis,
        "selected_model": model,
        "response": response.choices[0].message.content
    }

What each part does: The classifier agent uses Haiku (cheap) to produce structured PromptAnalysis output. The routing map is the decision tree: given complexity, pick the cheapest model that can handle it. The execution step uses litellm as a unified interface to multiple providers (Claude, GPT-4o, Groq all speak the same API through litellm). The logging is critical: without it, you can't optimize the routing over time.

Why this pattern works: it separates concerns cleanly. Classification logic is isolated from routing logic. Model selection is deterministic and auditable. Cost tracking is built in from the start, not bolted on later.

Integration: Making It Real with FastAPI

Here's where this lives in production:

from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse

app = FastAPI()

@app.post("/ask")
async def ask_with_routing(request: dict) -> JSONResponse:
    """Main endpoint: user sends a prompt, gets routed response."""

    try:
        user_prompt = request.get("prompt")
        if not user_prompt:
            raise HTTPException(status_code=400, detail="prompt required")

        result = await execute_with_routing(user_prompt)

        # Return response + metadata for frontend
        return JSONResponse({
            "response": result["response"],
            "metadata": {
                "complexity": result["analysis"].complexity,
                "model_used": result["selected_model"],
                "estimated_cost": result["analysis"].estimated_tokens * 0.00001,  # approximate
            }
        })

    except Exception as e:
        log_error(e, user_prompt)
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/costs/today")
async def get_daily_costs() -> dict:
    """Real-time cost tracking endpoint."""

    costs = fetch_costs_from_db(date=datetime.now().date())

    return {
        "total_spend": costs['total'],
        "by_model": costs['by_model'],
        "by_complexity": costs['by_complexity'],
        "savings_vs_always_gpt4o": costs['total'] - (costs['total_prompts'] * 0.015)
    }

The data flow: user sends a prompt to /ask. FastAPI receives it, calls classify_and_route, executes on the selected model, logs everything to a database (PostgreSQL or similar), returns the response with metadata. A background worker periodically aggregates costs. The /costs/today endpoint gives real-time visibility into spending patterns.

One gotcha worth knowing: litellm has rate limits per provider. If you route 1000 simple prompts to Groq simultaneously, you'll hit their API limits. You need a queue (Bull, Celery, or just asyncio.Queue) in front of execution to throttle by provider, not just by request count. Without it, your routing optimizer works great until you scale, then you're retrying failed requests that now route to expensive fallback models.

Tradeoffs and When NOT to Use This

This approach adds complexity. You're running at least one extra LLM call per user prompt (the classifier), which adds 200-500ms latency. For real-time chat applications, this might be unacceptable.

You also need to maintain the routing map. As new models launch or pricing changes, you update MODEL_ROUTING. This is manageable but not automatic.

The classifier itself can be wrong. A prompt might be marked "simple" when it actually needs complex reasoning, and you'll get a wrong answer. You need monitoring and a fallback system (if the simple model fails, retry on a more capable model).

When to choose something simpler: If your prompts are homogeneous (they're all the same type), just pick one good model and move on. If you have fewer than 10k prompts per month, the optimization overhead costs more than the savings. If latency is critical and sub-200ms matters, this adds too much latency.

When this approach wins: you have diverse prompt types, cost is a serious constraint, and you can afford 200-500ms classification overhead. You're running thousands of prompts per month across different complexity levels.

Get the Full Implementation

I packaged this as an open-source template on GitHub: https://github.com/Reactance0083/pydantic-ai-multi-llm-cost-optimizer

The scaffold gives you the architecture and core patterns. It's a good foundation but lacks production details: error handling for rate limits, caching of classifications, database schema design, monitoring dashboards.

The full production version with tests, retry logic, cached classifications, and example dashboards is on Gumroad: https://reactance0083.gumroad.com/l/ztmlv

What specific part of multi-model routing is causing pain in your system? Are you currently over-provisioning to certain models, or is latency from classification a blocker for your use case? Drop a comment and let's talk through your constraints.

Stop hand-parsing LLM JSON: structured outputs with pydantic-ai

Wade Allen — Sat, 20 Jun 2026 15:32:38 +0000

If you have ever written json.loads(response) around an LLM call and then a defensive try/except because the model returned

```json

fences, a trailing comma, or prose before the object — this is for you.

The fix is to stop treating the model's output as text you parse, and start treating it as a typed object the library validates for you. With pydantic-ai you declare the shape once and get a validated Python object back, with a retry on the model when it doesn't conform.

from pydantic import BaseModel, Field
from pydantic_ai import Agent

class Invoice(BaseModel):
    vendor: str
    total: float = Field(..., description='Grand total in USD')
    due_date: str | None = None

agent = Agent('anthropic:claude-sonnet-4-6', output_type=Invoice,
              system_prompt='Extract the invoice fields from the text.')

result = agent.run_sync('Acme Corp — $1,240.00 due 2026-07-01')
print(result.output.total)  # 1240.0  (a float, already validated)

What you no longer write: the JSON fence stripping, the KeyError guards, the float coercion, the "the model added an apology before the JSON" handling. If the model returns something that doesn't fit Invoice, pydantic-ai sends the validation error back to the model and asks it to try again — so your application code only ever sees a clean object.

Three things this buys you in production:

The type is the contract. Your endpoint signature, your tests, and your prompt all agree, because they reference the same model.
Failures are explicit. A field that won't validate raises where you can catch it, instead of silently becoming None three functions later.
It's auditable. result.output is a real object you can log, diff, and assert on.

The pattern scales from one field to a nested schema, and it's the same whether you're on Claude, GPT, or a local model. Once you've used it you stop writing parsers entirely.

I package patterns like this as small open-source pydantic-ai + FastAPI templates — the repos are on GitHub, and complete, ready-to-run versions are on Gumroad. Feedback and issues welcome.

How I Built an AI Email Triage Agent That Creates Linear Issues and Fires Slack Alerts

Wade Allen — Mon, 15 Jun 2026 13:05:07 +0000

How I Built an AI Email Triage Agent That Creates Linear Issues and Fires Slack Alerts

Most engineering teams I talk to have the same problem: support emails pile up, someone manually reads them, decides what kind of issue it is, guesses at priority, creates a Linear ticket, and then pings the right person on Slack. That whole chain takes 5-15 minutes per email and happens inconsistently across team members.

The obvious answer sounds like "just automate it." But after building this system, I'd reframe the actual problem: triage is a synchronization problem, not just an automation problem. Engineering teams using both GitHub and Linear need the judgment calls - priority, story points, team assignment - to happen at the moment the email arrives, not hours later when a human finally gets to it. If a production outage report hits your inbox at 2am, the ticket should already exist with P0 priority and the on-call engineer assigned before anyone opens their laptop.

This article walks through how I built exactly that with pydantic-ai, FastAPI, Gmail IMAP, Linear API, and Slack webhooks.

The Problem: Manual Triage Doesn't Scale

Here is what the broken version looks like in practice.

An inbound email arrives: "Hey, users in the EU region can't log in, getting 503 errors, this started about 20 minutes ago." Someone on your team reads it, decides it's a bug of type auth, marks it P0, creates a Linear issue, and posts in #incidents on Slack. That whole process works fine with two engineers and 10 emails a day.

At 40 emails a day, it starts breaking. Things get missed. A critical auth failure sits in an inbox for 45 minutes because the person who usually triages it is in a meeting. Another email gets classified as P2 when it should be P0 because the triager didn't read closely enough. A third gets a Linear ticket created, but no Slack alert fires because the person forgot that step.

The deeper problem: your team's triage logic exists entirely in people's heads. There is no consistent definition of what makes something P0 vs P1. Different engineers make different calls. New hires make worse calls. And none of this is auditable.

What you actually want: a system that reads the email, applies your team's specific classification logic, creates the Linear issue with correct metadata, and fires Slack alerts for anything critical - all within seconds of the email arriving, at any hour.

The Approach: pydantic-ai + FastAPI for Structured Judgment

The key architectural insight here is that LLM outputs need to be trustworthy enough to trigger side effects. If your agent classifies an email as P0 and your code blindly creates a Linear issue and pages an on-call engineer, you need that classification to be a proper typed object, not a string you parse with regex.

This is exactly where pydantic-ai earns its place. Unlike plain OpenAI API calls where you prompt-engineer your way to JSON and hope it validates, pydantic-ai lets you define the output schema as a Pydantic model and the library enforces it. The agent either returns a valid TicketClassification object or raises an exception you can handle. No silent failures where priority comes back as "high" instead of "P1".

Why not LangChain? I've used it. The abstraction layer is heavy, debugging is painful, and structured output handling requires enough boilerplate that you end up writing similar amounts of code with less visibility into what's happening. pydantic-ai is thinner and more explicit - closer to writing a typed Python function that happens to call an LLM.

Why FastAPI over a simple script? Because you want this running as a service. Gmail IMAP polling runs on a background task. The Linear and Slack integrations are async HTTP calls. FastAPI gives you a health endpoint, request logging, and a clean place to hang background tasks with lifespan context managers. It also makes the system testable - you can POST a fake email payload to your /triage endpoint in tests without touching Gmail at all.

The design decision that makes this reliable: the classification and the side effects are separated. The agent produces a TicketClassification. Then separate, deterministic functions consume that object to create the Linear issue and fire the Slack alert. The LLM never touches the API clients directly. If Linear is down, your classification still works. If the classification fails validation, nothing gets created.

The Central Code Pattern

Here is the core of the system - the classification agent and the downstream dispatch:

from pydantic import BaseModel
from pydantic_ai import Agent
from enum import Enum

class Priority(str, Enum):
    P0 = "P0"
    P1 = "P1"
    P2 = "P2"
    P3 = "P3"

class IssueType(str, Enum):
    BUG = "bug"
    FEATURE = "feature"
    QUESTION = "question"
    INCIDENT = "incident"

class TicketClassification(BaseModel):
    issue_type: IssueType
    priority: Priority
    title: str
    summary: str
    suggested_team: str
    story_points: int
    requires_immediate_alert: bool

# The agent - structured output enforced by pydantic-ai
triage_agent = Agent(
    "openai:gpt-4o",
    result_type=TicketClassification,
    system_prompt="""
    You are an engineering triage agent. Classify inbound support emails.
    P0: production down, data loss, security breach affecting users.
    P1: major feature broken, significant user impact, no workaround.
    P2: partial functionality broken, workaround exists.
    P3: minor issue, cosmetic, low user impact.
    story_points: 1-8 based on estimated fix complexity.
    requires_immediate_alert: true only for P0 and P1 incidents.
    """
)

async def process_email(raw_email: str) -> None:
    # Classification - LLM call with validated output
    result = await triage_agent.run(raw_email)
    classification = result.data  # This is a real TicketClassification object

    # Deterministic side effects - no LLM involved past this point
    linear_issue = await create_linear_issue(classification)

    if classification.requires_immediate_alert:
        await fire_slack_alert(classification, linear_issue.url)

A few things worth noting here.

result_type=TicketClassification is doing the heavy lifting. pydantic-ai will retry the LLM call with validation feedback if the output doesn't conform to the schema. You get a real typed object back, not a dict.

The requires_immediate_alert boolean is intentional. You could derive this from priority in code (if classification.priority in [Priority.P0, Priority.P1]), but having the LLM make this call explicitly means it can account for context. A P2 email that mentions "this is affecting 10,000 users right now" might warrant an alert that pure priority logic would miss.

story_points gives Linear a starting estimate. It won't always be right, but having something there is better than creating every issue with no estimate.

Integration: Gmail to Linear to Slack

The data flow runs like this. A background task polls Gmail via IMAP every 60 seconds, fetching unread emails from a designated support address. Each email gets decoded from MIME format, stripped to plain text, and pushed to the classification pipeline. After classification, the Linear issue gets created via Linear's GraphQL API with the full metadata. If requires_immediate_alert is true, a Slack webhook posts to your #incidents or #support channel with the issue link, priority, and summary.

The Linear integration uses their GraphQL API directly rather than an SDK - the SDK adds little value here and the GraphQL call is straightforward:

async def create_linear_issue(c: TicketClassification) -> LinearIssue:
    mutation = """
    mutation CreateIssue($input: IssueCreateInput!) {
      issueCreate(input: $input) {
        issue { id url title }
      }
    }
    """
    variables = {
        "input": {
            "title": c.title,
            "description": c.summary,
            "priority": PRIORITY_MAP[c.priority],
            "estimate": c.story_points,
            "teamId": TEAM_ID_MAP[c.suggested_team],
        }
    }
    # ... execute mutation

Gotcha worth knowing: Gmail IMAP with UNSEEN search will re-fetch emails if your process restarts before you mark them as read. You need to mark emails as SEEN immediately after fetching, before classification, not after. Otherwise a process crash between fetch and classification means you'll process the same email twice and create duplicate Linear issues.

Tradeoffs and Limitations

This system has real limitations you should know going in.

The classification quality depends heavily on your system prompt. GPT-4o is good, but it doesn't know your product's specific failure modes. A generic prompt will produce generic classifications. You will need to iterate on the prompt with real emails from your inbox before this is actually useful.

IMAP polling has latency. Sixty-second poll intervals mean a P0 incident email might sit for up to a minute before it creates a ticket. For most teams this is fine. For true real-time needs, you'd want Gmail push notifications via Pub/Sub instead.

This approach is probably overkill if your team gets fewer than 20 support emails per day. A simple Zapier workflow with a pre-defined category mapping would give you 80% of the value with zero maintenance. Use this when you have volume, varied email content, and need nuanced classification that rule-based systems get wrong.

Cost is real but small. At ~1000 tokens per email with GPT-4o, 100 emails/day runs about $2-3/day. Not a blocker, but worth accounting for.

Try It Yourself

I packaged this as an open-source template on GitHub: https://github.com/Reactance0083/pydantic-ai-email-linear-auto-triage

The scaffold gives you the agent structure, FastAPI setup, and integration patterns to get started. The full production version with tests, error handling, retry logic, duplicate detection, and complete docs is available here: https://reactance0083.gumroad.com/l/dcror

If you're running a similar triage setup or have tried other approaches (Zapier, custom scripts, other agent frameworks), I'd genuinely like to hear how it's working. Drop a comment with what broke first - that's usually the most useful part of any production story.

How I Built a Multi-Agent Prompt Engineering Runbook with pydantic-ai and FastAPI

Wade Allen — Mon, 08 Jun 2026 13:05:31 +0000

How I Built a Multi-Agent Prompt Engineering Runbook with pydantic-ai and FastAPI

Most teams building AI tooling eventually hit the same wall: they have five different prompt patterns scattered across Notion docs, Slack threads, and someone's local Python file. Nobody agrees on the output format. The SWOT analysis prompt returns markdown sometimes and JSON sometimes. The code reviewer just dumps text. When something breaks in production, you spend 40 minutes figuring out which version of the prompt was actually running.

This article walks through an architecture that solves that problem using pydantic-ai, FastAPI, and structured Pydantic outputs. The result is a prompt engineering runbook: a single deployable service that handles SWOT analysis, social post generation, code review, multi-format summarisation, and a decision framework, all returning typed, validated responses.

The Problem: Prompt Sprawl Kills Reliability

Here is a concrete scenario that plays out in teams of five or more engineers.

Someone writes a useful SWOT analyser prompt in a Jupyter notebook. It works great. A teammate copies it into a FastAPI route, changes a few words, and hardcodes the model name. Three months later, a third person builds a Slack bot that uses a slightly different version. Now you have three SWOT analysers in production with no shared contract on what the output looks like.

Downstream systems start breaking because one version returns strengths as a list and another returns it as a comma-separated string. The code reviewer prompt just returns raw text, so the frontend has to parse it with regex. When you upgrade the model, you have no idea which of the six prompt functions will silently regress.

Teams that use Slack as their source of truth are the most exposed to this problem. Context lives in threads that expire from memory, decisions get buried, and when someone needs to extract structured insights from that context, they either do it manually or rely on informal scripts that nobody maintains. The chaos compounds because there is no single place that says "this is what our AI outputs look like."

The fix is not better prompt writing. It is a typed contract layer between your prompts and the rest of your system.

The Approach: pydantic-ai + FastAPI as a Typed Contract Layer

The core idea is simple: every agent in the runbook has a Pydantic model as its output type. pydantic-ai enforces that contract at the LLM call boundary. FastAPI exposes each agent as an endpoint with typed request and response bodies.

Why pydantic-ai over alternatives?

LangChain is the obvious comparison. LangChain has output parsers and structured output support, but the abstraction layer is thick. Debugging a failed parse means tracing through multiple internal chain objects. For a runbook that needs to be maintained by the whole team, that opacity is a liability.

Plain requests with instructor is closer to what this is doing, and honestly a valid choice. The tradeoff is that pydantic-ai gives you agent-level retries and tool support out of the box, which matters when you start adding context retrieval or multi-step reasoning.

Raw OpenAI structured outputs work but lock you to one provider. pydantic-ai is provider-agnostic, so swapping from OpenAI to Anthropic or a local model is a config change, not a rewrite.

The key design decision that makes this reliable: every agent is defined with a result_type that is a Pydantic model, not a string. pydantic-ai will retry the LLM call if the output fails validation. You get automatic retries with validation feedback fed back into the prompt. This is the thing that plain prompt engineering cannot give you on its own.

The FastAPI layer adds HTTP-level validation on the way in and serialisation on the way out. Every request and response is typed. Your frontend, your Slack bot, and your CI pipeline all talk to the same contract.

The Code Pattern: Typed Agents with Structured Outputs

Here is the central pattern. Everything in the runbook follows this shape.

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from fastapi import FastAPI, HTTPException

# 1. Define the output contract
class SWOTAnalysis(BaseModel):
    strengths: list[str] = Field(description="Internal positive factors")
    weaknesses: list[str] = Field(description="Internal negative factors")
    opportunities: list[str] = Field(description="External positive factors")
    threats: list[str] = Field(description="External negative factors")
    summary: str = Field(description="Two-sentence executive summary")

# 2. Define the input
class SWOTRequest(BaseModel):
    context: str = Field(description="Business or product context to analyse")
    focus_area: str | None = Field(default=None, description="Optional domain focus")

# 3. Create the agent with result_type enforcing the contract
swot_agent = Agent(
    model="openai:gpt-4o",
    result_type=SWOTAnalysis,
    system_prompt=(
        "You are a strategic analyst. Analyse the provided context and return "
        "a structured SWOT analysis. Be specific and actionable. "
        "Each list should contain 3-5 items."
    ),
)

app = FastAPI()

# 4. Expose it as a typed FastAPI endpoint
@app.post("/analyse/swot", response_model=SWOTAnalysis)
async def analyse_swot(request: SWOTRequest) -> SWOTAnalysis:
    prompt = request.context
    if request.focus_area:
        prompt = f"Focus area: {request.focus_area}\n\nContext: {request.context}"

    try:
        result = await swot_agent.run(prompt)
        return result.data
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

What each part does and why it matters:

result_type=SWOTAnalysis is the critical line. This tells pydantic-ai to use the model's structured output mode and validate the response against your Pydantic schema. If the LLM returns malformed JSON or missing fields, pydantic-ai retries automatically.

response_model=SWOTAnalysis on the FastAPI route means the OpenAPI docs are generated from your actual output type. Your frontend developers can see exactly what fields are returned without reading the prompt.

result.data gives you the validated Pydantic instance directly. No JSON parsing, no .get() calls with fallbacks.

The same pattern is repeated for every agent in the runbook: code reviewer, social post generator, multi-format summariser, and decision framework. They each have a different Pydantic model and a different system prompt, but the structural shape is identical.

Integration: Connecting to External Sources

The runbook becomes genuinely useful when it is connected to external data sources. The most impactful integration for most teams is Slack.

The data flow looks like this:

Slack channel/thread
    -> Slack API (conversations.history or webhooks)
    -> extraction endpoint on the runbook
    -> summariser or SWOT agent
    -> structured output stored in Postgres or returned to Slack

For the Slack integration, you fetch message history using slack_sdk, concatenate the thread into a single context string, and pass it to whichever agent fits the use case. Decision threads go to the decision framework agent. Product discussion threads go to the SWOT analyser. Code snippets shared in chat go to the code reviewer.

from slack_sdk import WebClient

slack_client = WebClient(token=settings.slack_bot_token)

def extract_thread_context(channel_id: str, thread_ts: str) -> str:
    response = slack_client.conversations_replies(
        channel=channel_id,
        ts=thread_ts
    )
    messages = response["messages"]
    return "\n".join(
        f"{msg.get('username', 'user')}: {msg['text']}"
        for msg in messages
    )

One gotcha worth knowing: Slack message text contains user ID mentions in the format <@U12345>. These will confuse the LLM if left in. Preprocess the context string to replace user IDs with display names or generic placeholders before passing to any agent. You can do this with the users.info API call or by maintaining a local ID-to-name cache.

Tradeoffs and Limitations

This architecture has real costs that you should weigh before building it.

Latency. Every request makes at least one LLM API call. For a code reviewer on a hot path, that is 1-3 seconds minimum. Do not use this for anything that needs sub-200ms response times.

Retry costs. pydantic-ai's automatic retries on validation failure mean a badly calibrated system prompt can silently double your API spend. Monitor retry rates and set max_retries explicitly.

Overkill for small teams. If you have two engineers and three prompts, a shared Python module with well-named functions and type hints is probably the right answer. The FastAPI layer adds deployment overhead that only pays off when multiple systems are consuming the same agents.

Provider lock-in is deferred, not eliminated. Switching providers is easier than with raw OpenAI calls, but system prompts that are tuned for GPT-4o may behave differently on Claude or Gemini. You still need to test across providers if portability matters.

For teams with strict documentation habits already, the marginal value is lower. This runbook is most valuable when your AI prompts are currently scattered and your outputs are inconsistent.

Get the Code and Keep the Conversation Going

I packaged this as an open-source template on GitHub: https://github.com/Reactance0083/pydantic-ai-prompt-engineering-runbook

The scaffold gives you the core patterns for all five agents and the FastAPI setup. If you want the full production version with tests, error handling, provider configuration, logging middleware, and deployment docs, that is available here: https://reactance0083.gumroad.com/l/mdsbpc

If you are building something similar and have hit a different set of tradeoffs, specifically around retry strategies or multi-tenant prompt isolation, I would like to hear about it in the comments. This architecture has a few rough edges I am still working through and real-world feedback tends to surface the problems that local testing misses.

How I Built an Email Auto-Triage System with pydantic-ai, FastAPI, and Linear

Wade Allen — Thu, 04 Jun 2026 14:42:46 +0000

How I Built an Email Auto-Triage System with pydantic-ai, FastAPI, and Linear

Support email is a graveyard of good intentions. Every team I've worked with has some version of the same problem: a shared inbox accumulates emails, someone manually reads them, decides it's a bug or a billing question, copies the text into a Linear ticket, assigns a priority based on gut feel, and maybe pings Slack if it seems urgent. This process takes 5-10 minutes per email on a good day, and it scales terribly.

This article walks through the architecture and key code patterns for an automated triage pipeline that handles the full loop: classify incoming emails, create structured Linear issues, and fire Slack alerts for anything critical, all without a human in the loop.

The Problem: Manual Triage Doesn't Scale

Here's the concrete scenario that motivated this build.

A small SaaS team receives 80-150 support emails per day. Three categories consistently matter: bugs (customer-reported crashes or broken features), billing issues (failed charges, incorrect invoices), and feature requests (nice-to-haves that need product review). Everything else is general inquiry or noise.

Without automation, what happens is this: emails pile up overnight. The first engineer on in the morning spends 45 minutes triaging before writing a single line of code. A P0 bug report from a paying customer that arrived at 2 AM sits unread until 9 AM. Billing issues that should route to a different Slack channel get lost in the engineering queue. Feature requests never make it into the backlog because nobody wants to do the copy-paste work.

The real cost isn't the minutes per email. It's the decisions made inconsistently, the critical tickets that sit too long, and the cognitive load that comes with context-switching into support mode at the start of every day. Manual triage is a process that looks manageable until you actually measure it.

The Architecture: pydantic-ai + FastAPI as the Spine

The core insight here is that email triage is a structured extraction problem, not a generative one. You're not asking an LLM to write anything creative. You're asking it to read text and fill out a form with specific fields: category, priority, summary, suggested assignee. That's exactly what pydantic-ai is designed for.

Why pydantic-ai over LangChain or plain OpenAI requests?

LangChain adds a lot of abstraction for problems that don't need it. Output parsers in LangChain feel bolted on. Plain OpenAI API calls require you to write JSON schema definitions manually and then validate the output yourself, which inevitably means writing brittle string parsing.

pydantic-ai lets you define a Pydantic model as your expected output, and the library handles the prompting strategy and validation loop. If the LLM returns something malformed, pydantic-ai retries with the validation error included in context. In practice, this means you get typed, validated objects back from every agent call rather than dictionaries you hope have the right keys.

FastAPI wraps the whole thing as a webhook endpoint. Gmail sends events via IMAP polling (or you can swap in a push webhook), the FastAPI handler processes the email through the agent, and then fires the Linear and Slack API calls. This keeps the pipeline stateless and easy to deploy.

The key design decision: each email gets one agent call that returns a fully structured triage object. There's no chain of calls, no memory, no conversation state. This makes the system predictable, cheap to run, and easy to debug. A single email costs roughly 300-500 input tokens, which at current GPT-4o-mini pricing is fractions of a cent.

The Central Code Pattern: Structured Triage with pydantic-ai

Here's the core of the system, simplified but real:

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from enum import Enum
from typing import Optional


class TicketCategory(str, Enum):
    BUG = "bug"
    BILLING = "billing"
    FEATURE_REQUEST = "feature_request"
    GENERAL = "general"


class TicketPriority(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"


class TriageResult(BaseModel):
    category: TicketCategory
    priority: TicketPriority
    summary: str = Field(
        description="One sentence summary of the issue, max 100 characters"
    )
    customer_sentiment: str = Field(
        description="Brief assessment: frustrated, neutral, or positive"
    )
    suggested_team: str = Field(
        description="Which team should own this: engineering, billing, or product"
    )
    needs_immediate_slack_alert: bool = Field(
        description="True only if CRITICAL priority or customer mentions churn/legal"
    )


TRIAGE_AGENT = Agent(
    model="openai:gpt-4o-mini",
    result_type=TriageResult,
    system_prompt="""
    You are a support triage specialist. Analyze incoming support emails and 
    classify them accurately. Be conservative with CRITICAL priority - only 
    use it for active outages, data loss, or customers threatening to cancel.
    Billing issues are almost always HIGH, not CRITICAL, unless the customer 
    reports fraudulent charges.
    """,
)


async def triage_email(subject: str, body: str, sender: str) -> TriageResult:
    email_content = f"""
    From: {sender}
    Subject: {subject}

    Body:
    {body[:2000]}  # truncate to keep tokens predictable
    """
    result = await TRIAGE_AGENT.run(email_content)
    return result.data

A few things worth explaining here:

The Field(description=...) on each model field is not just documentation. pydantic-ai passes these descriptions into the schema that guides the LLM's output. This is how you constrain the model's behavior without writing verbose few-shot examples. The description on needs_immediate_slack_alert embeds your business logic directly into the type definition.

Body truncation at 2000 characters is deliberate. Support emails are either short (the important signal is in the first paragraph) or extremely long (forwarded threads, attached logs in pasted text). Truncating keeps costs predictable and prevents occasional emails from burning through your token budget.

The system_prompt includes explicit guidance about when NOT to use CRITICAL. Without this, LLMs tend to over-escalate because they have no sense of what your alert fatigue threshold is.

Integration: Gmail to Linear to Slack

The data flow works like this:

A FastAPI background task polls Gmail via IMAP every 60 seconds, fetching unread emails from the support inbox.
Each email runs through triage_email() and returns a TriageResult.
The result maps to a Linear issue via the Linear GraphQL API. Category becomes the label, priority maps to Linear's 1-4 scale, and the summary becomes the issue title.
If needs_immediate_slack_alert is true, the pipeline posts to a #critical-support Slack channel with the sender, summary, and a direct link to the newly created Linear issue.

async def process_email(email: ParsedEmail):
    triage = await triage_email(email.subject, email.body, email.sender)

    linear_issue = await create_linear_issue(
        title=triage.summary,
        description=email.body,
        priority=PRIORITY_MAP[triage.priority],
        label=triage.category.value,
        team=triage.suggested_team,
    )

    if triage.needs_immediate_slack_alert:
        await post_slack_alert(
            channel="#critical-support",
            message=f"*Critical ticket created*\nFrom: {email.sender}\n"
                    f"Issue: {triage.summary}\nLinear: {linear_issue.url}",
        )

The gotcha worth knowing: Linear's GraphQL API requires you to fetch team IDs and label IDs before you can create issues. These IDs are workspace-specific and not human-readable. The production version caches these at startup rather than fetching them on every email, which matters when you're processing a burst of 20 emails after an incident.

Tradeoffs and Limitations

This approach works well for teams with relatively consistent email volume and well-defined categories. It does not handle a few things cleanly:

Thread context is lost. Each email is processed independently. If a customer replies to an existing thread, the system will create a duplicate Linear issue rather than appending to the existing one. You need email threading logic (matching by subject or Message-ID header) to solve this, which adds meaningful complexity.

LLM classification has a tail of errors. On roughly 3-5% of emails in testing, the category is wrong. Ambiguous emails ("Your tool deleted all my data but I also want to request a refund and ask about your enterprise plan") get assigned to whichever category the model prioritizes. You still want a human review queue for anything below HIGH priority.

IMAP polling is not ideal for high volume. If you're processing thousands of emails per day, you'll want to switch to Gmail's Pub/Sub push notifications or a proper email processing service. Polling every 60 seconds is fine for most support inboxes.

For very low email volume, this is probably over-engineered. A simple filter rule plus a Zapier workflow might be the right call.

Closing

This pipeline eliminated the morning triage ritual for the team that tested it. Engineers stopped starting their days by reading email. Critical tickets started landing in Slack within two minutes of arrival rather than hours later.

I packaged this as an open-source template you can deploy in an afternoon:

GitHub scaffold: https://github.com/Reactance0083/pydantic-ai-email-linear-auto-triage

The scaffold gives you the core architecture. The full production version with proper error handling, retry logic, email thread deduplication, test suite, and deployment config is available here:

Full production code: https://reactance0083.gumroad.com/l/dcror

If you've built something similar or run into different edge cases with LLM-based classification in production, I'd genuinely like to hear about it in the comments. Particularly curious whether anyone has solved the thread-matching problem cleanly.

How I Built an Email-to-Linear Auto-Triage Agent with pydantic-ai and FastAPI

Wade Allen — Mon, 01 Jun 2026 13:15:57 +0000

How I Built an Email-to-Linear Auto-Triage Agent with pydantic-ai and FastAPI

Support engineers at most companies share a quiet frustration: they spend a chunk of every morning doing work that feels robotic. Read email, decide what type it is, guess the priority, open Linear, create a ticket, paste in the details, and maybe ping someone on Slack if it looks urgent. The work itself is mechanical. The judgment it requires is not always trivial, but the process absolutely is.

I built a system that eliminates that loop using pydantic-ai, FastAPI, Gmail IMAP, the Linear API, and the Slack API. This article explains the architecture, the key code pattern, and the honest tradeoffs you should know before using something like this in production.

The Problem: Manual Triage Still Lives in Every Support Team

Here is what actually happens without automation: a support email arrives at 2:47 AM. It says something like "our entire checkout flow is broken, no orders are going through." It sits in a shared inbox. Someone sees it at 8 AM. They manually create a Linear ticket, label it P1, assign it to the on-call engineer, and then fire off a Slack message. By that point, the company has lost five hours of potential revenue recovery.

The frustrating part is that most teams have tried to fix this. Zapier rules break when email subjects change slightly. Regex-based classifiers require constant maintenance as new email patterns appear. Full LangChain pipelines feel like overkill and introduce significant prompt engineering overhead when all you need is a structured classification step.

The result: support teams manually drag emails into ticket systems because existing integrations are either too brittle or too heavy. What you actually need is a lightweight agent that can read an email, make a judgment call about its type and priority, and take structured action without requiring a custom rule for every new ticket category that emerges over time.

That gap is exactly what pydantic-ai is designed to close.

The Approach: Structured Outputs as the Glue Layer

The core insight here is that pydantic-ai lets you define exactly what you want an LLM to return, enforced at the library level. You are not hoping the model formats its response correctly. You are not parsing JSON out of a Markdown code block. The model's output is validated against a Pydantic model before your code ever sees it.

Here is why that matters for email triage specifically: classification is only useful if downstream systems can consume it reliably. Linear's API expects specific field types. Slack's alert logic needs a boolean or an enum, not a string that might say "critical" or "Critical" or "very urgent" depending on the day. Structured output makes the LLM behave like a typed function.

The architecture is straightforward:

FastAPI exposes a webhook endpoint that receives incoming email data (polled from Gmail via IMAP on a background scheduler).
pydantic-ai agent receives the raw email text, runs it through an LLM with a strict output schema, and returns a TriageResult object.
The TriageResult is used to create a Linear issue via their GraphQL API.
If priority is P1 or P2, a Slack alert fires to the on-call channel.

Why this over LangChain? LangChain's output parsers work, but they add layers of abstraction that obscure what is actually happening. When the parser fails in production, debugging is painful. pydantic-ai is closer to the metal: you define a Pydantic model, you get that model back. The failure modes are explicit and easy to handle.

Why FastAPI over a cron script? You get health check endpoints, async support, and easy deployment to any container environment. The IMAP polling runs as a background task, keeping the architecture clean and testable.

The Code Pattern: Defining the Agent with a Typed Output Schema

This is the piece developers need to understand before anything else. The entire system depends on this pattern working correctly.

from pydantic import BaseModel
from pydantic_ai import Agent
from enum import Enum

class TicketType(str, Enum):
    BUG = "bug"
    BILLING = "billing"
    FEATURE_REQUEST = "feature_request"
    OUTAGE = "outage"
    GENERAL = "general"

class Priority(str, Enum):
    P1 = "P1"
    P2 = "P2"
    P3 = "P3"
    P4 = "P4"

class TriageResult(BaseModel):
    ticket_type: TicketType
    priority: Priority
    summary: str          # one sentence, max 120 chars
    suggested_team: str   # e.g. "backend", "billing", "platform"
    requires_immediate_alert: bool

triage_agent = Agent(
    model="openai:gpt-4o-mini",
    result_type=TriageResult,
    system_prompt=(
        "You are a support triage agent. Given an email, classify it accurately. "
        "Mark requires_immediate_alert=True only for outages or data loss scenarios. "
        "Keep summary under 120 characters. Be conservative with P1   reserve it for "
        "confirmed production outages affecting multiple users."
    ),
)

async def triage_email(raw_email_text: str) -> TriageResult:
    result = await triage_agent.run(raw_email_text)
    return result.data

A few things worth explaining here:

result_type=TriageResult is where the magic lives. pydantic-ai constructs the prompt scaffolding to coerce the model into returning a response that validates against this schema. If validation fails, it retries automatically (configurable).

The requires_immediate_alert boolean is intentional. Keeping alert logic inside the LLM's classification means you can tune it through the system prompt rather than adding conditional branches in your routing code. Want to tighten or loosen the alert threshold? Update the prompt. No code changes needed.

The suggested_team field is a free string rather than an enum because team names vary by organization. You validate it loosely downstream before routing.

The Integration: Email In, Linear Out, Slack on Fire

The data flow looks like this:

Gmail IMAP poll (every 60s)
    -> raw email extracted (subject + body)
    -> FastAPI background task queued
    -> pydantic-ai agent runs classification
    -> TriageResult returned
    -> Linear GraphQL mutation creates issue
    -> if requires_immediate_alert: Slack webhook fires
    -> email marked as read / label applied in Gmail

The Linear integration uses their GraphQL API. Creating an issue looks roughly like:

import httpx

LINEAR_API_URL = "https://api.linear.app/graphql"

async def create_linear_issue(result: TriageResult, team_id: str, api_key: str):
    priority_map = {"P1": 1, "P2": 2, "P3": 3, "P4": 4}
    mutation = """
    mutation CreateIssue($title: String!, $description: String!, 
                         $teamId: String!, $priority: Int!) {
      issueCreate(input: {
        title: $title,
        description: $description,
        teamId: $teamId,
        priority: $priority
      }) {
        issue { id url }
      }
    }
    """
    variables = {
        "title": result.summary,
        "description": f"Type: {result.ticket_type}\nSuggested team: {result.suggested_team}",
        "teamId": team_id,
        "priority": priority_map[result.priority],
    }
    async with httpx.AsyncClient() as client:
        response = await client.post(
            LINEAR_API_URL,
            json={"query": mutation, "variables": variables},
            headers={"Authorization": api_key},
        )
    return response.json()

One gotcha worth knowing: Gmail IMAP with OAuth2 requires the IMAPClient library and token refresh handling. If you use simple password authentication (which Google is deprecating for standard accounts), you will hit auth failures silently in some environments. Build in token refresh logic from day one, not as an afterthought.

Tradeoffs and Limitations

This architecture works well for well-defined triage scenarios, but it has real limitations you should understand before deploying it.

LLM cost at volume: If you are processing thousands of emails per day, even gpt-4o-mini adds up. For very high volume, you would want to add a fast pre-filter (keyword matching or a fine-tuned small model) before hitting the LLM classification step.

Hallucinated summaries: The summary field is free text generated by the model. Occasionally it will produce a summary that misrepresents the original email. This matters if your Linear issues are the system of record. Consider storing the raw email body as an attachment to the issue.

No threading awareness: The system treats each email as independent. Reply chains and escalations require additional logic that this template does not handle.

When to choose something simpler: If your email types are genuinely stable (three or four categories that never change), a rule-based system with regex matching will be cheaper, faster, and more predictable. LLM classification earns its complexity when the input space is messy and evolving.

Get the Code and Share What You Build

I packaged this as an open-source scaffold on GitHub: https://github.com/Reactance0083/pydantic-ai-email-linear-auto-triage

The scaffold gives you the core structure: the pydantic-ai agent definition, the FastAPI app skeleton, and stub integrations for Linear and Slack.

The full production version with complete error handling, OAuth2 Gmail auth, retry logic, test coverage, and deployment docs is available here: https://reactance0083.gumroad.com/l/dcror

If you are already running something like this in production, or if you have hit edge cases I did not cover here (multi-language emails, CRM integration, SLA tracking), I would genuinely like to hear about it in the comments. The design decisions here are not the only valid ones, and the tradeoffs look different at different scales.

How I Built a Customer Support Auto-Responder with Confidence Scoring Using pydantic-ai and FastAPI

Wade Allen — Mon, 01 Jun 2026 13:07:55 +0000

How I Built a Customer Support Auto-Responder with Confidence Scoring Using pydantic-ai and FastAPI

Support teams are drowning in tickets. Not because there are too many questions, but because the tooling makes it hard to automate the ones that should be automatic. Most tickets asking "how do I reset my password?" or "what are your refund terms?" get routed through the same queue as complex billing disputes. The answer to the first two exists in your docs. The answer to the third requires a human.

The gap between "we have docs" and "the AI reliably answers from docs without hallucinating" is where most support automation projects die.

This article walks through a production-grade pattern I built: a ticket ingestion system that uses RAG against your own documentation, scores its own confidence on every response, auto-replies when it's sure, and escalates to a human agent with a pre-drafted reply attached when it's not. Every decision is logged for audit.

The Problem: Manual Triage at Scale Is Not a Strategy

Here is the real scenario. Your support team gets 200 tickets per day. About 60% are answerable directly from your documentation. But your existing helpdesk either requires custom code per email format or rigid keyword-matching rules that break the moment a user phrases something slightly differently.

The integration problem is worse than it looks. Most existing connectors expect emails in a predictable structure. Real users do not write like that. One person writes "how do I cancel," another writes "I need to stop my subscription immediately," and a third writes "billing is still happening after I closed my account." Same intent, wildly different phrasing.

Without structured output from the LLM, you cannot reliably extract: what is the intent, what is the relevant doc section, and how confident is the model in its answer. So you end up with one of two bad outcomes:

You auto-reply with a hallucinated answer and destroy user trust
You route everything to humans and waste their time on questions your docs already answer

What is missing is a structured decision layer that sits between raw LLM output and the action taken. That is exactly what pydantic-ai provides.

The Approach: Structured Outputs as the Decision Layer

The key insight is that pydantic-ai forces the LLM to return data in a validated schema rather than free text. This is not just cosmetic. When your model must produce a TicketResponse object with a confidence_score: float, a suggested_reply: str, and an escalate: bool, you can branch on those values programmatically. You are not parsing prose looking for signals. You have actual typed fields.

Here is why this architecture beats the alternatives:

vs. LangChain: LangChain is flexible but the abstractions leak constantly. Debugging why a chain behaved unexpectedly is painful. For a system where every decision must be auditable, you want to see exactly what the model returned and why. pydantic-ai keeps the model call and the output schema co-located. You can inspect the raw response and the validated output side by side.

vs. plain OpenAI/Anthropic requests: You can use response_format with JSON mode, but you still hand-roll the Pydantic models and the validation logic. pydantic-ai handles that contract automatically.

vs. rigid rule engines: Rules break on phrasing variations. A hybrid approach where the LLM handles intent extraction and the rules handle routing based on structured fields is much more robust.

The architecture is:

FastAPI endpoint receives the ticket payload
ChromaDB retrieves the top-k relevant doc chunks via embedding similarity
pydantic-ai agent runs inference with the retrieved context
The structured output determines: auto-reply, escalate with draft, or flag for review
Every decision object is written to a PostgreSQL audit log

The key design decision that makes this reliable is that the confidence threshold is not hardcoded in the prompt. It is a validated field the model must populate, and you set the threshold in your application logic. This means you can tune it without touching the prompt.

The Code Pattern: Agent Definition and Confidence-Gated Routing

Here is the central pattern. This is simplified but structurally accurate:

from pydantic import BaseModel, Field
from pydantic_ai import Agent
import chromadb

# The structured output schema
class TicketResponse(BaseModel):
    intent: str = Field(description="Short label for ticket intent")
    suggested_reply: str = Field(description="Full draft reply to send or attach")
    confidence_score: float = Field(ge=0.0, le=1.0)
    escalate: bool
    escalation_reason: str | None = None
    doc_sources: list[str] = Field(default_factory=list)

# Agent with result type enforced
support_agent = Agent(
    model="claude-3-5-sonnet-20241022",
    result_type=TicketResponse,
    system_prompt="""
    You are a support assistant. Use only the provided documentation context.
    If the answer is not clearly supported by context, set confidence_score below 0.7
    and escalate to True. Always cite which doc sections informed your reply.
    """
)

async def handle_ticket(ticket_text: str, chroma_collection) -> TicketResponse:
    # Retrieve relevant docs
    results = chroma_collection.query(
        query_texts=[ticket_text],
        n_results=4
    )
    context_chunks = "\n\n".join(results["documents"][0])

    prompt = f"""
    TICKET:
    {ticket_text}

    DOCUMENTATION CONTEXT:
    {context_chunks}
    """

    result = await support_agent.run(prompt)
    response = result.data  # Validated TicketResponse instance

    # Confidence-gated routing -- no ambiguity
    if response.escalate or response.confidence_score < 0.72:
        await route_to_human(response)
    else:
        await send_auto_reply(response)

    await log_decision(ticket_text, response)
    return response

What each piece does and why it matters:

result_type=TicketResponse is the contract. The model cannot return something that does not fit this schema. pydantic-ai handles retries and validation errors internally.
confidence_score with ge=0.0, le=1.0 enforced by Pydantic means you never get a string like "high" that you need to interpret. It is a float you can threshold on.
doc_sources gives you audit traceability. You can show support managers which doc chunk informed which reply.
The routing logic lives outside the prompt. This is intentional. Prompts drift. Application logic is version controlled.

The 0.72 threshold is arbitrary in this snippet. In production you tune it based on your false-positive tolerance, with audit logs providing the data to make that call.

Integration: Email Ingestion to Helpdesk to Slack Escalation

The data flow end to end looks like this:

Inbound: Emails arrive via a webhook from your email provider (Postmark, SendGrid, or similar). FastAPI receives the parsed payload with subject, body, sender, and any attachments.

Processing: The ticket body hits the RAG pipeline. ChromaDB stores your docs as embeddings loaded at startup. The retrieval step happens in under 100ms for most collections under 50k chunks.

Outbound: If auto-reply triggers, the reply goes back through your email provider API. If escalation triggers, a Slack message goes to your #support-escalations channel with the ticket details, the confidence score, and the pre-drafted reply attached. The agent did the work. The human just reviews and hits send (or edits first).

Audit log: Every TicketResponse object is serialized to JSON and written to a ticket_decisions table. This includes the retrieved doc chunks used, the confidence score, whether it was auto-replied or escalated, and the timestamp.

Gotcha worth knowing: ChromaDB's default embedding model will embed your docs differently than the embedding used at query time if you change models mid-deployment. If you swap from all-MiniLM-L6-v2 to text-embedding-3-small, you need to re-embed your entire document collection or retrieval quality degrades silently. Build a doc version hash into your collection name.

Tradeoffs and Limitations

This architecture is not for every team. Honest assessment:

Latency: Each ticket goes through an embedding query plus an LLM call. Expect 1-3 seconds per ticket depending on model and collection size. For real-time chat this is borderline. For email-based support, it is fine.

RAG quality ceiling: If your docs are poorly structured, out of date, or missing coverage for common questions, no amount of prompt engineering fixes it. Garbage in, garbage out. Budget for doc maintenance.

Cost at volume: At 200 tickets per day with Claude Sonnet, you are spending a few dollars per day. At 2000 tickets, that is meaningful. If budget is the constraint, a smaller model for the first triage pass plus a larger model only for borderline cases is a sensible optimization.

When to skip this pattern: If your ticket types are genuinely narrow and you can enumerate them, a smaller fine-tuned classifier plus templated replies is cheaper, faster, and more predictable. This pattern earns its complexity when ticket phrasing is diverse and your docs are the source of truth.

Get the Code

I packaged this as an open-source template on GitHub: https://github.com/Reactance0083/pydantic-ai-customer_support_ticket_ai_auto_responde

The scaffold shows the core agent setup, ChromaDB integration, and FastAPI routing. The full production version with test suite, error handling for malformed payloads, retry logic, Slack webhook integration, audit logging migrations, and deployment config is available here: https://reactance0083.gumroad.com/l/qbvpl

If you are running support at scale and have tried to automate it before, I am genuinely curious where it broke down for you. Was it retrieval quality, confidence calibration, the email parsing step, or something else entirely? Drop it in the comments. The edge cases in this space are worth discussing.

Build an LLM Router with pydantic-ai: Route Prompts to the Cheapest Model

Wade Allen — Mon, 25 May 2026 20:59:34 +0000

Why LLM Routing Matters

Every LLM-powered application has the same hidden problem: you're using one model for every task, even though tasks vary wildly in complexity.

A simple "classify this as spam or not spam" prompt doesn't need Claude Sonnet or GPT-4o. A /usr/bin/bash.04/MTok model handles it at 99% accuracy. But a complex multi-step reasoning task absolutely needs the flagship model, and getting cheap is just slow failure.

The result: you're either wasting money on over-provisioning, or getting silent failures from under-provisioning. Usually both at the same time, on different parts of your pipeline.

LLM routing solves this by classifying each prompt's complexity before routing it to the cheapest model that can actually handle it.

The Architecture

The multi-LLM cost optimizer I built uses three layers:

Complexity Classifier (Pydantic AI + Claude Haiku)
Model Router (LiteLLM + dynamic pricing lookup)
Cost Tracker (Real-time spend logging)

The key insight: the classifier (using a cheap fast model) pays for itself when it prevents expensive routing on simple tasks.

The Core Pattern

Pydantic AI structured outputs are what make the classification reliable:

Without structured outputs, you are back to parsing free-text, and the classifier becomes another source of bugs. With Pydantic AI, you get a typed object back or an exception - no ambiguity.

The router then picks the model based on the classified category:

The Real Trade-offs

Classification latency adds overhead. The complexity classifier runs before every routed call - around 200-400ms depending on the model. For interactive apps, cache classifications by semantic similarity so repeated similar prompts skip the classifier.

Edge cases are real. Code-heavy prompts, domain-specific jargon, and ambiguous short prompts are where classifiers misfire. Build a feedback loop to log misclassifications so you can tune the routing thresholds over time.

Cheap models fail silently. A simple model routing a task it cannot handle won't throw an error - it will just give you a worse answer. Add output validation downstream, not just routing logic upstream.

Cold-start cost. LiteLLM manages provider connections. First call to a new provider has connection overhead. Warm up your most-used routes at startup.

When to Use This Pattern

This pattern is high-value when:

You have mixed workloads: classification, summarization, generation, reasoning
Your API costs are already meaningful and growing
You have multiple providers available (Anthropic, OpenAI, Groq all supported)
You want a single FastAPI endpoint that handles routing transparently

It adds complexity, so a single-model setup is fine when workloads are homogeneous or costs are still low.

The Template

I packaged this as a drop-in FastAPI + pydantic-ai template that you can have running in under 10 minutes. It includes the complexity classifier, LiteLLM router, cost tracker, and a /stats endpoint for real-time spend visibility.

Get it at: https://reactance0083.gumroad.com/l/ztmlv

If you have questions about the routing logic or want to adapt it to a specific use case, open an issue on the GitHub repo: https://github.com/Reactance0083/pydantic-ai-multi-llm-cost-optimizer