Alex Metelli

Posted on May 2

From AI Demo to Production: How to Ship Quality Agentic Applications

#testing #agents #llm #ai

AI developers are getting very good at building demos.

A prompt, a model call, maybe a tool call or two, and suddenly you have something that looks impressive in a workshop, a hackathon, or an internal prototype.

But production is where the illusion breaks.

The hard part is no longer proving that an LLM can answer a question. The hard part is knowing whether your AI system behaves correctly across messy real-world inputs, edge cases, model changes, tool failures, latency constraints, cost pressure, and business-specific policies.

That was the core theme of a recent Braintrust and Trainline workshop in London: shipping AI applications requires operational rigor, not just better prompts.

The prototype trap

A lot of AI applications start the same way:

const response = await openai.chat.completions.create({
  model: "gpt-5-mini",
  messages: [
    { role: "system", content: "You are a helpful support agent." },
    { role: "user", content: ticketText },
  ],
});

For a proof of concept, this is fine.

You pass in a ticket, get a structured response, and the output looks plausible. Maybe it categorizes the issue, assigns a severity, and drafts a customer reply.

The problem is that plausibility is not correctness.

A single prompt can work beautifully on three demo cases and fail badly once exposed to production inputs. This is especially true in business workflows where the model needs to understand implicit priority, policies, customer tiers, refunds, SLAs, billing impact, or escalation rules.

Traditional software has deterministic failure modes. AI systems do not.

In normal software, 1 + 1 = 2.

In LLM systems, the equivalent is closer to: “usually 2, unless the context, prompt, model, tool result, temperature, or previous step nudges it somewhere else.”

That does not mean AI systems are unusable. It means they need a different quality model.

Production AI needs both software engineering and ML discipline

Trainline described this well.

They operate both traditional ML systems and agentic AI systems at large scale. For example, they use machine learning to predict train disruptions, but they also run a travel assistant that can help users with refunds, alternative trains, and support workflows.

Those two worlds have different quality practices.

Classic software engineering gives you:

type checks
unit tests
integration tests
CI/CD
structured logging
service observability
release discipline

Machine learning gives you:

datasets
offline evaluation
online evaluation
model comparison
data quality checks
drift monitoring

Agentic AI sits in the middle.

Part of the system is deterministic: API calls, database lookups, validation, routing, tool execution.

Part of the system is nondeterministic: reasoning, classification, language generation, judgment, summarization.

So the correct quality model is not “just write tests” and not “just vibe-check the outputs.”

It is both.

Break the monolithic prompt into stages

One of the most useful engineering patterns from the workshop was taking a single prompt-based support agent and breaking it into a staged workflow.

Instead of one large model call that does everything, the system was split into clearer responsibilities:

Context collection
Gather deterministic context: account information, previous tickets, relevant help articles, customer tier, billing state, etc.
Triage
Classify the issue, infer severity, identify the affected domain, and decide whether more information is needed.
Policy review
Check whether the proposed action follows company policy, SLAs, refund rules, escalation rules, or compliance constraints.
Reply writing
Draft a customer-facing response in the correct tone.
Final packaging
Emit structured output for downstream systems, including escalation flags, internal notes, and customer reply.

This is not just cleaner architecture. It improves debuggability.

When a single prompt fails, you often do not know why. Was the categorization wrong? Did it miss account context? Did it misunderstand policy? Did the final reply hallucinate?

When the workflow is staged, each part can be traced, evaluated, and improved independently.

This is the same instinct software engineers already use when decomposing a monolith. The AI version is: do not put all reasoning, policy, tool usage, and writing into one giant prompt unless the problem is genuinely trivial.

Tool calls improve capability, but increase failure surface

Adding tools makes an AI application more useful.

A support agent can call tools to:

retrieve help articles
inspect account metadata
check previous incidents
look up billing state
fetch policy rules
create an escalation
draft or update a ticket

But every tool call adds another possible failure mode.

The tool may return stale data. The model may choose the wrong tool. The tool result may be incomplete. The model may ignore the result. The tool may succeed but the final answer may still misinterpret it.

This is why AI observability becomes mandatory.

If you cannot see which tools were called, with what inputs, what outputs they returned, and how the model used those outputs, you are effectively debugging production behavior blind.

Logs are not enough.

Logs tell you what happened at a shallow level. Tracing tells you how the system behaved internally.

Trace the full execution path

For production AI systems, tracing should capture the entire workflow:

parent request
child spans for each stage
model inputs
model outputs
tool calls
tool results
token usage
latency
cost
metadata
final structured output
scores or evaluation results

The important detail is nesting.

A lot of teams instrument only the top-level model call. That is not enough for agentic systems.

You want a trace that looks more like this:

support-ticket-run
  ├── collect-context
  │   ├── fetch-account
  │   ├── search-help-articles
  │   └── fetch-ticket-history
  ├── triage-specialist
  │   └── llm-call
  ├── policy-reviewer
  │   └── llm-call
  ├── reply-writer
  │   └── llm-call
  └── finalize-result
      └── maybe-escalate

This lets you answer the questions that matter:

Which step failed?
Did the model receive the right context?
Did the tool return the expected data?
Did latency come from retrieval, reasoning, or final generation?
Did the model output violate policy?
Did a cheaper model behave differently?
Did a prompt change improve one case but regress another?

Without tracing, you are guessing.

And guessing is not an engineering process.

Build a golden dataset before you trust the system

A production AI application needs evaluation data.

At the beginning, this can be small. It does not need to be perfect. But it needs to exist.

For the workshop support agent, the golden dataset contained representative support tickets with expected properties, such as:

correct category
expected severity
whether escalation is required
whether the output schema is valid
whether the reply follows policy
whether the customer-facing response is appropriate

This dataset becomes the baseline for safe iteration.

Every time you change the prompt, model, tool behavior, or workflow, you can rerun the evaluation and ask:

Did this change improve the system, or did it just look better on one example?

That distinction matters.

Prompt editing without evaluations is just production gambling.

Use deterministic scores where possible

Not every evaluation needs an LLM judge.

Some checks should be deterministic:

function hasValidSchema(output: unknown): boolean {
  return SupportTicketOutputSchema.safeParse(output).success;
}

Useful deterministic checks include:

schema validity
required fields present
escalation reason exists when escalation is required
severity is within allowed enum values
category matches expected label
reply is not empty
internal-only data is not present in customer reply

These checks are cheap, fast, stable, and should run often.

They are the AI equivalent of unit tests and type checks.

Use LLM-as-judge for subjective quality

Some things cannot be reliably checked with simple assertions.

For example:

Is this reply helpful?
Is the tone appropriate?
Does it follow the refund policy?
Does it avoid overpromising?
Does it correctly reason about the customer’s situation?
Is the escalation decision justified?

For those, LLM-as-judge can be useful.

The key is to use it deliberately. Write clear rubrics. Score specific dimensions. Avoid vague prompts like “is this good?”

A better judge prompt asks for concrete criteria:

Evaluate whether the support response:
1. Correctly identifies the user’s issue.
2. Does not promise actions outside company policy.
3. Gives a clear next step.
4. Uses an appropriate customer-facing tone.
5. Escalates when business impact is high.

Return a score from 0 to 1 and a short rationale.

LLM judges are not magic. They are another probabilistic component. But when paired with deterministic checks and real production traces, they give you a scalable way to evaluate nuance.

Offline evaluation is not enough

A golden dataset gives you confidence before deployment.

But production data is where the real failures appear.

Users will phrase things oddly. They will omit important details. They will create conflicting signals. They will say something is “not urgent” while describing a CFO blocked before a board meeting.

That example came up in the workshop:

"This isn't urgent, but our CFO can't export the invoices before the board meeting."

A weak triage agent may classify this as low severity because the user said “not urgent.”

A better system understands the business context:

CFO
invoices
board meeting
likely time-sensitive
high business impact

This is exactly the kind of failure that does not always appear in initial test data.

So production traces should feed back into the evaluation loop.

When you find a failure:

Capture the trace.
Add the case to the dataset.
Write or update the scoring rule.
Fix the prompt, tool, policy, or workflow.
Rerun evaluations.
Compare against previous runs.
Deploy only if the fix does not introduce regressions.

That loop is the real production AI workflow.

Model switching needs evaluation discipline

Trainline also described a very practical problem: model cost.

At scale, LLM bills can become painful fast. Teams naturally want to switch models, use cheaper models, reduce token usage, or route simpler requests to smaller models.

But model switching without evaluation is dangerous.

A cheaper model may perform similarly on simple tickets and fail on high-impact edge cases. A newer model may improve reasoning but change tone. A faster model may reduce latency but increase escalation mistakes.

The right question is not:

Can we switch to a cheaper model?

The right question is:

Can we switch to a cheaper model without degrading the scores that matter?

That requires offline evaluation, online monitoring, and trace-level comparison.

Managed prompts help cross-functional teams

Another practical point: prompts often become collaboration bottlenecks.

Engineers own the codebase, but product managers, support leads, legal reviewers, and domain experts often understand the desired behavior better than the engineering team.

If every prompt change requires a code change, review, deploy, and engineering handoff, iteration slows down.

Managed prompts and parameters solve part of this.

They allow teams to:

version prompts
track who changed what
compare prompt versions
update model parameters
collaborate with non-engineers
test changes before rollout
keep production behavior reproducible

This does not mean abandoning Git or software discipline.

The better pattern is to keep prompts, tools, and configuration synchronized with code where needed, while still giving teams a managed operational layer for experimentation and review.

For regulated industries, this matters even more. You need to know what changed, when it changed, who changed it, and what effect it had.

The production AI flywheel

The workshop’s core operating model can be summarized as a flywheel:

Build → Trace → Evaluate → Find failures → Remediate → Deploy → Monitor → Repeat

More concretely:

Start with a working agent
Even if it is simple.
Break it into explicit stages
Separate context gathering, reasoning, policy checks, reply generation, and final output.
Add tracing
Capture model calls, tool calls, latency, token usage, cost, inputs, outputs, and metadata.
Create a golden dataset
Start with representative examples and known edge cases.
Add deterministic scores
Validate structure, required fields, categories, escalation rules, and other objective behavior.
Add LLM judges where needed
Evaluate tone, helpfulness, policy compliance, and reasoning quality.
Run offline evaluations
Before shipping prompt, model, or workflow changes.
Score production traces
Use online evaluation and sampling to detect real-world failures.
Turn failures into tests
Every production failure should improve your dataset.
Compare experiments over time
Do not trust a change unless you can see its effect.

The main lesson for AI developers

The future of AI engineering is not just better models.

It is better systems around models.

A production-grade AI application needs the same seriousness we already expect from software systems: observability, testing, versioning, review, deployment discipline, and feedback loops.

But it also needs ML-style evaluation: datasets, scoring, model comparison, judge rubrics, and continuous monitoring.

The teams that win will not be the ones with the fanciest demo.

They will be the ones that can answer, with evidence:

What changed?
Did quality improve?
Did cost increase?
Did latency regress?
Which failure modes remain?
Which users are affected?
Can we reproduce the issue?
Can we safely ship the fix?

That is the difference between an AI prototype and an AI product.

And for agentic systems, that difference is everything.

DEV Community