DEV Community

Cover image for What Breaks After Your AI Demo Works
Lei Ye
Lei Ye

Posted on • Originally published at lei-ye.dev

What Breaks After Your AI Demo Works

Originally published at What Breaks After Your AI Demo Works.

A Short Story of How My AI Demo Worked and Failed

A few weeks ago I built a small AI API. Nothing fancy. Just a simple endpoint.

response = llm(prompt)
Enter fullscreen mode Exit fullscreen mode

It worked.

Requests came in. The model responded.Everything looked good.
Until the second week.

The First Question

A teammate asked:

“Which request generated this output?”
I checked the logs. There was nothing useful there.

NO request ID.
NO trace.
NO connection between the prompt and the output.

The system worked — but it wasn’t traceable.

The Second Question

Very quickly another question appeared.

“Why did our AI bill jump yesterday?”

I had no answer.

We were calling models through an API wrapper, but we weren’t recording:

  • Token usage
  • Model pricing
  • Request-level cost

We had built an AI system that spent money invisibly.

The Third Question

Then something more subtle happened.

A user reported that an output looked wrong. The model had responded successfully, but the answer was clearly not useful. Which raised another question:

“How do we know if a model response is acceptable?”

We didn’t.

The API only knew whether the model responded, not whether the result made sense.

The Realization

The model wasn't the problem. The system around the model was. AI APIs are fundamentally different from traditional APIs. They introduce three operational challenges:

Challenge What it means
Observability Can we trace what happened?
Economics How much did this request cost?
Output reliability Was the response acceptable?

Without solving these, AI systems quickly become hard to operate. So I built a small reference project to explore this problem. I called it Maester.


The Minimal Reliability Architecture

A reliable AI API request should pass through a few structured steps.

Client Request
      ↓
API Middleware
(request_id + trace_id)
      ↓
Route Handler
      ↓
Model Gateway
      ↓
Cost Metering
      ↓
Evaluation
      ↓
Structured Logs
      ↓
Response
Enter fullscreen mode Exit fullscreen mode

Each step adds operational clarity.

1. Observability: Making AI Requests Traceable

The first primitive is observability. Every request should be traceable.
In Maester, middleware attaches:

request_id
trace_id
Enter fullscreen mode Exit fullscreen mode

to the request context.

Example:

request_id = new_id()
trace_id = start_trace()
Enter fullscreen mode Exit fullscreen mode

These identifiers propagate through the entire request lifecycle. Then operations are wrapped in spans:

with span("model_generate", model=model_name) as sp:
    response = gateway.generate(prompt)
Enter fullscreen mode Exit fullscreen mode

The span records:

  • Operation name
  • Duration
  • Attributes

Example log output:

{
  "event": "span_end",
  "span": "model_generate",
  "duration_ms": 412,
  "model": "gpt-4o-mini"
}
Enter fullscreen mode Exit fullscreen mode

This gives immediate insight into where time is spent.

2. Cost Metering: AI Systems Spend Money Per Request

Unlike traditional APIs, AI requests have direct monetary cost. Token usage translates into real spend. So every request should produce a cost record.

Example:

cost_record = meter.record(
    model=model_name,
    input_tokens=usage.input_tokens,
    output_tokens=usage.output_tokens,
)
Enter fullscreen mode Exit fullscreen mode

The meter uses a pricing catalog:

MODEL_PRICING = {
    "gpt-4o-mini": {
        "input_per_1k": 0.00015,
        "output_per_1k": 0.00060,
    }
}
Enter fullscreen mode Exit fullscreen mode

The request returns:

input_tokens
output_tokens
total_cost
Enter fullscreen mode Exit fullscreen mode

Example response fragment:

{
  "input_tokens": 1200,
  "output_tokens": 350,
  "total_cost_usd": 0.00042
}
Enter fullscreen mode Exit fullscreen mode

Now the API answers a critical question:

“What did this request cost?”

3. Evaluation: Successful Calls Aren’t Always Correct

Even if a model responds successfully, the output may still be unusable.That is where evaluation comes in.

In Maester, responses pass through a simple evaluator:

result = evaluator.evaluate(prompt, response)
Enter fullscreen mode Exit fullscreen mode

Current checks include:

  1. Non-empty response
  2. Required term presence
  3. Maximum length

Example evaluation result:

{
  "passed": true,
  "checks": {
    "non_empty": true,
    "required_terms": true,
    "max_length": true
  }
}
Enter fullscreen mode Exit fullscreen mode

This pattern becomes more important as systems grow. Evaluation can evolve into:

  • Structured output validation
  • Hallucination detection
  • Policy enforcement
  • Safety filters

Why Not Just Use OpenTelemetry

I thoght to just adopt OpenTelemetry at the very beginning of this project, but decided to use home-made instead. Because OpenTelemetry solves a different problem. It provides:

  • Distributed tracing
  • Metrics exporters
  • Telemetry pipelines

But Maester focuses on application-level reliability primitives. Think of it as the layer that answers:

What happened in this AI request?
What model was called?
What did it cost?
Did the result pass validation?

These signals can later be exported to full observability stacks.

The Worker Path

AI systems rarely run only inside HTTP requests. Background jobs often run:

  • Batch inference
  • Evaluation pipelines
  • Data enrichment tasks

Maester includes a worker example to demonstrate that the same reliability primitives apply there. Worker execution uses the same tools:

  • Tracing
  • Cost metering
  • Evaluation
  • Structured logs

Reliability should not depend on the entrypoint.

What This Architecture Achieves

With only a few modules, the system now answers:

Question Component
What request generated this output? tracing
How long did the model call take? spans
How many tokens were used? cost meter
What did it cost? pricing model
Was the output valid? evaluator

These signals turn a black-box AI API into a traceable system.


Final Thought

Most reliability discussions around AI focus on models. But reliability often comes from system design, not model quality.

A simple architecture that records:

1. What happened
2. What it costs
3. Whether the result was acceptable

can dramatically improve how AI systems are operated.

The earlier these ideas are introduced into a system, the easier that system will be to maintain.





Note: This article was originally published on my engineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.
Original post: What Breaks After Your AI Demo Works.

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

The observability gap is what turns a working demo into a production liability. "Which request generated this output?" has no answer without request IDs baked in from the start — and retrofitting trace IDs into a running system is significantly harder than starting with them.

The bill jump question is the more brutal one because it's invisible until it's already happened. Token usage without per-request attribution is the equivalent of running a server with no per-endpoint metrics — you know the aggregate is high but have no idea which path is burning the most.

There's a related gap on the prompt side: when prompts are unstructured blobs rather than typed blocks, you also lose the ability to audit why a particular response was generated. Structured prompts (explicit role, constraints, output_format) are the counterpart to request tracing — they make the intent visible, not just the output.

We built flompt (flompt.dev) for this — a visual prompt builder that makes prompt structure auditable. Open-source: github.com/Nyrok/flompt