Lei Ye

Posted on Mar 8 • Originally published at lei-ye.dev

What Breaks After Your AI Demo Works

#architecture #saas #ai #programming

Originally published at What Breaks After Your AI Demo Works.

A Short Story of How My AI Demo Worked and Failed

A few weeks ago I built a small AI API. Nothing fancy. Just a simple endpoint.

response = llm(prompt)

It worked.

Requests came in. The model responded.Everything looked good.
Until the second week.

The First Question

A teammate asked:

“Which request generated this output?”
I checked the logs. There was nothing useful there.

NO request ID.
NO trace.
NO connection between the prompt and the output.

The system worked — but it wasn’t traceable.

The Second Question

Very quickly another question appeared.

“Why did our AI bill jump yesterday?”

I had no answer.

We were calling models through an API wrapper, but we weren’t recording:

Token usage
Model pricing
Request-level cost

We had built an AI system that spent money invisibly.

The Third Question

Then something more subtle happened.

A user reported that an output looked wrong. The model had responded successfully, but the answer was clearly not useful. Which raised another question:

“How do we know if a model response is acceptable?”

We didn’t.

The API only knew whether the model responded, not whether the result made sense.

The Realization

The model wasn't the problem. The system around the model was. AI APIs are fundamentally different from traditional APIs. They introduce three operational challenges:

Challenge	What it means
Observability	Can we trace what happened?
Economics	How much did this request cost?
Output reliability	Was the response acceptable?

Without solving these, AI systems quickly become hard to operate. So I built a small reference project to explore this problem. I called it Maester.

The Minimal Reliability Architecture

A reliable AI API request should pass through a few structured steps.

Client Request
      ↓
API Middleware
(request_id + trace_id)
      ↓
Route Handler
      ↓
Model Gateway
      ↓
Cost Metering
      ↓
Evaluation
      ↓
Structured Logs
      ↓
Response

Each step adds operational clarity.

1. Observability: Making AI Requests Traceable

The first primitive is observability. Every request should be traceable.
In Maester, middleware attaches:

request_id
trace_id

to the request context.

Example:

request_id = new_id()
trace_id = start_trace()

These identifiers propagate through the entire request lifecycle. Then operations are wrapped in spans:

with span("model_generate", model=model_name) as sp:
    response = gateway.generate(prompt)

The span records:

Operation name
Duration
Attributes

Example log output:

{
  "event": "span_end",
  "span": "model_generate",
  "duration_ms": 412,
  "model": "gpt-4o-mini"
}

This gives immediate insight into where time is spent.

2. Cost Metering: AI Systems Spend Money Per Request

Unlike traditional APIs, AI requests have direct monetary cost. Token usage translates into real spend. So every request should produce a cost record.

Example:

cost_record = meter.record(
    model=model_name,
    input_tokens=usage.input_tokens,
    output_tokens=usage.output_tokens,
)

The meter uses a pricing catalog:

MODEL_PRICING = {
    "gpt-4o-mini": {
        "input_per_1k": 0.00015,
        "output_per_1k": 0.00060,
    }
}

The request returns:

input_tokens
output_tokens
total_cost

Example response fragment:

{
  "input_tokens": 1200,
  "output_tokens": 350,
  "total_cost_usd": 0.00042
}

Now the API answers a critical question:

“What did this request cost?”

3. Evaluation: Successful Calls Aren’t Always Correct

Even if a model responds successfully, the output may still be unusable.That is where evaluation comes in.

In Maester, responses pass through a simple evaluator:

result = evaluator.evaluate(prompt, response)

Current checks include:

Non-empty response
Required term presence
Maximum length

Example evaluation result:

{
  "passed": true,
  "checks": {
    "non_empty": true,
    "required_terms": true,
    "max_length": true
  }
}

This pattern becomes more important as systems grow. Evaluation can evolve into:

Structured output validation
Hallucination detection
Policy enforcement
Safety filters

Why Not Just Use OpenTelemetry

I thoght to just adopt OpenTelemetry at the very beginning of this project, but decided to use home-made instead. Because OpenTelemetry solves a different problem. It provides:

Distributed tracing
Metrics exporters
Telemetry pipelines

But Maester focuses on application-level reliability primitives. Think of it as the layer that answers:

What happened in this AI request?
What model was called?
What did it cost?
Did the result pass validation?

These signals can later be exported to full observability stacks.

The Worker Path

AI systems rarely run only inside HTTP requests. Background jobs often run:

Batch inference
Evaluation pipelines
Data enrichment tasks

Maester includes a worker example to demonstrate that the same reliability primitives apply there. Worker execution uses the same tools:

Tracing
Cost metering
Evaluation
Structured logs

Reliability should not depend on the entrypoint.

What This Architecture Achieves

With only a few modules, the system now answers:

Question	Component
What request generated this output?	tracing
How long did the model call take?	spans
How many tokens were used?	cost meter
What did it cost?	pricing model
Was the output valid?	evaluator

These signals turn a black-box AI API into a traceable system.

Final Thought

Most reliability discussions around AI focus on models. But reliability often comes from system design, not model quality.

A simple architecture that records:

1. What happened
2. What it costs
3. Whether the result was acceptable

can dramatically improve how AI systems are operated.

The earlier these ideas are introduced into a system, the easier that system will be to maintain.

Note: This article was originally published on my engineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.
Original post: What Breaks After Your AI Demo Works.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.