Originally published at What Breaks After Your AI Demo Works.
A Short Story of How My AI Demo Worked and Failed
A few weeks ago I built a small AI API. Nothing fancy. Just a simple endpoint.
response = llm(prompt)
It worked.
Requests came in. The model responded.Everything looked good.
Until the second week.
The First Question
A teammate asked:
“Which request generated this output?”
I checked the logs. There was nothing useful there.
NO request ID.
NO trace.
NO connection between the prompt and the output.
The system worked — but it wasn’t traceable.
The Second Question
Very quickly another question appeared.
“Why did our AI bill jump yesterday?”
I had no answer.
We were calling models through an API wrapper, but we weren’t recording:
- Token usage
- Model pricing
- Request-level cost
We had built an AI system that spent money invisibly.
The Third Question
Then something more subtle happened.
A user reported that an output looked wrong. The model had responded successfully, but the answer was clearly not useful. Which raised another question:
“How do we know if a model response is acceptable?”
We didn’t.
The API only knew whether the model responded, not whether the result made sense.
The Realization
The model wasn't the problem. The system around the model was. AI APIs are fundamentally different from traditional APIs. They introduce three operational challenges:
| Challenge | What it means |
|---|---|
| Observability | Can we trace what happened? |
| Economics | How much did this request cost? |
| Output reliability | Was the response acceptable? |
Without solving these, AI systems quickly become hard to operate. So I built a small reference project to explore this problem. I called it Maester.
The Minimal Reliability Architecture
A reliable AI API request should pass through a few structured steps.
Client Request
↓
API Middleware
(request_id + trace_id)
↓
Route Handler
↓
Model Gateway
↓
Cost Metering
↓
Evaluation
↓
Structured Logs
↓
Response
Each step adds operational clarity.
1. Observability: Making AI Requests Traceable
The first primitive is observability. Every request should be traceable.
In Maester, middleware attaches:
request_id
trace_id
to the request context.
Example:
request_id = new_id()
trace_id = start_trace()
These identifiers propagate through the entire request lifecycle. Then operations are wrapped in spans:
with span("model_generate", model=model_name) as sp:
response = gateway.generate(prompt)
The span records:
- Operation name
- Duration
- Attributes
Example log output:
{
"event": "span_end",
"span": "model_generate",
"duration_ms": 412,
"model": "gpt-4o-mini"
}
This gives immediate insight into where time is spent.
2. Cost Metering: AI Systems Spend Money Per Request
Unlike traditional APIs, AI requests have direct monetary cost. Token usage translates into real spend. So every request should produce a cost record.
Example:
cost_record = meter.record(
model=model_name,
input_tokens=usage.input_tokens,
output_tokens=usage.output_tokens,
)
The meter uses a pricing catalog:
MODEL_PRICING = {
"gpt-4o-mini": {
"input_per_1k": 0.00015,
"output_per_1k": 0.00060,
}
}
The request returns:
input_tokens
output_tokens
total_cost
Example response fragment:
{
"input_tokens": 1200,
"output_tokens": 350,
"total_cost_usd": 0.00042
}
Now the API answers a critical question:
“What did this request cost?”
3. Evaluation: Successful Calls Aren’t Always Correct
Even if a model responds successfully, the output may still be unusable.That is where evaluation comes in.
In Maester, responses pass through a simple evaluator:
result = evaluator.evaluate(prompt, response)
Current checks include:
- Non-empty response
- Required term presence
- Maximum length
Example evaluation result:
{
"passed": true,
"checks": {
"non_empty": true,
"required_terms": true,
"max_length": true
}
}
This pattern becomes more important as systems grow. Evaluation can evolve into:
- Structured output validation
- Hallucination detection
- Policy enforcement
- Safety filters
Why Not Just Use OpenTelemetry
I thoght to just adopt OpenTelemetry at the very beginning of this project, but decided to use home-made instead. Because OpenTelemetry solves a different problem. It provides:
- Distributed tracing
- Metrics exporters
- Telemetry pipelines
But Maester focuses on application-level reliability primitives. Think of it as the layer that answers:
What happened in this AI request?
What model was called?
What did it cost?
Did the result pass validation?
These signals can later be exported to full observability stacks.
The Worker Path
AI systems rarely run only inside HTTP requests. Background jobs often run:
- Batch inference
- Evaluation pipelines
- Data enrichment tasks
Maester includes a worker example to demonstrate that the same reliability primitives apply there. Worker execution uses the same tools:
- Tracing
- Cost metering
- Evaluation
- Structured logs
Reliability should not depend on the entrypoint.
What This Architecture Achieves
With only a few modules, the system now answers:
| Question | Component |
|---|---|
| What request generated this output? | tracing |
| How long did the model call take? | spans |
| How many tokens were used? | cost meter |
| What did it cost? | pricing model |
| Was the output valid? | evaluator |
These signals turn a black-box AI API into a traceable system.
Final Thought
Most reliability discussions around AI focus on models. But reliability often comes from system design, not model quality.
A simple architecture that records:
1. What happened
2. What it costs
3. Whether the result was acceptable
can dramatically improve how AI systems are operated.
The earlier these ideas are introduced into a system, the easier that system will be to maintain.
Note: This article was originally published on my engineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.
Original post: What Breaks After Your AI Demo Works.
Top comments (1)
The observability gap is what turns a working demo into a production liability. "Which request generated this output?" has no answer without request IDs baked in from the start — and retrofitting trace IDs into a running system is significantly harder than starting with them.
The bill jump question is the more brutal one because it's invisible until it's already happened. Token usage without per-request attribution is the equivalent of running a server with no per-endpoint metrics — you know the aggregate is high but have no idea which path is burning the most.
There's a related gap on the prompt side: when prompts are unstructured blobs rather than typed blocks, you also lose the ability to audit why a particular response was generated. Structured prompts (explicit role, constraints, output_format) are the counterpart to request tracing — they make the intent visible, not just the output.
We built flompt (flompt.dev) for this — a visual prompt builder that makes prompt structure auditable. Open-source: github.com/Nyrok/flompt