One of the weirdest parts of shipping LLM features is how quickly debugging turns into folklore.
It worked yesterday.
It works in your happy-path test.
Then a user asks something slightly different and the output gets strange.
At that point most teams are missing the same answers:
- which prompt version actually ran?
- what context was sent with the request?
- how many tokens did it use?
- did the output pass any checks?
- was the problem the prompt, retrieval, model choice, or input shape?
In other words: the problem is not just prompting.
It is observability.
The easiest way to make AI features less mysterious is to stop treating them like magic and start treating them like APIs.
That means every call should have:
- an identity
- a trace
- structured inputs and outputs
- usage and latency data
- simple quality checks
You do not need a giant platform to do this. A thin wrapper and consistent logs already go a long way.
The mental model: every LLM call is an event
For debugging, each request to a model should leave behind enough evidence to answer:
- what was asked?
- what context was provided?
- which prompt version and model were used?
- what came back?
- did it match the output contract?
That is the minimum viable observability layer.
Start with a standard envelope
I like logging each LLM call with a shared schema.
At minimum:
-
trace_id: ties multiple steps to one user request -
span_id: identifies this specific model call prompt_idprompt_versionmodel- model parameters
- sanitized input
- context metadata
- sanitized output
- token usage
- latency
- validation checks
- error details if the call failed
Example TypeScript type:
export type LlmCallLog = {
ts: string;
trace_id: string;
span_id: string;
prompt_id: string;
prompt_version: string;
model: string;
params: { temperature: number; top_p?: number; max_tokens?: number };
input: { user_message: string; variables: Record<string, string> };
context?: { source: string; id: string; excerpt: string }[];
output?: { text: string };
finish_reason?: string;
usage?: { input_tokens: number; output_tokens: number; total_tokens: number };
latency_ms: number;
checks?: {
name: string;
ok: boolean;
details?: string;
score?: number;
}[];
error?: { message: string; type?: string };
};
Even if you only print this as JSON logs at first, it is already useful.
Generate trace IDs at the edge
A lot of debugging pain disappears once every user request has a trace ID.
If your API gateway already creates one, reuse it.
Otherwise generate one when the request enters your app.
import { randomUUID } from "node:crypto";
export function getTraceId(req: any) {
return req.headers["x-trace-id"] ?? randomUUID();
}
export function newSpanId() {
return randomUUID();
}
Now when something looks wrong, you can follow the whole story across steps instead of staring at one isolated completion.
Version prompts like code
The most common AI debugging question is also the most embarrassing one:
“Wait, which prompt was production using?”
Do not rely on vibes or copy-pasted prompt strings hidden in code.
Give each prompt:
- a stable
prompt_idlikesupport_reply - a
prompt_version, ideally a git SHA or semantic version
Then log both on every call.
This makes rollback, comparison, and incident review dramatically easier.
Log inputs and context, but sanitize aggressively
Inputs matter.
So does retrieval context.
But raw logs can become a privacy problem fast.
A decent compromise:
- redact secrets
- cap log size
- store document IDs and short excerpts instead of full documents
- hash or mask sensitive user fields when needed
A tiny sanitizer is already better than nothing:
const SECRET_PATTERNS = [
/sk-[A-Za-z0-9]{20,}/g,
/AKIA[0-9A-Z]{16}/g,
];
export function sanitize(text: string) {
let out = text;
for (const re of SECRET_PATTERNS) out = out.replace(re, "[REDACTED]");
return out.slice(0, 4000);
}
For RAG-style workflows, log chunk metadata plus a short excerpt:
{
"source": "kb",
"id": "doc_418:chunk_07",
"excerpt": "Refunds are issued within 5–7 business days..."
}
That gives you enough evidence to answer whether retrieval was useful or garbage.
Add cheap checks before you reach for fancy evaluation
A lot of production failures are boring.
The output is malformed. Too long. Missing a required field. Missing citations. Full of placeholders.
That is good news, because boring failures are easy to catch.
Useful first checks:
- valid JSON
- schema validation
- max length
- required section headings
- “must include” or “must not include” phrases
- placeholder detection like
TODOor<INSERT>
Example JSON schema validation with Ajv:
import Ajv from "ajv";
const ajv = new Ajv();
const validate = ajv.compile({
type: "object",
properties: {
answer: { type: "string" },
citations: { type: "array", items: { type: "string" } }
},
required: ["answer"],
additionalProperties: true
});
export function jsonSchemaCheck(raw: string) {
try {
const parsed = JSON.parse(raw);
const ok = validate(parsed);
return { ok: !!ok, details: ok ? undefined : ajv.errorsText(validate.errors) };
} catch {
return { ok: false, details: "Invalid JSON" };
}
}
Do not overcomplicate this early.
Consistent checks beat sophisticated but fragile ones.
Add one small score
Pass/fail checks are good.
A lightweight score is even better.
For example, if your app answers questions from retrieved documents, you might track a rough groundedness score by comparing the output to retrieved excerpts.
It will not be perfect. It does not need to be.
It just needs to be consistent enough to spot drift.
Wrap your client once
The simplest implementation pattern is a wrapper around your model client.
That wrapper should:
- accept prompt and model metadata
- execute the call
- run validation checks
- emit one structured event
Pseudo-code:
export async function runPrompt({
trace_id,
prompt_id,
prompt_version,
model,
params,
input,
context
}: any) {
const span_id = newSpanId();
const t0 = Date.now();
try {
const res = await llm.chat({ model, params, messages: buildMessages(input, context) });
const outputText = res.text ?? "";
const checks = [
{ name: "json_schema", ...jsonSchemaCheck(outputText) }
];
console.log(JSON.stringify({
ts: new Date().toISOString(),
trace_id,
span_id,
prompt_id,
prompt_version,
model,
params,
input: {
user_message: sanitize(input.user_message),
variables: input.variables
},
context: context?.map((c: any) => ({ ...c, excerpt: sanitize(c.excerpt) })),
output: { text: sanitize(outputText) },
finish_reason: res.finish_reason,
usage: res.usage,
latency_ms: Date.now() - t0,
checks
}));
return res;
} catch (e: any) {
console.log(JSON.stringify({
ts: new Date().toISOString(),
trace_id,
span_id,
prompt_id,
prompt_version,
model,
latency_ms: Date.now() - t0,
error: { message: String(e?.message ?? e), type: e?.name }
}));
throw e;
}
}
That wrapper pays for itself the first time someone says, “users are getting weird answers again.”
What to look for in the first week
Once the logs exist, patterns show up quickly.
I would start with these views:
- prompt versions with the most check failures
- latency spikes by model
- token outliers that suggest bloated context
- calls with missing or low-value retrieval context
- failure rate before and after prompt changes
You do not need a fancy dashboard immediately.
A few scripts over structured logs can already tell you a lot.
A practical example
Suppose your support assistant suddenly starts producing long, vague replies.
Without observability, you guess:
- maybe the model changed
- maybe the retrieval is worse
- maybe the prompt was edited
- maybe users are asking weirder questions
With observability, you can check:
- prompt version changed from
support_reply@a13f2dtosupport_reply@b48e91 - average retrieved chunk count dropped from 5 to 1
- JSON schema failures increased after the prompt edit
- token usage doubled because the new prompt added verbose examples
That turns a spooky production issue into a normal debugging session.
Build a tiny debug bundle
When something goes wrong, collect a small incident bundle:
trace_idprompt_idprompt_version- sanitized inputs
- retrieval IDs and excerpts
- output text
- checks and scores
That bundle makes it much easier for engineers, product people, and reviewers to talk about the same failure without guessing.
Closing
Prompting gets most of the attention.
Observability is what makes iteration safe.
If you can answer:
- what ran
- with which inputs
- using which context
- at what cost
- and whether it passed simple checks
then your AI feature stops feeling mysterious.
It starts behaving like the rest of your system: imperfect, debuggable, and improvable.
That is a much nicer place to build from.
Top comments (0)