Nova Elvaris

Posted on Mar 20

LLM Observability: Log, Trace, and Score Your AI Calls Like an API

#programming

One of the weirdest parts of shipping LLM features is how quickly debugging turns into folklore.

It worked yesterday.
It works in your happy-path test.
Then a user asks something slightly different and the output gets strange.

At that point most teams are missing the same answers:

which prompt version actually ran?
what context was sent with the request?
how many tokens did it use?
did the output pass any checks?
was the problem the prompt, retrieval, model choice, or input shape?

In other words: the problem is not just prompting.
It is observability.

The easiest way to make AI features less mysterious is to stop treating them like magic and start treating them like APIs.

That means every call should have:

an identity
a trace
structured inputs and outputs
usage and latency data
simple quality checks

You do not need a giant platform to do this. A thin wrapper and consistent logs already go a long way.

The mental model: every LLM call is an event

For debugging, each request to a model should leave behind enough evidence to answer:

what was asked?
what context was provided?
which prompt version and model were used?
what came back?
did it match the output contract?

That is the minimum viable observability layer.

Start with a standard envelope

I like logging each LLM call with a shared schema.

At minimum:

trace_id: ties multiple steps to one user request
span_id: identifies this specific model call
prompt_id
prompt_version
model
model parameters
sanitized input
context metadata
sanitized output
token usage
latency
validation checks
error details if the call failed

Example TypeScript type:

export type LlmCallLog = {
  ts: string;
  trace_id: string;
  span_id: string;

  prompt_id: string;
  prompt_version: string;

  model: string;
  params: { temperature: number; top_p?: number; max_tokens?: number };

  input: { user_message: string; variables: Record<string, string> };
  context?: { source: string; id: string; excerpt: string }[];

  output?: { text: string };
  finish_reason?: string;

  usage?: { input_tokens: number; output_tokens: number; total_tokens: number };
  latency_ms: number;

  checks?: {
    name: string;
    ok: boolean;
    details?: string;
    score?: number;
  }[];

  error?: { message: string; type?: string };
};

Even if you only print this as JSON logs at first, it is already useful.

Generate trace IDs at the edge

A lot of debugging pain disappears once every user request has a trace ID.

If your API gateway already creates one, reuse it.
Otherwise generate one when the request enters your app.

import { randomUUID } from "node:crypto";

export function getTraceId(req: any) {
  return req.headers["x-trace-id"] ?? randomUUID();
}

export function newSpanId() {
  return randomUUID();
}

Now when something looks wrong, you can follow the whole story across steps instead of staring at one isolated completion.

Version prompts like code

The most common AI debugging question is also the most embarrassing one:

“Wait, which prompt was production using?”

Do not rely on vibes or copy-pasted prompt strings hidden in code.

Give each prompt:

a stable prompt_id like support_reply
a prompt_version, ideally a git SHA or semantic version

Then log both on every call.

This makes rollback, comparison, and incident review dramatically easier.

Log inputs and context, but sanitize aggressively

Inputs matter.
So does retrieval context.

But raw logs can become a privacy problem fast.

A decent compromise:

redact secrets
cap log size
store document IDs and short excerpts instead of full documents
hash or mask sensitive user fields when needed

A tiny sanitizer is already better than nothing:

const SECRET_PATTERNS = [
  /sk-[A-Za-z0-9]{20,}/g,
  /AKIA[0-9A-Z]{16}/g,
];

export function sanitize(text: string) {
  let out = text;
  for (const re of SECRET_PATTERNS) out = out.replace(re, "[REDACTED]");
  return out.slice(0, 4000);
}

For RAG-style workflows, log chunk metadata plus a short excerpt:

{
  "source": "kb",
  "id": "doc_418:chunk_07",
  "excerpt": "Refunds are issued within 5–7 business days..."
}

That gives you enough evidence to answer whether retrieval was useful or garbage.

Add cheap checks before you reach for fancy evaluation

A lot of production failures are boring.
The output is malformed. Too long. Missing a required field. Missing citations. Full of placeholders.

That is good news, because boring failures are easy to catch.

Useful first checks:

valid JSON
schema validation
max length
required section headings
“must include” or “must not include” phrases
placeholder detection like TODO or <INSERT>

Example JSON schema validation with Ajv:

import Ajv from "ajv";

const ajv = new Ajv();
const validate = ajv.compile({
  type: "object",
  properties: {
    answer: { type: "string" },
    citations: { type: "array", items: { type: "string" } }
  },
  required: ["answer"],
  additionalProperties: true
});

export function jsonSchemaCheck(raw: string) {
  try {
    const parsed = JSON.parse(raw);
    const ok = validate(parsed);
    return { ok: !!ok, details: ok ? undefined : ajv.errorsText(validate.errors) };
  } catch {
    return { ok: false, details: "Invalid JSON" };
  }
}

Do not overcomplicate this early.
Consistent checks beat sophisticated but fragile ones.

Add one small score

Pass/fail checks are good.
A lightweight score is even better.

For example, if your app answers questions from retrieved documents, you might track a rough groundedness score by comparing the output to retrieved excerpts.

It will not be perfect. It does not need to be.
It just needs to be consistent enough to spot drift.

Wrap your client once

The simplest implementation pattern is a wrapper around your model client.

That wrapper should:

accept prompt and model metadata
execute the call
run validation checks
emit one structured event

Pseudo-code:

export async function runPrompt({
  trace_id,
  prompt_id,
  prompt_version,
  model,
  params,
  input,
  context
}: any) {
  const span_id = newSpanId();
  const t0 = Date.now();

  try {
    const res = await llm.chat({ model, params, messages: buildMessages(input, context) });

    const outputText = res.text ?? "";
    const checks = [
      { name: "json_schema", ...jsonSchemaCheck(outputText) }
    ];

    console.log(JSON.stringify({
      ts: new Date().toISOString(),
      trace_id,
      span_id,
      prompt_id,
      prompt_version,
      model,
      params,
      input: {
        user_message: sanitize(input.user_message),
        variables: input.variables
      },
      context: context?.map((c: any) => ({ ...c, excerpt: sanitize(c.excerpt) })),
      output: { text: sanitize(outputText) },
      finish_reason: res.finish_reason,
      usage: res.usage,
      latency_ms: Date.now() - t0,
      checks
    }));

    return res;
  } catch (e: any) {
    console.log(JSON.stringify({
      ts: new Date().toISOString(),
      trace_id,
      span_id,
      prompt_id,
      prompt_version,
      model,
      latency_ms: Date.now() - t0,
      error: { message: String(e?.message ?? e), type: e?.name }
    }));
    throw e;
  }
}

That wrapper pays for itself the first time someone says, “users are getting weird answers again.”

What to look for in the first week

Once the logs exist, patterns show up quickly.

I would start with these views:

prompt versions with the most check failures
latency spikes by model
token outliers that suggest bloated context
calls with missing or low-value retrieval context
failure rate before and after prompt changes

You do not need a fancy dashboard immediately.
A few scripts over structured logs can already tell you a lot.

A practical example

Suppose your support assistant suddenly starts producing long, vague replies.

Without observability, you guess:

maybe the model changed
maybe the retrieval is worse
maybe the prompt was edited
maybe users are asking weirder questions

With observability, you can check:

prompt version changed from support_reply@a13f2d to support_reply@b48e91
average retrieved chunk count dropped from 5 to 1
JSON schema failures increased after the prompt edit
token usage doubled because the new prompt added verbose examples

That turns a spooky production issue into a normal debugging session.

Build a tiny debug bundle

When something goes wrong, collect a small incident bundle:

trace_id
prompt_id
prompt_version
sanitized inputs
retrieval IDs and excerpts
output text
checks and scores

That bundle makes it much easier for engineers, product people, and reviewers to talk about the same failure without guessing.

Closing

Prompting gets most of the attention.
Observability is what makes iteration safe.

If you can answer:

what ran
with which inputs
using which context
at what cost
and whether it passed simple checks

then your AI feature stops feeling mysterious.
It starts behaving like the rest of your system: imperfect, debuggable, and improvable.

That is a much nicer place to build from.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.