plasma

Posted on Jun 26

My LLM API Calls Were Failing Silently. Here's the Logging Setup I Wish I Had Earlier

#ai #node #llm

The first few LLM API bugs I hit in production were easy to notice.

The request failed. The user saw an error. I opened the logs, found the stack trace, fixed the obvious thing, and moved on.

The harder bugs were quieter.

The API still returned a response, but it was slower than usual. A fallback model kicked in without anyone noticing. Token usage crept up over a few days. A retry made the request succeed, but doubled the latency. Streaming worked most of the time, except when it didn't.

Nothing looked "down." The app just started feeling worse.

That was when I realized my LLM logging was too thin.

I was logging errors, but not enough context to understand behavior.

The problem with normal API logs

For a typical REST API call, I might log:

request path
status code
latency
error message
user ID or request ID

That is useful, but LLM calls have a few extra dimensions.

A successful LLM request can still be a problem if:

it used the wrong model
it silently retried
it fell back to another model
it returned fewer tokens than expected
it took 18 seconds instead of 2
it streamed partially, then stopped
it cost more than the normal path
it failed only for long prompts
it failed only for tool calling or JSON mode

If all I log is status: 200, I miss almost everything that matters.

What I log for every LLM call

This is the basic shape I try to capture now:

{
  "event": "llm_request",
  "request_id": "req_123",
  "provider": "tokenbay",
  "model": "gpt-4.1-mini",
  "operation": "chat_completion",
  "status": "success",
  "latency_ms": 1842,
  "input_tokens": 812,
  "output_tokens": 244,
  "estimated_cost_usd": 0.0019,
  "retry_count": 0,
  "fallback_from": null,
  "fallback_to": null,
  "streaming": false,
  "error_type": null,
  "error_message": null
}

For failed requests:

{
  "event": "llm_request",
  "request_id": "req_124",
  "provider": "tokenbay",
  "model": "some-model",
  "operation": "chat_completion",
  "status": "error",
  "latency_ms": 5000,
  "input_tokens": null,
  "output_tokens": null,
  "estimated_cost_usd": null,
  "retry_count": 2,
  "fallback_from": "some-model",
  "fallback_to": "backup-model",
  "streaming": false,
  "error_type": "rate_limit",
  "error_message": "Rate limit exceeded"
}

The exact fields depend on your app, but the categories matter more than the names.

I want to know:

what model I asked for
what provider handled it
how long it took
whether retries happened
whether fallback happened
how many tokens were used
whether the request streamed
what kind of failure happened
roughly how much the request cost

That is the difference between "the AI feature feels slow today" and "requests to model X are retrying twice after 429s, then falling back to model Y."

A small Node.js wrapper

Here is a simple version using the OpenAI SDK.

It works with OpenAI directly, or with any OpenAI-compatible endpoint by changing baseURL.

Install:

npm install openai

Create llm-client.js:

import OpenAI from "openai";
import crypto from "node:crypto";

const client = new OpenAI({
  apiKey: process.env.LLM_API_KEY,
  baseURL: process.env.LLM_BASE_URL || "https://api.openai.com/v1"
});

function nowMs() {
  return Number(process.hrtime.bigint() / 1000000n);
}

function promptHash(messages) {
  const text = JSON.stringify(messages);
  return crypto.createHash("sha256").update(text).digest("hex").slice(0, 16);
}

function classifyError(error) {
  const status = error?.status;

  if (status === 400) return "invalid_request";
  if (status === 401 || status === 403) return "auth_or_permission";
  if (status === 413) return "request_too_large";
  if (status === 429) return "rate_limit";
  if (status === 503) return "service_unavailable";
  if (status === 504) return "upstream_timeout";
  if (status >= 500) return "provider_5xx";

  const message = String(error?.message || "").toLowerCase();

  if (message.includes("context length")) return "context_length";
  if (message.includes("timeout")) return "timeout";
  if (message.includes("content filter")) return "content_filter";

  return "unknown";
}

function logLLMEvent(event) {
  console.log(JSON.stringify(event));
}

export async function createLoggedChatCompletion({
  requestId,
  provider = "default",
  model,
  messages,
  temperature = 0.2,
  maxTokens = 500,
  streaming = false
}) {
  const startedAt = nowMs();

  const baseEvent = {
    event: "llm_request",
    request_id: requestId,
    provider,
    model,
    operation: "chat_completion",
    prompt_hash: promptHash(messages),
    streaming,
    retry_count: 0,
    fallback_from: null,
    fallback_to: null
  };

  try {
    const response = await client.chat.completions.create({
      model,
      messages,
      temperature,
      max_tokens: maxTokens,
      stream: streaming
    });

    const latencyMs = nowMs() - startedAt;

    if (streaming) {
      logLLMEvent({
        ...baseEvent,
        status: "success",
        latency_ms: latencyMs,
        input_tokens: null,
        output_tokens: null,
        estimated_cost_usd: null,
        error_type: null,
        error_message: null
      });

      return response;
    }

    logLLMEvent({
      ...baseEvent,
      status: "success",
      latency_ms: latencyMs,
      input_tokens: response.usage?.prompt_tokens ?? null,
      output_tokens: response.usage?.completion_tokens ?? null,
      estimated_cost_usd: null,
      error_type: null,
      error_message: null
    });

    return response;
  } catch (error) {
    const latencyMs = nowMs() - startedAt;

    logLLMEvent({
      ...baseEvent,
      status: "error",
      latency_ms: latencyMs,
      input_tokens: null,
      output_tokens: null,
      estimated_cost_usd: null,
      error_type: classifyError(error),
      error_message: error?.message || "Unknown error"
    });

    throw error;
  }
}

Use it like this:

import crypto from "node:crypto";
import { createLoggedChatCompletion } from "./llm-client.js";

const response = await createLoggedChatCompletion({
  requestId: crypto.randomUUID(),
  provider: "openai-compatible",
  model: "gpt-4.1-mini",
  messages: [
    {
      role: "user",
      content: "Explain retries and exponential backoff in one paragraph."
    }
  ]
});

console.log(response.choices[0].message.content);

Run it:

LLM_API_KEY="your-api-key" node app.js

If you use TokenBay, the OpenAI-compatible base URL is:

LLM_API_KEY="your-tokenbay-api-key" \
LLM_BASE_URL="https://api.tokenbay.com/v1" \
node app.js

Same SDK shape. Different base URL.

Do not log raw prompts by default

This part matters.

It is tempting to log the full prompt because it makes debugging easier. I try not to do that by default.

Prompts can contain:

user messages
emails
names
customer data
internal business logic
secrets that users accidentally pasted
private documents

Instead, I usually log a hash of the prompt and a few safe metadata fields:

{
  "prompt_hash": "a3f9c01de81b7a22",
  "message_count": 4,
  "has_system_prompt": true,
  "input_chars": 3821
}

That lets me group repeated failures without storing the actual content.

For local development, raw prompt logging can be useful. For production, I want it behind a very explicit flag, with retention rules and access control.

Provider logs are not enough

Provider-side usage logs are useful.

For example, TokenBay's Usage Logs page can show request-level details such as time, model, token count, and cost.

That is helpful, especially when you are using multiple models through one OpenAI-compatible API.

But provider logs usually do not know your application context.

They do not know that this request came from your support reply generator, or that the user had already waited through two failed attempts, or that the answer was discarded before being shown.

That is why I still keep app-side logs.

The provider can tell me what happened at the API layer.

My app logs tell me why it mattered.

The fields that helped me most

Some fields looked boring at first, but ended up being the most useful.

`model`

This sounds obvious until you have multiple models in production.

If your app can use GPT, Claude, Gemini, Qwen, DeepSeek, GLM, or smaller fallback models, you need to know which one actually handled the request.

Not which one the product team thinks is configured.

The actual model.

`provider`

This matters when using multiple vendors or an OpenAI-compatible API gateway.

The same model name can behave differently depending on the provider, gateway, account limits, or routing setup.

If latency spikes, I want to know whether it is model-specific or provider-specific.

`latency_ms`

Average latency is not enough.

I usually want p50, p95, and p99 by model and operation.

A chatbot can feel fine at p50 and awful at p95.

`retry_count`

Retries are sneaky.

They make reliability look better while quietly increasing latency and cost.

If a request succeeds after two retries, the user may not see an error, but the system still degraded.

`fallback_from` and `fallback_to`

Fallback is great until it hides the original problem.

If model A fails and model B saves the request, that is useful. But if it happens 30 percent of the time, I need to know.

Otherwise I might think model A is working fine.

`input_tokens` and `output_tokens`

Token usage explains a lot of cost surprises.

When a bill jumps, the cause is often not "the provider got expensive." It is more likely:

prompts got longer
retrieved context got larger
output limits were too high
retries increased
a more expensive model handled more traffic
tool calls caused extra rounds

You cannot see that from request count alone.

`error_type`

Raw error messages are messy.

One provider says rate_limit_exceeded. Another says Too many requests. Another gives you a 429 with a different body.

I normalize errors into categories:

const errorTypes = [
  "auth_or_permission",
  "invalid_request",
  "rate_limit",
  "request_too_large",
  "context_length",
  "content_filter",
  "provider_5xx",
  "service_unavailable",
  "upstream_timeout",
  "stream_interrupted",
  "unknown"
];

This makes dashboards and alerts much easier.

Silent failures I now watch for

The worst failures are not always exceptions.

These are the ones I try to catch with logs and metrics:

1. Retry storms

A provider starts returning intermittent 429s, 503s, or 504s.

Your retry logic hides it.

The app still works, but latency doubles and costs rise.

Watch:

retry count by provider
retry count by model
p95 latency after retries
final status after retry

2. Fallback becoming the main path

Fallback should be the backup plan.

If fallback becomes normal, you may have a provider issue, a bad timeout setting, or a model that is no longer suitable.

Watch:

fallback rate
fallback source model
fallback target model
quality complaints after fallback

3. Token creep

This is when prompts slowly get larger over time.

Maybe you added more retrieved documents. Maybe the system prompt grew. Maybe conversation history is not being trimmed.

Nothing breaks immediately. The bill just gets heavier.

Watch:

average input tokens by feature
p95 input tokens by route
output tokens by model
token usage per customer or workspace

4. Streaming interruptions

Streaming can fail differently from normal responses.

Sometimes the first tokens arrive, then the stream stops. If you only log the initial request success, you miss the failure.

Watch:

stream started
stream completed
stream duration
chunks received
interruption reason

5. Model mismatch

This happens when config changes, environment variables drift, or a gateway route points somewhere unexpected.

The app asks for one model, but production traffic goes somewhere else.

Watch:

requested model
resolved model, if available
provider
deployment environment

A slightly better log event

After a few rounds, my log event usually grows into something like this:

{
  "event": "llm_request",
  "timestamp": "2026-06-26T08:30:00.000Z",
  "request_id": "req_abc",
  "user_id_hash": "user_91ab",
  "environment": "production",
  "feature": "support_reply_generator",
  "provider": "tokenbay",
  "model": "gpt-4.1-mini",
  "operation": "chat_completion",
  "streaming": false,
  "status": "success",
  "latency_ms": 1842,
  "retry_count": 1,
  "fallback_from": null,
  "fallback_to": null,
  "input_tokens": 812,
  "output_tokens": 244,
  "estimated_cost_usd": 0.0019,
  "prompt_hash": "a3f9c01de81b7a22",
  "error_type": null
}

This is not fancy observability.

It is just enough structure to answer practical questions.

Which feature got slower?

Which model is causing errors?

Did fallback save us or hide a bigger issue?

Did the cost increase because of traffic, tokens, retries, or model choice?

Where TokenBay fits into this

Disclosure: I work on TokenBay, so I am biased here.

One reason I care about this logging shape is that TokenBay is built around using multiple AI models through one OpenAI-compatible API.

That makes it convenient to switch between models, but it also makes observability more important.

TokenBay can show usage details at the API layer. I still want my own application logs because my app knows things the API layer cannot always know:

which product feature triggered the request
whether the answer was shown to the user
whether the request was part of a retry chain
whether fallback was intentional
whether the user abandoned the flow before the model answered

The more flexible your model setup becomes, the more important boring logs become.

My current rule

For every production LLM call, I want enough information to debug four questions:

Why did this request fail?
Why was this request slow?
Why did this request cost more than expected?
Which model actually answered?

If my logs cannot answer those, I am probably flying blind.

The annoying part is that you usually do not notice this on day one.

You notice it later, when something is already weird and your only log line says:

LLM request completed

Ask me how I know.

Top comments (4)

Tae Kim • Jun 26

The field I'd add is a context_utilization_ratio: (input_tokens / model_context_window_limit). The silent failures I've hit most often aren't errors -- they're quality degradation as the system prompt grows over time and you quietly cross 60-70% context utilization, where most models start losing fidelity on earlier context. It shows up as "the AI feature feels different lately" with zero error logs. The other addition I'd strongly suggest is an operation_name field tied to the user-facing feature rather than just model/provider. When you have 8 different features routing through the same model, that's the field that actually tells you which user journey is causing the p95 spike.

plasma • Jun 26

Haha yeah, exactly. But I might call the first one context_pressure: not just raw input tokens, but how close the request is getting to the model’s practical context limit.

Nazar Boyko • Jun 26

Really useful set of fields, and the one about fallback becoming the main path is the failure I think people miss most. Small thing on the promptHash helper though: hashing the full messages JSON means a single changed token, a timestamp or an injected ID in context, gives you a brand new hash every time. That quietly breaks the grouping you built it for. If you normalize first, strip the volatile bits or hash a templated version of the prompt, repeated failures actually cluster the way you want.

Tae Kim • Jun 26

The failure mode that bit me hardest was 200s with malformed JSON - the API returned success but the response body was unparseable, which only showed up when the downstream code tried to extract a field. I added a parse-and-validate step right after every API call that throws a custom exception distinguishable from network errors, so the retry logic could tell the difference between a transient connection issue and a model compliance failure.

The problem with normal API logs

What I log for every LLM call

A small Node.js wrapper

Do not log raw prompts by default

Provider logs are not enough

The fields that helped me most

model

provider

latency_ms

retry_count

fallback_from and fallback_to

input_tokens and output_tokens

error_type

Silent failures I now watch for

1. Retry storms

2. Fallback becoming the main path

3. Token creep

4. Streaming interruptions

5. Model mismatch

A slightly better log event

Where TokenBay fits into this

My current rule

`model`

`provider`

`latency_ms`

`retry_count`

`fallback_from` and `fallback_to`

`input_tokens` and `output_tokens`

`error_type`