DEV Community

plasma
plasma

Posted on

My LLM API Calls Were Failing Silently. Here's the Logging Setup I Wish I Had Earlier

The first few LLM API bugs I hit in production were easy to notice.

The request failed. The user saw an error. I opened the logs, found the stack trace, fixed the obvious thing, and moved on.

The harder bugs were quieter.

The API still returned a response, but it was slower than usual. A fallback model kicked in without anyone noticing. Token usage crept up over a few days. A retry made the request succeed, but doubled the latency. Streaming worked most of the time, except when it didn't.

Nothing looked "down." The app just started feeling worse.

That was when I realized my LLM logging was too thin.

I was logging errors, but not enough context to understand behavior.

The problem with normal API logs

For a typical REST API call, I might log:

  • request path
  • status code
  • latency
  • error message
  • user ID or request ID

That is useful, but LLM calls have a few extra dimensions.

A successful LLM request can still be a problem if:

  • it used the wrong model
  • it silently retried
  • it fell back to another model
  • it returned fewer tokens than expected
  • it took 18 seconds instead of 2
  • it streamed partially, then stopped
  • it cost more than the normal path
  • it failed only for long prompts
  • it failed only for tool calling or JSON mode

If all I log is status: 200, I miss almost everything that matters.

What I log for every LLM call

This is the basic shape I try to capture now:

{
  "event": "llm_request",
  "request_id": "req_123",
  "provider": "tokenbay",
  "model": "gpt-4.1-mini",
  "operation": "chat_completion",
  "status": "success",
  "latency_ms": 1842,
  "input_tokens": 812,
  "output_tokens": 244,
  "estimated_cost_usd": 0.0019,
  "retry_count": 0,
  "fallback_from": null,
  "fallback_to": null,
  "streaming": false,
  "error_type": null,
  "error_message": null
}
Enter fullscreen mode Exit fullscreen mode

For failed requests:

{
  "event": "llm_request",
  "request_id": "req_124",
  "provider": "tokenbay",
  "model": "some-model",
  "operation": "chat_completion",
  "status": "error",
  "latency_ms": 5000,
  "input_tokens": null,
  "output_tokens": null,
  "estimated_cost_usd": null,
  "retry_count": 2,
  "fallback_from": "some-model",
  "fallback_to": "backup-model",
  "streaming": false,
  "error_type": "rate_limit",
  "error_message": "Rate limit exceeded"
}
Enter fullscreen mode Exit fullscreen mode

The exact fields depend on your app, but the categories matter more than the names.

I want to know:

  • what model I asked for
  • what provider handled it
  • how long it took
  • whether retries happened
  • whether fallback happened
  • how many tokens were used
  • whether the request streamed
  • what kind of failure happened
  • roughly how much the request cost

That is the difference between "the AI feature feels slow today" and "requests to model X are retrying twice after 429s, then falling back to model Y."

A small Node.js wrapper

Here is a simple version using the OpenAI SDK.

It works with OpenAI directly, or with any OpenAI-compatible endpoint by changing baseURL.

Install:

npm install openai
Enter fullscreen mode Exit fullscreen mode

Create llm-client.js:

import OpenAI from "openai";
import crypto from "node:crypto";

const client = new OpenAI({
  apiKey: process.env.LLM_API_KEY,
  baseURL: process.env.LLM_BASE_URL || "https://api.openai.com/v1"
});

function nowMs() {
  return Number(process.hrtime.bigint() / 1000000n);
}

function promptHash(messages) {
  const text = JSON.stringify(messages);
  return crypto.createHash("sha256").update(text).digest("hex").slice(0, 16);
}

function classifyError(error) {
  const status = error?.status;

  if (status === 400) return "invalid_request";
  if (status === 401 || status === 403) return "auth_or_permission";
  if (status === 413) return "request_too_large";
  if (status === 429) return "rate_limit";
  if (status === 503) return "service_unavailable";
  if (status === 504) return "upstream_timeout";
  if (status >= 500) return "provider_5xx";

  const message = String(error?.message || "").toLowerCase();

  if (message.includes("context length")) return "context_length";
  if (message.includes("timeout")) return "timeout";
  if (message.includes("content filter")) return "content_filter";

  return "unknown";
}

function logLLMEvent(event) {
  console.log(JSON.stringify(event));
}

export async function createLoggedChatCompletion({
  requestId,
  provider = "default",
  model,
  messages,
  temperature = 0.2,
  maxTokens = 500,
  streaming = false
}) {
  const startedAt = nowMs();

  const baseEvent = {
    event: "llm_request",
    request_id: requestId,
    provider,
    model,
    operation: "chat_completion",
    prompt_hash: promptHash(messages),
    streaming,
    retry_count: 0,
    fallback_from: null,
    fallback_to: null
  };

  try {
    const response = await client.chat.completions.create({
      model,
      messages,
      temperature,
      max_tokens: maxTokens,
      stream: streaming
    });

    const latencyMs = nowMs() - startedAt;

    if (streaming) {
      logLLMEvent({
        ...baseEvent,
        status: "success",
        latency_ms: latencyMs,
        input_tokens: null,
        output_tokens: null,
        estimated_cost_usd: null,
        error_type: null,
        error_message: null
      });

      return response;
    }

    logLLMEvent({
      ...baseEvent,
      status: "success",
      latency_ms: latencyMs,
      input_tokens: response.usage?.prompt_tokens ?? null,
      output_tokens: response.usage?.completion_tokens ?? null,
      estimated_cost_usd: null,
      error_type: null,
      error_message: null
    });

    return response;
  } catch (error) {
    const latencyMs = nowMs() - startedAt;

    logLLMEvent({
      ...baseEvent,
      status: "error",
      latency_ms: latencyMs,
      input_tokens: null,
      output_tokens: null,
      estimated_cost_usd: null,
      error_type: classifyError(error),
      error_message: error?.message || "Unknown error"
    });

    throw error;
  }
}
Enter fullscreen mode Exit fullscreen mode

Use it like this:

import crypto from "node:crypto";
import { createLoggedChatCompletion } from "./llm-client.js";

const response = await createLoggedChatCompletion({
  requestId: crypto.randomUUID(),
  provider: "openai-compatible",
  model: "gpt-4.1-mini",
  messages: [
    {
      role: "user",
      content: "Explain retries and exponential backoff in one paragraph."
    }
  ]
});

console.log(response.choices[0].message.content);
Enter fullscreen mode Exit fullscreen mode

Run it:

LLM_API_KEY="your-api-key" node app.js
Enter fullscreen mode Exit fullscreen mode

If you use TokenBay, the OpenAI-compatible base URL is:

LLM_API_KEY="your-tokenbay-api-key" \
LLM_BASE_URL="https://api.tokenbay.com/v1" \
node app.js
Enter fullscreen mode Exit fullscreen mode

Same SDK shape. Different base URL.

Do not log raw prompts by default

This part matters.

It is tempting to log the full prompt because it makes debugging easier. I try not to do that by default.

Prompts can contain:

  • user messages
  • emails
  • names
  • customer data
  • internal business logic
  • secrets that users accidentally pasted
  • private documents

Instead, I usually log a hash of the prompt and a few safe metadata fields:

{
  "prompt_hash": "a3f9c01de81b7a22",
  "message_count": 4,
  "has_system_prompt": true,
  "input_chars": 3821
}
Enter fullscreen mode Exit fullscreen mode

That lets me group repeated failures without storing the actual content.

For local development, raw prompt logging can be useful. For production, I want it behind a very explicit flag, with retention rules and access control.

Provider logs are not enough

Provider-side usage logs are useful.

For example, TokenBay's Usage Logs page can show request-level details such as time, model, token count, and cost.

That is helpful, especially when you are using multiple models through one OpenAI-compatible API.

But provider logs usually do not know your application context.

They do not know that this request came from your support reply generator, or that the user had already waited through two failed attempts, or that the answer was discarded before being shown.

That is why I still keep app-side logs.

The provider can tell me what happened at the API layer.

My app logs tell me why it mattered.

The fields that helped me most

Some fields looked boring at first, but ended up being the most useful.

model

This sounds obvious until you have multiple models in production.

If your app can use GPT, Claude, Gemini, Qwen, DeepSeek, GLM, or smaller fallback models, you need to know which one actually handled the request.

Not which one the product team thinks is configured.

The actual model.

provider

This matters when using multiple vendors or an OpenAI-compatible API gateway.

The same model name can behave differently depending on the provider, gateway, account limits, or routing setup.

If latency spikes, I want to know whether it is model-specific or provider-specific.

latency_ms

Average latency is not enough.

I usually want p50, p95, and p99 by model and operation.

A chatbot can feel fine at p50 and awful at p95.

retry_count

Retries are sneaky.

They make reliability look better while quietly increasing latency and cost.

If a request succeeds after two retries, the user may not see an error, but the system still degraded.

fallback_from and fallback_to

Fallback is great until it hides the original problem.

If model A fails and model B saves the request, that is useful. But if it happens 30 percent of the time, I need to know.

Otherwise I might think model A is working fine.

input_tokens and output_tokens

Token usage explains a lot of cost surprises.

When a bill jumps, the cause is often not "the provider got expensive." It is more likely:

  • prompts got longer
  • retrieved context got larger
  • output limits were too high
  • retries increased
  • a more expensive model handled more traffic
  • tool calls caused extra rounds

You cannot see that from request count alone.

error_type

Raw error messages are messy.

One provider says rate_limit_exceeded. Another says Too many requests. Another gives you a 429 with a different body.

I normalize errors into categories:

const errorTypes = [
  "auth_or_permission",
  "invalid_request",
  "rate_limit",
  "request_too_large",
  "context_length",
  "content_filter",
  "provider_5xx",
  "service_unavailable",
  "upstream_timeout",
  "stream_interrupted",
  "unknown"
];
Enter fullscreen mode Exit fullscreen mode

This makes dashboards and alerts much easier.

Silent failures I now watch for

The worst failures are not always exceptions.

These are the ones I try to catch with logs and metrics:

1. Retry storms

A provider starts returning intermittent 429s, 503s, or 504s.

Your retry logic hides it.

The app still works, but latency doubles and costs rise.

Watch:

  • retry count by provider
  • retry count by model
  • p95 latency after retries
  • final status after retry

2. Fallback becoming the main path

Fallback should be the backup plan.

If fallback becomes normal, you may have a provider issue, a bad timeout setting, or a model that is no longer suitable.

Watch:

  • fallback rate
  • fallback source model
  • fallback target model
  • quality complaints after fallback

3. Token creep

This is when prompts slowly get larger over time.

Maybe you added more retrieved documents. Maybe the system prompt grew. Maybe conversation history is not being trimmed.

Nothing breaks immediately. The bill just gets heavier.

Watch:

  • average input tokens by feature
  • p95 input tokens by route
  • output tokens by model
  • token usage per customer or workspace

4. Streaming interruptions

Streaming can fail differently from normal responses.

Sometimes the first tokens arrive, then the stream stops. If you only log the initial request success, you miss the failure.

Watch:

  • stream started
  • stream completed
  • stream duration
  • chunks received
  • interruption reason

5. Model mismatch

This happens when config changes, environment variables drift, or a gateway route points somewhere unexpected.

The app asks for one model, but production traffic goes somewhere else.

Watch:

  • requested model
  • resolved model, if available
  • provider
  • deployment environment

A slightly better log event

After a few rounds, my log event usually grows into something like this:

{
  "event": "llm_request",
  "timestamp": "2026-06-26T08:30:00.000Z",
  "request_id": "req_abc",
  "user_id_hash": "user_91ab",
  "environment": "production",
  "feature": "support_reply_generator",
  "provider": "tokenbay",
  "model": "gpt-4.1-mini",
  "operation": "chat_completion",
  "streaming": false,
  "status": "success",
  "latency_ms": 1842,
  "retry_count": 1,
  "fallback_from": null,
  "fallback_to": null,
  "input_tokens": 812,
  "output_tokens": 244,
  "estimated_cost_usd": 0.0019,
  "prompt_hash": "a3f9c01de81b7a22",
  "error_type": null
}
Enter fullscreen mode Exit fullscreen mode

This is not fancy observability.

It is just enough structure to answer practical questions.

Which feature got slower?

Which model is causing errors?

Did fallback save us or hide a bigger issue?

Did the cost increase because of traffic, tokens, retries, or model choice?

Where TokenBay fits into this

Disclosure: I work on TokenBay, so I am biased here.

One reason I care about this logging shape is that TokenBay is built around using multiple AI models through one OpenAI-compatible API.

That makes it convenient to switch between models, but it also makes observability more important.

TokenBay can show usage details at the API layer. I still want my own application logs because my app knows things the API layer cannot always know:

  • which product feature triggered the request
  • whether the answer was shown to the user
  • whether the request was part of a retry chain
  • whether fallback was intentional
  • whether the user abandoned the flow before the model answered

The more flexible your model setup becomes, the more important boring logs become.

My current rule

For every production LLM call, I want enough information to debug four questions:

  • Why did this request fail?
  • Why was this request slow?
  • Why did this request cost more than expected?
  • Which model actually answered?

If my logs cannot answer those, I am probably flying blind.

The annoying part is that you usually do not notice this on day one.

You notice it later, when something is already weird and your only log line says:

LLM request completed
Enter fullscreen mode Exit fullscreen mode

Ask me how I know.

Top comments (2)

Collapse
 
hannune profile image
Tae Kim

The field I'd add is a context_utilization_ratio: (input_tokens / model_context_window_limit). The silent failures I've hit most often aren't errors -- they're quality degradation as the system prompt grows over time and you quietly cross 60-70% context utilization, where most models start losing fidelity on earlier context. It shows up as "the AI feature feels different lately" with zero error logs. The other addition I'd strongly suggest is an operation_name field tied to the user-facing feature rather than just model/provider. When you have 8 different features routing through the same model, that's the field that actually tells you which user journey is causing the p95 spike.

Collapse
 
plasma_01 profile image
plasma

Haha yeah, exactly. But I might call the first one context_pressure: not just raw input tokens, but how close the request is getting to the model’s practical context limit.