plasma

Posted on Jul 1

LLM API debugging checklist

#llm #ai #javascript #node

When an LLM feature breaks in production, my first instinct used to be: "the model got worse."

That was usually the wrong place to start.

Most of the painful bugs I've debugged around LLM APIs had nothing to do with model quality. They were boring infrastructure problems hiding behind a model response: missing request metadata, silent timeouts, partial streaming output, retry logic that made things worse, or logs that captured the prompt but not the thing that actually failed.

Here's the checklist I use now before blaming the model.

1. Can I Reproduce the Exact Request?

The first question is simple:

Can I replay the exact same request that failed?

For LLM calls, "exact same" means more than just the user prompt.

I want to capture:

model name
provider
base URL
request timestamp
request ID if available
full message array
temperature / top_p / max_tokens
tool definitions
response format settings
streaming vs non-streaming mode
timeout settings
retry attempt number
user/session/job ID from my own system

The mistake I used to make was logging only the final prompt string. That works for simple demos, but production calls usually depend on system messages, tool schemas, routing logic, and runtime options.

A minimal structured log might look like this:

const requestLog = {
  event: "llm_request_started",
  request_id: crypto.randomUUID(),
  provider: "openai-compatible-provider",
  model: "gpt-4.1-mini",
  stream: true,
  temperature: 0.2,
  max_tokens: 800,
  message_count: messages.length,
  has_tools: tools.length > 0,
  timeout_ms: 30000,
  retry_attempt: 0,
  created_at: new Date().toISOString(),
};

I usually avoid logging raw user content unless I have a clear privacy policy and retention plan. But I do log enough metadata to know what shape of request failed.

2. Did the Request Actually Reach the Provider?

A surprising number of "LLM bugs" are not LLM bugs.

They are:

DNS issues
auth failures
proxy errors
gateway timeouts
invalid base URLs
SDK configuration mistakes
environment variables missing in one deploy target

Before looking at the model response, check whether the request reached the provider at all.

Log the transport layer separately from the model layer:

try {
  const startedAt = Date.now();

  const response = await client.chat.completions.create({
    model,
    messages,
    stream: false,
  });

  console.log({
    event: "llm_request_completed",
    latency_ms: Date.now() - startedAt,
    model,
    response_id: response.id,
    finish_reason: response.choices?.[0]?.finish_reason,
  });
} catch (error) {
  console.error({
    event: "llm_request_failed",
    model,
    error_name: error.name,
    error_message: error.message,
    status: error.status,
    code: error.code,
    type: error.type,
  });

  throw error;
}

The important part is separating:

network failure
HTTP error
provider error
model refusal
empty model output
malformed output
application parsing failure

Those are different problems. They should not all show up as LLM failed.

3. Was the Response Empty, Truncated, or Just Invalid?

There are three very different failure modes that often get mixed together:

The model returned nothing
The model returned a partial answer
The model returned something your app could not parse

They need different fixes.

For normal non-streaming calls, I log:

response ID
finish reason
output length
token usage if available
whether content was empty
whether JSON parsing failed
whether schema validation failed

Example:

const content = response.choices?.[0]?.message?.content ?? "";
const finishReason = response.choices?.[0]?.finish_reason;

console.log({
  event: "llm_response_received",
  response_id: response.id,
  model: response.model,
  finish_reason: finishReason,
  output_chars: content.length,
  prompt_tokens: response.usage?.prompt_tokens,
  completion_tokens: response.usage?.completion_tokens,
  total_tokens: response.usage?.total_tokens,
});

If finish_reason is length, I do not treat that the same as a model quality issue. It usually means my max_tokens value was too low, the prompt asked for too much, or the response format was too verbose.

If JSON parsing fails, I log that as an application-level failure:

try {
  const parsed = JSON.parse(content);
  return parsed;
} catch (error) {
  console.error({
    event: "llm_json_parse_failed",
    model: response.model,
    response_id: response.id,
    finish_reason: finishReason,
    output_preview: content.slice(0, 500),
  });

  throw error;
}

That distinction matters. A provider outage and a JSON parse failure should not trigger the same incident response.

4. If Streaming Failed, How Far Did It Get?

Streaming makes debugging harder because failure can happen after the request has already started successfully.

For streaming calls, I want to know:

did the stream open?
when did the first token arrive?
how many chunks arrived?
how many characters were received?
did the stream end cleanly?
was there a final usage chunk?
did the client disconnect first?
did my server timeout first?
did the provider close the stream?

Here's a simplified example:

let chunkCount = 0;
let outputChars = 0;
let firstChunkAt = null;
const startedAt = Date.now();

try {
  const stream = await client.chat.completions.create({
    model,
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    chunkCount += 1;

    if (!firstChunkAt) {
      firstChunkAt = Date.now();
    }

    const delta = chunk.choices?.[0]?.delta?.content ?? "";
    outputChars += delta.length;

    // Send delta to the client here
  }

  console.log({
    event: "llm_stream_completed",
    model,
    chunk_count: chunkCount,
    output_chars: outputChars,
    time_to_first_chunk_ms: firstChunkAt ? firstChunkAt - startedAt : null,
    total_latency_ms: Date.now() - startedAt,
  });
} catch (error) {
  console.error({
    event: "llm_stream_failed",
    model,
    chunk_count: chunkCount,
    output_chars: outputChars,
    time_to_first_chunk_ms: firstChunkAt ? firstChunkAt - startedAt : null,
    latency_before_error_ms: Date.now() - startedAt,
    error_name: error.name,
    error_message: error.message,
  });

  throw error;
}

This tells me whether I'm dealing with:

no response at all
slow first token
mid-stream failure
client disconnect
provider timeout
application-side streaming bug

Without this, all I know is "the stream broke," which is not enough.

5. Did My Retry Logic Make It Worse?

Retries are useful until they quietly multiply your problems.

For LLM APIs, I log every retry attempt with:

original request ID
retry attempt number
delay before retry
error that caused the retry
whether the retry used the same model
whether the retry used a fallback model
whether the operation was safe to retry

Example:

console.warn({
  event: "llm_retry_scheduled",
  original_request_id: requestId,
  retry_attempt: attempt + 1,
  delay_ms: delay,
  model,
  reason: error.message,
  status: error.status,
});

The big trap is retrying requests that already caused side effects.

For example, if the LLM call is part of a workflow that sends an email, creates a ticket, charges credits, or writes to a database, retries need idempotency keys and careful boundaries.

Otherwise, a timeout can turn into duplicate actions.

My rule is:

retry transport failures carefully
retry rate limits with backoff
do not blindly retry tool execution
do not retry user-visible side effects without idempotency
always log whether a response came back before retrying

6. Is the Failure Model-Specific or Provider-Specific?

When using OpenAI-compatible APIs, it is easy to swap models or providers and assume everything else is equivalent.

It usually is not.

Different models and providers can vary in:

streaming behavior
tool calling support
JSON mode support
context window limits
rate limit headers
error response shape
usage reporting
timeout behavior
finish reason semantics

So I try to isolate the failure:

same prompt, same provider, different model
same prompt, same model family, different provider if available
same prompt, non-streaming instead of streaming
same prompt, no tools
same prompt, smaller context
same prompt, lower max tokens

This usually tells me whether I'm looking at a model behavior issue, a provider compatibility issue, or my own integration bug.

7. Can I See the Full Timeline?

For production debugging, individual logs are not enough. I want the timeline.

A useful LLM request timeline looks like this:

00ms    request accepted by app
12ms    prompt assembled
18ms    provider selected
25ms    LLM request started
430ms   first token received
2840ms  stream chunk 50 received
5100ms  stream completed
5120ms  output parsed
5190ms  database write completed
5300ms  response sent to client

Or, for a bad request:

00ms     request accepted by app
10ms     prompt assembled
18ms     provider selected
30ms     LLM request started
30030ms  provider timeout
30035ms  retry scheduled
32040ms  retry started
62045ms  retry timeout
62050ms  request failed

This makes the real issue obvious.

Maybe the model was fine, but the timeout was too aggressive. Maybe the first token was fast, but the downstream parser failed. Maybe the app spent 4 seconds building context before the LLM call even started.

Without a timeline, it is too easy to blame the most mysterious part of the stack.

8. Do I Have a Small Test Case?

After I identify the likely failure, I try to reduce it to a tiny test.

For example:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.LLM_API_KEY,
  baseURL: process.env.LLM_BASE_URL,
});

const response = await client.chat.completions.create({
  model: process.env.LLM_MODEL,
  messages: [
    { role: "system", content: "Return only valid JSON." },
    { role: "user", content: "Return an object with one key: status." },
  ],
  temperature: 0,
  response_format: { type: "json_object" },
});

console.log(response.choices[0].message.content);

A small test case helps answer:

is my production prompt too complex?
is tool calling the problem?
is JSON mode supported?
is streaming the problem?
is the SDK configured correctly?
is the provider returning the shape I expect?

If the small test fails, the issue is probably integration-level.

If the small test passes, the issue is probably in my application logic, prompt assembly, context size, or output parsing.

9. What I Put in My Default LLM Debug Log

My default log event for LLM calls usually includes this shape:

{
  event: "llm_call",
  request_id: "...",
  user_id: "...",
  provider: "...",
  model: "...",
  stream: true,
  status: "completed",
  latency_ms: 3200,
  time_to_first_token_ms: 420,
  input_tokens: 1200,
  output_tokens: 300,
  finish_reason: "stop",
  retry_attempts: 0,
  error_type: null,
  error_message: null
}

For failures:

{
  event: "llm_call",
  request_id: "...",
  provider: "...",
  model: "...",
  stream: true,
  status: "failed",
  latency_ms: 30000,
  time_to_first_token_ms: null,
  chunks_received: 0,
  output_chars: 0,
  retry_attempts: 2,
  error_type: "timeout",
  error_message: "Request timed out after 30000ms"
}

That is usually enough to start debugging without drowning in logs.

10. The Short Checklist

Before blaming the model, I check:

Can I replay the exact request?
Did the request reach the provider?
Was it a network, HTTP, provider, model, parsing, or app failure?
Was the response empty, truncated, malformed, or refused?
If streaming failed, how many chunks did I receive?
Did the first token arrive?
Did my timeout happen before the provider finished?
Did retry logic duplicate the problem?
Is the issue model-specific or provider-specific?
Does the same request work without streaming?
Does the same request work without tools?
Can I reproduce it with a tiny test case?
Do I have a timeline from app request to final response?

Final Thought

The model is the easy thing to blame because it feels like a black box.

But a lot of LLM production bugs are regular distributed systems problems wearing a very expensive hat: timeouts, retries, partial responses, schema mismatches, bad observability, and unclear ownership between the app, SDK, provider, and model.

The fix is not always a better model.

Sometimes it is just better logs.

I work on TokenBay, so I spend a lot of time thinking about this layer between apps and model providers. The more models and providers you route through, the more valuable boring debugging discipline becomes.

DEV Community