DEV Community

plasma
plasma

Posted on

LLM API debugging checklist

When an LLM feature breaks in production, my first instinct used to be: "the model got worse."

That was usually the wrong place to start.

Most of the painful bugs I've debugged around LLM APIs had nothing to do with model quality. They were boring infrastructure problems hiding behind a model response: missing request metadata, silent timeouts, partial streaming output, retry logic that made things worse, or logs that captured the prompt but not the thing that actually failed.

Here's the checklist I use now before blaming the model.

1. Can I Reproduce the Exact Request?

The first question is simple:

Can I replay the exact same request that failed?

For LLM calls, "exact same" means more than just the user prompt.

I want to capture:

  • model name
  • provider
  • base URL
  • request timestamp
  • request ID if available
  • full message array
  • temperature / top_p / max_tokens
  • tool definitions
  • response format settings
  • streaming vs non-streaming mode
  • timeout settings
  • retry attempt number
  • user/session/job ID from my own system

The mistake I used to make was logging only the final prompt string. That works for simple demos, but production calls usually depend on system messages, tool schemas, routing logic, and runtime options.

A minimal structured log might look like this:

const requestLog = {
  event: "llm_request_started",
  request_id: crypto.randomUUID(),
  provider: "openai-compatible-provider",
  model: "gpt-4.1-mini",
  stream: true,
  temperature: 0.2,
  max_tokens: 800,
  message_count: messages.length,
  has_tools: tools.length > 0,
  timeout_ms: 30000,
  retry_attempt: 0,
  created_at: new Date().toISOString(),
};
Enter fullscreen mode Exit fullscreen mode

I usually avoid logging raw user content unless I have a clear privacy policy and retention plan. But I do log enough metadata to know what shape of request failed.

2. Did the Request Actually Reach the Provider?

A surprising number of "LLM bugs" are not LLM bugs.

They are:

  • DNS issues
  • auth failures
  • proxy errors
  • gateway timeouts
  • invalid base URLs
  • SDK configuration mistakes
  • environment variables missing in one deploy target

Before looking at the model response, check whether the request reached the provider at all.

Log the transport layer separately from the model layer:

try {
  const startedAt = Date.now();

  const response = await client.chat.completions.create({
    model,
    messages,
    stream: false,
  });

  console.log({
    event: "llm_request_completed",
    latency_ms: Date.now() - startedAt,
    model,
    response_id: response.id,
    finish_reason: response.choices?.[0]?.finish_reason,
  });
} catch (error) {
  console.error({
    event: "llm_request_failed",
    model,
    error_name: error.name,
    error_message: error.message,
    status: error.status,
    code: error.code,
    type: error.type,
  });

  throw error;
}
Enter fullscreen mode Exit fullscreen mode

The important part is separating:

  • network failure
  • HTTP error
  • provider error
  • model refusal
  • empty model output
  • malformed output
  • application parsing failure

Those are different problems. They should not all show up as LLM failed.

3. Was the Response Empty, Truncated, or Just Invalid?

There are three very different failure modes that often get mixed together:

  1. The model returned nothing
  2. The model returned a partial answer
  3. The model returned something your app could not parse

They need different fixes.

For normal non-streaming calls, I log:

  • response ID
  • finish reason
  • output length
  • token usage if available
  • whether content was empty
  • whether JSON parsing failed
  • whether schema validation failed

Example:

const content = response.choices?.[0]?.message?.content ?? "";
const finishReason = response.choices?.[0]?.finish_reason;

console.log({
  event: "llm_response_received",
  response_id: response.id,
  model: response.model,
  finish_reason: finishReason,
  output_chars: content.length,
  prompt_tokens: response.usage?.prompt_tokens,
  completion_tokens: response.usage?.completion_tokens,
  total_tokens: response.usage?.total_tokens,
});
Enter fullscreen mode Exit fullscreen mode

If finish_reason is length, I do not treat that the same as a model quality issue. It usually means my max_tokens value was too low, the prompt asked for too much, or the response format was too verbose.

If JSON parsing fails, I log that as an application-level failure:

try {
  const parsed = JSON.parse(content);
  return parsed;
} catch (error) {
  console.error({
    event: "llm_json_parse_failed",
    model: response.model,
    response_id: response.id,
    finish_reason: finishReason,
    output_preview: content.slice(0, 500),
  });

  throw error;
}
Enter fullscreen mode Exit fullscreen mode

That distinction matters. A provider outage and a JSON parse failure should not trigger the same incident response.

4. If Streaming Failed, How Far Did It Get?

Streaming makes debugging harder because failure can happen after the request has already started successfully.

For streaming calls, I want to know:

  • did the stream open?
  • when did the first token arrive?
  • how many chunks arrived?
  • how many characters were received?
  • did the stream end cleanly?
  • was there a final usage chunk?
  • did the client disconnect first?
  • did my server timeout first?
  • did the provider close the stream?

Here's a simplified example:

let chunkCount = 0;
let outputChars = 0;
let firstChunkAt = null;
const startedAt = Date.now();

try {
  const stream = await client.chat.completions.create({
    model,
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    chunkCount += 1;

    if (!firstChunkAt) {
      firstChunkAt = Date.now();
    }

    const delta = chunk.choices?.[0]?.delta?.content ?? "";
    outputChars += delta.length;

    // Send delta to the client here
  }

  console.log({
    event: "llm_stream_completed",
    model,
    chunk_count: chunkCount,
    output_chars: outputChars,
    time_to_first_chunk_ms: firstChunkAt ? firstChunkAt - startedAt : null,
    total_latency_ms: Date.now() - startedAt,
  });
} catch (error) {
  console.error({
    event: "llm_stream_failed",
    model,
    chunk_count: chunkCount,
    output_chars: outputChars,
    time_to_first_chunk_ms: firstChunkAt ? firstChunkAt - startedAt : null,
    latency_before_error_ms: Date.now() - startedAt,
    error_name: error.name,
    error_message: error.message,
  });

  throw error;
}
Enter fullscreen mode Exit fullscreen mode

This tells me whether I'm dealing with:

  • no response at all
  • slow first token
  • mid-stream failure
  • client disconnect
  • provider timeout
  • application-side streaming bug

Without this, all I know is "the stream broke," which is not enough.

5. Did My Retry Logic Make It Worse?

Retries are useful until they quietly multiply your problems.

For LLM APIs, I log every retry attempt with:

  • original request ID
  • retry attempt number
  • delay before retry
  • error that caused the retry
  • whether the retry used the same model
  • whether the retry used a fallback model
  • whether the operation was safe to retry

Example:

console.warn({
  event: "llm_retry_scheduled",
  original_request_id: requestId,
  retry_attempt: attempt + 1,
  delay_ms: delay,
  model,
  reason: error.message,
  status: error.status,
});
Enter fullscreen mode Exit fullscreen mode

The big trap is retrying requests that already caused side effects.

For example, if the LLM call is part of a workflow that sends an email, creates a ticket, charges credits, or writes to a database, retries need idempotency keys and careful boundaries.

Otherwise, a timeout can turn into duplicate actions.

My rule is:

  • retry transport failures carefully
  • retry rate limits with backoff
  • do not blindly retry tool execution
  • do not retry user-visible side effects without idempotency
  • always log whether a response came back before retrying

6. Is the Failure Model-Specific or Provider-Specific?

When using OpenAI-compatible APIs, it is easy to swap models or providers and assume everything else is equivalent.

It usually is not.

Different models and providers can vary in:

  • streaming behavior
  • tool calling support
  • JSON mode support
  • context window limits
  • rate limit headers
  • error response shape
  • usage reporting
  • timeout behavior
  • finish reason semantics

So I try to isolate the failure:

  • same prompt, same provider, different model
  • same prompt, same model family, different provider if available
  • same prompt, non-streaming instead of streaming
  • same prompt, no tools
  • same prompt, smaller context
  • same prompt, lower max tokens

This usually tells me whether I'm looking at a model behavior issue, a provider compatibility issue, or my own integration bug.

7. Can I See the Full Timeline?

For production debugging, individual logs are not enough. I want the timeline.

A useful LLM request timeline looks like this:

00ms    request accepted by app
12ms    prompt assembled
18ms    provider selected
25ms    LLM request started
430ms   first token received
2840ms  stream chunk 50 received
5100ms  stream completed
5120ms  output parsed
5190ms  database write completed
5300ms  response sent to client
Enter fullscreen mode Exit fullscreen mode

Or, for a bad request:

00ms     request accepted by app
10ms     prompt assembled
18ms     provider selected
30ms     LLM request started
30030ms  provider timeout
30035ms  retry scheduled
32040ms  retry started
62045ms  retry timeout
62050ms  request failed
Enter fullscreen mode Exit fullscreen mode

This makes the real issue obvious.

Maybe the model was fine, but the timeout was too aggressive. Maybe the first token was fast, but the downstream parser failed. Maybe the app spent 4 seconds building context before the LLM call even started.

Without a timeline, it is too easy to blame the most mysterious part of the stack.

8. Do I Have a Small Test Case?

After I identify the likely failure, I try to reduce it to a tiny test.

For example:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.LLM_API_KEY,
  baseURL: process.env.LLM_BASE_URL,
});

const response = await client.chat.completions.create({
  model: process.env.LLM_MODEL,
  messages: [
    { role: "system", content: "Return only valid JSON." },
    { role: "user", content: "Return an object with one key: status." },
  ],
  temperature: 0,
  response_format: { type: "json_object" },
});

console.log(response.choices[0].message.content);
Enter fullscreen mode Exit fullscreen mode

A small test case helps answer:

  • is my production prompt too complex?
  • is tool calling the problem?
  • is JSON mode supported?
  • is streaming the problem?
  • is the SDK configured correctly?
  • is the provider returning the shape I expect?

If the small test fails, the issue is probably integration-level.

If the small test passes, the issue is probably in my application logic, prompt assembly, context size, or output parsing.

9. What I Put in My Default LLM Debug Log

My default log event for LLM calls usually includes this shape:

{
  event: "llm_call",
  request_id: "...",
  user_id: "...",
  provider: "...",
  model: "...",
  stream: true,
  status: "completed",
  latency_ms: 3200,
  time_to_first_token_ms: 420,
  input_tokens: 1200,
  output_tokens: 300,
  finish_reason: "stop",
  retry_attempts: 0,
  error_type: null,
  error_message: null
}
Enter fullscreen mode Exit fullscreen mode

For failures:

{
  event: "llm_call",
  request_id: "...",
  provider: "...",
  model: "...",
  stream: true,
  status: "failed",
  latency_ms: 30000,
  time_to_first_token_ms: null,
  chunks_received: 0,
  output_chars: 0,
  retry_attempts: 2,
  error_type: "timeout",
  error_message: "Request timed out after 30000ms"
}
Enter fullscreen mode Exit fullscreen mode

That is usually enough to start debugging without drowning in logs.

10. The Short Checklist

Before blaming the model, I check:

  • Can I replay the exact request?
  • Did the request reach the provider?
  • Was it a network, HTTP, provider, model, parsing, or app failure?
  • Was the response empty, truncated, malformed, or refused?
  • If streaming failed, how many chunks did I receive?
  • Did the first token arrive?
  • Did my timeout happen before the provider finished?
  • Did retry logic duplicate the problem?
  • Is the issue model-specific or provider-specific?
  • Does the same request work without streaming?
  • Does the same request work without tools?
  • Can I reproduce it with a tiny test case?
  • Do I have a timeline from app request to final response?

Final Thought

The model is the easy thing to blame because it feels like a black box.

But a lot of LLM production bugs are regular distributed systems problems wearing a very expensive hat: timeouts, retries, partial responses, schema mismatches, bad observability, and unclear ownership between the app, SDK, provider, and model.

The fix is not always a better model.

Sometimes it is just better logs.

I work on TokenBay, so I spend a lot of time thinking about this layer between apps and model providers. The more models and providers you route through, the more valuable boring debugging discipline becomes.

Top comments (0)