When an LLM feature breaks in production, my first instinct used to be: "the model got worse."
That was usually the wrong place to start.
Most of the painful bugs I've debugged around LLM APIs had nothing to do with model quality. They were boring infrastructure problems hiding behind a model response: missing request metadata, silent timeouts, partial streaming output, retry logic that made things worse, or logs that captured the prompt but not the thing that actually failed.
Here's the checklist I use now before blaming the model.
1. Can I Reproduce the Exact Request?
The first question is simple:
Can I replay the exact same request that failed?
For LLM calls, "exact same" means more than just the user prompt.
I want to capture:
- model name
- provider
- base URL
- request timestamp
- request ID if available
- full message array
- temperature / top_p / max_tokens
- tool definitions
- response format settings
- streaming vs non-streaming mode
- timeout settings
- retry attempt number
- user/session/job ID from my own system
The mistake I used to make was logging only the final prompt string. That works for simple demos, but production calls usually depend on system messages, tool schemas, routing logic, and runtime options.
A minimal structured log might look like this:
const requestLog = {
event: "llm_request_started",
request_id: crypto.randomUUID(),
provider: "openai-compatible-provider",
model: "gpt-4.1-mini",
stream: true,
temperature: 0.2,
max_tokens: 800,
message_count: messages.length,
has_tools: tools.length > 0,
timeout_ms: 30000,
retry_attempt: 0,
created_at: new Date().toISOString(),
};
I usually avoid logging raw user content unless I have a clear privacy policy and retention plan. But I do log enough metadata to know what shape of request failed.
2. Did the Request Actually Reach the Provider?
A surprising number of "LLM bugs" are not LLM bugs.
They are:
- DNS issues
- auth failures
- proxy errors
- gateway timeouts
- invalid base URLs
- SDK configuration mistakes
- environment variables missing in one deploy target
Before looking at the model response, check whether the request reached the provider at all.
Log the transport layer separately from the model layer:
try {
const startedAt = Date.now();
const response = await client.chat.completions.create({
model,
messages,
stream: false,
});
console.log({
event: "llm_request_completed",
latency_ms: Date.now() - startedAt,
model,
response_id: response.id,
finish_reason: response.choices?.[0]?.finish_reason,
});
} catch (error) {
console.error({
event: "llm_request_failed",
model,
error_name: error.name,
error_message: error.message,
status: error.status,
code: error.code,
type: error.type,
});
throw error;
}
The important part is separating:
- network failure
- HTTP error
- provider error
- model refusal
- empty model output
- malformed output
- application parsing failure
Those are different problems. They should not all show up as LLM failed.
3. Was the Response Empty, Truncated, or Just Invalid?
There are three very different failure modes that often get mixed together:
- The model returned nothing
- The model returned a partial answer
- The model returned something your app could not parse
They need different fixes.
For normal non-streaming calls, I log:
- response ID
- finish reason
- output length
- token usage if available
- whether content was empty
- whether JSON parsing failed
- whether schema validation failed
Example:
const content = response.choices?.[0]?.message?.content ?? "";
const finishReason = response.choices?.[0]?.finish_reason;
console.log({
event: "llm_response_received",
response_id: response.id,
model: response.model,
finish_reason: finishReason,
output_chars: content.length,
prompt_tokens: response.usage?.prompt_tokens,
completion_tokens: response.usage?.completion_tokens,
total_tokens: response.usage?.total_tokens,
});
If finish_reason is length, I do not treat that the same as a model quality issue. It usually means my max_tokens value was too low, the prompt asked for too much, or the response format was too verbose.
If JSON parsing fails, I log that as an application-level failure:
try {
const parsed = JSON.parse(content);
return parsed;
} catch (error) {
console.error({
event: "llm_json_parse_failed",
model: response.model,
response_id: response.id,
finish_reason: finishReason,
output_preview: content.slice(0, 500),
});
throw error;
}
That distinction matters. A provider outage and a JSON parse failure should not trigger the same incident response.
4. If Streaming Failed, How Far Did It Get?
Streaming makes debugging harder because failure can happen after the request has already started successfully.
For streaming calls, I want to know:
- did the stream open?
- when did the first token arrive?
- how many chunks arrived?
- how many characters were received?
- did the stream end cleanly?
- was there a final usage chunk?
- did the client disconnect first?
- did my server timeout first?
- did the provider close the stream?
Here's a simplified example:
let chunkCount = 0;
let outputChars = 0;
let firstChunkAt = null;
const startedAt = Date.now();
try {
const stream = await client.chat.completions.create({
model,
messages,
stream: true,
});
for await (const chunk of stream) {
chunkCount += 1;
if (!firstChunkAt) {
firstChunkAt = Date.now();
}
const delta = chunk.choices?.[0]?.delta?.content ?? "";
outputChars += delta.length;
// Send delta to the client here
}
console.log({
event: "llm_stream_completed",
model,
chunk_count: chunkCount,
output_chars: outputChars,
time_to_first_chunk_ms: firstChunkAt ? firstChunkAt - startedAt : null,
total_latency_ms: Date.now() - startedAt,
});
} catch (error) {
console.error({
event: "llm_stream_failed",
model,
chunk_count: chunkCount,
output_chars: outputChars,
time_to_first_chunk_ms: firstChunkAt ? firstChunkAt - startedAt : null,
latency_before_error_ms: Date.now() - startedAt,
error_name: error.name,
error_message: error.message,
});
throw error;
}
This tells me whether I'm dealing with:
- no response at all
- slow first token
- mid-stream failure
- client disconnect
- provider timeout
- application-side streaming bug
Without this, all I know is "the stream broke," which is not enough.
5. Did My Retry Logic Make It Worse?
Retries are useful until they quietly multiply your problems.
For LLM APIs, I log every retry attempt with:
- original request ID
- retry attempt number
- delay before retry
- error that caused the retry
- whether the retry used the same model
- whether the retry used a fallback model
- whether the operation was safe to retry
Example:
console.warn({
event: "llm_retry_scheduled",
original_request_id: requestId,
retry_attempt: attempt + 1,
delay_ms: delay,
model,
reason: error.message,
status: error.status,
});
The big trap is retrying requests that already caused side effects.
For example, if the LLM call is part of a workflow that sends an email, creates a ticket, charges credits, or writes to a database, retries need idempotency keys and careful boundaries.
Otherwise, a timeout can turn into duplicate actions.
My rule is:
- retry transport failures carefully
- retry rate limits with backoff
- do not blindly retry tool execution
- do not retry user-visible side effects without idempotency
- always log whether a response came back before retrying
6. Is the Failure Model-Specific or Provider-Specific?
When using OpenAI-compatible APIs, it is easy to swap models or providers and assume everything else is equivalent.
It usually is not.
Different models and providers can vary in:
- streaming behavior
- tool calling support
- JSON mode support
- context window limits
- rate limit headers
- error response shape
- usage reporting
- timeout behavior
- finish reason semantics
So I try to isolate the failure:
- same prompt, same provider, different model
- same prompt, same model family, different provider if available
- same prompt, non-streaming instead of streaming
- same prompt, no tools
- same prompt, smaller context
- same prompt, lower max tokens
This usually tells me whether I'm looking at a model behavior issue, a provider compatibility issue, or my own integration bug.
7. Can I See the Full Timeline?
For production debugging, individual logs are not enough. I want the timeline.
A useful LLM request timeline looks like this:
00ms request accepted by app
12ms prompt assembled
18ms provider selected
25ms LLM request started
430ms first token received
2840ms stream chunk 50 received
5100ms stream completed
5120ms output parsed
5190ms database write completed
5300ms response sent to client
Or, for a bad request:
00ms request accepted by app
10ms prompt assembled
18ms provider selected
30ms LLM request started
30030ms provider timeout
30035ms retry scheduled
32040ms retry started
62045ms retry timeout
62050ms request failed
This makes the real issue obvious.
Maybe the model was fine, but the timeout was too aggressive. Maybe the first token was fast, but the downstream parser failed. Maybe the app spent 4 seconds building context before the LLM call even started.
Without a timeline, it is too easy to blame the most mysterious part of the stack.
8. Do I Have a Small Test Case?
After I identify the likely failure, I try to reduce it to a tiny test.
For example:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.LLM_API_KEY,
baseURL: process.env.LLM_BASE_URL,
});
const response = await client.chat.completions.create({
model: process.env.LLM_MODEL,
messages: [
{ role: "system", content: "Return only valid JSON." },
{ role: "user", content: "Return an object with one key: status." },
],
temperature: 0,
response_format: { type: "json_object" },
});
console.log(response.choices[0].message.content);
A small test case helps answer:
- is my production prompt too complex?
- is tool calling the problem?
- is JSON mode supported?
- is streaming the problem?
- is the SDK configured correctly?
- is the provider returning the shape I expect?
If the small test fails, the issue is probably integration-level.
If the small test passes, the issue is probably in my application logic, prompt assembly, context size, or output parsing.
9. What I Put in My Default LLM Debug Log
My default log event for LLM calls usually includes this shape:
{
event: "llm_call",
request_id: "...",
user_id: "...",
provider: "...",
model: "...",
stream: true,
status: "completed",
latency_ms: 3200,
time_to_first_token_ms: 420,
input_tokens: 1200,
output_tokens: 300,
finish_reason: "stop",
retry_attempts: 0,
error_type: null,
error_message: null
}
For failures:
{
event: "llm_call",
request_id: "...",
provider: "...",
model: "...",
stream: true,
status: "failed",
latency_ms: 30000,
time_to_first_token_ms: null,
chunks_received: 0,
output_chars: 0,
retry_attempts: 2,
error_type: "timeout",
error_message: "Request timed out after 30000ms"
}
That is usually enough to start debugging without drowning in logs.
10. The Short Checklist
Before blaming the model, I check:
- Can I replay the exact request?
- Did the request reach the provider?
- Was it a network, HTTP, provider, model, parsing, or app failure?
- Was the response empty, truncated, malformed, or refused?
- If streaming failed, how many chunks did I receive?
- Did the first token arrive?
- Did my timeout happen before the provider finished?
- Did retry logic duplicate the problem?
- Is the issue model-specific or provider-specific?
- Does the same request work without streaming?
- Does the same request work without tools?
- Can I reproduce it with a tiny test case?
- Do I have a timeline from app request to final response?
Final Thought
The model is the easy thing to blame because it feels like a black box.
But a lot of LLM production bugs are regular distributed systems problems wearing a very expensive hat: timeouts, retries, partial responses, schema mismatches, bad observability, and unclear ownership between the app, SDK, provider, and model.
The fix is not always a better model.
Sometimes it is just better logs.
I work on TokenBay, so I spend a lot of time thinking about this layer between apps and model providers. The more models and providers you route through, the more valuable boring debugging discipline becomes.
Top comments (0)