plasma

Posted on Jun 30

What I Log When an LLM API Call Fails Mid-Stream

#webdev #ai #programming

Streaming looks great in demos.

The UI feels faster, users see tokens immediately, and nobody has to stare at a spinner while the model thinks.

Then production happens.

A request starts fine. The first few tokens arrive. Maybe half the answer is already on screen. Then the stream dies.

No clean final response. No full usage object. Sometimes no helpful error body. Just a broken connection, a timeout, or a user asking why the answer stopped in the middle.

That was the point where I realized my normal API logging was not enough.

For regular HTTP APIs, I used to care about simple things:

request started
request finished
status code
latency
error message

For streaming LLM APIs, that misses the most important part: how far the request got before it failed.

Why mid-stream failures are different

A non-streaming LLM call usually has a simple shape:

send request
wait
receive complete response or error

A streaming call has more states:

send request
wait for headers
wait for first token
receive chunks
maybe receive tool calls
maybe receive usage
maybe receive a final done event
maybe fail somewhere in the middle

That means "request failed" is too vague.

These are very different failures:

the provider never accepted the request
the provider accepted it but never sent the first token
the stream started and then disconnected
the model generated partial text but never sent a final event
tool call JSON started but was cut off halfway
the client closed the connection
my own server timed out while the provider was still streaming

If all of those show up as 500 stream failed, debugging becomes guesswork.

The minimum fields I log

I try not to log user prompts or raw model outputs unless there is a very explicit reason and consent path.

But I still want enough metadata to debug failures.

This is the baseline event I want for every LLM request:

{
  "request_id": "req_01J...",
  "provider": "example-provider",
  "model": "example-model",
  "operation": "chat.completions.stream",
  "stream": true,
  "started_at": "2026-06-30T10:12:41.120Z",
  "status": "failed",
  "failure_stage": "after_first_token",
  "error_type": "stream_interrupted",
  "http_status": 200,
  "time_to_first_token_ms": 842,
  "stream_duration_ms": 4810,
  "chunks_received": 37,
  "output_chars_received": 1842,
  "finish_reason": null,
  "usage_available": false
}

The most important field here is failure_stage.

Without that, the rest is just noise.

The failure stages I use

I like to classify streaming failures by where they happened.

type StreamFailureStage =
  | "before_request"
  | "request_rejected"
  | "before_first_token"
  | "after_first_token"
  | "during_tool_call"
  | "after_finish_reason"
  | "client_aborted"
  | "unknown";

Here is how I think about each one.

before_request

The request failed before it reached the provider.

Common causes:

invalid local config
missing API key
bad base URL
serialization error
invalid request body built by my own app

This is usually my bug, not the provider's.

I log:

provider
model
route or feature name
config version
validation error
whether the API key was present, never the key itself

request_rejected

The provider returned a normal error before the stream started.

Common causes:

401 invalid key
403 forbidden model
404 model not found
429 rate limit
400 invalid tool schema
context length exceeded

I log:

HTTP status
provider error code
provider error type
normalized error category
retryable or not
request size metadata

Example:

{
  "failure_stage": "request_rejected",
  "http_status": 429,
  "error_type": "rate_limit",
  "retryable": true,
  "input_tokens_estimated": 1200
}

This is the cleanest kind of failure. Annoying, but at least the provider told you what happened.

before_first_token

This is where a lot of streaming pain starts.

The provider accepted the request, but no token arrived before the timeout.

I log:

time to headers
first-token timeout
model
provider
region if available
whether this was a fallback attempt
whether the request was retried

Example:

{
  "failure_stage": "before_first_token",
  "error_type": "first_token_timeout",
  "time_to_headers_ms": 210,
  "first_token_timeout_ms": 15000,
  "chunks_received": 0
}

This is different from a normal request timeout.

For streaming, I usually want separate timeout budgets:

connection timeout
headers timeout
first token timeout
idle stream timeout
total stream timeout

One giant timeout: 60000 value is usually too blunt.

after_first_token

This is the classic mid-stream failure.

The model started responding, the user saw something, and then the stream broke.

I log:

time to first token
chunks received
output characters received
stream duration
last chunk timestamp
whether a final done event arrived
whether finish reason was present
whether usage was present

Example:

{
  "failure_stage": "after_first_token",
  "error_type": "stream_interrupted",
  "time_to_first_token_ms": 730,
  "stream_duration_ms": 6200,
  "chunks_received": 58,
  "output_chars_received": 2410,
  "done_event_received": false,
  "finish_reason": null,
  "usage_available": false
}

This tells me the request was not simply "slow" or "rejected".

It was accepted, generated partial output, and then the stream was interrupted.

That matters for UX too. You probably do not want to show the same message as a normal failure.

during_tool_call

Tool calls make stream logging more interesting.

Sometimes the model starts emitting tool call arguments and the stream fails halfway through the JSON.

If I only log "stream interrupted", I miss the actual risk: I might have an incomplete tool call that should never be executed.

For tool calls, I log:

tool call started
tool name if available
argument JSON completed or not
tool call id if available
whether tool execution started

Example:

{
  "failure_stage": "during_tool_call",
  "error_type": "incomplete_tool_arguments",
  "tool_name": "create_invoice",
  "tool_arguments_completed": false,
  "tool_execution_started": false
}

The key rule in my apps is simple:

Never execute a tool call unless the tool arguments are complete and validated.

A half-streamed JSON object is not an instruction. It is debris.

after_finish_reason

This one is subtle.

Sometimes the model sends a finish reason, but the client still fails before all cleanup logic runs.

For example:

the stream ended but usage metadata was missing
the final parser failed
my server failed while saving the assistant message
the client disconnected right after completion

I log:

finish reason
final event received
usage available
persistence status
response delivered to client or not

Example:

{
  "failure_stage": "after_finish_reason",
  "finish_reason": "stop",
  "usage_available": false,
  "message_persisted": false,
  "client_delivery_status": "unknown"
}

This helps separate model/provider failures from my own application failures.

client_aborted

Sometimes the user closes the tab. Sometimes the browser cancels the request. Sometimes a reverse proxy drops the connection.

That is not the same as the provider failing.

I log:

client abort timestamp
provider stream still active or not
generated tokens/chunks before abort
whether billing metadata was received
whether the server cancelled the upstream request

Example:

{
  "failure_stage": "client_aborted",
  "stream_duration_ms": 3100,
  "chunks_received": 22,
  "upstream_cancelled": true
}

This matters because client aborts can distort reliability metrics.

If 30% of your "failed" streams are just users navigating away, your provider is not necessarily the problem.

A small Node.js example

Here is a simplified wrapper around a streaming call.

This example does not depend on a specific vendor beyond using an OpenAI-compatible chat completions endpoint.

type StreamLog = {
  requestId: string;
  provider: string;
  model: string;
  startedAt: string;
  status: "ok" | "failed";
  failureStage?: string;
  errorType?: string;
  httpStatus?: number;
  timeToFirstTokenMs?: number;
  streamDurationMs?: number;
  chunksReceived: number;
  outputCharsReceived: number;
  doneEventReceived: boolean;
  finishReason?: string | null;
};

function nowMs() {
  return Date.now();
}

function classifyError(error: unknown) {
  if (error instanceof Error && error.name === "AbortError") {
    return "client_or_timeout_abort";
  }

  if (error instanceof Error && error.message.includes("timeout")) {
    return "timeout";
  }

  return "unknown_stream_error";
}

export async function streamChatCompletion({
  requestId,
  apiKey,
  baseUrl,
  model,
  messages,
  onToken
}: {
  requestId: string;
  apiKey: string;
  baseUrl: string;
  model: string;
  messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;
  onToken: (token: string) => void;
}) {
  const started = nowMs();

  const log: StreamLog = {
    requestId,
    provider: new URL(baseUrl).hostname,
    model,
    startedAt: new Date().toISOString(),
    status: "ok",
    chunksReceived: 0,
    outputCharsReceived: 0,
    doneEventReceived: false,
    finishReason: null
  };

  let firstTokenAt: number | null = null;

  try {
    const response = await fetch(`${baseUrl}/chat/completions`, {
      method: "POST",
      headers: {
        "authorization": `Bearer ${apiKey}`,
        "content-type": "application/json"
      },
      body: JSON.stringify({
        model,
        messages,
        stream: true
      })
    });

    log.httpStatus = response.status;

    if (!response.ok || !response.body) {
      log.status = "failed";
      log.failureStage = "request_rejected";
      log.errorType = `http_${response.status}`;
      return log;
    }

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { value, done } = await reader.read();

      if (done) {
        log.doneEventReceived = true;
        break;
      }

      const chunk = decoder.decode(value, { stream: true });
      log.chunksReceived += 1;

      for (const line of chunk.split("\n")) {
        const trimmed = line.trim();

        if (!trimmed.startsWith("data:")) continue;

        const data = trimmed.slice("data:".length).trim();

        if (data === "[DONE]") {
          log.doneEventReceived = true;
          continue;
        }

        try {
          const parsed = JSON.parse(data);
          const token = parsed.choices?.[0]?.delta?.content ?? "";
          const finishReason = parsed.choices?.[0]?.finish_reason ?? null;

          if (finishReason) {
            log.finishReason = finishReason;
          }

          if (token) {
            if (firstTokenAt === null) {
              firstTokenAt = nowMs();
              log.timeToFirstTokenMs = firstTokenAt - started;
            }

            log.outputCharsReceived += token.length;
            onToken(token);
          }
        } catch {
          log.status = "failed";
          log.failureStage = firstTokenAt === null
            ? "before_first_token"
            : "after_first_token";
          log.errorType = "invalid_stream_chunk";
          return log;
        }
      }
    }

    log.streamDurationMs = nowMs() - started;

    if (!log.doneEventReceived) {
      log.status = "failed";
      log.failureStage = firstTokenAt === null
        ? "before_first_token"
        : "after_first_token";
      log.errorType = "missing_done_event";
    }

    return log;
  } catch (error) {
    log.status = "failed";
    log.streamDurationMs = nowMs() - started;
    log.failureStage = firstTokenAt === null
      ? "before_first_token"
      : "after_first_token";
    log.errorType = classifyError(error);
    return log;
  }
}

This is not a complete production wrapper. It does not handle every SSE edge case, retry policy, proxy timeout, or tool call format.

But the shape is useful:

track whether the first token arrived
count chunks
track whether the final event arrived
separate request rejection from stream interruption
return structured logs instead of one generic error

What I avoid logging

The tempting thing is to log everything.

Raw prompt. Raw completion. Full request body. Full streamed chunks.

That makes debugging easy, but it creates a different problem.

For most production apps, I try to avoid logging:

raw user prompts
raw assistant responses
API keys or auth headers
full tool arguments if they may contain user data
uploaded file contents
personal identifiers unless explicitly needed

Instead, I prefer metadata:

prompt token estimate
output character count
message count
model name
feature name
latency buckets
error category
retry count
request id
user/account id only if your privacy policy allows it

You can debug a surprising amount without storing the actual prompt.

The dashboard I want

Once these fields are logged, the useful charts become obvious.

I want to see:

failure rate by provider
failure rate by model
failures before first token vs after first token
p50/p95 time to first token
average chunks before interruption
stream interruptions by route or feature
retry success rate
client abort rate
missing usage rate

That last one matters more than people expect.

If usage metadata is missing after failed streams, your cost dashboard can lie. Not always dramatically, but enough to make optimization work fuzzy.

Where OpenAI-compatible APIs get tricky

OpenAI-compatible APIs are great for switching providers, but compatibility does not mean every streaming behavior is identical.

In practice, I still check:

does the provider send [DONE]?
is finish_reason always present?
is usage included for streamed responses?
how are provider errors formatted?
what happens when the upstream model times out?
are tool calls streamed in the same structure?
does the SDK surface stream errors clearly?

This is one reason I like keeping my logging format provider-neutral.

Whether I am using a direct provider API or routing through something like TokenBay, I want my app logs to answer the same question:

Where did the stream fail, and how much work happened before it failed?

My current checklist

For every streamed LLM call, I want to know:

Did the provider reject the request?
Did we receive the first token?
How long did first token take?
How many chunks arrived?
Did the stream finish cleanly?
Did we get a finish reason?
Did we get usage metadata?
Was the failure caused by the provider, my app, or the client?
Is this safe to retry?
Did a tool call start?
Did a tool call complete?
Did a tool execute?

If my logs can answer those questions, debugging becomes much less painful.

Not effortless. This is still LLM infrastructure, after all.

But at least I am not staring at stream failed and pretending that is observability.

DEV Community

What I Log When an LLM API Call Fails Mid-Stream

Why mid-stream failures are different

The minimum fields I log

The failure stages I use

before_request

request_rejected

before_first_token

after_first_token

during_tool_call

after_finish_reason

client_aborted

A small Node.js example

What I avoid logging

The dashboard I want

Where OpenAI-compatible APIs get tricky

My current checklist

Top comments (0)