DEV Community

plasma
plasma

Posted on

What I Log When an LLM API Call Fails Mid-Stream

Streaming looks great in demos.

The UI feels faster, users see tokens immediately, and nobody has to stare at a spinner while the model thinks.

Then production happens.

A request starts fine. The first few tokens arrive. Maybe half the answer is already on screen. Then the stream dies.

No clean final response. No full usage object. Sometimes no helpful error body. Just a broken connection, a timeout, or a user asking why the answer stopped in the middle.

That was the point where I realized my normal API logging was not enough.

For regular HTTP APIs, I used to care about simple things:

  • request started
  • request finished
  • status code
  • latency
  • error message

For streaming LLM APIs, that misses the most important part: how far the request got before it failed.

Why mid-stream failures are different

A non-streaming LLM call usually has a simple shape:

  1. send request
  2. wait
  3. receive complete response or error

A streaming call has more states:

  1. send request
  2. wait for headers
  3. wait for first token
  4. receive chunks
  5. maybe receive tool calls
  6. maybe receive usage
  7. maybe receive a final done event
  8. maybe fail somewhere in the middle

That means "request failed" is too vague.

These are very different failures:

  • the provider never accepted the request
  • the provider accepted it but never sent the first token
  • the stream started and then disconnected
  • the model generated partial text but never sent a final event
  • tool call JSON started but was cut off halfway
  • the client closed the connection
  • my own server timed out while the provider was still streaming

If all of those show up as 500 stream failed, debugging becomes guesswork.

The minimum fields I log

I try not to log user prompts or raw model outputs unless there is a very explicit reason and consent path.

But I still want enough metadata to debug failures.

This is the baseline event I want for every LLM request:

{
  "request_id": "req_01J...",
  "provider": "example-provider",
  "model": "example-model",
  "operation": "chat.completions.stream",
  "stream": true,
  "started_at": "2026-06-30T10:12:41.120Z",
  "status": "failed",
  "failure_stage": "after_first_token",
  "error_type": "stream_interrupted",
  "http_status": 200,
  "time_to_first_token_ms": 842,
  "stream_duration_ms": 4810,
  "chunks_received": 37,
  "output_chars_received": 1842,
  "finish_reason": null,
  "usage_available": false
}
Enter fullscreen mode Exit fullscreen mode

The most important field here is failure_stage.

Without that, the rest is just noise.

The failure stages I use

I like to classify streaming failures by where they happened.

type StreamFailureStage =
  | "before_request"
  | "request_rejected"
  | "before_first_token"
  | "after_first_token"
  | "during_tool_call"
  | "after_finish_reason"
  | "client_aborted"
  | "unknown";
Enter fullscreen mode Exit fullscreen mode

Here is how I think about each one.

before_request

The request failed before it reached the provider.

Common causes:

  • invalid local config
  • missing API key
  • bad base URL
  • serialization error
  • invalid request body built by my own app

This is usually my bug, not the provider's.

I log:

  • provider
  • model
  • route or feature name
  • config version
  • validation error
  • whether the API key was present, never the key itself

request_rejected

The provider returned a normal error before the stream started.

Common causes:

  • 401 invalid key
  • 403 forbidden model
  • 404 model not found
  • 429 rate limit
  • 400 invalid tool schema
  • context length exceeded

I log:

  • HTTP status
  • provider error code
  • provider error type
  • normalized error category
  • retryable or not
  • request size metadata

Example:

{
  "failure_stage": "request_rejected",
  "http_status": 429,
  "error_type": "rate_limit",
  "retryable": true,
  "input_tokens_estimated": 1200
}
Enter fullscreen mode Exit fullscreen mode

This is the cleanest kind of failure. Annoying, but at least the provider told you what happened.

before_first_token

This is where a lot of streaming pain starts.

The provider accepted the request, but no token arrived before the timeout.

I log:

  • time to headers
  • first-token timeout
  • model
  • provider
  • region if available
  • whether this was a fallback attempt
  • whether the request was retried

Example:

{
  "failure_stage": "before_first_token",
  "error_type": "first_token_timeout",
  "time_to_headers_ms": 210,
  "first_token_timeout_ms": 15000,
  "chunks_received": 0
}
Enter fullscreen mode Exit fullscreen mode

This is different from a normal request timeout.

For streaming, I usually want separate timeout budgets:

  • connection timeout
  • headers timeout
  • first token timeout
  • idle stream timeout
  • total stream timeout

One giant timeout: 60000 value is usually too blunt.

after_first_token

This is the classic mid-stream failure.

The model started responding, the user saw something, and then the stream broke.

I log:

  • time to first token
  • chunks received
  • output characters received
  • stream duration
  • last chunk timestamp
  • whether a final done event arrived
  • whether finish reason was present
  • whether usage was present

Example:

{
  "failure_stage": "after_first_token",
  "error_type": "stream_interrupted",
  "time_to_first_token_ms": 730,
  "stream_duration_ms": 6200,
  "chunks_received": 58,
  "output_chars_received": 2410,
  "done_event_received": false,
  "finish_reason": null,
  "usage_available": false
}
Enter fullscreen mode Exit fullscreen mode

This tells me the request was not simply "slow" or "rejected".

It was accepted, generated partial output, and then the stream was interrupted.

That matters for UX too. You probably do not want to show the same message as a normal failure.

during_tool_call

Tool calls make stream logging more interesting.

Sometimes the model starts emitting tool call arguments and the stream fails halfway through the JSON.

If I only log "stream interrupted", I miss the actual risk: I might have an incomplete tool call that should never be executed.

For tool calls, I log:

  • tool call started
  • tool name if available
  • argument JSON completed or not
  • tool call id if available
  • whether tool execution started

Example:

{
  "failure_stage": "during_tool_call",
  "error_type": "incomplete_tool_arguments",
  "tool_name": "create_invoice",
  "tool_arguments_completed": false,
  "tool_execution_started": false
}
Enter fullscreen mode Exit fullscreen mode

The key rule in my apps is simple:

Never execute a tool call unless the tool arguments are complete and validated.

A half-streamed JSON object is not an instruction. It is debris.

after_finish_reason

This one is subtle.

Sometimes the model sends a finish reason, but the client still fails before all cleanup logic runs.

For example:

  • the stream ended but usage metadata was missing
  • the final parser failed
  • my server failed while saving the assistant message
  • the client disconnected right after completion

I log:

  • finish reason
  • final event received
  • usage available
  • persistence status
  • response delivered to client or not

Example:

{
  "failure_stage": "after_finish_reason",
  "finish_reason": "stop",
  "usage_available": false,
  "message_persisted": false,
  "client_delivery_status": "unknown"
}
Enter fullscreen mode Exit fullscreen mode

This helps separate model/provider failures from my own application failures.

client_aborted

Sometimes the user closes the tab. Sometimes the browser cancels the request. Sometimes a reverse proxy drops the connection.

That is not the same as the provider failing.

I log:

  • client abort timestamp
  • provider stream still active or not
  • generated tokens/chunks before abort
  • whether billing metadata was received
  • whether the server cancelled the upstream request

Example:

{
  "failure_stage": "client_aborted",
  "stream_duration_ms": 3100,
  "chunks_received": 22,
  "upstream_cancelled": true
}
Enter fullscreen mode Exit fullscreen mode

This matters because client aborts can distort reliability metrics.

If 30% of your "failed" streams are just users navigating away, your provider is not necessarily the problem.

A small Node.js example

Here is a simplified wrapper around a streaming call.

This example does not depend on a specific vendor beyond using an OpenAI-compatible chat completions endpoint.

type StreamLog = {
  requestId: string;
  provider: string;
  model: string;
  startedAt: string;
  status: "ok" | "failed";
  failureStage?: string;
  errorType?: string;
  httpStatus?: number;
  timeToFirstTokenMs?: number;
  streamDurationMs?: number;
  chunksReceived: number;
  outputCharsReceived: number;
  doneEventReceived: boolean;
  finishReason?: string | null;
};

function nowMs() {
  return Date.now();
}

function classifyError(error: unknown) {
  if (error instanceof Error && error.name === "AbortError") {
    return "client_or_timeout_abort";
  }

  if (error instanceof Error && error.message.includes("timeout")) {
    return "timeout";
  }

  return "unknown_stream_error";
}

export async function streamChatCompletion({
  requestId,
  apiKey,
  baseUrl,
  model,
  messages,
  onToken
}: {
  requestId: string;
  apiKey: string;
  baseUrl: string;
  model: string;
  messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;
  onToken: (token: string) => void;
}) {
  const started = nowMs();

  const log: StreamLog = {
    requestId,
    provider: new URL(baseUrl).hostname,
    model,
    startedAt: new Date().toISOString(),
    status: "ok",
    chunksReceived: 0,
    outputCharsReceived: 0,
    doneEventReceived: false,
    finishReason: null
  };

  let firstTokenAt: number | null = null;

  try {
    const response = await fetch(`${baseUrl}/chat/completions`, {
      method: "POST",
      headers: {
        "authorization": `Bearer ${apiKey}`,
        "content-type": "application/json"
      },
      body: JSON.stringify({
        model,
        messages,
        stream: true
      })
    });

    log.httpStatus = response.status;

    if (!response.ok || !response.body) {
      log.status = "failed";
      log.failureStage = "request_rejected";
      log.errorType = `http_${response.status}`;
      return log;
    }

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { value, done } = await reader.read();

      if (done) {
        log.doneEventReceived = true;
        break;
      }

      const chunk = decoder.decode(value, { stream: true });
      log.chunksReceived += 1;

      for (const line of chunk.split("\n")) {
        const trimmed = line.trim();

        if (!trimmed.startsWith("data:")) continue;

        const data = trimmed.slice("data:".length).trim();

        if (data === "[DONE]") {
          log.doneEventReceived = true;
          continue;
        }

        try {
          const parsed = JSON.parse(data);
          const token = parsed.choices?.[0]?.delta?.content ?? "";
          const finishReason = parsed.choices?.[0]?.finish_reason ?? null;

          if (finishReason) {
            log.finishReason = finishReason;
          }

          if (token) {
            if (firstTokenAt === null) {
              firstTokenAt = nowMs();
              log.timeToFirstTokenMs = firstTokenAt - started;
            }

            log.outputCharsReceived += token.length;
            onToken(token);
          }
        } catch {
          log.status = "failed";
          log.failureStage = firstTokenAt === null
            ? "before_first_token"
            : "after_first_token";
          log.errorType = "invalid_stream_chunk";
          return log;
        }
      }
    }

    log.streamDurationMs = nowMs() - started;

    if (!log.doneEventReceived) {
      log.status = "failed";
      log.failureStage = firstTokenAt === null
        ? "before_first_token"
        : "after_first_token";
      log.errorType = "missing_done_event";
    }

    return log;
  } catch (error) {
    log.status = "failed";
    log.streamDurationMs = nowMs() - started;
    log.failureStage = firstTokenAt === null
      ? "before_first_token"
      : "after_first_token";
    log.errorType = classifyError(error);
    return log;
  }
}
Enter fullscreen mode Exit fullscreen mode

This is not a complete production wrapper. It does not handle every SSE edge case, retry policy, proxy timeout, or tool call format.

But the shape is useful:

  • track whether the first token arrived
  • count chunks
  • track whether the final event arrived
  • separate request rejection from stream interruption
  • return structured logs instead of one generic error

What I avoid logging

The tempting thing is to log everything.

Raw prompt. Raw completion. Full request body. Full streamed chunks.

That makes debugging easy, but it creates a different problem.

For most production apps, I try to avoid logging:

  • raw user prompts
  • raw assistant responses
  • API keys or auth headers
  • full tool arguments if they may contain user data
  • uploaded file contents
  • personal identifiers unless explicitly needed

Instead, I prefer metadata:

  • prompt token estimate
  • output character count
  • message count
  • model name
  • feature name
  • latency buckets
  • error category
  • retry count
  • request id
  • user/account id only if your privacy policy allows it

You can debug a surprising amount without storing the actual prompt.

The dashboard I want

Once these fields are logged, the useful charts become obvious.

I want to see:

  • failure rate by provider
  • failure rate by model
  • failures before first token vs after first token
  • p50/p95 time to first token
  • average chunks before interruption
  • stream interruptions by route or feature
  • retry success rate
  • client abort rate
  • missing usage rate

That last one matters more than people expect.

If usage metadata is missing after failed streams, your cost dashboard can lie. Not always dramatically, but enough to make optimization work fuzzy.

Where OpenAI-compatible APIs get tricky

OpenAI-compatible APIs are great for switching providers, but compatibility does not mean every streaming behavior is identical.

In practice, I still check:

  • does the provider send [DONE]?
  • is finish_reason always present?
  • is usage included for streamed responses?
  • how are provider errors formatted?
  • what happens when the upstream model times out?
  • are tool calls streamed in the same structure?
  • does the SDK surface stream errors clearly?

This is one reason I like keeping my logging format provider-neutral.

Whether I am using a direct provider API or routing through something like TokenBay, I want my app logs to answer the same question:

Where did the stream fail, and how much work happened before it failed?

My current checklist

For every streamed LLM call, I want to know:

  • Did the provider reject the request?
  • Did we receive the first token?
  • How long did first token take?
  • How many chunks arrived?
  • Did the stream finish cleanly?
  • Did we get a finish reason?
  • Did we get usage metadata?
  • Was the failure caused by the provider, my app, or the client?
  • Is this safe to retry?
  • Did a tool call start?
  • Did a tool call complete?
  • Did a tool execute?

If my logs can answer those questions, debugging becomes much less painful.

Not effortless. This is still LLM infrastructure, after all.

But at least I am not staring at stream failed and pretending that is observability.

Top comments (0)