karchichen

Posted on May 10

5 production bugs I debugged in popular AI libraries this week (and the fix patterns you can steal)

#ai #python #langchain #debugging

If you're building anything with LangChain, OpenAI's Python SDK, or LangGraph in
production, you've probably run into the same class of bugs that have been
quietly breaking real apps for months. Over the last week I dug through ~70
open issues across these libraries while doing client debugging work, and there
are five patterns that come up constantly — each with a non-obvious root
cause and a 5-line fix.

Here they are, ranked by how much pain they cause.

1. `_create_usage_metadata` crashes when `service_tier` is set and `cached_tokens` is missing

Repo: langchain-ai/langchain (#36657)

The trap:

prompt_details = response.get("prompt_details", {})
prompt_details.get("cached_tokens")  # <-- this line is the bug

This stores None in the dict explicitly when cached_tokens is missing.
The subsequent .get(key, 0) then sees the key as present and returns None
instead of the default 0 — classic Python dict.get + explicit-None trap.

The fix:

prompt_details.get("cached_tokens") or 0

Use or 0 instead of the default-arg form. Handles both missing keys and
explicit None.

Why this matters: The bug only triggers when you've enabled cached tokens
and are using a service tier — which is exactly when you're optimizing cost,
i.e. production traffic. In dev with cached_tokens disabled, you never see it.

2. `@langchain/aws` crashes on empty AIMessage text blocks in Bedrock Converse

Repo: langchain-ai/langchainjs (#10782)

The trap: When converting a LangChain AIMessage to Bedrock Converse format,
empty text blocks (e.g. when an LLM emits a tool-call without any prose) fall
through to an else branch that throws.

The fix:

if (!textBlock.text || textBlock.text.length === 0) {
  continue;  // skip empty blocks instead of throwing
}

The subtler issue worth flagging upstream: if the only content is empty
text + tool calls, the resulting Converse message has an empty content array.
Bedrock's API rejects that with a different error. So you need a second guard
on the final message construction.

3. SSRF bypass in `validate_safe_url` when `LANGCHAIN_ENV=local_test`

Repo: langchain-ai/langchain (#37297)

The trap: A local_test branch in the URL validator does an early return
before the hostname/IP validation runs. Set LANGCHAIN_ENV=local_test and
the entire SSRF check short-circuits — not just the localhost allowlist.

Attack surface: any code path that lets an attacker influence env vars (a
misconfigured container, a dev-staging promotion mistake, a shared CI runner)
gets a free pass on outbound URL filtering. SSRF to internal services becomes
one env-var-flip away.

The fix shape: local_test mode should widen which destinations are
permitted, not skip validation entirely:

if env == "local_test":
    allowlist = LOCAL_ALLOWLIST  # broader, but still validated
else:
    allowlist = PROD_ALLOWLIST
validate_against(url, allowlist)  # always runs

Why this matters: This is a CVE-class bug hiding behind what looks like
an env-gated dev convenience. If you're running LangChain in any env that
ingests user URLs, audit your validate_safe_url calls today.

4. `create_agent` exits prematurely on next turn with stale `structured_response`

Repo: langchain-ai/langchain (#36957)

The trap: create_agent doesn't clear or re-validate structured_response
from the checkpoint state at the start of a new turn. If the previous run
ended with a populated structured_response, the graph's conditional edge
sees it as a valid terminal signal and routes to END before the LLM even runs.

You'll see this as: "my agent worked perfectly on turn 1, then on turn 2 it
just... returns immediately with no LLM call." Easy to misdiagnose as a tool
binding issue.

The fix: explicitly reset structured_response to None in your state
initializer or in a pre-processing node before the agent node executes each turn:

def reset_response(state):
    return {"structured_response": None}

graph.add_node("reset", reset_response)
graph.add_edge(START, "reset")
graph.add_edge("reset", "agent")

5. Responses stream accumulator crashes when `response.output_item.added` has `item=None`

Repo: openai/openai-python (#3125)

The trap: The streaming accumulator hard-crashes when a provider sends a
response.output_item.added event where item is None. This happens
intermittently with some Responses API providers — especially during high
concurrency or when streaming through proxies that re-batch events.

The fix:

if event.item is None:
    return snapshot  # graceful no-op

Don't use getattr(event, "item", None) — item is a defined attribute on
the Pydantic model, just nullable. The right check is event.item is None.

Defense-in-depth: the same guard belongs on response.output_item.done,
since a provider that sends null on added could trivially do the same on
done.

The pattern across all five

Notice anything? Every single one of these bugs is at the boundary between
your code and an external system:

LLM provider returning ambiguous null vs missing
LLM provider streaming events that don't match schema
User-controlled input (env vars, URLs) flowing into a security check
Checkpoint state surviving across turns when it shouldn't

If you're building an AI app and want to harden it before something breaks at
2am, audit your boundary code first. Provider responses, streaming events,
checkpoint deserialization, env-var-gated branches. That's where 80% of
production AI bugs live.

If you've hit a similar bug and want a second set of eyes on it, my GitHub is
github.com/KarchiChen — I'm tracking issues
across LangChain / LangGraph / OpenAI Python / Anthropic SDK and happy to look
at specific repros.

Originally drafted from a week of debugging vibe-coded AI MVPs. Each issue
above links to the live GitHub thread — feel free to upvote / pile on if it
matches what you've seen.

DEV Community

5 production bugs I debugged in popular AI libraries this week (and the fix patterns you can steal)

1. `_create_usage_metadata` crashes when `service_tier` is set and `cached_tokens` is missing

2. `@langchain/aws` crashes on empty AIMessage text blocks in Bedrock Converse

3. SSRF bypass in `validate_safe_url` when `LANGCHAIN_ENV=local_test`

4. `create_agent` exits prematurely on next turn with stale `structured_response`

5. Responses stream accumulator crashes when `response.output_item.added` has `item=None`

The pattern across all five

Top comments (0)

1. _create_usage_metadata crashes when service_tier is set and cached_tokens is missing

2. @langchain/aws crashes on empty AIMessage text blocks in Bedrock Converse

3. SSRF bypass in validate_safe_url when LANGCHAIN_ENV=local_test

4. create_agent exits prematurely on next turn with stale structured_response

5. Responses stream accumulator crashes when response.output_item.added has item=None

The pattern across all five

1. `_create_usage_metadata` crashes when `service_tier` is set and `cached_tokens` is missing

2. `@langchain/aws` crashes on empty AIMessage text blocks in Bedrock Converse

3. SSRF bypass in `validate_safe_url` when `LANGCHAIN_ENV=local_test`

4. `create_agent` exits prematurely on next turn with stale `structured_response`

5. Responses stream accumulator crashes when `response.output_item.added` has `item=None`