<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wu Jiang</title>
    <description>The latest articles on DEV Community by Wu Jiang (@wu_jiang_2ca3f4c2d1718f07).</description>
    <link>https://dev.to/wu_jiang_2ca3f4c2d1718f07</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3967328%2Fce83007f-2618-4635-a352-a9ad6551a6fc.jpg</url>
      <title>DEV Community: Wu Jiang</title>
      <link>https://dev.to/wu_jiang_2ca3f4c2d1718f07</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wu_jiang_2ca3f4c2d1718f07"/>
    <language>en</language>
    <item>
      <title>Why your LLM tool calls silently break — and a ~10µs fix</title>
      <dc:creator>Wu Jiang</dc:creator>
      <pubDate>Thu, 04 Jun 2026 02:05:27 +0000</pubDate>
      <link>https://dev.to/wu_jiang_2ca3f4c2d1718f07/why-your-llm-tool-calls-silently-break-and-a-10us-fix-15mj</link>
      <guid>https://dev.to/wu_jiang_2ca3f4c2d1718f07/why-your-llm-tool-calls-silently-break-and-a-10us-fix-15mj</guid>
      <description>&lt;p&gt;If you stream tool calls or structured output from an LLM, you have almost certainly seen one of these in production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 12 (char 11)
serde_json::Error: EOF while parsing a string at line 1 column 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It usually shows up under load, on your longest and most important responses, and it's maddening because &lt;em&gt;the model did its job&lt;/em&gt; — it just got cut off. This post is about why that happens, why the obvious fixes don't really work, and a small proxy (&lt;a href="https://github.com/tensorhq/suture-stream-repair" rel="noopener noreferrer"&gt;Suture&lt;/a&gt;) that fixes it on the wire in microseconds without touching your code or your API keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually breaks
&lt;/h2&gt;

&lt;p&gt;When you stream a chat completion, the provider doesn't send you one JSON document. It sends a long sequence of Server-Sent Events, each a &lt;em&gt;complete, valid&lt;/em&gt; little JSON object carrying a fragment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ci"&lt;/span&gt;&lt;span class="p"&gt;}}]}}]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ty&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Par"&lt;/span&gt;&lt;span class="p"&gt;}}]}}]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"is&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;}}]}}]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;DONE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your SDK &lt;strong&gt;reassembles&lt;/strong&gt; the &lt;code&gt;arguments&lt;/code&gt; field across all those events into one string -— &lt;code&gt;{"city":"Paris"}&lt;/code&gt; — and &lt;em&gt;then&lt;/em&gt; parses it. The catch: the thing that's actually JSON (the tool arguments, or your structured-output &lt;code&gt;content&lt;/code&gt;) lives &lt;em&gt;inside&lt;/em&gt; those fragments and is only complete once the whole stream arrives.&lt;/p&gt;

&lt;p&gt;So when the stream ends early — the model hits &lt;code&gt;max_tokens&lt;/code&gt;, blows the context window, or the socket just dies — you're left holding this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"city"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Par
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SSE envelope was fine. The reassembled JSON is not. Your parser throws.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the obvious fixes don't work
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retry the request.&lt;/strong&gt; You pay for the whole long generation again, and it may truncate again the same way. Expensive and non-deterministic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;try/except&lt;/code&gt; and move on.&lt;/strong&gt; You throw away a response the model spent real tokens producing — often you can see the answer right there, just missing a &lt;code&gt;"}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bigger &lt;code&gt;max_tokens&lt;/code&gt;.&lt;/strong&gt; Pushes the cliff back; doesn't remove it. Socket deaths don't care about your token budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hand-rolled "close the braces" logic in your app.&lt;/strong&gt; This is the right &lt;em&gt;idea&lt;/em&gt;, and it's also where people quietly ship bugs — see the next section.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  "Just close the braces" is harder than it looks
&lt;/h2&gt;

&lt;p&gt;The naive repair is "append the missing &lt;code&gt;]&lt;/code&gt; and &lt;code&gt;}&lt;/code&gt;." Consider a tool-args stream truncated right after a comma:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;194&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tempting fix is to append &lt;code&gt;]}&lt;/code&gt; → &lt;code&gt;{"items":[250,194,]}&lt;/code&gt;. That is &lt;strong&gt;invalid JSON&lt;/strong&gt; — a trailing comma. A correct repair has to &lt;em&gt;drop&lt;/em&gt; the dangling comma first, then close:&lt;br&gt;
&lt;code&gt;{"items":[250,194]}&lt;/code&gt;. The same trap hides in partial numbers (&lt;code&gt;1.&lt;/code&gt;, &lt;code&gt;1e&lt;/code&gt;), partial keywords (&lt;code&gt;tru&lt;/code&gt;), incomplete &lt;code&gt;\uXXXX&lt;/code&gt; escapes, and — the nastiest — a multibyte UTF-8 character sliced in half by the truncation, where naively appending &lt;code&gt;"&lt;/code&gt; produces invalid UTF-8 and a &lt;em&gt;different&lt;/em&gt; crash.&lt;/p&gt;

&lt;p&gt;Getting this right means treating it as what it is: a tiny, careful JSON parser. Suture's core is a byte-level state machine with one invariant, checked by a property test against &lt;code&gt;serde_json&lt;/code&gt;: &lt;em&gt;for any prefix of any valid JSON value, the repaired output parses.&lt;/em&gt; That test caught the trailing-comma bug, the partial-scalar bugs, and a UTF-8-splitting panic before any of them could ship.&lt;/p&gt;
&lt;h2&gt;
  
  
  The approach: repair on the wire, see nothing you shouldn't
&lt;/h2&gt;

&lt;p&gt;Suture is a reverse proxy. You point your SDK's &lt;code&gt;base_url&lt;/code&gt; at it and change nothing else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8787/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It forwards your request verbatim (your key just passes through — Suture stores nothing), watches the streaming response, tracks the reassembled tool-args / structured content with the byte-level engine, and at end-of-stream emits exactly the characters needed to close it — as a final, well-formed delta event before the terminator. Your client reassembles valid JSON and never knows anything was wrong.&lt;/p&gt;

&lt;p&gt;Design choices that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It's append-only and passthrough.&lt;/strong&gt; Complete events stream straight through untouched; Suture only appends a closing delta at the end. Added latency is ~10µs of CPU per chunk (measured with &lt;code&gt;criterion&lt;/code&gt;) — three orders of magnitude under the time you spend waiting on the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's content-aware, not byte-naive.&lt;/strong&gt; It repairs the &lt;em&gt;reassembled&lt;/em&gt; field, and only JSON-bearing fields (tool arguments always; &lt;code&gt;content&lt;/code&gt; only when it's actually JSON), so it never mangles prose.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It handles compression and four providers.&lt;/strong&gt; gzip/brotli/deflate are decoded, repaired, and re-encoded on the fly; OpenAI, Anthropic, Google Vertex (Gemini + Claude-on-Vertex), and AWS Bedrock (&lt;code&gt;ConverseStream&lt;/code&gt;, a binary CRC-checked frame protocol) are all supported.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A note on keys, because it should be a note
&lt;/h2&gt;

&lt;p&gt;Suture forwards your credential and holds nothing. For &lt;strong&gt;AWS Bedrock&lt;/strong&gt; it's even stronger: SigV4 signing means the secret access key never crosses the wire at all — only a per-request signature — so a compromised proxy can't steal a reusable AWS credential. (We validate the upstream &lt;code&gt;Host&lt;/code&gt; to AWS, too; an SSRF that tried to exploit the &lt;code&gt;Host&lt;/code&gt; header was caught and fixed in review.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limits
&lt;/h2&gt;

&lt;p&gt;This isn't magic and it isn't for everything. Providers are shipping native structured-output guarantees (strict schemas, constrained decoding) that reduce &lt;em&gt;malformed&lt;/em&gt; JSON — good. What they don't fix is &lt;strong&gt;truncation&lt;/strong&gt;: a stream cut at the token cap or a dead socket still leaves you with valid-but-incomplete JSON, across the long tail of models, Bedrock, and older APIs. That residual is exactly what Suture is for. It also won't resurrect data that never arrived — it makes what &lt;em&gt;did&lt;/em&gt; arrive parseable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Suture is Rust, dual-licensed MIT/Apache-2.0, ~100 tests, on GitHub:&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/tensorhq/suture-stream-repair" rel="noopener noreferrer"&gt;https://github.com/tensorhq/suture-stream-repair&lt;/a&gt;&lt;/strong&gt;. The repair engine is a standalone library if you'd rather repair in-process and keep even the response bytes off the network.&lt;/p&gt;

&lt;p&gt;If your structured-output pipeline has ever thrown on a truncated stream, it's a one-line &lt;code&gt;base_url&lt;/code&gt; change to find out whether this helps.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
      <category>rust</category>
    </item>
  </channel>
</rss>
