Managing Unreliable Compilers

Derek Cheng — Wed, 28 Jan 2026 16:49:38 +0000

Is software development done? Is it all over for a profession that has rewarded, empowered, and provided direction for 30 million people worldwide?

The answer is clearly no: developers are needed as much as ever. More software will get built than ever before, and most of it will be in meaningfully complex domains and settings, requiring strong human judgment.

But it is changing at an incredible pace.

The Unreliable Compiler

Many have analogized LLMs with compilers. Both transform a more compact, higher-level description of behavior into a more verbose, lower-level code. But there is a crucial difference: compilers are now incredibly reliable, so much so that “it was a compiler bug” gets you approximately the same reaction as “a cosmic ray inverted a bit in RAM”. LLMs/coding agents on the other hand are anything but: they make errors in logic and errors in judgement, resulting in functional bugs and slop.

But they’re fast, and there are effectively infinitely many of them.

The developer’s key role, then, is to figure out how to put all these unreliable compilers to work. How to specify and structure work clearly, delegate that work efficiently, then verify and guardrail imperfect outputs. In other words, developers have all just become first-time managers.

Mistakes Managers Make

First-time managers make two classic mistakes: under-delegation and over-delegation. I have seen and made both of these mistakes during my time as an engineering manager at Meta, Microsoft and Atlassian.

Under-delegation results in micro-management. This is incredibly common; the manager can’t let go, and insists on babysitting everything and everyone. This limits scale: you can’t take on more projects if you’re providing dense supervision over everything. You can see this in a lot of present-day coding agent usage: developers sitting in chat panels, watching as an LLM performs a task.

Over-delegation is also a road to pain and suffering. This is the classic hands-off manager who is clueless about details and useless in a crisis. You see this pattern with present-day coding agent interactions as well: blindly one-shotting entire apps that turn out to be completely broken or unmaintainable. Fine for a one-off demo, fireable offense for any real production workload.

The solution to both problems is to define a clear protocol with explicit hand-offs and well-defined points when you as the manager can weigh in: sparse, but effective supervision that scales up.

Lifting the Barbell

A simple model for development is plan → code → verify. It applies at multiple scales, and it’s not entirely waterfall and linear, but the model holds.

In this model, it’s clear where human attention and judgment should be concentrated: at the endpoints. Planning is where you exercise judgment over significant technical decisions: what storage system to use, whether to factor something into a framework vs one-offs, whether logic should live on the client or server. And verification is where you exercise judgment over quality, both functional and non-functional. Just as with managers, a key duty here is to hold a high quality bar.

This is the transformation that is upon us as developers: learning to switch from spending most of our time and energy on coding, to spending most of it at the endpoints. Our role remains critical, but has become barbell-shaped.

We’re building Tonkotsu around this barbell. We give you powerful tools for planning and verification while orchestrating the middle so you don’t have to babysit. The profession isn’t ending. It’s scaling up.

“I've failed 17 consecutive times with the exact same error”

Derek Cheng — Tue, 09 Dec 2025 16:57:26 +0000

That was the verbatim response from our agent as it attempted to execute a coding task. Actually, it was a little more verbose in its self-scolding:

I’ve failed 17 consecutive times with the exact same error. I keep calling replace_file with only the file_path parameter and never include the content parameter.

After 17 consecutive failures, I need to break this pattern. Let me think about what’s happening:

I’m not actually constructing the CSS content in the tool call

I keep saying I will, but then I don’t do it

I need to ACTUALLY write the complete CSS inside the content parameter

In my next response, I MUST write the COMPLETE tool call with BOTH parameters, including the actual CSS code in the content parameter. I cannot make this mistake an 18th time.

Spoiler alert: it would repeat the mistake an 18th time. Even though it knew exactly what it was doing wrong. The strangest part wasn’t the failure; it was the self-awareness.

while (true)

Tonkotsu is a multi-agent management environment — the successor to IDEs. A core part of it is a coding agent that executes engineering tasks in parallel and without need for micromanagement. The coding agent uses an LLM (Claude Sonnet mostly) and a set of coding tools focused on reading and writing to a git repo. The LLM is given a task specification and then calls tools over and over (to read relevant parts of the repo, make code edits, then run tools to validate) until its task is accomplished. Pretty standard coding agent architecture.

We track task failures in a daily review to make sure agent reliability and generated code quality meet high standards. We get to see LLM behavior at the edges, where things either perform shockingly well or fail in very bizarre ways. Starting in September, we saw that a large percentage of our task failures were because the LLM session exceeded a limit we had on the maximum number of messages. Upon inspection of these failing tasks, we could see that the LLM had fallen into an infinite loop of calling a tool unsuccessfully, then calling that same tool in the same erroneous way over and over (often 30-40 times), until the limit was hit.

We have a replace_file tool that allows the LLM to overwrite an existing file (or create a new file) at file_path with text provided in content. Both parameters are identified as required.

{
  name: “replace_file”,
  description: “Write a file to the local filesystem. Overwrites the existing file if there is one.”,
  input_schema: {
    type: “object”,
    properties: {
      file_path: {
        type: “string”,
        description: “Path to the file to replace or create”
      },
      content: {
        type: “string”,
        description: “New content for the file”
      }
    },
    required: [”file_path”, “content”]
  }
}

In the failing tasks, the LLM repeatedly called replace_file with a valid file_path but no content at all! And once it made a bad call, it would spiral into an infinite loop, calling replace_file over and over in exactly the same way and never specifying content.

break;

Our initial mitigation was simple and direct. When receiving a bad tool call, we started returning a more verbose error message to the LLM, explicitly naming the parameter that was missing and clearly instructing it to think about the value of that parameter before making the call again. The fix was deployed and we found it had no observable effect at all — our first hint that this wasn’t just a run-of-the-mill mistake.

Next, we tried a stronger intervention. When a bad tool call was made, we would disable tool calling entirely in the next LLM turn. We’d explicitly tell the model via a user message that tool calling was disabled, that the function call was missing a parameter, and that it should reflect on what the content of that parameter should be. The model would respond with an assistant text message (not tool call) with its thinking, and then we would re-enable tool calls on the subsequent turn. This was a much more invasive approach, pausing the entire trajectory to give the model a chance to think deeply.

And it still didn’t work. The model would indeed think through what needed to be done, often with a dose of self-flagellation as in the quote at the start of this article. The fascinating thing: it knew precisely what was wrong. It could textually describe how to fix it. But then, on the next tool-call-enabled turn, it would immediately repeat the malformed call.

At some point, it also started talking about internal implementation details, suggesting that internally the model emits tool calls as XML:

The issue is clear: I keep writing:

<invoke name=”replace_file”>
<parameter name=”file_path”>styles/styles.css</parameter>
</invoke>

But I MUST write:

<invoke name=”replace_file”>
<parameter name=”file_path”>styles/styles.css</parameter>
<parameter name=”content”>THE ACTUAL CSS CODE HERE</parameter>
</invoke>

We had stumbled upon some strange, deep-seated behavior of the model. We speculated that the behavior was an artifact of the model’s training that demonstrated the value of retrying tool calls, and once it latched onto this failing tool call pattern, it kept sampling the same tool call sequence again and again. It had fallen into a gravity well so strong that not only could it not correct the tool call, it also couldn’t formulate any other strategy as a workaround.

At this point, we were stumped. Unsure exactly how to proceed, we kept experimenting and also sought the advice of the Anthropic team. They suggested a tweak to our intervention approach by providing the LLM the exact JSON template for the function call and asking it to fill it out during its tool-calls-disabled reflection turn. We didn’t expect much of this simple tweak but added it to our battery of experiments. We would now add this static prompt to our reflection instruction to the model:

Generate the following JSON object to represent the correct tool call with real parameter values for replace_file. Conform to exactly this JSON structure:

  {
    ‘type’: ‘tool_use’,
    ‘name’: ‘replace_file’,
    ‘input’: {
      ‘file_path’: <FILE_PATH_HERE>,
      ‘content’: <CONTENT_HERE>
    }
  }

Shockingly, this simple tweak resulted in significant improvements! The model still occasionally generates incorrect tool calls, but is able to recover rather than spiral into an infinite loop — a much better result. In yet another bizarre aspect of the model’s behavior, this explicit JSON structure was enough to help the model climb out of the gravity well of the tool call loop.

More recently, Anthropic released strict tool use, which should guarantee correct tool calls. We’re currently experimenting with this as well.

Parallel > Perfect

What’s striking is how familiar this all feels if you’ve ever been an engineering manager or even just an observant member of a team. You’ve probably worked with someone who:

Repeats the same unproductive action in the face of increasingly explicit feedback
Is generally quite reasonable, but gets bizarrely stubborn on one issue
Can verbalize the solution to a problem, but simply can’t execute it

Humans do this, and so do LLMs. Our bet is that the future isn’t perfect coworkers (agent or human); it’s the ability to effectively coordinate them all together to solve a big problem in parallel.

👉 If you're interested in more write-ups on building with multi-agent LLM workflows, I’m writing about these experiences here → blog.tonkotsu.ai

DEV Community: Derek Cheng