Henry Ivry

Posted on Jun 18

May You Get What You Asked For

#ai #programming #software #agents

Recently, while working on an in-progress open-source framework called Projector, I ran into a (not particularly novel) issue: one of it's internal packages (core) had grown during this period, and was not nearly as flyweight as it needed to be in the browser. The result was 10-20kbs of unnecessary machinery getting pulled in.

I noticed this while running examples. I was consistently hitting a wall in bundle sizes that was surprisingly difficult to get past, even for someone as stubborn and relentless as I am. Naturally, I turned to Claude and ChatGPT to help me with this, and ended up using ChatGPT 5.5 with Codex as I find that, with the "precise" output mode, it tends to be a little more honest than Opus 4.8 these days.

I shared exported HAR network logs with it, having it go through the chunks to confirm where the bulk was; consistently, it confirmed that the issue was around an entangling of authoring/resolution code with runtime code in core that was pulling in too much to the browser.

The technical details here aren't really important, but I'm using them to illustrate a larger point.

We then iterated through a lot of different solutions—I setup a "goal" in codex with benchmarks to hit, and gave it a bunch of constraints, context, and tooling. Finally, after about 2-3 hours of looping against that goal, it completed.

Looking through the git diff, I noticed something odd—it had duplicated the result of the resolved module, so it could skip the resolution machinery and thus drop it from the browser bundle (again, technical details not really relevant).

It hit the rough kb benchmarks, respected all constraints, utilized all context and skills available, and avoided importing the machinery that we both aligned on being the core problem. It provided an elegant, coherent, well-written api, implemented a surgical, well-tested, well-designed solution, and convincingly defended its work when I queried about the implementation.

That sounds great, right? In fact, I think it highlights an emerging issue with frontier models—the ability to present sophisticated-looking, elegantly executed, not-wrong solutions with increasingly convincing reasoning, while simultaneously not being able to navigate ambiguity outside of a prompt's implicit parameters (yes, the problem is always user input and/or expectations, ie me; the LLM is just tooling).

May You Get What You Asked For

This, to me, is an increasing footgun in the advancement of LLMs, especially as the world seems to increasingly fixate on the abilities of LLMs to deliver "working" solutions and outcomes. It highlights how, as LLMs generate increasingly sophisticated output, it requires increasingly sophisticated design review and precise prompting to avoid potentially catastrophic land-mines hidden behind a very inhuman approach to problem solving. This is the kind of code design that make code bases increasingly harder to reason about for both humans and LLMs; a good sign of it is ever-increasing LOC and indirection. For my use case, it was a perfect example of an ironic saying I've heard somewhere that goes something like "may you get what you asked for", which I think perfectly sums up our relationships with LLMs.

When I looked at the code, it wasn't immediately clear what I disliked; all that I had was an intuition that something wasn't quite...right.

It gave me a convincing argument that, yes, what it added created some duplication and indirection, but that it was arguably acceptable and pragmatic here because otherwise we'd need to expose resolution internals from core that shouldn't necessarily be public. In fact, all of the arguments were perfectly sound and convincing.

I think a lot of engineers would have stopped there, because I almost did. But still, something irked me, and I'm annoying. It still added complexity to a system where this problem was already solved somewhere else internally. The case it presented boiled down to, why go through all that resolution machinery when we already know what the outcome should be in this particular scenario?

The answer is: that makes perfect sense...if you're an LLM looking for a solution amidst ambiguity. As a human, I saw that it added both complexity and potential for drift that outweighed the design, and hinted at a larger underlying problem in the codebase. It created a subtle caveat to a system designed to be reasonable and predictable. It made it harder to grep for a source-of-truth, or understand why everything else is modeled one way, except for this place, and added something that was similar-but-different in how we treated code here.

So I tried reframing the problem: if, in an ideal world, core was extremely cheap to import and run, would the ideal state be to use materialization on the browser? Or is the larger point that the resolution is not needed in the browser at all, because we already know what we need?

Its response was definitively different than its implementation:

"In an ideal world where core is extremely cheap to import and run, I would use materialization in the browser. Not because the browser needs ceremony, but because resolution is the mechanism that preserves the model as the source of truth. If that mechanism were cheap, there would be little reason to create a separate base fast path. The browser would simply materialize the surface service from the model-bound plan, and we would be done."

Again, ignore the technical details, but note that this is not what it implemented. To me, after all the work and iteration that we did, this was the most useful kernel of output—a clear articulation of the problem as a design issue in core that would require a more invasive refactor. We did that, and the kbs dropped below the target, the separation of concerns in core made a lot more sense, and overall the entire system became easier to reason about.

They're Not Human 😱

Again, this isn't about the LLM output or problem itself; it's not a particularly novel scenario. It's about how:

we guide LLMs through ambiguity
LLMs make a million decisions and choices all the time that we as humans wouldn't—even if they're trained on our data and tightly coached by us

As these models advance, they're providing increasingly convincing, well-designed, elegant, nuanced solutions, with increasingly convincing reasoning and rationale, while simultaneously potentially compromising the stability, scalability, and maintainability of a code base. The issue here is again, not the LLM, it's that it has no real point of view on how to resolve ambiguity; so when I lacked precision in my description of the problem, regardless of the guardrails I provided and benchmarks I told it to hit, it delivered a solution that would have potentially seriously compromised the scalability and integrity of my code base.

But note in the example I provided, it took me multiple steps to even be able to articulate the underlying issue:

I provided a detailed prompt, with constraints, context, a specialized harness, and a goal with benchmarks to hit
I checked in over multiple hours of iteration by the LLM to see where it was going
I performed a code design review
I went back-and-forth, debating over its implementation
I pursued further contextual investigation myself of the codebase based on my intuition and deep understanding of my principals, original goals, and lots of subconscious workings I'm probably not even privy to,
I reframed the topic to the LLM
We aligned on an actual root cause around an underlying design tension

Notably, this took significant time, effort, iteration, and discipline, along with a human intuition about what was subjectively "right" and "wrong" to arrive at an objectively better solution.

What is most interesting to me is how much better these models are getting in presenting convincing, reasonable arguments, and writing what is otherwise well-designed, ergonomic, well-tested code...that still might paper over a nuanced, underlying system design problem with real potential consequences. The latter half of that statement, the papering over, is nothing new to software engineers who regularly work with LLMs. What is increasingly new is the sophistication of the papering over as models become more capable, combined with the fact that we're asking it to do harder things in more ambiguous problem spaces.

This provides a growing opportunity for oversight gaps in detecting mismatches where the LLM designs solutions to problems that don't exist, solutions that solve the wrong problems, or solutions for a problem that you couldn't quite articulate (which, if you can't articulate a problem, you can't know if you've solved it). Not to mention the amount of time it takes to uncover these issues, where it would be far more stark if it was a less capable model.

They're...Really Not Human 😱

LLMs are inherently solution-oriented, while human engineers are curious, chaotic, and problem-oriented. We may say we're solution oriented, but the most rewarding parts of my career have been trying to solve incredibly hard problems through collaboration, iteration, and trial-and-error with team members, who contribute perspective, experience, and points-of-view that I may crave from an LLM, but will never get, because they are not human.

The danger I see is that the better and more convincing these models get, the more we may trust them to handle important tasks, like we would with more senior engineers. But it's not looking like they "scale" in the same way human engineers do...at all.

In human engineers, with increased skill and seniority also comes increasingly sophisticated solutions and problem-solving, sure. But it also comes with stronger points of view, and an increased ability to make sound judgment calls based on a number of intangible human factors to navigate ambiguity in problem spaces, like prioritization, accountability, responsibility, experience, judgement, and trust.

The differences are becoming more and more stark in that there's a fundamentally different evolution when it comes to evaluating the "seniority" of LLMs, and it's increasingly bolstering my gut instinct that when people say things like "this LLM can replace an x-level engineer, with oversight" just...doesn't make sense; it really boils down to trust.

My trust that an LLM understood how to solve a problem correlates directly with:

the level of precision in which I'm able to articulate the problem
the level of ambiguity inherent to the problem space

The more precision and the less ambiguity, the more I will trust the LLM to do a good job. I say "trust" and not "confidence", because I never trust that an LLM to ask me when it doesn't understand something.

What Are Companies Doing?

When I see companies like Meta doing mass layoffs of engineers that they think can be replaced with AI, my first thought is empathy as an engineer myself, but immediately followed by confusion. The level of hubris in thinking that LLMs can come close to replacing the utility of a human engineer's brain, process, and intuition to navigate ambiguity is astounding to me, as someone who works with LLMs 24/7.

Also, simply put: you cannot have senior engineers create more senior engineers without junior engineers. It's just the law of nature, folks!

Additionally, to not see that increasingly sophisticated and complex solutions from incredibly powerful yet novel tooling requires an equal amount of careful oversight from an increased amount of people spending more time on doing that demonstrates leadership that's shockingly out of touch with reality, people, and impossibly, even the technology they create. It speaks to a culture of using engineers to do mechanical work that can be replaced by automation, which is simultaneously a waste of company money, as well as a waste of engineering careers.

It also offloads an incredible burden to justify that level of investment in AI tooling to the engineers, while simultaneously prescribing it as a tool, because LLMs are a tool. While AI is incredibly useful and does increase productivity in specific scenarios, to use it right requires discipline, oversight, time, points of view, and context that (at the moment) only a human can provide. Not to mention, the value they're actually providing is incredibly subjective and use-case specific, often not even related to the code itself, but the iteration process with code.

Time Spent

For better or for worse, as models advances, I have found that, due to the increased sophistication of the code it writes and the design choices it makes, time commitment to code review is only growing, because the trade-offs of accepting AI-generated code become increasingly nuanced (not obsolete) while the risks inherent to developing and scaling software remain. Ironically, in the end, we inevitably end up being constrained by human factors anyway, because we as engineers can only carefully review so much before our brain collapses and we either approve a PR or stare at the wall for an hour thinking about nothing.

Would I rather review a sophisticated PR written by a human or an agent? Honestly, it's all code to me, but I trust a human far more than an agent, and if I ask why, a human will provide me with an honest response and an actual point of view.

Again, I love LLMs! This article began as a simple exploration into my thoughts on the evolving sophistication of them and the pit-falls that come with that. At the end of the day, I find the rapid advancement of LLM models fascinating, valuable, time-saving, time-consuming, and incredibly vexing; I find LLMs provide value in codegen, but also in quick iteration to help me discover that I might be solving the wrong problem much faster than if I had to do it all myself, which is value in time saved, but in a different way I might have expected. Unsurprisingly, as models become more capable, I ask more ambiguous and demanding tasks from them, and this is also the place where I find the most interesting shortcomings.

With increased abilities, I find myself spending more time reviewing LLMs with less trust, rather than spending less time with more trust, because I've learned the hard way that just because the argument and implementation it presents appears convincing, and the output is well-designed, articulated, and executed, does not mean it's correct. There's no shortcut to resolving ambiguity, in the end.

And for any organizations that think sophisticated code output outweighs the value of engineers across levels collaborating on ambiguous problem spaces...well, may you get what you asked for.

Top comments (3)

L. Cordero • Jun 19

"not-wrong solutions" - My exact pain point for the build I'm working on.

Alex Shev • Jun 18

The bundle-size example is a good agent lesson: the agent should not just propose a cleanup, it should prove the artifact changed.

For this kind of work I like forcing the loop through a measurable check: before/after bundle output, chunk ownership, and a short explanation of what moved. Otherwise "I removed the dependency" can mean the code changed while the browser payload stayed the same.

Henry Ivry • Jun 19

Totally agree in regard to forcing a loop as great hygiene—your check catches real issues.

I was actually poking at an insidious sibling: where the agent is actually correct and meeting every criteria, while working around a design issue and solving the wrong problem in a way that a measurable loop or benchmark wouldn't catch, because every number looks great, and it followed every instruction.

That's the part I find both fascinating—and a little unnerving—as the models get better. They're increasingly able to effortlessly work around serious design tensions that would normally trip up a human or lesser model and force a refactor in the right place. Instead, this class of workaround becomes camouflaged as well-designed and defended output that makes it harder (and more time consuming) to detect, potentially re-enforcing underlying design issues that can contribute to serious fragility and come back and bite later, or make it much harder to refactor in the future.

Thanks for the thoughtful read, it definitely gets me thinking about where verification helps and where it gives false comfort while trying to articulate that more clearly.