Iain Thomson for Daily Context

Posted on Jul 1

It’s Time To Put Humans Back In The Software

#aie #software #ai #agents

AI Engineer World's Fair Coverage

Software engineers have become overreliant on models to build applications, and it’s time to put humans more firmly in the loop, the AI Engineer World’s Fair was told on Tuesday.

The term software engineering was first used in a 1968 NATO conference, which was called to discuss the problems of building large-scale, reliable applications at speed. It was suggested that applying industrial manufacturing to code creation would help speed this process up, but with larger and larger amounts of code now being machine generated, the cracks are starting to show in reliability.

“We are all racing to put AI coding into production, and there's been lots said about loop engineering, and we should probably write more loops,” argued Dex Horthy, the impressively mustachioed co-founder of HumanLayer.

“I'm here to convince you today that this is, in fact, not a joke, and that no amount of harness engineering or loops maxing can solve what is fundamentally a model training issue.”

The problem, he said, is that developers have become overreliant on badly trained models. While some agents are great at some parts of the development cycle, they are lousy at others. As a result, pull requests are up, code breaks are multiplying, and customers — and the companies supplying them — are losing out.

His proposal is to make humans much more involved in the early stages of software design, particularly with planning and architecture decisions. Human steering at the start can solve a lot of problems before they become serious issues, he opined, but 30 minutes more in preplanning and alignment can save hours in review.

Loops don’t offer a magic bullet, Horthy stated — it’s not possible to throw engineers into training agentic models to do their jobs for them. Loops can help, but they aren’t the universal panacea that many in the industry are saying.

Part of the problem is the reward system for training, Horthy said. If a serious flaw is committed in vibed code, it could only be discovered months or even years after the error was made. There’s no way to “propagate that reward signal back across the gap,” he told the crowd.

Human code reviews are too often being skimped on, or even eliminated, in some software houses. AI can help human checkers considerably, but it can’t be relied on without human oversight, he proposed. And if you’re getting drowned out in AI-generated pull requests, then that also indicates the models are at fault. He added that if machine-made code needs a 20% rewrite, then that’s actually pretty good for much of the agentic output.

Small fixes can be assigned to an agent to fix, but for key system architecture decisions, humans need to review and plan where a product is going, how its parts fit together, and how to get there without too many pull requests. Mistakes will always be made, but it’s important to minimize them if you want to keep users from getting overly pissed off. The answer is a fusion of man and machine, but with some realignment, the issue is fixable, he suggested.

“It's easy to hear all this and be a little bummed out. I really like the world where we can just not have to ever read code ever again,” he said.

“But we're engineers, and these are just constraints, and models are good at certain things, and they're not good at other things. And so, go figure out how to solve problems given a set of constraints. Use loops, they're great. But go solve the hard problems, seek leverage.”

Top comments (30)

Vinicius Pereira • Jul 1

Good writeup, and Horthy is right about the part everyone underinvests in, humans at the front. But I'd push back gently on "no amount of harness engineering can solve what is fundamentally a model training issue," because I don't think the harness was ever trying to solve the training problem. It's not there to make the model better, it's there to bound a model you already don't fully trust. Those are different jobs. You don't test your way to a smarter model, you test your way to a broken change not shipping. Framing the harness as a failed attempt to fix the model sets it up to fail at something it was never for.

And honestly his strongest point argues the other way. The reward-signal gap he describes, where a flaw surfaces months later and can't propagate back to training, is exactly why you need a harness. Training can't learn from a failure that shows up in a year, but a deterministic test or eval gives you the fast local reward at the moment of the change that training never gets. The harness is the substitute for the signal the delayed failure can't provide, not a rival to it.

Where I fully agree is that this is really three layers, not two, and the middle is the one everyone maxes. A human decides what's worth building, the model proposes, and the harness disposes on whether the change is correct. Loops and evals bound correctness, they don't tell you the thing was worth building, and that human-at-the-front judgment lives on a different axis than verification. So the fix isn't less harness, it's to stop asking the harness to cover the two layers it was never on. Good talk to surface.

Gamya • Jul 4

This three-layer breakdown is really useful. I'd add a beginner's angle to it: for someone still building foundational skills, the risk isn't just "the model writes the code"—it"'s that the model quietly makes the judgment calls you haven't developed the instinct for yet. The human at the front layer assumes you already have taste to apply. If you skip the years of small mistakes that normally build that taste, it's not clear you're actually contributing judgment at the "human" layer—or just rubber-stamping something you don't yet have the experience to question."

Vinicius Pereira • Jul 5

this is the sharpest version of the beginner problem i have seen written down. the taste question is real: judgment used to be a byproduct of writing and debugging your own mistakes, and if the model absorbs those reps, the default path produces reviewers who never earned the instinct they are being asked to apply.

but i do not think the answer is abstinence, it is relocating the reps. the habit i have watched work for people coming up fast: before reading what the model produced, write down what you expect the change to look like, then diff your expectation against the actual output. the gap between the two is exactly the lesson a small mistake used to teach, except it arrives in minutes instead of after a production incident. done honestly, you see more distinct failure modes in a month of reviewing model output than a pre-ai junior saw in a year of their own bugs, because a model generates a wider variety of plausibly-wrong code than any one person ever would. the raw material for taste got richer, not poorer. what kills the formation is the posture: accept-and-move-on treats the diff as an answer, predict-then-diff treats it as a hypothesis, and taste only grows in the second stance.

Gamya • Jul 6

The predict-then-diff framing might be the most useful reframe of this whole debate for me personally. I think what got me stuck before was treating the model's output as something to evaluate after the fact — which quietly puts you in the accept-and-move-on posture even when you think you're being careful, because you're only ever reacting to what's already on the screen. Writing the expectation down first forces the commitment before you've seen the answer, which is exactly the part that used to happen naturally when a bug forced you to have a theory before you could even start debugging. I'm going to actually try this instead of just reading the diff and nodding along to it.

Vinicius Pereira • Jul 6

yeah, the "only ever reacting to what's already on the screen" is the whole thing, and it's worse than it looks because reading the output moves your prior before you can catch it. once you've seen the diff your judgment is already anchored to it, so "careful review after the fact" isn't a weaker version of predict-then-diff, it's a different thing that can't recover the call you'd have made cold. the prediction written first is the only uncontaminated opinion you get.

one thing that decides whether it actually works when you try it: make the prediction specific enough to be wrong. "it should handle the empty case" kind of matches whatever shows up, so you get the ritual and none of the reps. "it'll add a guard clause at the top and leave the loop untouched" can be flatly wrong, and being wrong is the entire point, that's the rep. if your prediction can't be falsified you didn't make one. curious how it goes once you run it on real diffs, that's where you find out if the habit holds.

Gamya • Jul 8

"If your prediction can't be falsified you didn't make one" is going straight on a sticky note. That's the part I would have gotten wrong — I'd have written something vague enough to always feel right and wondered why the habit wasn't building anything.
The specificity requirement also reframes what being wrong means. A falsifiable prediction that turns out to be wrong isn't a failure of the exercise — it's the exercise working exactly as intended. The rep only lands when there's something concrete to be wrong about. Will report back once I've run it on real diffs.

Vinicius Pereira • Jul 8

the 'falsifiable-but-wrong is the exercise working' is the reframe that makes it survive contact, glad it landed. one thing for when you run it on real diffs, because it's the exact trap one step later: the same anchoring that contaminates the review contaminates the grading. after you see the diff it's very easy to decide past-you 'basically predicted that' and quietly round a miss up to a hit, and now the rep is building nothing again, just later. so freeze the prediction where you can't edit it, commit it, timestamp it, paste it in the PR before you open the diff. the point isn't ceremony, it's that a prediction you can still soften after the fact is as vague as one that was never specific, it just fails honest later instead of early. curious what your hit rate looks like the first week, it's usually humbling in a useful way.

Gamya • Jul 9

"A prediction you can still soften after the fact is as vague as one that was never specific" — that's the trap I would have fallen into without being told, absolutely. The instinct to retroactively round a near-miss up to a hit is so automatic you don't even notice you're doing it until the habit has been silently hollow for weeks.
Committing or timestamping the prediction before opening the diff is the right call — making it immutable is the only way to guarantee past-you can't negotiate with present-you after the answer is visible. Will report back on the hit rate. Based on everything you've described, I'm expecting the first week to be genuinely humbling.

Vinicius Pereira • Jul 9

You will run it well. One read for the humbling week: separate the misses that were wrong from the misses that were just too vague to grade. A specific prediction that was specifically wrong is the exercise paying off, you learned the code does something you did not expect. A miss you cannot even score cleanly taught you nothing, it only means the prediction was not sharp yet. Early on the real signal is not the hit rate, it is how many predictions were specific enough to lose cleanly. Report back, genuinely curious.

Gamya • Jul 9

"Specific enough to lose cleanly" is the better metric — hit rate measures confidence, this measures whether the predictions were real predictions at all. That's the prior question.

Vinicius Pereira • Jul 9

Exactly, and it is a gate, not a metric beside it. Until the predictions are specific enough to lose cleanly, the hit rate is not low or high, it is uninterpretable, you are scoring confidence on claims that could not have been wrong. First you earn the right to be scored, then the score means something.

Gamya • Jul 10

"First you earn the right to be scored" — that reframe changes the whole setup. The feedback loop isn't broken if your hit rate looks fine early, it just means you haven't made real predictions yet.

Vinicius Pereira • Jul 10

Right, and it gives you something actionable to check instead of a vibe. The discipline that makes it work in practice: every prediction gets two fields filled in before it ships, what outcome counts as a miss, and when we will check. If either field is blank, it does not go in the ledger. Same rule as writing tests: a test that cannot fail is not a test, and a prediction that cannot miss is not a prediction. The early "fine-looking" hit rate you mention is exactly what that rule filters out.

Gamya • Jul 11

Two fields, both required before it ships — that's the kind of constraint that actually holds under pressure because there's no judgment call about whether the prediction was good enough. Either both fields are filled or it doesn't count, full stop. The ambiguity of "was that a real prediction" disappears when the rule is structural rather than discretionary.

The test analogy is the right one too. A test that always passes isn't giving you information, it's just adding noise that looks like signal. A prediction you can always retroactively score as a hit is the same thing — comfort without feedback. The two fields force the prediction to do the thing a prediction is actually for, which is to be wrong in a way you can learn from rather than right in a way that teaches you nothing. Really taking this whole thread as a framework to actually implement rather than just agree with.

Vinicius Pereira • Jul 11

The two fields can be filled and still hollow: write a miss condition loose enough that nothing realistic trips it and the test-that-cannot-fail walks back in through the contents. So the miss field needs its own bar, it has to name an observable that could actually happen on the timeline you set. That is "earn the right to be scored" made operational.

Gamya • Jul 12

So the constraint isn't just that the field exists, it's that what goes in the field has to meet a standard — an observable that could actually happen on the timeline, not just a condition that sounds specific but is vague enough to never fire. Which means the discipline isn't really about the two-field structure at all, it's about being honest with yourself when you're filling them in, which is the hardest part of any system that depends on your own judgment to enforce it. The structure makes it harder to fool yourself accidentally, but it can't stop you from filling in something technically compliant that lets you off the hook. That last mile is still just you deciding whether you're actually doing the thing or doing a version of the thing that looks like the thing.

Vinicius Pereira • Jul 12

Right, and the honest move is to move as much of that last mile off your own judgment as you can. A miss condition that names a real observable is checkable by someone who isn't you: a reviewer, or a job that asserts the miss actually fires against a known-bad input before the field counts as valid. You can't automate the will to be honest, but you can make the dishonest version fail a check instead of just failing your conscience. The self-attested part shrinks to whatever you refused to make observable.

Gamya • Jul 14

"You can't automate the will to be honest, but you can make the dishonest version fail a check instead of just failing your conscience" — that's the cleanest version of the whole principle I've seen across this entire thread. The conscience check is unreliable under pressure, under deadline, under the general human condition of wanting to be done. The external check doesn't care about any of that, it just fails or it doesn't.

The shrinking self-attested surface is the right way to think about the goal too — you're not trying to eliminate judgment, you're trying to minimize how much of the system depends on you being honest with yourself in a given moment, because that's the weakest link in any process that has to hold under real conditions. Whatever you refused to make observable is exactly where the process will eventually fail, and naming that honestly is already most of the discipline.

Vinicius Pereira • Jul 14

That's the honest way to draw it, both lines on the diagram beats a boundary that quietly claims more than it holds. The query-access-without-DB persona is the one I'd guard hardest, and keeping the resolve path auditable is the right first move there. Good thread.

Gamya • Jul 15

Yeah this thread turned out way better than I expected honestly — started as a comment on a conference talk and somehow became the most operationally useful conversation I've had on DEV. Appreciate you pushing every reply somewhere more specific than where it started.

Vinicius Pereira • Jul 15

Mutual. The best threads are the ones where nobody lets "roughly right" sit, and this had two of those. Ping me when you want to pressure-test the next one.

Gamya • Jul 15

Will do — and same goes the other way. The threads that go somewhere are the ones where someone pushes back on the comfortable version of the idea, and you did that consistently enough that the ideas actually got sharper rather than just getting agreed with. That's rare and I appreciate it.

Vinicius Pereira • Jul 15

Likewise, Gamya. Catch you in the next one that refuses to sit comfortable.

Gamya • Jul 16

Looking forward to it. 🌸

Mykola Kondratiuk • Jul 5

i’d push back on this - adding human review gates without designing what they’re actually checking just creates rubber stamps. seen it happen in a week. the bottleneck isn’t headcount, it’s verification design.

Nazar Boyko • Jul 1

Horthy's point about the reward signal never making it back across the gap is the real argument in the whole talk, and it's the part I'd want people to sit with. If a flaw that got vibed in only surfaces months later, there's no clean way to tell the model "that thing you did in March was wrong," so the training loop literally can't learn from its most expensive mistakes. That's a much stronger case for keeping humans in the planning seat than "AI makes errors," because it says the errors that matter most are structurally invisible to the process that's supposed to fix them. The 30 minutes of preplanning saving hours of review lands for the same reason, you're moving the catch to a point where the feedback is still cheap.

algorhymer • Jul 4

This is the full the script of the conference

It is really important to read the original writings, since Daily Context's authors have a clear marketing bias to misuse and misrepresent facts for their own financial gains.

Gamya • Jul 4

This resonates with me from the other side of the experience curve. As someone still building foundational skills, I notice the temptation isn't just to "let the model write the code"—it's to let the model make the decisions you haven't developed the judgment for yet. Horthy's point about humans owning the planning stage assumes you already have the instinct for what good architecture looks like. For beginners, I think there's a real risk of skipping the years of small mistakes that normally build that instinct, because the model quietly absorbs the decision before you've had to sit with it. The harness and the human both matter, but if you never struggled through the wrong call yourself, it's not clear the "human in the loop" is actually contributing judgment—or just rubber-stamping.

Alex Shev • Jul 2

The human role gets more important as the factory gets faster. Someone still has to own taste, risk, priorities, and the decision to stop generating and ship the smaller correct thing.

Julian Neagu • Jul 3

The more I use AI, the more I treat it like a fast junior teammate. It can build a lot, but I still want humans making the early calls because changing direction later is always the expensive part.

View full discussion (30 comments)