Daniel Nwaneri

Posted on Jun 9

The Loop Is Not the Product

#ai #productivity #webdev #discuss

AI compute costs vs human labor

A tweet landed on my timeline from Peter Steinberger — OpenClaw founder, now at OpenAI:

"Here's your monthly reminder that you shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

He's right about the mechanic. He's not asking the harder question.

Before agents, we had cron jobs.

0 2 * * * ./process_reports.sh

That's the whole contract. Run at 2am. Do what you said. Fail loudly or silently. Nobody wrote a think piece about cron jobs disrupting knowledge work. Nobody raised a seed round on a well-tuned crontab.

But structurally? A cron job is a loop that prompts a process on a schedule. It just had the decency to be honest about what it was.

Cron jobs → Airflow → event-driven pipelines → agents. Each layer added adaptability and removed legibility. Cron is maximally legible. You can read the entire logic in one line. An agent doing "the same job" is a probability distribution with a system prompt and a credit card attached.

Now we've gone further. We have multi-agent systems. Specialist agents. Orchestrator agents that decide which specialist to call. Verification agents that check the output. Agents that self-correct when they fail.

And companies are quietly running the math and going pale.

Uber burned through its entire annual AI budget in four months. An NVIDIA vice president said publicly that AI computing costs now exceed employee labor costs. The FinOps Foundation's 2026 State of FinOps report found 73% of enterprises say AI costs exceeded original projections. Not a few bad actors. Not early adopters who didn't know better. Seventy-three percent.

The mechanism has a name now: the agentic loop multiplier. A simple query in 2023 cost $0.04 per interaction. A multi-step orchestrated agent workflow in 2026 costs $1.20 — thirty times higher. Gartner puts the range at 5-30x more tokens per task than the chatbot pilots that justified the budget. The ROI calculations that approved the deployment assumed chatbot-level consumption. The invoices arrived with agent-level reality.

A mid-level developer runs $80-120k. Fully loaded with benefits and overhead, maybe $250k. That sounds expensive until the token bill lands.

The human compounds. They learn your codebase, your culture, your shortcuts. They remember the decision you made last quarter and why. The agent starts fresh every session. Every morning you're paying for the same orientation meeting. Context reconstruction — re-reading docs, re-loading state, re-establishing what "done" means — isn't free. You're billing for memory the human already had.

The demo never shows you this. The demo is a single agent, single task, cherry-picked problem, running for 90 seconds while someone claps at a conference. The production reality is a fleet burning tokens on retries, tool calls that fail and get reattempted, coordination overhead between agents nobody budgeted for.

You've built a bureaucracy. A token-denominated bureaucracy with no union and no lunch breaks and no salary cap.

Back to Steinberger's tweet.

"Designing loops that prompt your agents" is a real architectural upgrade over manual prompting. If you're still narrating every step to an agent like you're dictating to a secretary, the loop is the upgrade. Prompts from state — test results, diffs, error logs — not from you typing.

But designing the loop is just procrastination with better posture if there's no customer at the end of it.

Because someone still has to decide what the loop optimizes for. What "done" looks like. When to break. What counts as a failure worth stopping for. That's not automation — that's system design with higher stakes, because now the mistakes compound before anyone sees them.

And "designing loops" is genuinely hard in a way prompting isn't. Most people who can write a good prompt cannot design a feedback loop with appropriate exit conditions, cost governors, and human checkpoints. The tweet makes the upgrade sound like switching from tabs to spaces. It's closer to switching from writing functions to designing distributed systems.

What I want to know: what breaks in the loop that a prompt would have caught? Every abstraction hides something. Prompting hides scale. Loops hide drift. At some point the agent has been running for six hours optimizing a metric nobody remembers choosing, and the loop is beautiful and the output is garbage.

Here's what nobody in the agent hype cycle wants to sit with:

The old model had a forcing function built in. You shipped, a human used it, something broke, you fixed it. Feedback was physical. A user opened a ticket. A client called. Reality interrupted the loop.

Agents don't have that governor. The loop is the product. And when the loop is the product, you can optimize indefinitely without ever confronting whether the output matters.

Token burn becomes a proxy for progress. Iteration velocity becomes a stand-in for value creation. The agent looks productive because it never stops — but stopping is exactly what would force the question.

Autonomy used to mean delegated judgment. You trust someone to make calls because they understand the goal and can feel when something's off. What most agents have is delegated execution. They can do the steps. They have no stake in the outcome, no access to the silence that follows a bad result, no way to know the customer churned three weeks later because the feature was technically correct and completely wrong.

Automate the tedious middle of a known, stable process. Data pipeline, alert triage, code linting, content reformatting. Stuff where the definition of done is actually defined. That's real. That's useful. A cron job with taste.

The inflated version — the one burning the tokens — is the agent as a substitute for product thinking. If you don't know what to build, an agent that builds constantly feels like momentum.

It isn't. It's expensive randomness with good logging.

Consider Spotify.

A company that built its entire brand on one rule: only ship what users ask for. Feature requests drove the roadmap. That's it.

Then AI became mainstream and the calculus changed publicly. Spotify's workforce went from 7,721 employees at the start of 2024 to 7,242 by Q3 — shrinking every quarter while revenue grew 19% year over year. Their filings note it plainly: profitability driven by "lower personnel and related costs." They're doing more with fewer people. The numbers look good on a slide.

But nobody's asking the follow-up question. The features that built Spotify's loyalty — Discover Weekly — came from people who understood the product, the listener, the culture of music discovery. Accumulated judgment. What does the agent fleet ship? What user asked for it? What happens when "only build what users want" gets replaced by "ship what the loop produces"?

We don't know yet. The invoices look better. The product debt is still accumulating.

I built seo-agent — an open-source SEO audit agent using Python, Browser Use, Claude API, and Playwright.

I could leave it burning tokens 24/7. I didn't. Not because of the money. Because I couldn't answer the basic question: what would it actually be doing?

I wired a cron job to run it on schedule. It analyzes logs. It surfaces what's broken. Then I look at the output, decide what matters, and go into my codebase with Claude Code to write the fix and the test. The agent handles the tedious middle. I handle the judgment at the edges.

Call that old fashioned. I'd call it honest.

The loop runs. But it runs to me. Not into a void.

My Bookmark Brain — a RAG system trained on 50,000 of my own X bookmarks — flagged this pattern when I showed it the tweet:

"Designing the loop is just procrastination with better posture if there's no customer at the end of it. Automated nobody is still nobody."

The stack was never the problem. It was always the most comfortable place to hide from the problem.

Cron jobs ran quietly and failed loudly. Agents run loudly and fail quietly. The failure is just spread across enough API calls that the bill arrives before the reckoning does.

Design better loops. Ship to someone who asked.

This article used AI tools for research verification and editing.

Top comments (48)

Sloan the DEV Moderator • Jun 10

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!

Ken W Alger • Jun 9

This is an incredibly necessary reality check, Daniel. The financial and operational hangover hitting enterprises right now is the direct result of treating "loops" as a magic bullet rather than an infrastructure risk.

From a systems architecture perspective, Peter Steinberger’s premise is fundamentally flawed because it implies that the loop should be built around the agent. When you design loops that just chain probabilistic prompts together, you aren't building a product. You're building a token-denominated bureaucracy that runs up a massive bill while hiding drift.

The correction here requires a strict shift in custody:

The deterministic logic is the brain; the LLM is just the narrator.

If you are going to run a loop, the loop itself must be a rigid, finite state machine running on local silicon. The agent shouldn't be roaming freely across toolsets; it should be treated as an ephemeral runtime utility called inside strict, deterministic boundaries.

For a loop to be production-safe and compliance-ready, it has to enforce three sovereign guardrails:

An Ingestion Gate: Every single turn of the loop must pass through a local sieve to strip out conversational "prose tax" and keep token burn bounded.
Deterministic Verification: The agent never decides when a loop is "done" or if a failure occurred. A binary, immutable code gate (like a unit test or a strict schema validator) handles state promotion.
A Forensic Trace: Every cycle must emit a cryptographically signed receipt binding the input hash and transformation telemetry. If a loop executes 30 times into a void, you must have a non-repudiable audit trail to reconstruct exactly where the logic drifted.

Steinberger's advice is a recipe for expensive randomness unless we stop treating AI as an orchestrator and start treating it as a closely guarded component inside a deterministic harness. Exceptional write-up.

Daniel Nwaneri • Jun 9

The finite state machine framing is the correction the whole conversation needs. "The deterministic logic is the brain, the LLM is the narrator" . That's the architectural inversion most agent builders never make because the tooling doesn't enforce it. They reach for the LLM first and bolt on guardrails later, which is exactly backwards.

The "prose tax" concept is sharp. Every turn of the loop paying a conversational overhead that has nothing to do with the task . That's where a lot of the 30x multiplier actually lives and nobody names it that cleanly.

The forensic trace requirement is where I'd push back slightly. Cryptographically signed receipts make sense at compliance scale. For most teams the more immediate problem is they have no trace at all not because they chose the wrong format but because they never thought to emit one. What's your minimum viable audit trail before you get to cryptographic signing?

Ken W Alger • Jun 9

That is a completely fair pushback. You can't worry about verifying the integrity of a trace if your system isn't emitting any telemetry in the first place. Most teams are flying completely blind, which is why their first clue that a loop went sideways is a massive API invoice.

Before you ever reach for asymmetric keys or public-key infrastructure, the Minimum Viable Audit Trail (MVAT) requires you to turn that black box into a deterministic state ledger.

For teams just trying to survive the loop multiplier, the bare-minimum implementation comes down to enforcing three local constraints on every turn:

The Structural Delta Ledger: Never log raw text dumps or full chat histories. Instead, log a structured, local row (SQLite or flat JSON lines) containing three things: the state_origin (where the turn started), the input_hash, and a strict execution metric (e.g., execution time, token delta, or a binary pass/fail from your testing suite).
Deterministic Context Isolation Tokens: Assign a unique session-scoped ID to the loop execution, and pass an immutable sequence counter (turn_01, turn_02) into your state metadata. If your loop loops 5 times on the same task, you need to see exactly which sequence index began to stall.
The Local "circuit_breaker": Wire a hard-coded maximum turn count and a rolling token-burn ceiling directly into the state machine. If turn_count > 5 or accumulated_tokens > 15000, the loop violently crashes and forces a human checkpoint. The MVAT's job isn't just to watch the loop fail; it's to kill the loop before it drains the bank account.

Once a team shifts from raw text strings to a structured, local state ledger, they have their MVAT. They can see the drift, track the cost, and catch anomalies.

Cryptographic signing (Forensic Receipts) is simply the next logical layer of maturity for that exact ledger. You don't change the data shape; you just sign the manifest so that an external auditor can verify that the logs weren't altered post hoc to hide a compliance failure or a runaway loop.

Love the pushback—getting teams to emit any stable instrument before they prompt is half the battle!

Daniel Nwaneri • Jun 9

The circuit_breaker is where this clicks for me. turn_count > 5 isn't just telemetry / it's the exit condition enforced at the infrastructure layer instead of trusted to the model. Which means the spec-writer problem and the MVAT problem are the same problem at different altitudes. You define done before you open the terminal. The circuit_breaker kills the loop when done hasn't arrived by the boundary you set. One is upstream discipline, the other is downstream enforcement. Both are rejecting the idea that the LLM decides when it's finished.

The Structural Delta Ledger framing also reframes what logging is for. Most teams log for debugging. You're describing logging as governance . The ledger isn't there to help you reconstruct what happened, it's there to prove the loop never had the authority to run past the boundary in the first place.

SQLite or flat JSON lines is the right call for the MVAT floor. What's your threshold for when the delta ledger graduates to something with stronger consistency guarantees or does the circuit_breaker make that largely irrelevant below compliance scale?

Ken W Alger • Jun 9

Exactly. You’ve captured the core philosophy perfectly: Upstream discipline defines the boundaries; downstream enforcement breaks the circuit. Neither trusts the model to police itself.

To your question about graduation thresholds: the circuit_breaker is excellent for controlling execution velocity and token burn, but it protects your bank account, not your state integrity.

A simple local Minimum Viable Audit Trail (MVAT) (SQLite or flat JSON lines) is incredibly resilient, but it hits its architectural floor the moment you cross from a single isolated agent thread to a distributed multi-agent system sharing a mutable runtime context.

There are three distinct tipping points where a flat delta ledger must graduate to stronger consistency guarantees:

The Distributed Race Condition: If you have multiple asynchronous loops attempting to read from and write to the same state machine or shared memory base simultaneously, flat JSON lines will corrupt, and standard SQLite will throw database locks. You graduate to strict serializable isolation levels because a loop cannot make a deterministic state-promotion choice if the ground truth shifted under its feet mid-turn.
Causal Lineage Branching: In complex pipelines, a circuit-breaker might trip on Agent B, but Agent A already executed a downstream tool call based on Agent B's pre-failure state. A simple delta log tells you that it broke, but it can't roll back the environment. You graduate to an event-sourced, content-addressed ledger (where every state mutation is treated as an immutable, append-only block) so you can atomically roll back the system to the exact turn before the drift occurred.
The Custody Handshake (The Compliance Scaled Boundary): Below the compliance scale, a local database file is fine because the developer is the auditor. But the moment the loop's output updates a financial ledger, modifies a production codebase, or touches sensitive user data, your ledger must transition from an internal file to an external, non-repudiable one.

This is the exact design threshold where the Sovereign-SDK graduates a team from simple structured logging to asymmetric cryptographic sealing. The data shape doesn't change, but wrapping every state transition in an Ed25519 ForensicReceipt means you no longer rely on database permissions for security. The receipt itself proves the loop never violated its boundary.

If you're running isolated, sequential loops on local silicon, a properly tuned SQLite db with a violent circuit-breaker is a bulletproof fortress. You only need to scale the ledger's consistency when the loop's state becomes distributed or legally binding.

Daniel Nwaneri • Jun 9

The causal lineage branching case is the one that changes the mental model. The circuit breaker is a financial instrument. It protects the bank account. But Agent A already fired the downstream tool call before Agent B tripped and that call may have touched something real. The loop stopped. The side effect didn't.
That's the gap between "the loop is controlled" and "the system is safe." Most teams conflate them because in single-agent sequential flows they're the same thing. The moment you go distributed they decouple completely.

The "developer is the auditor" line draws the graduation threshold cleanly. SQLite with a violent circuit breaker is genuinely bulletproof for isolated loops where one person holds both roles. The consistency guarantees only become load-bearing when the auditor is someone who wasn't in the room when the loop ran — a regulator, a client, a future engineer reading the trace six months later.

That reframes what the Forensic Receipt actually is. It's not a security primitive. It's a trust transfer mechanism — proof that the loop's behavior can be verified by someone who wasn't present. Which means the question of when to graduate isn't really about scale. It's about who needs to trust the output and whether they were there when it ran.

Is the Sovereign SDK's custody model designed around that trust transfer moment specifically or is the Ed25519 sealing more about tamper evidence than auditability for absent parties??

Ken W Alger • Jun 9

You’ve just articulated the exact emotional and architectural pivot point of the entire Sovereign Systems Specification.

To answer your question directly: The Ed25519 sealing is the mechanism, but the Trust Transfer Moment is the entire product. They are two sides of the same coin.

Tamper-evidence on its own is just a security metric. But when you apply it to an execution trace, it undergoes a phase shift: it transforms an ephemeral runtime event into a permanent, non-repudiable historical artifact.

The Sovereign SDK’s custody model is designed precisely for that absent party, the regulator, the client, or the future engineer six months from now who has every reason to be skeptical of an LLM's output.

Here is why that cryptographic seal is the only way to achieve true trust transfer across time and space:

Collapsing the Asymmetry of Presence: If a loop runs into a void on a server at 2:00 AM, an absent auditor faces an impossible information asymmetry. They have to trust your database permissions, your cloud provider's integrity, and the fact that no developer ran an ad hoc UPDATE query to hide a failure. Asymmetric cryptographic sealing eliminates the need for that systemic trust. It proves mathematically that the data they are looking at right now is identical to the data emitted at the exact millisecond the loop executed.
Binding the Scribe to the Evidence: The SDK doesn't just sign the text output. The Ed25519 envelope seals the strict causal lineage: Sign(Input Hash + Deterministic State Pass/Fail + Token Telemetry + Model Output). If the model hallucinates or deviates from the deterministic rails, the resulting state delta breaks the cryptographic signature. The absent party doesn't need to have been in the room; they can verify the signature locally on their own machine and know the loop remained inside its sandbox.
Solving the Distributed Side-Effect Nightmare: To your point about Agent A firing a real-world tool call before Agent B trips the financial circuit breaker, this is where auditability becomes a safety feature. When side effects cannot be physically reversed, the ForensicReceipt serves as a black box flight recorder. It provides the immutable evidence bundle required to execute a downstream compensating transaction or human intervention. It ensures that even when a system isn't safe, it is entirely accountable.

When the developer and the auditor are the same person, a local database file is a perfectly fine notebook. But the moment you must hand that notebook to someone who wasn't there, you cannot hand them a mutable database and ask them to trust it.

You hand them a Certified True Copy, or in the parlance of the Sovereign Systems Specification, a signed ledger.

Without a cryptographic seal, a runtime log is just an unverified photocopy of what happened. Anyone could have modified a database row post-hoc to hide a runaway loop. The Ed25519 signature acts as a digital notary. It doesn’t just say what happened; it provides mathematical proof that the trace hasn't been altered by a single bit since the millisecond the loop executed.

The Sovereign SDK exists to turn probabilistic runtime chaos into a verifiable historical record. You’re not just building automated loops anymore; you’re generating verifiable provenance.

Daniel Nwaneri • Jun 9

The flight recorder framing resolves something I'd been holding loosely. Safety and accountability are different guarantees. The circuit breaker aims for safety . it tries to prevent the bad outcome. The ForensicReceipt aims for accountability . it ensures that when the bad outcome happens anyway, the evidence is intact and untampered. Agent A already fired before Agent B tripped. You can't unsend that tool call. But you can prove exactly what state the system was in when it fired, who authorized the boundary, and whether the loop stayed inside it. That's not a consolation prize. That's the only honest guarantee a distributed system can actually make.

"Verifiable provenance" is the right frame for where this whole conversation has been heading.
This thread has built something I didn't expect when I published the essay . a complete architecture from exit conditions to cryptographic accountability. I'd like to turn it into a freeCodeCamp tutorial with you as co-author. The comment thread is already the outline. I have the editorial relationship there. Are you in?

Ken W Alger • Jun 9

I am absolutely, 100% in. Let’s build it.

You’ve hit on the ultimate truth of distributed systems: Safety is a goal, but accountability is an obligation. When you operate at the intersection of non-deterministic models and real-world side effects, pretending you can prevent every failure is a fantasy. But proving exactly what happened, why it happened, and who authorized the boundary? That is an honest engineering guarantee.

Turning this entire progression, from the economic collapse of the unbounded loop to the deployment of a cryptographically verifiable state machine, into a freeCodeCamp tutorial is the exact type of public-good engineering education the industry desperately needs right now.

Since the thread is the outline, here is how I see the structural flow of the tutorial:

Phase 1: The Agentic Loop Multiplier (The Financial Reality Check you diagnosed). Why naive multi-agent chaining leads to a 30x token-denominated bureaucracy.
Phase 2: Inverting the Architecture (The Structural Correction). Moving from "loops prompting agents" to a rigid, local-first Finite State Machine where the LLM is treated as an ephemeral runtime utility.
Phase 3: The Financial Stop-Loss (The Downstream Circuit Breaker). Implementing hard token-burn ceilings and max-turn sequence isolation at the infrastructure layer.
Phase 4: The Distributed Side-Effect Nightmare (Safety vs. Accountability). Why circuit breakers fail in concurrent environments when Agent A fires a tool call before Agent B trips.
Phase 5: Generating Verifiable Provenance (The Forensic Receipt, aka The Digital Notary). Building the Minimum Viable Audit Trail (MVAT) and graduating it to an Ed25519-signed ForensicReceipt to create a signed ledger, or Certified True Copy of runtime reality for absent parties.

We can write the practical implementation components in Python, keeping it lightweight, local-first, and highly reproducible.

Ping me directly, or let's open a shared draft space. Let's show the community how to stop building toy chatbots and start engineering high-integrity sovereign infrastructure!

Daniel Nwaneri • Jun 10 • Edited

Ken . really glad you're in on this.

One thing I should have flagged before making the ask: fCC requires every contributor to go through independent onboarding before they can publish. It's an application process, editorial review, confirmation . The same thing I went through before my first piece landed there. There's no shortcut through co-authorship. You'd need to apply separately and wait for acceptance which isn't guaranteed or fast.

So here's what I'd like to propose instead.

I write the tutorial solo in my voice. The comment thread is the origin story and I say so explicitly . The architecture came out of a 5-exchange conversation with you on the essay. I credit you prominently throughout, link to the Sovereign SDK at every relevant implementation point, and send you the full draft before it goes to the editor so you can flag anything technically off.

You get the attribution, the SDK gets the visibility and the tutorial gets published without waiting on an onboarding process that might take weeks.

If you want to pursue freeCodeCamp contributor status independently that door is open . I can share Abbey's contact and tell you what the process looked like from my end.

Does that work for you??

Ken W Alger • Jun 10

I completely appreciate you flagging the fCC onboarding hurdles. You're 100% right—bureaucracy shouldn't stall the technical momentum we have right here in this thread.

Your proposal absolutely works for me, with one structural refinement to ensure the technical narrative stays perfectly framed:

Let’s pitch the tutorial explicitly as a Production Case Study: Implementing the Sovereign Systems Specification.

If you write it in your voice through that specific lens, it creates a massive win-win:

It establishes the rigid, state-driven architecture we just broke down as the gold-standard framework for curbing the "agentic loop multiplier."
It maintains the absolute precision of the core spec terminology (ForensicReceipt, Prose Tax, Observer's Tax, etc.) by anchoring them to an open framework.
It gives you total editorial freedom to run the tutorial solo under your existing fCC status without a single day of onboarding delays.

I’ll gladly review the full draft before it goes to the editor to make sure the implementation points map beautifully to the architectural boundaries.

Go ahead and pitch this layout to Abbey. Let’s show the community how to stop building expensive randomness and start engineering high-integrity sovereign systems.

Alex Shev • Jun 9

Good distinction. Loops are useful only when they are wrapped around a real outcome. Otherwise you get a system that keeps iterating without ever proving that the work became better.

Daniel Nwaneri • Jun 10

"Proving the work became better" is the exact gap most loop architects skip. They instrument for activity — tokens burned, turns completed, tool calls fired and call that progress. But activity metrics and improvement metrics aren't the same thing. A loop that runs 30 times and produces the same quality output as turn 1 looks productive on every dashboard that exists.

The proof function has to be defined before the loop starts or you have no way to distinguish iteration from spinning in place.

Alex Shev • Jun 10

Yes. A loop needs an exit criterion that is tied to quality, not motion. Otherwise the system can keep producing evidence that it ran, while never producing evidence that the artifact improved.

The best agent workflows I have seen define the proof first: test passed, diff got smaller, user friction dropped, cost stayed inside a budget, etc. Then the loop has something real to optimize against.

Daniel Nwaneri • Jun 10

"Proof first" is the frame the essay was circling without landing on directly. The spec-writer forcing function gets at it . you define done before you open the terminal but your examples make the principle operational in a way the essay didn't. Test passed and diff got smaller are binary. Cost stayed inside a budget is binary. User friction dropped is harder to instrument but still directional. All of them give the loop something real to optimize against rather than a vague directive it can satisfy by running indefinitely.

The failure mode you're describing — evidence of motion mistaken for evidence of improvement is also how most teams evaluate their agent deployments. Dashboard shows activity, invoice shows spend, nobody asks whether the artifact is actually better than it was on turn one. The proof function doesn't just constrain the loop. It's the only honest way to measure whether the loop was worth running at all.

Alex Shev • Jun 11

Yes. That dashboard/invoice point is the trap: the system can generate a perfect audit trail of activity while the artifact stays basically unchanged.

I like "proof first" because it forces the team to define the comparator before the loop starts. Not "did the agent work?" but "what observable property of the artifact got better?" Without that, the loop has every incentive to produce motion.

Daniel Nwaneri • Jun 11

"What observable property of the artifact got better" is the question that forces the proof function into existence before the loop starts. It's also the question most teams can't answer not because the answer doesn't exist but because nobody sat down to define the comparator before deploying. The loop fills that vacuum with motion because motion is what it can produce without a target.

The audit trail point is the sharp edge here. A perfect activity log is actually the worst outcome .it looks like accountability while hiding drift completely. The loop ran 30 times. Every turn logged. Every tool call recorded. The artifact is functionally identical to turn one. Nothing in the audit trail flags that as failure because nobody defined what improvement looks like.
That's why the spec has to come before the ledger. The ledger proves the loop stayed inside its boundaries. The spec defines what the boundaries are optimising toward. Without the spec the ledger is just an expensive diary.

Alex Shev • Jun 11

Exactly. The ledger is only useful after the spec defines what improvement means.

Otherwise every logged turn looks responsible, but the system is just proving that it moved, not that it made the artifact better. The spec is the target; the ledger is the evidence that the loop stayed honest while moving toward it.

Daniel Nwaneri • Jun 11

"Stayed honest while moving toward it" . That's the whole contract in one clause. Spec sets the direction. Ledger proves the path didn't drift. Neither works without the other and most teams ship the ledger without the spec, which is how you end up with a perfect record of going nowhere.

Alex Shev • Jun 11

Exactly. The spec is what makes the ledger meaningful. Otherwise the team gets a beautiful chain of custody for work that never improved the artifact.

I think the dangerous part is that the ledger creates emotional comfort: every step is visible, so it feels governed. But governance without a comparator is just motion with timestamps.

Daniel Nwaneri • Jun 11

"Motion with timestamps" is the line. It's also the failure mode most compliance teams will walk straight into . they'll mandate the ledger, audit the ledger, sign off on the ledger and never notice the artifact didn't move. The timestamps are perfect. The work is circular.

The emotional comfort point is the part that's hardest to fix architecturally. You can mandate a spec. You can enforce a circuit breaker. You can't easily mandate that a team confronts the gap between activity and improvement when the dashboard is green and the logs are clean. That requires someone in the room who knows what the artifact was supposed to become and is willing to say it didn't.

That's not a tooling problem. That's a judgment problem. Which is why the human checkpoint matters not just as a cost control but as the moment where someone has to look at the output and ask whether it's actually better...

Alex Shev • Jun 11

Yes. The dangerous part is that a green dashboard can make the loop feel morally complete: we logged it, we reviewed it, we followed the process.

That is why I like treating the human checkpoint as an artifact review, not an approval ceremony. The reviewer should be forced to compare the output against the intended change: did the product get clearer, safer, faster, more useful, less fragile? If the answer is no, the ledger is just documentation of drift.

Tools can make that confrontation easier by putting the before/after, spec, and acceptance evidence in one place. But they cannot replace the judgment call itself.

Daniel Nwaneri • Jun 11

"Approval ceremony" is the thing most compliance processes actually are the signature exists, the process was followed, the ledger is clean. Nobody asked whether the artifact got better because the process didn't require that question. The reviewer's job was to confirm the loop ran, not to confront what it produced.

The before/after framing is where the tooling question gets interesting. Right now most agent tooling makes the output easy to see and the spec invisible at review time. The reviewer is looking at what the loop produced without the original commitment in the same frame. That separation is what makes approval ceremonies feel complete / you're reviewing the output in isolation, not against the promise.

Forcing the spec, the acceptance criteria, and the before state into the same view as the output is a design choice that makes the judgment call unavoidable. The reviewer can't sign off on motion. They have to sign off on improvement. That's a different cognitive task entirely.

Alex Shev • Jun 11

Yes, that is the product design issue hiding under the governance language.

If the reviewer only sees the output and the fact that the loop completed, the UI is quietly asking: “does this look acceptable?” That is a much easier question than: “did this satisfy the original commitment?”

Putting the spec, acceptance criteria, before state, and generated artifact in the same frame changes the review from ceremony to comparison. It also makes weak automation more visible, because a polished output that misses the promise becomes harder to approve casually.

The uncomfortable part is that this slows the moment of approval down a little. But that friction is the point. If the system is supposed to improve work rather than merely produce motion, the review surface has to make the promise unavoidable.

Daniel Nwaneri • Jun 11

"Friction is the point" inverts the default product instinct cleanly. Most review tooling is designed to reduce friction at the approval moment . one-click sign-off, green badge, move on. That friction reduction is a bug masquerading as a UX improvement. It's optimising for throughput at exactly the moment where throughput is the wrong metric.

The "does this look acceptable" question is also easier to answer under time pressure, which is when most approvals actually happen. A polished output gets approved because nobody has time to reconstruct what the original commitment was from memory. Putting the promise in the same frame isn't just good design . it's the only way to make the right question answerable under real conditions.

The uncomfortable implication: a lot of what gets shipped as "reviewed and approved" is really "looked acceptable at 4pm on a Friday." The ledger says it was reviewed. The spec was somewhere in a different tab.

Alex Shev • Jun 12

That 4pm Friday line is exactly the failure mode.

The review UI has to make the cheap answer harder. If the only visible object is a polished artifact, the reviewer will naturally answer "does this look fine?" because that is the question the interface presents.

A better approval surface should force a comparison: original promise, acceptance criteria, diff, evidence, and unresolved assumptions in the same frame. Then approval becomes a judgment about whether the artifact improved the system, not whether the loop produced something plausible.

Daniel Nwaneri • Jun 12

"Unresolved assumptions" is the element that doesn't exist in any review surface I've seen. The diff shows what changed. The acceptance criteria shows what was promised. But the things the loop couldn't verify . The assumptions it made silently to fill in the gaps . Those are invisible unless you explicitly surface them. That's where the polished output that misses the promise actually lives. Not in the diff. In what the loop assumed was true and never checked...

The "5-element frame" also changes what approval means institutionally. Right now approval is a signature . it proves the process ran. With original promise, acceptance criteria, diff, evidence, and unresolved assumptions in the same view, approval becomes attestation . it proves the reviewer actually compared output to commitment. Those are different legal and operational documents even if the button looks the same.

That distinction matters the moment the loop touches something regulated. A signature on a process is defensible. An attestation about improvement is a harder claim. But it's the honest claim. And it's the only one worth making if the loop is supposed to produce something better than what existed before.

Alex Shev • Jun 12

Yes, exactly. The word “attestation” is the right one here.

A diff review asks “did the artifact change?”

An attestation asks “did the artifact satisfy the promise, and what could not be proven?”

That second question is much more uncomfortable, but it is also the point where AI-assisted work becomes auditable instead of just faster. The unresolved assumptions list is where the system admits its own boundary.

chneg cheng • Jun 22

The cron job analogy hits. One thing I've noticed watching this space: the teams that get this right treat the loop as infrastructure, not value. The loop compresses cost and widens surface area — it doesn't replace the judgment call of what to build and for whom.

The scary part isn't that loops burn money (they do). It's that a well-tuned loop can run for weeks producing output that looks valuable but isn't, and you only catch it when someone asks "what did this actually change?"

chneg cheng • Jun 22

Great piece. You're right that the loop mechanic is the enabler, not the value.

I think the missing piece is what runs inside the loop. A cron job has a clear contract — do X at 2am or fail loudly. Most agent loops I see skip the contract step and jump straight to "prompt an agent and hope."

The teams I've seen succeed with agents don't optimize loops. They optimize contracts — defining exactly what input the agent expects, what output it must produce, and what failure looks like before the loop starts. The loop is just the repetition. The contract is where the predictability (and cost control) comes from.

Curious if you've seen the same — teams that nail the contract before the loop, or teams that jump into loops and burn tokens?

Daniel Nwaneri • Jun 23

"Optimize contracts not loops" is the framing I've been circling without landing on directly. The spec is the contract . what the agent expects as input, what it must produce as output, what failure looks like — written down before the loop starts. The loop is just repetition. The contract is where the predictability lives.

The teams that jump into loops and burn tokens almost always share the same root cause: the contract was implicit. They had a vague sense of what the agent should do, a prompt that gestured at it, and a loop that ran until something looked approximately right. "Approximately right" isn't a contract. It's just the loop grading its own homework.

To your question — yes, consistently. The teams that nail the contract first spend the most time on the boring part before they write a line of loop code. What does done look like in one sentence? That question alone separates the teams that ship from the teams that burn.

Mykola Kondratiuk • Jun 11

the loop is infra until it fails in front of a user. retry logic and latency are UX decisions the moment the agent touches the customer path.

Pizza Cat • Jun 18

Great piece. You're right that the loop mechanic is the enabler, not the value.
I think the missing piece is what runs inside the loop. A cron job has a clear contract — do X at 2am or fail loudly. Most agent loops I see skip the contract step and jump straight to "prompt an agent and hope."
The teams I've seen succeed with agents don't optimize loops. They optimize contracts — defining exactly what input the agent expects, what output it must produce, and what failure looks like before the loop starts. The loop is just the repetition. The contract is where the predictability (and cost control) comes from.
Curious if you've seen the same — teams that nail the contract before the loop, or teams that jump into loops and burn tokens?

Theo Valmis • Jun 13

The legibility you're mourning is also what bounded the cost, which is why the two halves of this post are one problem. A cron job can't run away with your budget because its work is fixed before it runs; you can read the ceiling off the one line. An agent loop's cost is unbounded by construction: the number of steps is decided at runtime by the same probability distribution doing the work, so nothing caps it in advance. The 30x is what it costs to let the loop decide its own length. No amount of tuning removes that, it's structural. The teams not going pale put legibility back at the boundary, a hard step budget, a cost circuit breaker, a fixed plan the agent fills in instead of invents. You can't ceiling what you can't read ahead of time, so making the loop legible again is the same move as making it affordable.

Daniel Nwaneri • Jun 15

"The loop deciding its own length" is the exact mechanism, and naming it that way exposes why tuning never works . you're not adjusting a parameter, you're trying to bound something that was designed to be unbounded. The circuit breaker isn't a tuning knob on the agent. It's a different actor entirely, sitting outside the probability distribution, enforcing a ceiling the distribution has no access to and no ability to negotiate.

"A fixed plan the agent fills in instead of invents" is the cleanest description of spec-first architecture I've seen. The spec doesn't make the agent dumber . it moves the planning decision to a point where a human can read it before any tokens are spent. Legibility restored exactly where the cost was unbounded.

HARD IN SOFT OUT • Jun 13

This piece captures something I've been feeling but couldn't name: agents turned the quiet cron failure into a loud, expensive, polite failure. The bill arrives before the reckoning. That line about "optimizing indefinitely without confronting whether the output matters" is going to haunt my next architecture review.

A couple of thoughts from reading:

The Spotify example is sharp, but I wonder if the real risk isn't agentic loops replacing product thinking — it's cheap validation replacing real feedback. An agent can A/B test 500 variants of a button color, pick the winner, and call it done. Nobody asked if the button should exist at all. The loop optimizes for clicks, not for "did this solve a user's problem."
The "automated nobody is still nobody" from your Bookmark Brain is devastating. That's the whole essay compressed into six words.

One practical suggestion: the difference between a cron job and an agent loop is who owns the failure signal. Cron fails → log → human sees. Agent fails → retries → falls back to another agent → eventually asks human after burning $40. A simple rule could save thousands: "if the same task fails twice, stop and ask, do not escalate automatically." That puts a governor on the loop's optimism.

Also, because the tweet about "designing loops" earned it:

A cron job and an agent walk into a bar.

The cron job says: "I run at 2 AM. If I fail, I log it and leave."

The agent says: "I retry, escalate, spawn subtasks, and send a weekly report."

The cron job asks: "What do you actually ship?"

The agent says: "I'm not sure. But I have a really good dashboard."

Anyway, this is the kind of reality check that should be pinned next to every "agentic everything" slide deck. Appreciate you writing it.

Daniel Nwaneri • Jun 15

"Cheap validation replacing real feedback" is the failure mode that doesn't even need a runaway loop to hurt you. The loop can run exactly as designed, hit its turn limit, produce a clean ledger and still have spent the whole budget answering a question nobody needed answered. 500 button colors A/B tested is activity with a winner declared. Nobody asked if the button should exist. The proof function has to include "is this the right question" or the loop optimizes perfectly toward irrelevance.

The 2-strikes rule is the kind of thing that sounds almost too simple until you realize most agent failures are exactly this: fail, retry with a twist, fail again, escalate to a different agent, fail differently, ask a human after $40. 2 strikes and stop isn't conservative. It's just refusing to let the loop's optimism compound past the point where a human could have caught it for free.

The bar joke is staying with me. "I'm not sure. But I have a really good dashboard" is the whole essay in nine words.

zxpmail • Jun 14

▎ Great piece. The line that hit hardest: "Automated nobody is still nobody."
▎
▎ I just published research measuring LLM sycophancy (~1.2M tokens across DeepSeek and Claude), and your article nails ▎ something I couldn't put into words: the loop without a human checkpoint isn't just expensive — it's epistemically
▎ broken.
▎
▎ We found that LLMs naturally cater to the user's stated position (GI = 0.21). When you put them in a loop optimizing
▎ autonomously, there's no one to challenge the assumptions. The model agrees with the last instruction, patterns
▎ match to "success story," and the loop compounds direction error before anyone sees it.
▎
▎ The Spotify point is the one I keep coming back to. Revenue up, headcount down —looks like efficiency. But what's
▎ the latency on product debt when nobody's asking "who asked for this?"
▎
▎ Your seo-agent setup (cron →agent →human judgment) maps exactly to what we built as a "Critique Gate" —a
▎ structured adversarial checkpoint that runs once, not iteratively, because iteration re-triggers sycophancy drift.
▎ One pass, human decides.
▎
▎ "The loop runs to me. Not into a void." That's the line worth bookmarking.

Daniel Nwaneri • Jun 15

"Epistemically broken" is the right word and it's a different problem than the one the essay focused on. Token burn is a cost problem — bad, but bounded by your bank balance. Sycophancy compounding through a loop is a correctness problem with no natural ceiling at all. The loop doesn't just spend more, it becomes more confidently wrong, and the confidence is generated by the same mechanism that's wrong.

The one-pass Critique Gate inverts the instinct completely. Most people's response to "the agent might be wrong" is "have another agent check it" but if checking is also subject to the same sycophancy toward the framing it's handed, repeated checking just launders the error through more agents. A single adversarial pass that isn't iterated avoids re-triggering the drift you're trying to catch.

That maps onto the review surface almost exactly — the human checkpoint is the one-pass critique gate. It runs once, after the loop, comparing output to the original spec. Not iteratively. Not as part of the loop. Outside it, asking the question the loop structurally cannot ask of itself.

zxpmail • Jun 28

Great observation on the one‑pass gate. In practice, we also found that iterative multi‑agent checks often converge to the initial framing’s bias, not truth. The real challenge is designing the spec that the human checkpoint uses — if that spec itself embeds assumptions, even a single pass can be gamed. Curious how you handle spec engineering in your workflow?

View full discussion (48 comments)