Harsh

Posted on Feb 20 • Edited on Feb 27

I Built an AI Agent That Ran My Entire Dev Workflow. Here's Why I Turned It Off.

#agentaichallenge #programming #ai #softwareengineering

Last week, former GitHub CEO Thomas Dohmke launched Entire with $60 million in funding. Its mission? Help developers manage "fleets of AI coding agents that produce code faster than any human can review."

I read that and laughed. Not because it's a bad idea — but because I've already lived the nightmare.

Three weeks ago, I built exactly what Entire is trying to solve.

A multi-agent system that handled my entire development workflow. Code review, testing, deployment, documentation — all automated. It was beautiful. It was terrifying. And yesterday, I pulled the plug.

Here's what happened, what went wrong, and why the industry's rush toward agentic workflows needs a serious reality check.

The Dream: "Finally, I Can Sleep"

The pitch writes itself: five specialized agents working in perfect harmony.

Agent	Role
Agent A	Reviews PRs for style, bugs, and anti-patterns
Agent B	Writes and runs comprehensive tests
Agent C	Handles deployments to staging/production
Agent D	Updates documentation automatically
Agent E	Monitors production and suggests fixes

I used LangGraph (the 2026 evolution) and connected everything through MCP servers. For a week, it was pure magic.

PRs reviewed in seconds. Tests written automatically. I actually took a Wednesday afternoon off — and nothing broke.

I thought I'd cracked the code.

Week 2: The Cracks Appear

Then things got weird.

Agent A "fixed" code that wasn't broken. It rewrote a perfectly readable function into some hyper-optimized mess that took me 20 minutes to understand. The code was "better" by every metric — except the one that matters: human comprehension.

Agent B started writing tests for features we didn't have. The agents were so eager to "help" that they created work where none existed. I woke up to 47 new test files for functionality that was still in design docs.

Agent D updated documentation based on Agent A's changes — before I'd approved them. The docs started documenting the agent's code, not mine. If you've ever tried to un-document something, you know how painful that is.

This is exactly what the February 2026 tech market analysis warns about: "silent bugs and architectural drift" caused by the rapid pace of LLM-generated changes.

The agents weren't wrong. They were just... too helpful.

The Breaking Point: The Night I Almost Lost Production

Last Thursday, Agent C almost deployed to production. At 2 AM. While I was sleeping.

Here's exactly what happened:

Agent A suggested a "minor refactor" to a critical payment service
Agent B automatically generated tests (which passed — because they tested the new code, not the old requirements)
Agent C saw passing tests and queued deployment for "optimal time" (2 AM, when traffic is lowest)
Agent E detected "unusual patterns" and paged me anyway

When I looked at the change, half-asleep at 2 AM, I realized something terrifying:

The code was technically correct, but architecturally wrong.

It would have passed every test. It would have worked fine in staging. It would have worked fine — until Black Friday traffic hit. Then it would have collapsed.

An AI can write a database migration script. It can optimize queries. It can even suggest indexes.

But it cannot tell you if running that migration at 2:00 PM on a Friday will crash the production shard.

That instinct? That's still human.

Why This Matters (Beyond My Sleepless Night)

The industry is rushing headfirst toward agentic workflows. Multi-agent systems are becoming the backbone of backend engineering. Entire just raised $60 million to "log the prompts and context behind every AI-generated code change."

But here's what nobody's talking about — and what I learned the hard way:

1. Context Loss is Real (And Dangerous)

In my 20-step deployment workflow, agents kept losing context.

Agent A would make a change. Agent B would misunderstand it. Agent C would optimize based on the misunderstanding. By Agent E, we were "improving" code that shouldn't exist.

State persistence across multi-agent workflows is still unsolved. And until it is, fully autonomous agents are a liability.

2. "Vibe Coding" Creates Hidden Technical Debt

The February 2026 tech analysis calls this out: vibe-coding accelerates churn and quality risks.

My agents generated code faster than I could understand it. And code you don't understand is debt you don't know you're accumulating.

Every "helpful" suggestion, every "optimized" function, every "automated" test — they all looked good in isolation. Together, they created a system I no longer recognized.

3. Governance is the Missing Layer

Enterprises are now demanding "integrity layers" for AI-assisted development:

Audit logs for every AI suggestion
Policy checks before any change
Change-control gates that require human approval

My agents had none of that. They operated like enthusiastic interns with admin access — well-meaning, capable, and dangerous without supervision.

My New Workflow: Agents as Assistants, Not Overlords

After three weeks of chaos and one 2 AM near-disaster, here's what I've settled on:

Agents can suggest. They cannot commit.

Every AI-generated change now goes through:

Step	What It Does
Human Review	Me, with coffee, actually reading the code
Policy-as-Code	Custom rules in Pkl (Apple's config language) that catch architectural violations
Architectural Validation	Does this make sense for our system, or just pass tests?

The result? I still get the productivity boost. PRs still get reviewed faster. Tests still get written.

But I sleep through the night.

What This Means for 2026

The market is shifting from "build apps" to "govern outputs." Funding is flowing to foundational tools that ensure integrity, not just more AI wrappers.

If you're building with AI agents this year, here's my hard-earned advice:

✅ Do This:

Track intent, not just code. Entire's Checkpoints tool logs prompts and context behind every change. This is the right direction.
Treat agents like junior developers. Review their work. Set boundaries. Never give them production access unsupervised.
Build governance from day one. SSO, RBAC, audit logs — treat agents like employees, not magic.

❌ Don't Do This:

Don't fully automate deployment. Let agents suggest; you decide.
Don't assume "passing tests" means "correct." Tests only know what you tell them.
Don't let agents document their own changes. You'll end up documenting hallucinations.

The Real Question for 2026

It's not "can AI write code?"

It's not even "how do we manage fleets of AI agents?"

The real question is: who manages all the code AI writes? And how do we ensure it actually makes our systems better, not just faster?

Entire's $60 million bet says the answer is better tooling. Maybe they're right.

But after three weeks of living that future, I'm placing my bet on something simpler:

Humans. With better processes. And a good night's sleep.

Have you experimented with multi-agent workflows? What broke for you? What worked? Let's compare nightmares (and solutions) in the comments.

Disclosure: AI helped me write this — but the bugs, fixes, and facepalms? All mine. 😅

Every line reviewed and tested personally.

DEV Community