A week ago I committed a YAML file listing 50 features I wanted built, set a Claude Code Routine to wake up twice a day, and went to bed.
The next morning, a new feature was in production. The morning after, another one. Five days in, the pipeline was still running. I had written zero lines of production code that week.
When this first started working, I genuinely sat staring at the deploy log for longer than I'd admit. A commit had landed in main overnight on the RivoCalc repo. The author wasn't me. The feature worked. CI was green.
Then the cracks started.
The setup
A scheduled cloud agent reads a YAML backlog twice a day. Picks the next pending entry. Builds it — component, registry, build verification. Commits, pushes, deploys via Vercel.
I wrote the prompt once. I haven't touched it since. The features ship themselves.
I'm going to walk through the four problems I had to fix before this loop became reliable. Each is the kind of thing nobody writes about because it's too specific to mention until you've hit it, and then suddenly it matters a lot.
1. The model copies your examples, even when you tell it not to
Day three, I noticed strange HTML comments leaking into the production page source:
<!-- BENCHMARK: 900 words | 5 subtopics | 2 gaps -->
This was internal prompt scaffolding — targets the agent was supposed to use while writing, not include in the output. By the time I caught it, nine pages were live with that comment in their HTML.
Tracing it back, my prompt had a line that said roughly: "register your word-count target and topic structure in a comment at the top of the content, then write the body." And it gave an example. A literal HTML comment with placeholder values.
The agent followed the example. Every. Single. Time.
My first fix was to add a warning. "This comment is for your reference only — don't include it in the final HTML." It didn't work. The agent kept doing it.
A concrete example beats an abstract warning almost every time. The example is a working pattern the model can imitate. The warning is a rule the model has to remember to apply. The pattern wins because it's right there in the prompt, doing useful work.
The real fix was to delete the example. I rewrote the instruction with no template at all: "Keep these targets in mind while writing, but never reproduce them anywhere in the output." No "here's what it looks like." No "avoid this." Just the abstract instruction.
Zero leaks after that.
Don't show a model what you don't want it to do. Even as a counter-example. The example is the lesson.
2. Your cloud sandbox has rules it didn't tell you about
For the first two weeks, the agent pushed straight to main. Then push attempts started returning 403. Branches with other names still worked.
I went looking. Anthropic's docs say Claude Code's cloud sandbox restricts pushes to branches prefixed claude/. By design. For safety — to prevent a runaway agent from rewriting your protected branches.
Which is reasonable. Which I would have seen if I'd read the security docs before building the pipeline.
But the result was a real problem: every overnight build that hit the wall was going to silently fail. The agent would build the work, get the 403, give up. Feature wouldn't ship. No notification.
The fix had two halves.
On the agent: explicit fallback in the prompt. If the push to main fails after retries, create a session branch named claude/run-<timestamp> and push there instead. The work stays committed somewhere.
On the CI: a GitHub Action that watches for pushes to claude/* branches. When one appears, the Action runs the build, merges the branch into main, deletes the source, and opens a GitHub issue titled Published: <feature> as a notification.
The flow now:
Agent builds → pushes to
claude/run-<timestamp>→ Action picks up → builds → merges tomain→ deletes branch → emails me.
Ten to fifteen minutes from agent waking to feature live. Zero involvement.
The broader point: every service in an automated pipeline has its own defaults. The sandbox has security rules. The git host has rate limits. The deploy platform has timeouts. You won't know about most of them until they fire. Build the fallbacks anyway.
3. Build artifacts in git will eventually betray you
A few days later, the auto-merge Action started failing with:
Your local changes to public/sitemap.xml would be overwritten by checkout
The cause was banal. My project uses next-sitemap in postbuild to regenerate the sitemap every time. Each build produces a new file with a fresh timestamp. The sitemap was tracked in git. So every CI build introduced local changes that blocked the next git checkout.
This problem doesn't exist when only you run builds. Your laptop's sitemap matches the one in git because you committed it last time.
CI runs builds too. And when CI tries to switch branches before merging, the regenerated sitemap blocks the checkout.
Two minutes to fix once I understood it: git rm --cached public/sitemap.xml plus one line in .gitignore.
Build outputs aren't source. They're derivatives of source. They belong on the build server, not in version control. Most developers know this. Most of us track them anyway when nothing forces us to be disciplined — and then automation forces us, and we lose half an hour figuring out why CI suddenly stopped working.
4. Don't make the agent idempotent. Make the merge step refuse collisions.
I manually triggered the agent at 15:02. I didn't notice the scheduled run was set for 15:00 with the usual few-minute delay. Both runs started within two minutes. Both read the same backlog. Both built the same next item.
The first run finished, pushed its branch, Action ran, merged. Feature live.
The second run finished a few minutes later. Pushed its own branch. Action ran. Build passed. Merge step failed — paths already taken. Action opened an issue: Merge conflict in claude/run-<timestamp>.
No damage. The first run's work was safe in main. The second run's duplicate work was rejected at the gate. I deleted the orphan branch manually and moved on.
I could have made the agent itself idempotent. Locks, queues, coordination logic. That's expensive to design and easy to get subtly wrong.
Instead, the merge step refuses collisions. Cheap. Simple. And robust to every race condition I haven't thought of yet — they all hit the same gate.
Detection at the boundary beats correctness everywhere upstream.
What's actually changed
A month ago, my model of AI in development was "the AI helps me code". Pair programming. Reviewing each commit. Always in the loop.
It's different now. "The AI ships code while I sleep." I'm at the perimeter. Watching for signals that something needs my attention.
The agent ships. CI catches failures. GitHub issues notify me when something needs eyes. I check the inbox in the morning. Sometimes I intervene. Most days I don't.
What I work on shifted too. I write fewer prompts. I think more about three questions:
- What can the agent do without me?
- When it goes wrong, how will I know?
- When I know, what tools do I have to recover?
Prompts are short and stable. Observability is where the engineering goes now. Escape hatches are the architecture.
The live result is at rivocalc.com — a calculator aggregator built and maintained autonomously. The calculators themselves aren't the interesting part. What's interesting is that nobody's writing them by hand.
Stack: Next.js 16.2, TypeScript, Tailwind v4, Vercel, Claude Code Routines, GitHub Actions.
If you're building something similar and you've hit different walls, I'd like to read about them. Most of this is still being learned in the open.
Top comments (0)