DEV Community: Teemu Piirainen

AI took my job. Twice in one year. I still have one.

Teemu Piirainen — Tue, 07 Jul 2026 06:22:52 +0000

What a year of coding with AI did to my work, my role, and me.

TL;DR: AI took over my job twice in one year: first the writing, then the checking. I had to relearn my role three times.
What remained for me was the harder part: deciding, trusting, and staying responsible for work produced by the system.
It multiplied what I can produce. It did not make the work lighter.

One evening, somewhere past seven, I had six separate AI agent teams running at once and no idea what any of them were doing.

Not in a panicked way. The agents were working, every check was green. One was rebuilding the way we store data, another was wiring up a feature I'd half-forgotten I asked for, a third was so deep into something I'd have to scroll back to even remember what it was. They were all making progress. That was the problem. They were producing faster than I could read, and at some point that evening I realized I'd stopped trying.

For most of my career I've been a software developer, the kind who writes code by hand. Back in the spring of 2025, I still read every line the AI wrote. I treated it like a very fast developer who couldn't be trusted: I wrote the prompt, it wrote the code, and I read all of it, every function and every edge case, because that was the job and because reading it was the part that still made me feel like a developer.

Now I read almost none of it. And I'm no longer sure I'm the same developer I was.

This is a story about what happens after the AI tools start working: every bottleneck I removed didn't disappear, it moved closer to me, until the thing slowing everything down wasn't the AI at all. It was me.

I'll tell this as a developer's story, because that's what I was. But I don't think it stays a developer's story for long. If AI is anywhere near your work, whether you write code, run a team, or hire the people who do, some version of this is heading your way.

A year ago, trust meant reading everything

A year ago the work looked nothing like this. One project, one session, one thing at a time. What fascinated me then was the AI model underneath and the prompt itself: phrase a request well and it did something clever, phrase it badly and it flailed. A good prompt was a small art.

The jump in productivity felt enormous at the time. Only later did I see how small it really was next to what was coming. But it didn't matter much then, because I wasn't optimizing for output. I was optimizing for staying the one who decided how each thing got built. Being good at the job meant writing good code with my own hands. That was the work, and the pride.

So I read everything. Not to catch the AI in a mistake, but because reading was how I understood the code, and understanding was the only kind of trust I had. If I could follow the logic with my own eyes, I could rely on it; if I couldn't, I couldn't. Trust, back then, was just another word for understanding.

Somewhere in those months, without any day I could point to, the writing itself quietly stopped being mine. That was the first time AI took my job. What was left to me was the reading, and I held onto it like it was the whole profession.

Then the agents got good enough to work for an hour on their own, and the trade I never saw coming arrived: trust is throughput. The more I trusted the system, the more I let it run before stepping in, and the more it produced. One stream of work became two, then several. The output curved upward, and for a while it felt like pure gain.

And then I crossed the second line without noticing: I stopped reading every line of code. There was no decision I could point to, just a week where I realized I'd shipped things I'd only skimmed and the sky hadn't fallen. For an entire career, reading the code had been the job, and letting it go felt like negligence in the costume of efficiency.

The bottleneck had only moved. The code stopped being the thing that slowed us down. I did. The factory could now produce faster than I could approve, and the fix wasn't to read faster. It was to build a system I could trust without reading it, one whose reliability came from process rather than my own eyes.

So I stopped writing code and built the thing that writes it

It started small. Instead of single tasks, I gave the AI a process: one agent built, another tested, another reviewed, and an orchestrator decided what happened when. I wasn't writing prompts anymore. I was taking the way I work, the order I do things in and the checks I refuse to skip, and encoding it into workflows the agents followed. The prompt was me telling the AI what to do once. The workflow was me teaching it how I think.

Trust without reading meant building the reading into the factory: validation agents that checked the work, gates that caught an error and bounced it back before it reached me. The AI model still made mistakes; they just no longer needed my eyes to get caught.

That was the second time AI took my job. First the writing, now the checking: the two halves of what I'd always understood a developer to be.

And then there were the rules, which turned out to matter most. Not rules about what the agent could build, but about what it was allowed to decide. An implementation detail it could settle on its own; a small change inside the existing structure, usually too. But anything that touched product direction, the user experience, the architecture, or security had to stop and come back to me.

The hard part wasn't writing those rules. It was that the whole factory's value lived in getting that line right: ask too much and I became the queue again; ask too little and the trust quietly fell apart.

It worked. By now a single agent could run for four hours on one task without stopping, sometimes ten, and the factory kept several of them going at once, all producing good work I never touched. And then, on schedule, the bottleneck moved again, this time to the front of the pipeline. The factory wasn't waiting on better code or on my approval. It was waiting on me to tell it what to build next. The factory doesn't stop because it can't code. It stops because it has no next approved decision.

Then I was a manager of machines, not a developer

By now my job had quietly stopped being software development. I was making product calls, decisions about the customer experience, judgments about what the business actually needed, not because anyone had promoted me but because the AI worked better the more I could tell it about how we work and where we're going.

It also pushed me sideways, into design and domain calls a year earlier I'd have left to someone who actually knew. That is a real gift with a quiet edge: AI can make you feel fluent in a field where you have no judgment at all, and the feeling is convincing.

The agents themselves had started to look less like a tool and more like an organization. Each one needed a role, and once I gave them real reach, permission to read this, change that, run the other, they needed something closer to identities. An agent with no permissions is a toy; an agent with permissions is an employee. Each session woke with a blank memory, so each needed onboarding like a new hire: here's your role, here's what you may touch, here's what you must never go near. I was, absurdly, doing something close to HR for software.

And the center of gravity shifted again. The AI model underneath had stopped mattering much: Claude 4.6, 4.7, whichever was strongest that month, the factory ran fine on any of them. The code stopped being precious too, because the factory could regenerate it on demand.

What mattered now was the factory itself: the rules, the workflows, and the memory of every decision made and mistake learned along the way. All of it lives in version control like everything else, but that was never where its value sat. You can copy the files in a second. What you can't copy is the year it took to get them right, one correction at a time.

I'd set out to write better software. I'd ended up running a small organization whose workers happened to be machines.

Until the day the factory failed

There's a version of this story where the factory just works, and I'd be lying if I told it that way.

One of the things it built was a customer-facing chat feature: a chain of steps, each meant to check, shape, or route a message before it reached a person. (LangGraph, for anyone who wants the detail.) The flow looked right. The tests were green. I didn't read it closely, because by then I didn't read anything closely.

What I hadn't done was write the architecture down. Nothing said: in this kind of case, the message has to pass through these specific steps before it goes out. So the factory did what it always does without a rule. It reasoned, decided the checks weren't needed here, and shipped a flow that sometimes sent the AI model's raw output straight to the user, past the very steps built to stop exactly that.

We found it in production.

The part that stays with me isn't the bug. It's the shape of it. The AI model didn't fail; it did exactly what I'd authorized it to do: make the call and build it from the documentation. I'd handed over the decision without handing over the architecture it needed to decide correctly, and the factory filled the gap with a guess. The failure was mine.

Now the slow part was me

For everything I'd handed off, one thing was left that I couldn't hand off: me. The numbers kept getting better and I kept getting worse.

Picture six agents running and a mind ping-ponging between six unrelated problems: switch into one, reload its entire context, make the call, switch to the next, reload that. The factory had gotten good at removing the easy questions, which was the point, so everything that reached me was the hard stuff, stripped of its simple parts and dense with context. I'd built a factory that filtered out everything I could answer easily and served me only what I couldn't.

That is a particular kind of tired: not the good tired of a long day, but the tired of being the slowest part of a system that never slows down. It had optimized my throughput, not my cognitive load, and those are not the same thing. The flow I used to get from coding never came back in the new work; efficiency went up, flow did not.

And this isn't a problem a better AI model fixes. The AI model shipped a new version every few weeks; I was running the one I was born with. You can make the factory faster. You can't make the human read, decide, or recover faster by wanting it badly enough.

What it cost me was knowing my own work

The hardest cost wasn't the tiredness. It was quieter.

For my whole career, the question am I good at this had a clean answer: can I write good code? Now I wasn't writing code, and I'm not a software developer anymore turned out to be an easy sentence to say and a hard one to live inside. You don't notice how much of yourself was load-bearing on the old definition until you pull it out.

Here is the moment it stopped being abstract. Early in 2026, someone asked me how one of our features actually worked. For years that question had a reflex answer: I'd say "it works like this," and if they wanted more, I'd open the code and point. Here. This is where it happens. It was the simplest thing in the world, and it was also quiet proof that I knew my own work.

This time I had nothing. I didn't know how it worked, or where in the code it lived, because I hadn't written it and I hadn't read it. The AI had built it, it was running fine, and I could not explain it. The first few times it happened, I dodged: I'd say I had to check and come back to them. It took me longer than I'd like to admit to just say the sentence out loud. I don't know how it's made. And "I don't know" was the part that actually hurt, because answering that question had always been the thing that made me a developer.

Underneath that sat a debt I couldn't pay down. My ability to produce had shot straight up; my understanding of what I produced only crawled behind it. The software works, the tests pass, the architecture holds, and still I can't explain every part of what I'm shipping from memory. Call it understanding debt.

It's tempting to dress it up as wisdom, to say I learned to let go, but I didn't close the gap. I made my peace with not closing it, which is a different thing, and the comfortable version is a lie a lot of us will be tempted to tell. So trust, not technical quality, became the real problem: my trust could no longer mean I read it and understood it, only I trust the process that produced it.

There's an uncomfortable reading of all this I won't pretend away: if one person can now produce what used to take a team, the rest of that team's work went somewhere. Whether it relocated the way mine did, or just disappeared, I honestly don't know. What I do know is this. If I had stayed the developer I was in the spring of 2025, the one whose value was writing the code, the AI really would have taken my job, and not slowly.

I kept it by retraining into whatever the AI couldn't do yet. The AI took my job twice that year, the writing and then the checking, and each time I climbed to the next thing: first the person who built the factory, then the one who ran it. Now I'm in the third role, the one who decides what it should build at all. Stand still and the role really does get optimized toward zero; keep moving and it doesn't, because the work never disappears, it relocates, and so must I. It will move again.

This isn't only a developer's problem

I said at the start that this doesn't stay a developer's story for long. Here's what I mean, and it matters most to the people who run teams rather than live inside them.

The temptation, for a company, is to treat all of this as a tool rollout: buy the licenses, run a training, wait for the numbers to climb. After this year, I think that misses almost everything. The tool is the easy part. What actually changes is the work around it: how tasks get written, how much of the decision-making has to be spelled out in advance, what documentation is even for, and what review means once nobody reads everything.

None of that comes from access. It's a redesign of how work is done, and if an organization skips the redesign, the cost doesn't disappear. It lands on whichever person is holding the system, usually as cognitive load that no productivity dashboard will show, and often on their own evenings and weekends. AI adoption isn't a tool you hand people. It's a change in the shape of the work itself, and someone has to do that reshaping on purpose.

What I'd tell you, if you're heading here too

I don't have best practices, but I have a few things I wish I'd known at the start. Most of them aren't really about software.

Watch where the bottleneck moves. The constraint never disappears, it relocates: from writing the work, to checking it, to feeding the AI, to you. When it lands on you, the move isn't to try harder; it's to let it shift again and follow your role when it does.

Write down what the agent may decide. Decide on purpose what it can settle alone and what has to come back to a person. The blanks you leave get filled with a guess, and you find out which guesses were wrong in production.

Treat documentation as operational memory, not a record. It's part of the factory now: if it's vague or missing, the agent doesn't pause, it improvises confidently, and you live with whatever it improvised.

Measure cognitive load, not just throughput. More output means more decisions, more context-switching, a denser day even when everything is technically fine. That load is what quietly burns people out, and no throughput number will ever show it.

Be honest about understanding debt. You'll understand less of what you ship than you used to. That can be a reasonable trade, but only if you make it on purpose instead of waking up one day to find it already true.

If you're just starting with AI, this may feel distant. If you've already watched it change your work, you probably recognized yourself somewhere back there. This is just what tends to come next, and what comes after that, none of us fully knows.

It's seven in the morning as I write this. A new day, and the first decision is mine:

not how the code gets written, but what the agents should spend the day building.

That's the work now.

The satisfaction I used to get from solving a problem with my own hands is mostly gone; what replaced it lives further out, in pointing the whole thing in the right direction. Some days that's enough. Some days I miss the code.

But I've stopped expecting to arrive. The job I have now isn't the one I had a year ago, and it won't be the one I have a year from now. The only part that keeps repeating is the learning: getting good at the next shape of the work before it learns me. That, it turns out, is the job I kept.

5 months. 100% Claude Code. Zero architectural drift.

Teemu Piirainen — Tue, 30 Jun 2026 07:00:55 +0000

In one web service project I have been building and maintaining for 5 months, every line of code has been written by Claude Code. In a second project, a macOS app in Swift built over 3 months, the same is true with Claude Code.

The architecture has not drifted.

What drift actually is with AI

AI failures usually get described as catastrophic. The agent broke the build. Shipped a bug. Those happen, but they are not what wears down a long project.

The failure that does is much smaller:

An error swallowed with a try/catch, where the rest of the codebase returns a Result.
The same problem solved one way here, a different way three files over.
A helper rewritten from scratch, where one already existed a folder away.

Each change is locally justified. None of them are globally correct. After five months, the codebase looks "almost right everywhere", and the architecture I started with is no longer what I have.

Better CLAUDE.md does not fix this. It works for one session, not five months.

How the setup works: three components

The system has three parts working together:

Locked Claude Code rules. The files that define how the agents themselves operate.
Locked architecture and code style. The files that define what the agents produce.
Independent validators. Separate agents that check every implementation against the rules in 1 and 2.

The first two share the same enforcement mechanism: a file lock at the server. The third is a different kind of layer that runs after every implementation. The rest of this article covers each.

Locked files: who is allowed to change what

The model that has worked, on both projects, is to separate every piece of agent instruction into two categories based on who is allowed to change them. Everything outside these two categories is left to the agents.

1. Claude Code rules

The files that define how the agents themselves operate:

CLAUDE.md            # project-level Claude Code config
.claude/agents/      # subagent definitions (architect, developer, validator)

Agents can read these and propose changes. They cannot push changes. Every modification goes through me.

The reason this category is locked is recursive. If AI can change the rules that constrain it, drift becomes self-reinforcing. The agent that is drifting becomes the agent that rewrites the rules to permit the drift. The human gate breaks the loop.

The system still has to improve. Agents propose changes to their own rules when they notice friction or a repeated mistake. The improvements that survive my review become permanent.

2. Architecture and code style

The files that define what the agents produce:

docs/code-documentation/
  architecture-overview.md      # system map, tech stack, data flow
  architecture-backend.md       # service patterns, REST boundaries
  architecture-frontend.md      # React structure, state boundaries
  architecture-shared-types.md  # cross-package contracts
  testing-guidelines.md         # test patterns, anti-patterns
  git-workflow.md               # branch rules, commit conventions

Same mechanism. Agents read and propose. They cannot push.

I enforce both categories with a single GitHub push ruleset. It restricts pushes to these paths:

CLAUDE.md
.claude/agents/**
docs/code-documentation/architecture-*.md
docs/code-documentation/testing-guidelines.md
docs/code-documentation/git-workflow.md

Only the repository admin role is on the bypass list. The AI agent pushes with its own token, tied to a non-admin identity, so it is not a bypass actor. Any push it attempts to a locked file is rejected at the server.

When the agent wants to change a locked file, it surfaces the proposal as a Product Decision: situation, options, trade-offs, recommendation. I decide. If I approve, I make the file change myself.

The naming prefix matters. architecture-* is not only an organizational hint. It is also part of the access rule. A new architecture document added later inherits the lock by name.

What this looks like in practice

Across both projects, the agents have proposed changes to locked files dozens of times. Reasonable proposals. About one in three I decline. Those implementations get rewritten within the existing rules, and the codebase does not gain the new pattern. The rest are genuine improvements, and I make those file changes myself.

The reason the architecture has not drifted is not that the agents stopped trying. It is that they cannot apply their proposals without going through me first.

My memory file currently has 50+ pitfall entries in each project. Many of them come from a drift attempt that was caught in review, recorded so the agent does not repeat it.

Validation has to be independent

The whole setup works because of a final layer: independent validator agents that check every implementation against the locked architecture and code style documents. The validator sub-agents do not see the implementation reasoning. They get a fresh context, the code and the locked docs, and they verify the code conforms. Nothing else. If the validator reads the implementation's reasoning, it starts agreeing with it. That is rubber-stamping, not validation.

But not all of that validation is a judgment call, and that is on purpose. Before an implementation is even eligible for review, the compiler has to pass in strict mode, the test suite has to be green, and coverage has to clear an 80% threshold. A deterministic check cannot be talked into agreeing with the code the way an LLM can. The agent validators sit on top of that floor. They catch what the machine cannot: whether the code conforms to the locked architecture, not just whether it runs.

I have written about the validator setup in more detail in an earlier article: How I validate quality when AI agents write my code. The two-category model in this article and the validation pipeline in that one are halves of the same system.

Setting up a model that holds

Identify the files in your project that define the agent rules, the architecture, and the code style.
Add a GitHub push ruleset that restricts pushes to those paths. Put only the repository admin role on the bypass list and keep yourself as the only admin.
Tell your planning agent to propose changes to those files as a Product Decision instead of editing them directly.
Add independent validator agents that check each implementation against the locked files. Give them in context the code and the docs only. No planning, no reasoning.
Decide on each proposal yourself. When you approve, you make the file change.

AI will not destroy your architecture in one mistake. It will replace it with a better one, one reasonable change at a time. Preventing that is the developer's job. Our skill matters most when we know how to set up the system that does it. This is one model that worked.

FOMO, FOMAT, FOMUT, and the Bottleneck That Turned Out to Be Me

Teemu Piirainen — Mon, 11 May 2026 06:30:00 +0000

First FOMO. Then FOMAT. Now FOMUT.

I have hit all three. And the biggest bottleneck in my AI workflow is no longer the agents. It is me.

I did not expect to write that sentence. A year ago, the bottleneck was clearly the agents. They were slow, they asked too many questions, they needed constant hand-holding. I spent my days answering their questions so they could keep working.

Now the agents run autonomously for ten hours straight. They do not ask me anything mid-run. Pull requests are waiting when I wake up.

And somehow, I feel more pressure than before.

There is a term for this: FOMAT (Fear Of Missing Agent Time). The feeling when you close your laptop and know the agents stop. The guilt when you take a break and realize idle agents are burning nothing but potential. The low-grade anxiety of being the human bottleneck in a system that could run faster without you.

I had the feeling for six months before I had the word.

The interruption loop

A year ago, my workflow looked like this:

I would start an agent on a task.
It would run for a few minutes.
Then it would stop and ask a question.
I would answer.
It would continue.
Three minutes later, another question.
I would answer again.

This sounds manageable. It was not.

The agent did few minutes of waiting per question. I did few minutes of context-switching per answer. But across a day, the agent's few minutes of waiting cost me hours of fragmented attention.

Every interruption pulled me out of whatever I was doing. Reviewing a PR. Writing a spec. Thinking through architecture. Each time, I dropped context to give the agent what it needed, then tried to reload what I had been doing. Flow state became a memory.

I tried the obvious fixes. Answer faster. Keep the laptop open at all times. Set up notifications so I would not miss a question. All of these optimized my day around the agent's idle time.

That was exactly the trap. I had become air traffic control for AI agents instead of doing deep work.

The problem was not that the agent asks too much. The problem was that I had not given the agents enough decision-making ability.

That was the turning point.

Rebuilding the system, not the agent

I rebuilt the workflow from the ground up. Not the agents. The system around them. Autonomy does not come from the model. It comes from the system around the model.

Move questions upfront

The core change was moving questions from inline to upfront. Instead of agents asking me things as they hit unknowns during execution, I now have a 1-2 hour alignment session with a dedicated orchestrator agent. It is a Claude Code session whose only job is to surface unknowns before work begins.

During that session, the orchestrator asks me dozens of questions. Scope. Priorities. Edge cases. Constraints. Trade-offs. "Should this feature support offline mode, or is network-required acceptable?" "What should happen if the API returns partial data?" "Is a 200ms response time a hard requirement or a nice-to-have?"

Every question the agent might have asked at 2am on a Sunday night gets asked at 10am on Saturday morning instead. The orchestrator tells me when alignment is sufficient. I do not guess when to stop.

The volume of questions did not go down. Their location in time changed.

The alignment session produces the inputs. The spec is what we (me + Claude) write afterward. I specify the why and the what, plus acceptance criteria. Not the how. Something like: "A user can complete the full onboarding flow without encountering an error state, and the result is persisted across sessions." The agents figure out implementation.

Give agents a decision framework

The second change was giving agents a decision framework for situations the spec does not cover. Three states instead of two:

If I know the answer, I act.
If I do not know but have enough context, I choose the simpler option, document why, and continue.
If I am genuinely blocked, I ask. Last resort.

That second state is the one that matters most. If the spec does not specify error handling for a network timeout, the agent picks a simple retry with exponential backoff, notes the decision in the PR description, and moves on. No midnight ping. No blocked pipeline.

That single change eliminated most mid-run interruptions. Agents stopped asking me about font sizes and error message wording. They made reasonable choices and moved on.

Trust the pipeline, not the run

The third piece was trusting the system, not the individual run. I wrote earlier about why quality in AI-assisted development is a pipeline, not a checkpoint. That same principle is what makes autonomous runs possible. A separate validator agent reviews every PR independently. It never saw the implementation agent's reasoning, so it cannot rationalize the same mistakes. Quality comes from layers: validators, multi-agent review, CI/CD gates, human review. The protection is structural, not surveillance.

The result: I cannot remember the last time an agent stopped mid-run to ask me something. 8+ hour autonomous sessions. I write specs during the day, launch agents in the evening, and review pull requests in the morning.

Things still go wrong. But the failure mode shifted. Now it is the agent doing the wrong thing, but doing it correctly. The spec was unclear, the direction was off, the edge case was not covered, and the agent executed exactly what was asked. The problem surfaces in review, not mid-run. Which means the problem is mine.

From FOMAT to FOMUT

Here is what I expected to happen: the pressure would disappear.

Here is what actually happened: the pressure transformed.

Before, the stress was reactive. An agent is waiting for my answer right now. Like being on call. I felt it as a pull. Something needed me, and every minute I did not respond was a minute wasted.

Now, the stress is proactive. I did not write enough specs today, so overnight capacity goes unused. Like a factory floor standing empty, not because something broke, but because I did not load it.

Same pressure. Completely different cause.

I started thinking of it differently after I had the word for it. Because what I have now is not really FOMAT anymore. FOMAT is the fear that agents are idle because they are waiting for you. What I have is closer to a term I have seen floating around as a joke: FOMUT (Fear Of Missing Unused Tokens). It is funny because it is a little too true. The fear that agents are idle because you did not prepare enough work for them.

The name comes from the flat-fee era, where idle agents felt like wasted subscription budget. With GitHub Copilot moving to usage-based billing in June 2026 and Cursor already there, the joke is losing its literal meaning. An unused token now costs nothing. But the feeling underneath was never really about tokens. It is about capacity that could have produced output and did not.

I went from one to the other.

The old pressure was: "An agent needs me right now and I am not responding." The new pressure is: "I have twenty hours of agent capacity available tonight and I only prepared six hours of work." The constraint shifted from my response time to my planning throughput.

This changes what a productive day looks like. I used to measure my day by hours spent coding. Now I measure it by how much agent capacity I utilized overnight. A good day is not one where I worked ten hours. A good day is one where agents ran at full capacity while I slept.

When the bottleneck moved

Three years ago, ten hours of careful coding produced a certain amount of output. Now I achieve the same output in about two hours of specifying and reviewing. But I did not pocket the eight hours. I set targets ten times higher.

I used to think about how much I had time to do. Now I think about how much capacity went unused.

The bottleneck moved from execution to direction-setting. The system can produce more than one human can specify. Product thinking (what to build, why, in what order) becomes the constraint. Not coding skill. Not agent capability. The ability to define good work fast enough to keep the line running. I recognized this pattern from manufacturing: when you add production capacity but the design team cannot produce enough blueprints, the factory floor sits idle. Not because it is broken, but because it is starving for instructions.

The developer role shifts. Less time writing code. More time as a product thinker, architect, and reviewer. I spend more of my day now thinking about what to build than how to build it. The value I add is in the alignment session and the spec, not in the execution.

A better problem to have

Despite everything I just described, the current state is a thousand times better.

Not because the pressure is gone. It is not. But the nature of the pressure changed entirely.

Reactive firefighting became proactive planning. Babysitting became reviewing. Constant interruption became focused alignment. I traded a problem I could not solve (being fast enough to keep up with agent questions all day) for a problem I can solve, which is getting better at specifying work upfront.

The biggest bottleneck in my AI workflow is me. And that turns out to be a much better problem than the one I started with.

How I Validate Quality When AI Agents Write My Code

Teemu Piirainen — Mon, 16 Mar 2026 07:04:39 +0000

Someone asked me the best question after I posted about managing AI agents like a dev team:

And how do you validate quality?

Fair point. If AI is writing the code, who's making sure it actually works?

My solution: a system of enforced gates that makes shipping bad code harder than shipping good code. Here's how I built that system.

The Mental Model: Quality Is a Pipeline, Not a Checkpoint

Often we think of quality as something you check at the end. Run the tests. Do a code review. Ship it.

But we have already learned this lesson with SDLC / SSDLC:

security and quality must be embedded in every phase, not bolted on at the end.

The same principle applies when AI writes the code. The difference is that you can't rely on AI agent developer discipline to follow the process. Your AI framework must enforce it through gates that agents cannot bypass.

AI agents can produce plausible-looking code that passes superficial inspection but drifts from requirements, violates architecture patterns, or introduces subtle bugs. I first tried the obvious approach: detailed instructions telling the coding agent to handle testing, architecture patterns, and edge cases all at once. It never worked reliably. The breakthrough came when I loosened the constraints. Let the LLM write its best code freely, then build independent validation gates with separate agents that catch what the first one missed.

My workflow has eight quality gates. Code must pass through all of them before it reaches production.

If issues surface at Gate 5, 6, or 7, the fix flows back through Gate 3 → 4 before proceeding. In my experience, most issues are caught at Gate 4.

Gate 1: Requirements Definition (~70% of My Time)

This is the most counterintuitive part. In an AI-native workflow, I spend roughly 70% of my time defining requirements, not writing code. My role has shifted from how to build it to what to build and why. The code is the agent's job. Getting the requirements right is mine.

Why does this matter for quality? Because agents are extremely literal. Give them vague instructions and they'll build something that technically matches what you said but misses what you meant. The quality of AI output is directly proportional to the clarity of input.

How It Works

I use a requirements-analyst agent that:

Reads the issue from our project management tool (Linear)
Researches business requirements documentation to map functional and non-functional requirements
Searches for industry patterns and best practices
Asks me clarifying questions until requirements are unambiguous
Decomposes epics into right-sized stories with clear acceptance criteria

Every issue gets a structured format:

## What
[Problem to solve]

## Why
[Business value]

## Context
[Constraints, dependencies, scope]

## Acceptance Criteria
- [ ] Criterion 1 (specific, testable)
- [ ] Criterion 2
- [ ] Criterion 3

The key insight: acceptance criteria are the contract between me and the agents. If a criterion is vague, the agent will interpret it however it wants. If it's specific and testable, the agent has a clear target, and so does the validator that checks the work later.

But requirements alone aren't enough. I also maintain architecture documentation: files that describe the project's patterns, conventions, data models, and design system. When a code-architect agent later designs the implementation, it reads these docs and follows established patterns rather than inventing its own. The requirements define what, the architecture docs constrain how.

What This Prevents

Scope creep (agents build exactly what's specified, nothing more)
Spec drift (each sub-task traces back to business requirements)
Wasted iterations (ambiguities are resolved before any code is written)

Gate 2: Architecture Design

Before any code is written, a code-architect agent takes the requirements from Gate 1 and the architecture documentation I maintain, then designs the implementation. For example, my project maintains docs like these:

docs/code-documentation/
├── architecture-backend.md
├── architecture-frontend.md
├── business-requirements.md
├── gcp-setup.md
├── design-system.md
├── testing-guidelines.md
└── ...

I typically maintain 10-20 such documents per project. These are living documents that evolve with the codebase. They serve as context for every agent, ensuring each one understands the project's patterns, conventions, and constraints before making any decisions.

The architect agent reads relevant docs before designing anything, so it follows established patterns instead of inventing its own. Its process:

Reads project architecture docs to understand established patterns and conventions
Analyzes the existing codebase for relevant precedents
Researches best practices for the specific technology stack
Designs the feature architecture with specific file paths, component responsibilities, and data flow
Breaks the work into ordered implementation phases
Creates sub-issues for each phase with its own acceptance criteria

I review the blueprint if it proposes changes to the general architecture. Personally, I want to understand and own the high-level design, but that's a preference, not a requirement of the system.

The blueprint specifies:

Every file to create or modify
Component responsibilities and boundaries
Data flow from entry points through transformations to outputs
Build sequence that defines which phases must complete before others can start

Each sub-issue carries its own acceptance criteria, which means the validator at Gate 4 has specific targets to check against. The quality chain is: requirements → architecture → implementation → validation, and each gate feeds the next.

What This Prevents

Architecture drift (agents follow established patterns, not their own ideas)
Integration failures (data flow is designed upfront, not discovered during integration)
Over-engineering (scope is bounded by the blueprint)

Gate 3: Implementation with Built-in Validation

The developer agents (separate for each domain, like backend, frontend etc.) don't just write code and hand it off. They have mandatory validation steps built into their process. Why separate agents? Each one has a focused prompt, isolated context window, and role-specific evaluation criteria. A backend agent doesn't need to know about React patterns, and vice versa.

Incremental Testing

After modifying or creating each file, the agent runs only the related test file, not the full suite. This is deliberate: running all tests after every file change slows the agent dramatically, especially as the project grows and integration tests get heavier. By scoping to the affected test file, feedback cycles stay at seconds instead of minutes. The agent must fix failures before moving to the next file. This works well when test boundaries are clear (one service = one test file), and catches issues at the smallest possible scope. The full test suite runs later at Gate 4.

Pre-Completion Validation

Before reporting back, every developer agent must run and pass three checks:

Type-check: zero errors
Lint: zero errors
Test suite: all tests pass + coverage >= 90% for new/modified files (as a minimum guardrail, not a quality metric, since high coverage alone doesn't prove tests are meaningful)

These checks use custom validation scripts that produce compact, structured output: a 5-line summary instead of hundreds of lines of test runner noise. This matters because verbose tool output slows AI agents down significantly. When agents can parse results in seconds instead of scrolling through walls of text, the feedback loop stays tight.

What This Prevents

Cascading failures (small scope means bugs are isolated to one subtask)
Test regressions (existing tests must still pass before moving on)
Untested code (90% coverage enforced per file)

Gate 4: Code Validator Agent

After each developer agent completes, a dedicated code-validator agent runs independently. This is the quality gate that blocks commits.

The validator:

Reads the issue and acceptance criteria
Inspects recent changes and existing tests
Runs the full test suite for affected packages
Generates and reviews coverage reports
Performs a code review focusing on correctness, edge cases, security, and project conventions
Decides: PASS or FAIL

This review focuses on the current subtask in isolation. The broader feature-level review happens at Gate 5.

Confidence Scoring

The validator rates each potential issue on a 0-100 confidence scale:

Score	Meaning
0	False positive, not a real issue
25	Might be real, might be false positive
50	Real issue, but minor
75	Verified real issue, will impact functionality
100	Confirmed critical issue

Only issues with confidence >= 75 are reported. The scoring uses structured prompts that require the agent to provide evidence for each finding. No evidence, no report. It's a pragmatic filtering mechanism that dramatically reduces noise and false positives.

The Hard Rule

Commits are blocked until the validator returns PASS. If it returns FAIL, the developer agent is re-invoked to fix the issues, and the validator runs again. The workflow enforces this automatically, so there's no way to skip it.

Developer Agent
  ↓
Validator (FAIL)
  ↓
Developer Agent (fix)
  ↓
Validator (PASS)
  ↓
Commit allowed

What This Prevents

Convention violations (code that works but doesn't follow project patterns)
Coverage regressions (no commit without meeting the threshold)
Blind spots from the writing agent (independent review catches what the author missed)

Gate 5: Multi-Agent Code Review

While Gate 4 validates each subtask in isolation, Gate 5 reviews the entire feature across all commits before creating a pull request. A code review skill runs multiple specialized agents in parallel:

Parallel Review Agents

Four agents run simultaneously, each with a different focus:

Architecture Compliance: Audit changes against architecture documentation, flag violations with exact rule citations
Bug Detection: Scan the diff for logic errors, null handling issues, and edge cases
Security Review: Check for vulnerabilities, injection risks, and unsafe patterns in changed code
E2E Test: Run an end-to-end test that exercises the new feature from the user's perspective

Validation Round

Each flagged issue goes through a separate validation agent that confirms the issue actually exists. This filters out false positives before any findings are reported.

High-Signal Only

The review explicitly does not flag:

Code style concerns (linters handle that)
Subjective improvements
Pre-existing issues not introduced in this change
Pedantic nitpicks
Patterns used consistently elsewhere in the codebase

What This Prevents

Architectural violations slipping through
Security issues in new code
Logic bugs that tests don't cover

Gate 6: CI/CD Pipeline

Gates 3-5 all ran on the agent's machine. Gate 6 is the first time code runs in a completely independent environment. When the pull request is marked ready for review, CI runs the full pipeline from scratch.

Detect Changed Packages
        ↓
  Lint & Type Check
        ↓
  ┌─────┼─────┐
  ↓     ↓     ↓
Pkg A Pkg B Pkg C   (tests in parallel)
  ↓     ↓     ↓
  └─────┼─────┘
        ↓
      Build
        ↓
  Static Scanners
        ↓
  Ready for Review

Smart Change Detection

The CI pipeline detects which packages changed and only runs their tests. If shared types change, all dependent packages are retested automatically because cascading dependencies are tracked. This keeps CI fast on small changes while still catching cross-package breakage.

What the Pipeline Runs

Lint & Type Check: Static analysis across all changed packages
Per-package tests: Unit and integration tests run in parallel for each affected package
Build: Full production build of all changed modules
Static Scanners: Run static analysis tools to catch potential security issues before merging

Draft PR Strategy

PRs are always created as drafts first. CI skips draft PRs to save CI minutes. When ready for review, the PR is marked as non-draft, which triggers the full pipeline. This means CI resources are only spent on code that's already passed all local gates (Gate 3 + Gate 4).

What This Prevents

Environment-specific failures (clean CI, not the developer's machine)
Cross-package breakage (shared type changes tested across all dependents)
Build failures in production configuration

Gate 7: Human Review and Merge

This is the only manual approval gate in the entire pipeline. After CI passes, I personally review the changes before merging. This is a critical checkpoint that forces me to consciously take ownership of delivered code. I want to understand what changed at a high level so that I'm able to steer future work and make informed decisions about architecture and design patterns.

The review is intentionally lightweight. By this point, the code has already passed five automated gates. I'm not hunting for bugs or style issues. I'm checking that the feature makes sense, the approach aligns with where the project is heading, and nothing looks fundamentally wrong.

What This Prevents

Losing architectural awareness (I stay informed about every change)
Autopilot merging (conscious decision to ship, not rubber-stamping)
Strategic drift (changes that technically work but move the project in the wrong direction)

Gate 8: Deployment Verification

On merge to main, automated release tooling creates a versioned release, and the deploy pipeline runs:

Validates environment variables before building (catches missing config early)
Builds all changed modules with production configuration
Deploys only changed components: backend, frontend, and infrastructure rules are deployed independently based on what actually changed
Verifies all deployments succeeded: if any component fails, the release is marked as failed with actionable retry instructions

What This Prevents

Deploying with missing or misconfigured environment variables
Deploying unchanged components unnecessarily
Silent partial failures (one component fails but the release looks successful)

The System in Practice

Here's what a typical feature looks like flowing through these gates:

1. Define requirements           [~1 hour]
2. Architecture design           [~10 min]
3. Implementation + tests        [~0,5 - 6 hours in total]
4. Validator after each phase    [~3 min each]
5. Code review before PR         [~5 min]
6. CI pipeline                   [~8 min]
7. I review and merge            [~10 min]
8. Deploy on merge               [~5 min]

Regardless of feature size, the validation overhead stays roughly bounded: about 20 minutes of automated checks. The implementation time scales with complexity, but the quality gates are much less variable. That's the point.

When the Pipeline Catches Something

Here's a real example. A developer agent implemented a new feature that added a field to a shared data model. Unit tests passed. Type-check passed. Coverage was above 90%. The agent reported success.

Then the validator ran. It detected that while the new field existed in the TypeScript interface and the backend service, the Firestore converter (responsible for translating between the database and the application) was never updated. Data would be written to the database but silently lost on read. The validator returned FAIL, the developer agent was re-invoked with the specific finding, and it fixed the converter in under a minute.

Without Gate 4, that bug would have shipped. Unit tests didn't catch it because they mocked the database layer. The type system didn't catch it because the converter used a spread operator that silently dropped unknown fields. Only an independent agent reviewing the full change against project conventions found the gap.

That failure became a permanent memory entry. Now every agent touching shared data models gets warned: "Converter updates require synchronized changes in 4+ locations."

Long-Term Memory: Quality That Improves Itself

Without persistent memory, every session starts from zero. Agents repeat the same mistakes, validators catch the same failures, and I re-explain the same constraints. The quality gates work, but they don't get better.

Long-term memory closes this gap. It forms a feedback loop with the gates: the validator catches a failure, that failure becomes a memory entry, and in the next session the developer agent gets warned before it writes a single line of code. The agent avoids the mistake. The validator confirms. Gates catch problems once. Memory prevents them forever.

This compounds. Early in a project, agents make more mistakes and the validator catches them frequently. After 10+ runs, agents start each session already knowing dozens of project-specific traps. Validator failures become rarer. The system gets faster because it spends less time fixing and re-running.

Here are a few real pitfalls that the pipeline caught and encoded:

Zod/TypeScript sync: Adding interface fields requires updating Zod schemas AND all consumers
Test mock indices: New LLM-calling nodes shift ALL mock call indices in integration tests
Config wiring: Adding a parameter signature without reading config is a silent no-op
Converter updates: New conversation fields require synchronized updates in 4+ locations

These aren't hypothetical. Each one caused a real failure, was caught by the validation pipeline, and became permanent institutional knowledge. Every project accumulates its own version of this list.

The result is a quality system that develops itself. Every feature it validates teaches it how to validate the next one better. No human intervention needed for this loop to run. This is fundamentally different from static tooling that works the same way on day one and day three hundred.

What I've Learned

1. Front-load requirements, not reviews

The biggest quality lever isn't better testing. It's clearer requirements. When I spend an hour defining exactly what a feature should do with specific acceptance criteria, the agents produce correct code on the first try far more often than when I rush through requirements and rely on review to catch problems.

2. Separate writing from validation

Don't ask the same agent to write code and verify it. That's like having students grade their own exams. The coding agent's job is to write the best code it can. The validator is a separate agent with a separate prompt, separate context, and explicit permission to fail the work. It has no incentive to pass. This separation is what makes the gates trustworthy.

3. One subtask at a time

The natural instinct is to implement a full feature and test at the end. That's where quality breaks down. Instead, break the work into small subtasks, implement one, validate it, commit it, then move to the next. Each commit is a known-good checkpoint. When something fails, the blast radius is one subtask, not an entire feature. This pattern is counterintuitive but it's the most practical change another developer could adopt immediately.

4. Enforce the process in the framework, not in prompts

You can't tell an AI agent to "be careful" and expect consistent results. The quality comes from a workflow that runs validation automatically after every subtask, not from instructions asking agents to remember to test. Bake the gates into the framework so they execute by default. When skipping a gate is harder than following it, quality becomes a property of the system rather than a hope.

5. This is an engineering problem, not an AI problem

The question isn't:

Can AI write good code?

It can.

The question is

Does your system prevent bad code from shipping?"

That requires overlapping automated gates, independent validation agents, long-term memory, and a workflow that enforces all of it. No single technique is enough. The system is the product.

Tools: Claude Code, Codex, specialized AI agents per role, skills, long-term memory for persistent learnings, git worktrees, Linear for issue tracking, GitHub Actions for CI/CD.

Your AI Coding Agents Are Slow Because Your Tools Talk Too Much

Teemu Piirainen — Sat, 07 Feb 2026 10:39:43 +0000

Our AI code validator agent took 608 seconds to report results from a test suite that runs in 96 seconds. The agent wasn't stupid. The tool output was.

Every developer tool we use (test runners, linters, compilers, build systems) was designed for humans reading a terminal. When an AI agent reads that same output through a context window, things break in ways you don't expect. This is one example of that problem, and a pattern for fixing it.

The Symptom

We run a TypeScript monorepo with ~12,000 tests across four packages. After each feature, a code-validator agent runs tests and reports pass/fail with coverage. Simple job.

Agent Task	Actual Test Time	Agent Time	Overhead
Backend (3,683 tests)	24s	224s	9.3x
Frontend (7,450 tests)	96s	608s	6.3x

The agent was spending 6-9x longer understanding the results than the tests took to run.

What The Agent Actually Did

We parsed the agent transcripts (every tool call, every reasoning step). Here's the backend agent's actual sequence:

1. npm run test:coverage           → 419KB output, truncated at 235KB
2. grep "Tests" /tmp/output.log    → matched console.log JSON, not summary
3. npm run test:coverage           → re-ran entire suite. Truncated again.
4. tail -20 /tmp/output.log        → got coverage table row, not summary
5. grep -E "passed|failed"         → matched 47 lines of noise
6. npm run test:coverage           → third complete re-run
   ... repeated 6 times total ...

12 tool calls. 6 complete test re-runs. 224 seconds. To answer a yes/no question.

The frontend agent was worse: 28 tool calls, 5 test re-runs, 13 different grep/tail/head combinations trying to parse a coverage text table. It even reported a false failure — incorrectly flagging coverage as below threshold because it parsed the wrong line.

Why? Because vitest produces this:

 ✓ src/services/__tests__/userService.test.ts (12 tests) 45ms
 ✓ src/services/__tests__/authService.test.ts (8 tests) 23ms
   ... 1,386 more files ...

 Test Files  1389 passed (1389)
      Tests  3683 passed (3683)
   Duration  24.1s

----------|---------|----------|---------|---------|
File      | % Stmts | % Branch | % Funcs | % Lines |
----------|---------|----------|---------|---------|
   ... 141 rows ...

419KB of human-readable output. The answer five numbers is at the bottom. The context window truncates from the bottom. The agent never sees it.

You wouldn't send 419KB of raw HTML to a mobile app and tell it to regex out the data. But that's exactly what we were doing with our agents.

The Fix

We stopped asking "how do we make the agent parse this better" and asked "can we give the agent a command that just outputs the answer?"

RESULT_FILE=$(mktemp)
trap 'rm -f "$RESULT_FILE"' EXIT

# JSON reporter writes structured data to file. Everything else → /dev/null.
(cd "$PKG_DIR" && npx vitest run \
  --reporter=json \
  --outputFile="$RESULT_FILE" \
) > /dev/null 2>&1

# Extract exactly what the agent needs
PASSED_TESTS=$(jq '.numPassedTests' "$RESULT_FILE")
FAILED_TESTS=$(jq '.numFailedTests' "$RESULT_FILE")
SUCCESS=$(jq '.success' "$RESULT_FILE")

echo "RESULT=$( [ "$SUCCESS" = "true" ] && echo "PASS" || echo "FAIL" )"
echo "TESTS=$PASSED_TESTS passed, $FAILED_TESTS failed"
echo "WALL_TIME=${WALL_TIME}s"

# On failure only: extract what failed
if [ "$SUCCESS" != "true" ]; then
  jq -r '.testResults[] | select(.status == "failed") |
    "FILE: \(.name)\n\([.assertionResults[] |
    select(.status == "failed") | "  - " + .fullName] | join("\n"))"
  ' "$RESULT_FILE" | head -30
fi

Three decisions:

--reporter=json — vitest writes structured JSON to a file
> /dev/null 2>&1 — 419KB of terminal noise disappears
jq — extracts five numbers from structured data

The agent now sees this:

=== VALIDATION: test:backend ===
RESULT=PASS
SUITES=1389 passed, 0 failed (1389 total)
TESTS=3683 passed, 0 failed (3683 total)
WALL_TIME=40s

Five lines. One tool call. No parsing, no ambiguity, no re-runs.

The Pattern Is Everywhere

This isn't a vitest problem. It's a tool output problem. Every developer tool your agent touches has the same issue:

Linters — ESLint's default output is human-friendly. eslint --format json gives your agent structured violations with file paths, line numbers, and severity — no parsing needed.
Type checkers — tsc --noEmit dumps errors to stderr as human-readable text. A 5-line wrapper that counts errors and captures file paths turns it into a structured report.
Build tools — docker build streams layers of progress output. The agent only needs: did it succeed, what's the image size, how long did it take.
Infrastructure — terraform plan produces pages of human-readable diff. terraform plan -json gives your agent a structured changeset it can reason about.

The pattern is always the same: the tool already has structured output (JSON, machine-readable flags), but the default is designed for a terminal. Switch the format, discard the noise, extract the answer.

The Results

Metric	Before	After
Backend: tool calls	12	1
Backend: agent time	224s	42s
Frontend: tool calls	28	1
Frontend: agent time	608s	66s
False failures	2	0
Test re-runs per agent	5-6	0

Same tests. Same agent. Same model. Same prompts.

The Takeaway

The industry is pouring effort into prompt engineering, model selection, and agent frameworks. Meanwhile, half the agent's context window is filled with ANSI color codes, progress bars, and output that was never meant for machine consumption. The context window is a scarce resource => treat it like memory, not a terminal screen.

When your agent is slow, don't start with the prompt. Start with what the tools are sending back. Audit every command your agent runs. If the output is more than a screenful, the agent is probably struggling with it. Most tools already support structured output: JSON flags, machine-readable formats, quiet modes. Use them. And where they don't exist, a simple wrapper script that filters noise and extracts the answer will do more for your agent's performance than any prompt rewrite.

The fastest agent isn't the one with the best reasoning. It's the one that doesn't have to reason about the data format at all.

Based on a real optimization on a production TypeScript monorepo with ~12,000 vitest tests. The pattern — structured output, noise suppression, answer extraction — applies to any tool your agents touch.

AI Agent - Lessons Learned

Teemu Piirainen — Mon, 11 Aug 2025 05:12:00 +0000

Who’s this for: Builders and skeptics who want honest numbers: did an AI coding agent really save time, money, and sanity or just make a mess faster?

TL;DR

⌛ ~60 h build time (↓~66 % from 180 h)
💸 $283 token spend
🚀 374 commits, 174 files, 16'215 lines of code
🤖 1 new teammate - writes code 10× faster but only listens if you give it rules

Series progress:

Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▇▇▇▇▇ Retrospect ▇▇▢▢▢

This is Part 4, the final piece: my honest verdict.

Now the question: Was it worth it?

Did the numbers add up?
Where did the agent pay off?
Where did it backfire?
How would I push it further next time?

Series Roadmap - How This Blueprint Works

One last time, here’s the big picture:

Control - Control Stack & Rules → trust your AI agent won’t drift off course (Control - Part 1)
Build - AI agent starts coding → boundaries begin to show (Build - Part 2)
Release - CI/CD, secrets, real device tests → safe production deploy (Release - Part 3)
Retrospect - The honest verdict → what paid off, what blew up, what’s next

Why care?
The AI agent is a code machine that never sleeps, knows every library, and wants to push commits 24/7 but without your control, it has no clue what the end product should be.

👉 Let’s break it down - this is Part 4.

1. Did the Numbers Add Up?

Back in Part 1, I posed the core challenge:

Can we build a fully autonomous AI agent (one that an organisation can own and audit end-to-end) and make it deliver real, production-grade code, with just a fraction of human input?

That meant no black-box SaaS tools. No prompt-hacking toys. Just a scoped AI teammate working inside a real, observable control loop: Planner → Executor → Validator, backed by rules I could evolve and CI/CD pipelines I already trust.

Short answer: YES.

Here’s the breakdown:

Effort - ~60 h of my time with the AI agent delivered the same output I’d expect from 180 h solo
Money - $283 in Gemini 2.5 Pro tokens
Speed - Flutter work flew by 5–7× faster, native Swift/Kotlin dragged to <2×, landing a real-world ~3× boost
Delivery - 374 commits, 174 files, 16'215 lines of code
Trust - Every task passed through my control loop, tested and clean. Full control. Full traceability.

So was it cheap? Absolutely - but only because I stayed in the loop.

The $283 didn’t magically buy 180 hours of code. It bought an extra pair of hands that turned my 60 hours into a full 180h deliverable.

✅ Bottom line: The 3× boost didn’t come from magic, it came from structure.
The agent didn’t invent new skills; it scaled the ones I already had.
Sometimes it even surprised me, but only because the groundwork made it possible.

The stack I share here worked and was battle-tested in June 2025. Treat this write-up as a snapshot, not a rulebook.

2. Lessons Learned - What I’d Keep Next Time ✅

Some parts of the setup worked better than expected and these are the ones I’d repeat from day one. Most of them are invisible from the outside, but they made the difference between chaos and clarity.

2.1 Trust Isn’t Given - It’s Built

One of the biggest takeaways from integrating an AI agent into my workflow is that trust doesn’t happen by default. You don’t get it just because the agent can write “good” code. You earn it by proving, over and over, that the agent can operate reliably inside the same guardrails as the human team.

When trust is missing, adoption stalls. Every bug becomes a reason to sideline the agent rather than improve it. Pull requests sit unreviewed because no one wants to take responsibility. Eventually, the “AI teammate” becomes just another unused tool.

The turning point was treating the agent like a real developer:

Make its actions visible so everyone can see what it’s doing and why
Start small and collect wins before scaling up
Learn from mistakes and feed those lessons back into its instructions and rules
Require approval for every plan before coding starts
Apply the same rules as for human devs (no shortcuts because it’s an agent)

Built through visibility, shared rules, guardrails, and real accountability, that trust made me comfortable approving the agent’s work. Without it, none of the technical improvements would have mattered.

2.2 Control Stack First, Prompts Second

Giving the agent a state‑machine style loop (Planner → Executor → Validator), similar what Anthropic’s best‑practice write‑up. It forced the AI agent to think before splatting code and gave me natural checkpoints to cancel nonsense.

2.3 Rules as Live Documentation

/rules/airules.md began as nothing, ballooned into a 400-line beast, and finally slimmed down to a tight 70-liner that covers only what matters. By week’s end the agent spoke my dialect (thinking process, code architecture, commit style) with minimal reminders.

JetBrains’ Junie guideline files show the same pattern: write rules once, enforce forever. But “forever” takes discipline.

2.4 Ruthless Task Scoping

Start with a living PRD: Draft a concise Product Requirements Document that maps the entire service: user flows, non-functional needs, “nice-to-have” ideas, everything.
Slice every feature into bite-size tasks: Break big rocks into shovel-ready tickets. Do it yourself, subdivide the work, just be sure each task fits comfortably in a one AI agent task context.
Let the agent explode tasks into execution units: When implementation starts, the agent generates its own subtasks, acceptance notes, and edge cases, and keeps that checklist current as it commits code.

2.5 Secrets Stay Secret

Fine‑grained PAT plus GitHub Secrets meant the agent never held a signing key. The 2025 Wiz secrets‑leak report is proof that anything less is asking for page‑one headlines.

2.6 Real-Device CI/CD - The Only Trustworthy Loop

CI/CD isn’t optional when working with AI agents, it’s what turns speed into reliability. No pipeline, no autonomy - just faster mistakes.

Every pull request goes through the same pipeline: build, sign, ship to TestFlight and Play Console. That means the code lands on real hardware, gets tested by real eyes, and reveals real bugs the agent never saw coming.

The first sprint showed what works. These are the bets I’ll double down on next time to turn speed into consistency - not just more commits.

3. Lessons Learned - Where I’ll Push Further

3.1 Smarter Context, Longer Autonomy

Big LLMs forget fast. The fix isn’t just more tokens. It’s structured, real-time access to the whole repo, open tasks, recent merges - everything that defines “what’s really going on.”

The longer a single chat grows, the worse the output gets. So I’d like to push for smarter retrieval next time: live task lists, commit-aware reasoning, and context that updates as the codebase evolves.

Both Devin and Anthropic hit the same wall: without structured, evolving context, long autonomous runs just fall apart. Even though one favors single-agent and the other multi-agent setups.

In my own sprint, I tackled the same challenge by keeping tasks small and starting each with a clean context. A simple but surprisingly effective workaround.

3.2 Scaling the Team

One dev plus one agent is simple. But the moment you add more people (or more agents) things can get messy fast.

Who owns what? What changed while the agent was thinking? What if two agents fix the same bug? Without shared state and safe commit boundaries, you don’t get more speed. Just more conflict.

One thing I’d like to try: task leasing. AI agents (Planner) pick up tasks from a shared hub (like task.md), evaluates is there other tasks running parallel (by other agents / humans) that might impact to work and validate state (Validator) before pushing code. Paired with clean CI/CD and commit guards, that might keep the swarm aligned.

These ideas would need careful testing in real-world coding workflows, as current multi-agent systems often fail due to shared-state complexity.

3.3. True TDD Loop

Tests after code worked fine, but next run the agent will flip it: failing test first, fix second. This tightens the feedback loop and cuts down surprises at the Validator step.

Anthropic recommends the same test-first mindset in their Claude Code Best Practices: write the tests first, confirm they fail, then guide your agent to turn them green. The goal is the same: catch bugs early, not after they hit production.

3.4. Deeper Static Analysis

Syntax checks aren’t enough. Unlike experienced developers, AI agents don’t intuitively spot complex or fragile code structures. Adding tools like SonarQube or Qodana to the CI pipeline gives early feedback on code quality, helping catch issues the AI agent might repeat without realizing.

3.5. UAT Feedback Automation

One practical issue that I didn't solve was: how to get human testers' feedback into the agent's task list? When working in a team, I would create a separate hook integrated to Jira or Slack (or what ever tool the team is using), so that testers could report issues and the AI agent would pick them up and automatically create a linked task. But in this case I didn't have a team, so I just added the issues manually into task.md and let the agent handle them.

Mobile dev: In general, I think that we could get the biggest improvement if the AI agent would have access to Android emulator UI during the development. AI agent would have been able to run the tests and fix majority of issues automatically during the development. But as I mentioned earlier, due to technical limitations in Firebase Studio, real-time emulator access wasn’t available in this project.

4. MCP (Model Context Protocol) - The Next Frontier

Everything so far (coding speed-up, the Planner–Executor–Validator loop, and the CI/CD pipeline) was built and tested during the initial 30-day sprint. But while writing this recap, one thing became clear:

There’s still room to grow. And it starts with context.

The core limitation I ran into was this: the AI agent didn’t truly “know” what is happening outside the project. Each task was handled in context-isolation with one AI agent.

That missing context (the lack of shared memory or real-time awareness) is what I’ll explore next.

Why MCP matters

Model Context Protocol (MCP) lets an agent bolt on extra tools in real time. Need repo search? A test runner? Up to date docs? Hook it up as a structured API call instead of fragile prompt glue.

Below are some candidate features I’m excited to try:

Shared brain, real‑time

Every agent writes to (and reads from) a live task.md hub. Spin up ten runners, one planner, one validator; they all share the same queue and state, acting like one coherent engineer instead of a Slack channel on fire.

Always‑fresh docs (Context 7 FTW)

Because MCP streams docs in via context 7, the agent’s knowledge is never stale. Update the README, push to main, the next call sees it automatically.

Enterprise-ready glue

Forget brittle webhooks - MCP exposes Jira, GitHub, Slack as typed plugin calls. The agent plugs straight into your existing workflows without extra scaffolding.

Bottom line: MCP turns one clever AI agent into a fully-armed AI agent squad, all speaking the same language and pulling from the same live playbook.

5. Final Word

AI agents won’t replace you, but they will scale what you can deliver.

All this worked because I’ve got 20+ years of real dev work behind me: the blueprint, the rules, the guardrails all come from doing the work first.

If we hand every line of code to the AI, that craft fades and soon there’s nothing left to steer.

So protect your craft. Build the hard parts yourself. Keep your edge, so the agent stays a partner - not the boss.

Wire it tight, scope it clear, and let your AI agent prove it can keep up!

6. If You Want The 3× Boost - Do This ✅

Control Stack First. Planner → Executor → Validator - no shortcuts.
Keep /rules alive. Update instructions as your agent learns.
Scope tasks tight. Small tasks, clear acceptance notes, tracked commits.
Secrets stay secret. Repo-scoped PAT + CI/CD secrets only.
Real tests. Always run on real hardware, no emulator-only trust.
Watch, learn, tweak. Your agent only stays smart if you guide it.

Ready for next?

I’m planning to plug MCP into the All‑Hands AI framework next, linking multiple agents with a shared brain and tighter feedback loops. I’ll share how that turns out once I’ve pushed it far enough to see what breaks.

💬 Seen anything I missed? Or got your own battle story testing AI agents on real projects? Drop it in the comments. I read them all.

Release AI Agent Code Safely - Production CI/CD & Secrets

Teemu Piirainen — Mon, 04 Aug 2025 05:40:00 +0000

Who’s this for: Devs, team leads, and DevOps folks responsible for a production CI/CD pipeline - looking to integrate AI agents that generate code, without losing reliability or control.

TL;DR

Secrets, Pipelines, Real Tests:
Fine-grained Personal Access Tokens (PATs) protected my repo, GitHub Actions auto-built every PR, a second AI agent reviewed commits and human approved the PR. Real device tests closed the loop, still ~3× faster.

Series progress:

Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▇▇▇▇▇ Retrospect ▢▢▢▢▢

Welcome to Part 3 of my deep-dive series on building an autonomous AI agent: how do you actually deploy AI agent code safely?

In Part 1, I locked my agent inside a clear Planner → Executor → Validator loop.
In Part 2, I proved it could blast through real Flutter tasks and handle native Swift/Kotlin with human guard-rails.

But shipping day is where AI agents usually faceplant: secrets leaks, who is actually responsible, app doesn't work on a real device.

This part breaks down:

How I kept secrets safe (fine-grained GitHub PATs, repo-scoped only)
How I automated CI/CD (GitHub Actions, PR reviews with a second AI)
How to integrate real device testing into the loop

Series Roadmap - How This Blueprint Works

Control - Control Stack & Rules → trust your AI agent won’t drift off course (Control - Part 1)
Build - AI agent starts coding → boundaries begin to show (Build - Part 2)
Release - CI/CD, secrets, real device tests → safe production deploy
Retrospect - The honest verdict → what paid off, what blew up, what’s next

Why care?
Without proper CI/CD controls and human-in-the-loop rules, your AI agent can go rogue - just like Replit’s did when it dropped a live database during a code freeze. No guardrails, no mercy.

👉 Let’s secure it - this is Part 3.

1. App Observability: Keep Analytics and Crash Reports Under Control

Firebase Studio can auto-generate configuration files, but I didn’t want it touching anything sensitive. It also offers a one-click setup for Crashlytics and Analytics, but using it would have meant linking my personal credentials. That wasn’t acceptable.

Instead, I handled the setup manually:

Created the Firebase project through the Firebase Console
Registered iOS and Android apps and downloaded the required config files
Added GoogleService-Info.plist and google-services.json to the project folders
Configured dependencies and updated the Podfile by hand

Using the Firebase CLI was an option, but running it inside Studio through an AI agent didn’t meet my security bar.

2. GitHub Access: Minimal Permissions, Full Workflow

Firebase Studio initially requested full GitHub access. That was not acceptable. It then suggested a general Personal Access Token, which was still too broad for my setup.

Instead, I configured a fine-grained PAT with only the permissions required for this single repository. That allowed the AI agent to commit code, open pull requests, and read comments, nothing more. I also installed the GitHub CLI and used the same token for PR management.

All sensitive keys stayed out of the repository. I stored key.properties and Apple certificates in GitHub Secrets. The pipeline injected them only during the build process. The AI agent had zero access to any secrets at rest, keeping the risk surface small and controlled.

3. CI/CD - Let the Pipeline Do the Dirty Work

Once the AI agent created a PR, my custom GitHub Actions pipeline took over. PR reviews were done first by the GitHub Copilot and then by human in the loop - me.

3.1 Who does the PR review and takes responsibility?

In a normal software lifecycle, we don’t let developers review their own code, not because we’re careless, but because we know we miss things. The same principle applies to AI agents, and arguably even more so.

In this setup, every pull request went through a two-phase review: first by GitHub Copilot, then by me. The code was originally written by Gemini 2.5 Pro, and I honestly expected Copilot to just nod along. But surprisingly, it flagged real issues, especially around edge cases and error handling.

Early on, I followed every line the AI agent wrote. But as the control stack matured, I trusted it more. By the end, I reviewed its pull requests just like I would with any human teammate.

3.2 GitHub Actions Pipeline

When it was time to create a release build, this was triggered manually by me utilizing create_release.yml workflow. Pipeline then took care of the whole release process.

Release notes and the whole CI/CD pipeline was very similar to what I would do with my real customers and human developer colleague (Dependabot, Release Drafter, analyzer, linters, test, build generation, signing, version bumps etc.). This setup worked the same way I use with human teammates.

Example of my .github folder structure:

.github/
├── CODEOWNERS
├── dependabot.yml
├── release-drafter.yml
└── workflows/
    ├── create_release.yml
    ├── labeler-pr.yml
    └── labeler-update-draft.yml

4. Branch and PR Flow

This is how the full development cycle played out with the AI agent in control.

Create a new branch
- Once the task prompt was clear and scoped, the AI agent created a new branch from main. I used trunk-based development, where all releases were built from main.
- The agent followed the git-workflow.instructions.md rules to stay aligned with my CI/CD pipeline.
Implement the task
- The agent executed the Planner → Executor → Validator loop.
- One commit per one subtask, so that it was easy to make rollbacks when (not if) AI agent went ballistics. ⚠️ And yes, this will happen. Be prepared to revert fast.
- It committed changes with descriptive messages, including the task ID (e.g., ID-1234: Add UI widget for xx).
Open PR
- After completing the full task, the agent opened a PR to main, which triggered the CI/CD pipeline.
- 💡 Pro tip: If you use secondary AI agent for PR review, request your coding AI agent to add relevant PR description and possible instructions for the reviewer. This way you can get better results from the PR review AI agent.
PR review
- The PR review agent left comments, which were then passed back to the coding agent. The coding agent addressed the feedback and pushed updates.
- 💡 Pro tip: Make sure your coding agent treats PR comments critically and does not blindly implement all suggestions. It's also important to distinguish between human and AI-generated comments.
PR approval
- After my review and approval, the CI/CD pipeline automatically merged the PR to main and started the build process.

4.1 Picking a Git Strategy Your Human + AI Crew Won’t Hate

This is my personal opinion but I would say that trunk-based (single main + short-lived feature branches) keeps merge hell minimal and CI green, exactly what an always-on coding AI agent needs. But copy–paste doesn’t fit every org, so sanity-check against your constraints.

💡 Pro tips for AI agent repos and team work

Single-source state: keep /rules, prompts and task.md on the same branch the agent edits, no “hidden” gist or Wiki versions.

Atomic commits per subtask: easier to revert when the AI agent goes rogue (git reset HEAD or git revert -m 1 HEAD saves the day).

Branch-naming conventions like feat/ID-1234-short-slug help the agent map Jira ↔ Git without spaghetti regexes.

5. Real Devices and Store Metadata: What Still Needs a Human

End-to-end testing in mobile development can’t rely on emulators alone. Once a feature was merged, my CI/CD pipeline shipped a staging build directly to real devices. This uncovered bugs that didn’t show up in simulators, issues in widgets, deep links, permissions, and screen behavior. The Validator phase kept test coverage high, but hands-on testing still revealed critical gaps.

Each bug I found was added to task.md as a tracked fix with a task ID, and the AI agent processed them through the same Planner → Executor → Validator loop. This kept the feedback loop tight and repeatable.

But automation stops at the app stores. Submitting release builds to Google Play and App Store Connect is still a manual process. Review feedback from the stores must be collected, analyzed, and addressed by a human. Many rejections can be avoided by setting correct metadata and permissions early. But when something does slip through, you need to decide whether it’s your job or the agent’s to fix it.

6. Run & Observe - Releasing Is Just the Beginning

Once the release pipeline is humming, flip the switch on observability and feed the data back into your development process.

Run & Observe Checklist

Crash & Error Rates - Use Firebase Crashlytics (or Sentry) to track crashes and errors on real devices, not just emulators. Auto-symbolication shows exactly where the agent’s code fails.
Performance & Responsiveness - Monitor App Store Connect and Google Play Console dashboards for frame drops, slow rendering warnings, and battery drain.
ANR & Startup Time - Critical for Android: watch for Application Not Responding (ANR) cases and slow app launches.
AI Agent Hit‑Rate - Custom metric: track AI-generated LOC merged vs. reverted and Defect Rate per Feature. If reverts or bugs climb, tighten your /rules and boost your tests.

💡 Pro tip: Don’t just collect these metrics, feed them straight back into your task.md plans. If crash rates, ANRs or defect rates creep up, adjust your AI agent’s scope, tighten testing, or split tasks smaller to keep that 3× boost real.

7. Recap – Parts 1 → 3 at Warp Speed

Phase	What Happened	Why It Mattered
Control (Part 1)	Locked the AI agent into the Planner → Executor → Validator loop and defined clear guardrails in the `/rules` folder to keep its scope tight and behavior predictable.	Gave the agent a sandbox it can’t break out of. No random rewrites, no scope creep.
Build (Part 2)	Turned the high-level PRD into `task.md`, it is the agent’s working brain.	Made sure the agent builds only what you planned, nothing more, nothing less.
Release (Part 3)	Fine-grained tokens, secrets locked in CI/CD, GitHub Actions pipeline, PR reviews, real-device test gates.	Closed the “works-on-my-machine” gap and hit production-ready confidently.

Bottom line: one well-guarded AI agent can turn a 180 h project into a 60 h sprint for <$300.

8. Key Takeaways - Part 3 ✅

Use fine-grained Personal Access Tokens (PATs). Never give your agent repo-wide access.
Keep secrets secret. Store keys in GitHub Secrets - never hardcode.
Automate checks. Use a second AI for PR reviews + human final pass.
Real device tests. Don’t trust emulators - deploy staging builds to real phones.
Trunk-based flow. Short branches, atomic commits, fast merges.

Next Up - The Brutal Reality Check:

So the AI agent built, tested, and shipped real mobile code with secrets locked and pipelines green. But was it really faster, cheaper, or safer? Did all those /rules and CI/CD gates pay off or just look good on paper?

In Part 4, I’ll break down exactly what worked, what blew up in my face, and how I’d tweak the setup to squeeze out more ROI next time.

💬 How are you keeping your secrets and pipelines locked down when you add AI into the mix? Got a trick or tool I should try next? Tell me below!

👉 [Part 4 → Coming Soon]

I Shipped 3x More Features with One AI Agent, All Production Ready

Teemu Piirainen — Mon, 28 Jul 2025 05:19:00 +0000

Who’s this for: Developers exploring AI agent in software development for real coding work. Especially those struggling with hallucinated code, unclear task boundaries, or when to keep a human in the loop.

TL;DR

With a scoped workflow, my AI agent delivered code up to 6× faster but only when the task matched its strengths.

6× in app UI, 4.5× in business logic, 1.7× in platform code.
1× in publishing and UI design. Still fully manual.
Net result: solid 3× productivity increase

The key? A 4-step flow:

Pick the right coder - me or the agent
Define the task clearly
Talk before code - ask, plan, refine
Code only after approval

Series progress:

Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▢▢▢▢▢ Retrospect ▢▢▢▢▢

Welcome to Part 2 of my honest deep-dive: does an AI agent really hold up when you move from one code framework to a completely different tech stack?

In Part 1, I showed how the Planner → Executor → Validator loop + /rules folder kept my AI agent from rewriting files it shouldn’t touch.
Now it’s time for the real test:

Set up a real Flutter project, App Store Connect and Google Play Console.
Stress-test Flutter quirks, native iOS / Android bridges, OS-level permissions.
Measure what the agent did fast and where it wasted my time.

Even though this series uses a Flutter mobile app as the demo project, everything here - from the agent control loop to testing, PR review, and CI/CD - maps directly to backend, web or other SW dev work.

Spoiler: Pure Flutter work? Lightning-fast. Native iOS / Android? Still half manual and parts where you’ll still crack open Xcode and Android Studio.

Series Roadmap - How This Blueprint Works

Control - Control Stack & Rules → trust your AI agent won’t drift off course (Control - Part 1)
Build - AI agent starts coding → boundaries begin to show
Release - CI/CD, secrets, real device tests → safe production deploy (Release - Part 3)
Retrospect - The honest verdict → what paid off, what blew up, what’s next

Why care?

Give your AI agent the wrong job, and it’ll cheerfully break your build at lightning speed. Your role? Decide what the AI agent should do, what to protect from it, and when to step in as a teammate.

👉 Let’s break it down - this is Part 2.

1. Task.md - The Agent’s Single Source of Truth

Before I wrote a single prompt, I wrote a real blueprint planning.md (a.k.a Product Requirements Document - PRD): what I wanted to build, high level architecture, coding principles that I want to follow, folder structure, technical requirements, and what my delivery looked like.

Then I turned that blueprint into a detailed task.md (IDs, subtasks, clear acceptance notes). The AI agent was never in charge of defining scope. That’s still my job. I also bolted Jira to GitHub and linked to my task list, so I could see how well the agent was tracking my tasks.

task.md is where the whole Planner → Executor → Validator loop keeps the state in practice. In the Control phase, it keeps the AI agent on a tight leash: every task starts here, every plan gets approved here, and every bug or test failure loops back here. In Build phase, it turns the blueprint into concrete commits: it tracks what the AI agent built, what needed manual fixes, and what code parts still demanded human tweaks. By the Ship phase, task.md is still the single source of truth. PR reviews, CI/CD pipeline checks, and real-device test results all feed back into it, creating new tasks when blockers pop up.

It’s not just a to-do list. It’s the living playbook that ties blueprint, AI agent, and CI/CD pipeline together.

💡 Pro tip: Treat task.md like a living log - never let it freeze.
Whenever the agent spots edge cases, test failures, or blockers, make it write them back to task.md. That way your blueprint always matches reality, not just the plan on day one.

1.1 Designing a Project to Reveal AI Agent Boundaries

I didn’t want a hello‑world demo app. From day one, I scoped my mobile app so that the AI agent had to touch native iOS and Android code. You can’t cheat that with pure Dart. You need native Swift for iOS and native Kotlin for Android.

Home Widget on both OSs – a SwiftUI WidgetExtension for iOS, a Kotlin AppWidgetProvider on Android.
Platform-channel glue – JSON method channels moving data between Dart and native layers.
OS-level permissions & new targets – schedule tasks.

This setup helped me push the AI agent to its limits, shape a blueprint for splitting tasks the right way, and define when to bring in human hands and where to draw that line.

💡 Pro tip: When you kick off a new project, map out which parts are likely to trip up your AI agent (native glue, tricky permissions, platform quirks). Watch these spots closely and be ready to jump in yourself when the agent hits its limits.

2. Project and Firebase Studio’s setup

2.1 Cloud vs. Local environment

Local dev is fast and familiar: my Mac runs Flutter, emulators, and real devices just fine. But local isn’t built for agents that run around the clock.

That’s why I used a cloud setup from day one:

The agent stays live 24/7, not tied to any laptop.
Bugs go from Jira → webhook → Planner → Executor → Validator → PR. No manual wake-ups.
Adding more devs or agents? Everyone shares the same environment. No “works on my machine” issues.

🔍 Why Firebase Studio?
It supported Flutter and Android emulator well enough for a single sprint, but lacks full 24/7 agent support. (Next time I’ll likely use All Hands built for continuous agents from day one)
💡 Cloud setup tip: Private or public? Just make sure your agent can stay live, productive, and compliant with your org’s data rules.

2.2 Firebase Studio’s Flutter Setup - Avoiding the Trap

When I hit “Start coding an app” in Firebase Studio and picked Flutter, the Firebase scaffolded every wrong default: MyApp, a useless web/ folder, no iOS target (🚫 Firebase Studio does not support iOS), and the classic com.example.myapp package ID. I spent some time cleaning that up, but it was waste of time and decided to try another approach.

This was my mistake: I should have started at the first place with a clean Flutter project locally, then imported it into Firebase Studio. But I get why Firebase Studio does this: it’s meant as a playground setup, not production-ready.

I created a clean Flutter project locally, then imported it into Firebase Studio via Git. See the details below for how to do this.

How to create a clean Flutter project locally

# local init (tweak org "fi.awave" & name "sampleapp" for your own use case)
flutter create --platforms=android,ios \
               --org fi.awave \
               sampleapp

🍏 iOS: Targets, Deployment & Certificates

Trim targets – Xcode → Targets → General tab → delete unneeded macOS / visionOS schemes.
Deployment target – Xcode → Targets → General tab → update minimum iOS version + Display name.
Deployment target – platform :ios, '15.6' in Podfile
App category & capabilities – Targets → General tab → select category.
Signing – Xcode → Signing & Capabilities
- Select the correct team.
- Let Xcode manage certificates, or upload your .p12 + provisioning profiles. (tweak values for your own use case)

🤖 Android: SDK Levels & Signing

SDK Levels – android/app/build.gradle

   android {
     compileSdk = 36
     ndkVersion = "27.0.12077973"
     defaultConfig {
       minSdk = 30
       targetSdk = 36
     }
   }

Gradle & Kotlin – Bump to the latest wrapper and AGP.
Release signing – Android Studio → Build > Generate Signed Bundle…
- Generates keystore.jks and key.properties (keep them out of Git).
- Add debug / staging / release build flavours.
- Store key.properties outside Git and add these into GitHub secrets for CI/CD pipelines.
Firebase Studio -friendly tweak – In build.gradle, wrap the signingConfig block (see below details) (tweak values for your own use case)

🤖 Android: Gradle to skip signing on Firebase Studio debug version

// skip signing on Firebase Studio
val keystoreProperties = Properties()
val keystorePropertiesFile = rootProject.file("key.properties")
if (keystorePropertiesFile.exists()) {
    keystoreProperties.load(FileInputStream(keystorePropertiesFile))
} else {
    println("⚠️  Note: key.properties not found – release, staging build not possible to be signed.")
}

...

buildTypes {
    getByName("debug") {
        // uses default debug signing config located in: ~/.android/debug.keystore
    }
    getByName("staging") {
        if (keystorePropertiesFile.exists()) {
            signingConfig = signingConfigs.getByName("staging")
        }
    }
    getByName("release") {
        if (keystorePropertiesFile.exists()) {
            signingConfig = signingConfigs.getByName("release")
        }
    }
}

(tweak values for your own use case)

Run the basic Flutter app inside Firebase Studio
Firebase Studio offers a possibility to run the app in an Android emulator (iOS you need to run locally on your Mac). In general the emulator works ok(ish), but during my test I found it a bit slow and buggy. In many cases I ended up pulling the code locally and running the app on my own Android and iOS phones / simulator, which was much faster and more reliable.

Thinking the whole AI agent development process, I think that this is the biggest drawback compared to web development, where the AI agent can run the app in a browser and pull logs directly into chat context.

Quick checklist before importing to Firebase Studio

✅ Do locally first

[ ] Create app shells in Google Play Console and App Store Connect
[ ] flutter create … with proper --org, --no-web
[ ] Update app names & other meta data
[ ] 🍏 Add iOS target, set deployment version
[ ] 🤖 Generate key.properties, keep outside Git
[ ] 🤖 Upgrade dependencies + Gradle wrapper, set SDK targets
[ ] 🤖 Upgrade build.gradle to skip signing on Firebase Studio debug version
[ ] Replace icons & launch screens with vectors

🚀 After import

[ ] When importing project into Firebase Studio check “This is a Flutter project”
[ ] Try to run flutter doctor and flutter run in Firebase Studio to make sure everything works

⚠️ Heads-up: you’ll still create the app shells in Google Play Console and App Store Connect manually.

💡 Pro tip: Skip the boilerplate trap. Create & clean the project locally, push to Git, then import from Git to Firebase Studio. Firebase will create required environment dev.nix files when you import your project.

3. Initial Prompt - First the Plan, Then the Code

Before the AI agent could write a single line of code, I loaded a clean and complete context. This started with loading four critical pieces:

/rules/ - to activate the full System Prompt: coding rules, feedback loops, commit strategy, testing discipline (Control - Part 1)
planning.md - to understand the big picture and architecture, and avoid writing code that would later conflict with future features
task.md - to get context on what’s already done, what’s in progress, and any known constraints
Task ID - to know exactly which feature to focus on and avoid scope bleed

This reset was not optional, it was the countermeasure to context drift.

As covered in Part 1 → Context Hygiene, long-running chats quickly spiral into hallucination territory. Each task was treated as a clean slate with fresh context.

But context alone wasn’t enough. The agent wasn’t allowed to just start coding.

Instead, the first step was scoping the task properly. The agent had to pause, reflect, and write down every question or uncertainty it still had. No assumptions. No guessing silently.

This up-front conversation was the foundation for building the User Prompt, our shared understanding of what we’re about to build.

Example of the first prompt to start new feature development

- Read `/rules/airules.md` and other instruction files that are defined in the `airules.md`.
- Read `task.md` for the implementation context.
- Now we are working with the task: ID-123 Implement Feature X
- Think step by step how to make implementation.
- Write an implementation plan and ask my approval before starting the implementation.
- _Before you start, list anything unclear. If you don’t know, ask now._
- Do not start coding before I approve your plan.

This small ritual turned the agent from a prompt follower into a planning partner. Only after this plan was reviewed and approved by me did the Planner begin its work and that up-front clarity meant fewer surprises and less cleanup later.

3.1 User Prompt - What to Build, Together

An AI agent can’t guess what to build. Every task begins by establishing a shared mental model. A short-lived but precise agreement on what the feature is, how it should behave, and how we’ll know it’s done. That’s the User Prompt, and it’s where tactical reasoning happens.

The User Prompt is built from three sources:

planning.md + task.md these define the overall scope and architecture, but only the relevant slices are pulled in for the current task.
Initial prompt and conversation an active dialogue with the agent to review the scope, clarify anything unclear, and co-create the plan.

Unlike the System Prompt (Part 1 → System Prompt), which can always be reloaded from /rules/ to restore the same mindset, the User Prompt exists only in memory for the current conversation. When the chat ends, it’s gone, and must be rebuilt from scratch the next time.

But this isn’t a top-down instruction set. The agent has to ask, clarify, and plan and I have to approve. That shared clarity is what keeps the Planner focused, the Executor scoped, and the Validator relevant.

When the AI agent stumbles, it’s almost always because the User Prompt was vague or incomplete. That’s why I force the agent to ask questions and write a plan before it writes a single line of code. The implementation must be explicit, not assumed.

3.2 The Power of Two Prompts - Behavior Meets Implementation

When you combine a System Prompt with a task-specific User Prompt, you give the AI agent both its job description and its exact mission.

System Prompt shapes how the agent works: how it commits, how it asks questions, how it tests.
User Prompt defines what to build right now: the logic, constraints, edge cases.

Together, they turn the AI agent into a real teammate. By keeping the rules persistent and the scope specific, I could trust the agent to move fast without breaking things that weren’t part of the current task.

Without both, the agent either forgets the bigger picture or gets lost in implementation details.

4. Flutter Development - How the Agent Held Up

To test how far an AI agent could really go - in terms of code quality, delivery speed, and handling edge cases - I deliberately picked two tricky features from a fully built-out app. The goal was to push the agent in real-world scenarios and see if this workflow could actually save serious dev hours in practice:

An analog clock widget that ticks every second - constant UI and state updates, custom painter logic.
Riverpod-powered state management - scoped providers and reactive rebuilds keep the architecture clean but can trip up less experienced setups fast.

4.1 Speed & code quality

Analog clock: The AI agent drew the full custom-painter dial, hands and smooth tick animation in ≈ 10 minutes. A couple of follow-up prompts polished the details. Doing this by hand would have been hours of work.
State management: The AI agent didn’t make solid architectural choices on its own. I mapped out the high-level graph up front. After a fresh start and a few rounds of tweaks and clarifications, the raw wiring went smoothly and the AI agent handled the repetitive parts well.

The result: On pure Flutter UI tasks the AI agent outpaced me 6–0. On architectural decisions, it needed my direction but once pointed, it did the heavy lifting.

4.2 Where It Struggled

These tests also revealed clear weak spots.

No eyes: The AI agent can’t see what’s happening in the Firebase Studio Android emulator screen. Screenshots were mandatory: I captured the UI, dropped it into the chat and asked for pixel tweaks.
No design instinct: Without explicit references, the AI agent’s visual taste is pretty much 90s terminal style. Only after I shared reference images did the styling improve.
No architecture awareness: The AI agent had no real sense of what helper methods or reusable patterns were already in the codebase. It often rewrote logic that existed elsewhere or missed calling shared utils, unless I pointed it to the right files and explained how similar code was structured.

4.3 Key Takeaway

These tricky features were perfect to test real limits. For Flutter code generation and Riverpod wiring the AI agent is a monster accelerator - minutes instead of hours. But it stays blind and tasteless. You’re still the art director feeding it screenshots and clear visual direction. With that human guidance, though, the AI agent can paint pixels and write Dart faster than you can open Figma.

One more thing: always guide your AI agent to refactor code systematically. This is how you can catch duplicated logic and keep your codebase clean over time. Add regular refactoring checkpoints to your task.md or set clear /rules for how often and where the AI agent should look for patterns to merge.

5. Native iOS and Android development

Flutter part proved the AI agent could handle coding fast. But mobile apps need native code, here's where things got interesting.

I knew that adding new targets is a task that you must do manually in Xcode and Android Studio, so I didn’t even try to ask the AI agent to do that. Instead, I created the targets manually and then let the AI agent focus on writing the code that runs in those targets.

At first, the AI agent treated the native code like it was just another SwiftUI / Android screen. Ignoring all the platform-specific quirks related to newly added targets: permission, entitlements, manifest tweaks.

So I did the one thing AI agent does well when you nudge it right: I told it to read Apple’s and Google’s docs first, then come back with a plan.

Once it read the docs, it surprised me. The AI agent wrote decent Swift and Kotlin glue. Not perfect, but runnable.

In native-heavy features I broke the work into four crystal-clear stages, each one nudging the AI agent to focus on a single layer at a time:

Stage	What the agent does	My role
1. Plan	Draft a complete architecture map for the feature, covering Flutter, Android, and iOS parts. Clearly split each platform’s role so no piece gets missed.	Verify Flutter ↔ Native bridge logic, fix blind spots, and approve the plan.
2. Flutter	Generate all Dart code first: UI, state management (like Riverpod), and method channels for platform calls.	Check logic, tweak naming, and make sure native calls are explicit.
3. Android	Write the native Android glue in Kotlin: `AppWidgetProvider`.	Validate permissions, Gradle tweaks, and any custom OS-level hooks, utilize Android Studio tools to validate code.
4. iOS	Implement the native iOS side in Swift: `WidgetExtension`.	Open in Xcode check signing & provisioning,entitlements, plist updates, utilize Xcode tools to validate code.

Each stage flows into the next: Flutter first, so the AI agent nails down method channel contracts and data shapes. Then Android glue connects to Flutter. Finally, iOS mirrors the same bridge, patching up any extra parameters the AI agent spots during build.

By carving it up this way, I kept the AI agent focused, the commits clean, and the cross-platform glue tight, one layer at a time.

💡 Pro tip: Don’t just run the agent. Decide who does what, guide the architecture, and split tasks so each side plays to its strengths.

6. Where Native Code Still Needs You

People underestimate this. Flutter won’t do it for you, and your AI agent won’t either.
Some things are still better to do in Android Studio or Xcode (or there are no other ways to do these).

6.1 Native Glue Code - 40% AI, 60% Me (Both Platforms)

The moment we crossed into Swift or Kotlin, the speed advantage dropped. The AI agent could write glue code (small methods, platform channels, a bit of SwiftUI). But when it came to signing, entitlements, deployment targets, Xcode project tweaks, it didn't have possibilities to do these or didn't have up-to-date info what / how to do these. But this was easily solved by me doing these manually in Xcode and Android Studio, or instructing AI agent to read the latest docs.

Available tools in Firebase Studio just doesn’t match Xcode or Android Studio for native tools. So even when the agent gave me runnable Kotlin, I found plenty of deprecated calls, outdated syntax, or just missing parts.

Example: Kotlin code by AI vs. Android Studio
// Ai agent code

internal fun scheduleNextUpdate(context: Context) {
    ...
    val backgroundColor = Color.parseColor("#FF121212")
    ...
    try {
        alarmManager.setExactAndAllowWhileIdle(AlarmManager.RTC_WAKEUP, nextUpdateTime, pendingIntent)
    } catch (e: Exception) {
        Log.e("ClockWidget", "[$timestamp] Failed to schedule alarm", e)
    }
}

// Proper Android Studio / Kotlin code

@RequiresPermission(Manifest.permission.SCHEDULE_EXACT_ALARM)
internal fun scheduleNextUpdate(context: Context) {
    ...
    val backgroundColor = "#FF121212".toColorInt()
    ...
    try {
        alarmManager.setExactAndAllowWhileIdle(AlarmManager.RTC_WAKEUP, nextUpdateTime, pendingIntent)
    } catch (e: Exception) {
        Log.e("ClockWidget", "[$timestamp] Failed to schedule alarm", e)
    }
}

So the AI agent got me 40% there. That last 60% (making it follow native Android & iOS coding style & annotations) still needed my own eyes and Xcode + Android Studio.

6.2 Branding, Permissions & The Final Manual Mile

Some things just stay manual. Visual assets and OS-level entitlements still live in Xcode and Android Studio, not in an AI agent prompt.

Task	Tool	Notes
Replace app icon	Xcode Asset Catalog / Android Studio Image Asset wizard	Use SVG / PDF vectors - tools render all sizes.
Custom launch screen	Same wizards as above	Remove the stock Flutter logo; keep it light.
Permissions & capabilities	Xcode & Android Studio	Flip toggles, add targets, push to git, let the agent handle code only.

My rule? Do these by hand once, commit, then keep the agent busy where it adds real value.

7. Where the Agent Saved (and Didn’t) My Time

The numbers below tell the whole story. Whenever the work stayed inside pure Flutter territory (widgets, Dart models, lightweight state and their unit tests) the AI agent chewed through tasks ~5 × faster than I do by hand.

As soon as we crossed into native Swift/Kotlin glue the boost shrank to 1.7 ×, and for signing, entitlements or full end-to-end runs the speed-up vanished: those steps are still human-only.

Task Type	Estimated Work Hours	AI + Human Hours	Speed Factor
Planning & Architecture	~10	~10	1.0× (manual)
Flutter UI & Layout	~58	~11	5.3×
State Management & Logic	~44	~10	4.4×
Native Integrations (iOS & Android)	~26	~15	1.7×
Unit & Widget Testing	~30	~4	7.5×
End-to-End Testing & QA	~6	~6	1.0× (manual)
Icons, Store, Metadata	~6	~6	1.0× (manual)
Total	~180	~62	~3× overall

📏 See my real manual vs. agent benchmarks
1️⃣ Global dayTimePeriod state

I wired up dayTimePeriod to update globally in the app, syncing widgets and screens to the current time.

• Manual run: ~3 hours - scoped providers, tested edge cases, debugged rebuilds.

• AI agent run: ~35 minutes - nailed the providers and rebuild logic in few iterations.

Boost: ~5.1× faster.

2️⃣ Analog Clock: New dayTimePeriod color sector

I added a new color sector to the custom analog clock face, updating dynamically with the dayTimePeriod.

• Manual run: ~3 hours - tweak painter logic, test rendering on real devices.

• AI agent run: ~25 minutes - handled the custom painter math perfectly, needed few extra round to tweak visuals.

Boost: ~7.2× faster.

These sample benchmarks line up with the overall Speed Factor: pure Flutter tasks easily hit 5-7× boost when scoped right.

Notes on What’s Not in These Hours 🔍

The table above covers only the hands-on coding and test work I clocked directly related to work with the AI agent.

CI/CD pipeline & release automation → Not included in the hours. This part was 100% manual, but absolutely critical: the full pipeline work (versioning, signing flows, App Store / Play Console configs, GitHub Actions) is covered separately in Part 3.
UI / UX design work → Not counted here. The design (screens, flow, user journeys, final mockups) were assumed done up front. This breakdown focuses purely on implementing those assets in code.

If you’re reading this as a dev planning your own agent setup: don’t underestimate these “invisible” hours. A clean CI/CD flow will save you weeks later, and no AI agent yet replaces a good human designer with good taste 🎨.

💡 Pro tip: A disciplined agent can absolutely deliver a 7× “wow” factor for well structure SW dev work, but the minute you need to do something else than coding, you’re back on the tools. Net result: a very real 3× productivity jump as long as your task blueprint is crystal-clear and you keep the AI agent focused on the parts it excels at.

That moment when you realize you’re just watching your AI agent write, test, commit, push, open a PR and pass checks - all while you sip your coffee - is both magic and just a tiny bit unnerving. More than once I caught myself staring at my screen for 5 minutes doing… nothing. The code just… shipped.

8. Firebase Studio Issues

My overall verdict on Firebase Studio as an AI-coding sandbox: usable, but slower and quirkier than a local setup.

Day-to-day friction

Sluggish Android emulator - Boot times were glacial; I often fell back to local setup for testing.
GitHub auth hiccups - Setting correct access rights was hard. (I’ll dig into the hacky fix in the next article.)
Nice touch: the agent pipes terminal output straight into the chat pane, so I didn’t need to tail logs in a separate window.

Below some of the quirks I hit:

“No space left on device” 🐘
At ~20 GB of usage Firebase Studio simply froze with

no space left on device.

Root cause: Gradle cache alone had ballooned to 12 GB inside the dev.nix env.
Quick tries (flutter clean, gradle clean) did nothing.
Nuclear workaround:
1. rm -rf ~/.gradle inside the workspace.
2. Add the Gradle distro back into dev.nix.
3. Re-run nix environment.
4. Discover the Android emulator now refuses to boot 🙃
Final: spin up a new Firebase Studio project, git clone the repo, and delete the old one. Yet another reason to commit early & often.

Google’s support replied with a generic “list files, then remove them safely” doc link, no actual cache-pruning guide. If you hit the quota wall, be ready to start fresh.

Android license weirdness
Firebase Studio ships a fresh SDK image and normal license approvals were needed. flutter doctor --android-licenses failed five times in a row; then, a few days later, licenses were magically accepted. I never found a root cause...

Bottom line: Firebase Studio works, but expect sluggish Android emulator, storage quotas, and the occasional phantom SDK glitch. Keep your repo pushed, caches light, and patience handy.

9. Key Takeaways - Part 2 ✅

Keep task.md alive. It’s the agent’s working brain, every plan, blocker, and fix lives here. Keep it up-to-date so the Control loop never loses track.
Split tasks smart. Identify where the AI agent really shines and steer your task list to match its strengths.
Manual native setup. You handle iOS/Android configs, signing, and entitlements.
Feed visuals. The agent is blind → give UI references and screenshots.
Tight commits. Small, atomic steps keep bugs cheap to fix.

These aren’t magic switches the AI flips by itself, they all rely on clear rules and scoped tasks you write up front in /rules and task.md. It’s human work first, every project, every time.

Next Up - Keep It Safe:

So the AI agent can blast through mobile code but shipping it means locking secrets tight and keeping your pipeline bulletproof.

In Part 3, I’ll break down how I managed GitHub PATs, Firebase configs, signing keys, and real device testing. You’ll see exactly how my CI/CD flow kept the agent honest and my keys secure.

💬 I’m still refining how I scope tasks and split what the agent does vs. me. What’s worked (or backfired) in your dev flow? Might borrow a trick!

👉 Stay tuned - [Part 4 → Coming Soon]

Master Autonomous AI Agent - Control Stack for Production Code

Teemu Piirainen — Mon, 21 Jul 2025 05:22:00 +0000

Who’s this for: Software and mobile developers who want to move beyond AI demos and bring an autonomous AI agent into real daily coding work, all the way to production.

TL;DR

Control, not just code
Turn a single-prompt agent into a repeatable, autonomous teammate with three roles (Planner → Executor → Validator) and a /rules/ folder.
Ship production-ready code without babysitting.

Series progress:

Control ▇▇▇▇▇ Build ▢▢▢▢▢ Release ▢▢▢▢▢ Retrospect ▢▢▢▢▢

Welcome to Part 1 of my deep-dive series on building an autonomous AI agent in a real-world SW development project.

0. The Problem: AI Agents Don’t Survive Production

Local IDE AI agents/helpers still need you to approve every diff.
SaaS platforms (such as Lovable, Bolt.new ...) work well for MVPs, but their black-box control stacks and lack of security and quality controls are show-stoppers for many large organizations looking to use them in production.

My hypothesis:

We can build a fully autonomous AI agent that organizations own and audit end-to-end, meeting enterprise-level demands for security, compliance, and CI/CD.

To validate the idea, I built an AI agent workflow, set it loose on a real project, and tracked its performance with the same KPIs my clients use. I capped my own input to roughly 2 hours a day for 30 days, mirroring the stop-and-go rhythm of real-world development with human-in-the-loop pauses.

The Experiment Project

After 20+ years in web and mobile development, I know “hello‑world demos” are cheap; shipping is hard. So instead of a single‑page web app, I threw the AI agent into the deep end:

Mobile app with Apple and Google platform requirements
Flutter as the main framework for a real mobile app
Swift ↔ Kotlin native code and permissions that Flutter can’t hide
A CI/CD pipeline wired with deployment builds from day one (a mobile application that I’d normally budget ~180 h of manual coding)

Most studies report only a 1.2×–1.5× productivity lift (Reuters, 2025‑07).
This blueprint aims much higher, using a scoped agent and control loop.

Series Roadmap - How This Blueprint Works

This is a full process, not just a single trick:

Control - Control Stack & Rules → trust your AI agent won’t drift off course
Build - AI agent starts coding → boundaries begin to show (Build - Part 2)
Release - CI/CD, secrets, real device tests → safe production deploy (Release - Part 3)
Retrospect - The honest verdict → what paid off, what blew up, what’s next

⚠️ I’m using Firebase Studio with Gemini 2.5 Pro here, but the control‑stack principles apply to any agent framework or IDE. I also ran the same flow locally with VS Code + Claude Sonet 4 → same results. ⚠️

If you want an AI agent you can scale, one that follows your rules, works like the rest of your team, and delivers repeatable results, this series is for you.

👉 Let’s dive in - this is Part 1: The Control Stack Blueprint.

1. System Prompt - How to Align the AI Agent’s Mindset

Every dev has seen it. When you use agent mode in your IDE, the agent suddenly touches files it shouldn’t, skips tests, and leaves TODOs in code. We’ve all tested those limits made fixes by modifying prompts.

To make an AI agent behave like a reliable teammate (not a loose cannon) we need to define in much deeper level how it works, not just what it builds. In other words:

How to do the work → covered here in this Control phase article
What work to do → tackled in the next Build phase article

Let’s start with the foundation: shaping how the AI agent thinks, acts, and writes code by creating a System Prompt.

In my setup it’s not just one file; it’s the full /rules/ directory. Together, these files define the AI agent’s mindset: how it commits, how it asks for help, how it escalates risk, how it tests its own code, and how it avoids doing anything dumb. This System Prompt drives the agent’s strategic reasoning, guiding long‑term decision‑making, your organization's SDLC (Software Development Life Cycle) rules, trade‑offs, and actions that consider multiple future scenarios.

These rules don’t live in the prompt. They’re loaded at the start of every task, just like a human would check your coding guide or dev playbook before writing production code.

Unlike the user prompt (more on that in Part 2), there’s no negotiation with System Prompt. This is the law. The AI agent doesn’t get a vote; it just follows the values, rules, and expectations I defined.

This was the first step in enforcing the Control Stack. Shaping how the agent thinks before it even looks at the task.

2. Role Assignment: Planner → Executor → Validator

Once I gave the AI agent clear rules (System Prompt), another question popped up: how do I make sure it actually follows them?

Turns out, trying to make an AI agent do everything at once (plan, code, test, validate) is a great way to watch it spiral into spaghetti logic and forgotten files.

So I gave the agent three distinct roles, each with a clear mission:

Planner → Think before you code: break down tasks, plan architecture, define sub‑tasks, acceptance criteria, and hand a sub‑task to the Executor.
Executor → For a given sub‑task, write and test production‑grade code, and fix issues.
Validator → Check the results, validate coverage, test rigorously, and decide whether the task is actually done.

Why three? Because even humans suck at multitasking. Specialising each role kept the AI agent sharp, scoped, and far less likely to go rogue. It also helped me debug when things went sideways: I could see which role failed and fix that part of the loop.

💡 Pro tip: Once the three core roles are running smoothly, you can add specialised roles, such as Architect, Security, Product Manager, etc. (if needed)

To make this work in production, we should also have a human‑in‑the‑loop, but ideally outside the control loop. Guide at the edges by approving the plan and reviewing PRs. Do this right and you get autonomy with accountability, a teammate that thinks on its own but stays within boundaries. That is the sweet spot.

2.1 Planner - The Architect

The Planner is the brain. It doesn’t write code; it figures out what needs to be written.

Every time a new task begins, the Planner receives a ready‑made context: all relevant files are pre‑loaded (task.md, planning.md, and /rules/). This setup gives the Planner everything it needs to plan the implementation, including scoped instructions and constraints that form the task’s mission.

How this context is constructed, including the initial prompt and the user prompt (more about this in Part 2).

Based on the received context, the Planner:

Analyses the architecture and breaks the task into manageable sub‑tasks
Defines acceptance criteria for each sub‑task
Updates task.md with new task entries, IDs, and current progress
Highlights dependencies or missing inputs
Ensures the plan aligns with the project’s long‑term structure

The goal is clarity. If the Planner messes up, everything downstream goes sideways.

Once the plan is complete, AI agent moves to the next phase.

💡 Pro tip: Don’t over‑restrict your AI agent. Guide it just enough, you only learn this by testing.

2.2 Executor - The Builder

The Executor is the pair of hands. It picks up one sub‑task, utilizes the predefined context from the Planner, and starts writing production‑ready code.

Its job isn’t just to code; it also needs to:

Follow the System Prompt rules: coding style, test‑first discipline, safe commit strategy
Run static analysis (flutter analyze) and write unit tests
Respect architectural boundaries, don’t rewrite unrelated files
Solve issues if tests fail, using debug strategies (like printing state)

If something fails repeatedly, the Executor stops and escalates to me instead of grinding forever.

Executor - Lessons Learned

Honestly, this was the toughest part of the 30‑day sprint.

Rules were too vague. Without strict coding and testing rules in the System Prompt, the agent just improvised.
Context drift was real. Long chats or fuzzy task descriptions made it lose focus fast.
Doing too much at once (planning, building, and testing in one go) led to sloppy, unpredictable behavior.

The whole goal was to build an autonomous AI agent that writes production‑ready code I could trust without babysitting it 24/7. It took me about 15 days of constant tweaking to get this balance right, but in the end I found a good middle ground that worked for this project.

💡 Pro tip: Give your AI agent concrete targets. It won’t write code exactly like you do, but you can train it to stick to your quality bar. Think of it like a senior‑to‑junior dev relationship: mentor it, don’t micromanage it.

💡 Pro tip: Set clear guardrails. For example: “If you fail to fix the same issue five times, stop and ping me.” That one rule alone saved me hours.

2.3 Validator - The Safety Net

The Validator is the reviewer. Once the Executor thinks it’s done, the Validator steps in and double‑checks:

Run all relevant tests and linters again
Verify that acceptance criteria from the Planner are fully met
Look for missing coverage, skipped test logic, or flaky behavior
Confirm that only the expected files changed, nothing outside scope
Make a clean commit: one sub‑task, one atomic commit

If anything fails validation, it bounces the task back to the Planner or Executor with an error message and reason.

Once all checks pass, the Validator triggers the final commit and marks the sub‑task as complete in task.md. If multiple sub‑tasks are defined, the cycle repeats and each one goes through the same Planner → Executor → Validator loop.

The Validator helps avoid the “looks fine to me” trap. It forces the agent to prove quality, not just assume it.

Validator - Lessons Learned

By the end of the sprint I didn’t check every line by hand during development. Instead, the Validator’s task was to find problems automatically. If the AI agent hit a blocker it couldn’t fix, it flagged that sub‑task as blocked in task.md and then pinged me for input.

But your AI agent can get stuck in loops. One time my AI agent got stuck rewriting the same unit test about ten times in a row. It kept trying until I stopped it and together we found the right solution. After that, it finished in a few iterations.

Another lesson: the AI agent once wrote a test loop that spammed the console so badly it pushed massive logs straight into my Gemini prompt, nearly blowing up my token budget.

Luckily, Gemini has some sanity checks. Otherwise my bill would’ve gone straight to the moon. Consider adding your own guardrails to catch runaway output early → every token costs.

The input token count (3 076 984) exceeds the maximum number of tokens allowed (1 048 576).
══╡ EXCEPTION CAUGHT BY RENDERING LIBRARY^C
 *  The terminal process "bash '-c', '(set -o pipefail && flutter test test/features/home/view/home_screen_test.dart 2>&1 | tee /tmp/ai.1.log)'" terminated with exit code: 130. 
 *  Terminal will be reused by tasks, press any key to close it.

💡 Pro tip: Your AI agent tries to diagnose bugs by reading code and terminal logs. When I personally hit tricky bugs during development, I normally add extra debug prints. I instructed the AI agent to do the same.
Once it learned how and when to use extra debug prints, it solved most bugs in five tries or fewer.

3. My Rules Folder - The Real Secret Weapon

We’ve now defined the AI agent’s three roles (Planner, Executor, and Validator) each with clear responsibilities. So where do these responsibilities actually live? How does the agent know how to plan, execute, test, commit, or ask for help?

That’s what the /rules/ folder is for. It’s the instruction manual, the coding playbook, the System Prompt, all written down.

What’s in /rules/?

The folder is small enough for the AI agent to read on every pass, yet opinionated enough to steer it towards my coding style and production requirements.

These rules weren’t static. During the first fifteen days, I tweaked them daily. When the agent slipped up, I updated the rules. By the last half of the sprint I needed to tweak less and less, and the AI agent started to understand how I wanted to approach development.

💡 Pro tip: Don’t over‑constrain your AI agent, but don’t leave it wandering, either. In many cases, the AI agent actually knows better than you how to write a block of code, but only if you teach it how to behave, not exactly what lines to type.

Agent /rules/ folder content ↓

Expand to open full list of rules

My Rules Folder - What My AI Agent Reads (tweak for your own use case)
(each file is 10‑70 lines)

✅ airules.md
→ Core principles & how to chain docs.

✅ accessibility.instructions.md
→ WCAG‑AA, large touch targets, screen‑reader OK.

✅ code-reuse.instructions.md
→ Reuse core/utils and shared test helpers.

✅ context-management.instructions.md
→ Ask specific PLANNING slices, skip huge files.

✅ development-workflow.instructions.md
→ File‑by‑file, analyse/test, quality gates.

✅ flutter-mobile.instructions.md
→ Riverpod, feature‑first, ≤500 code lines per file.

✅ git-workflow.instructions.md
→ Short branches, ID‑xx commits, CLI only.

✅ pull-request.instructions.md
→ Auto‑PR via CLI, extract ID‑xx, shell‑safe text.

…and more whenever the agent slips up.

Real‑world tips for writing effective AI agent rules

Focus on defining clear behavior rules: how to ask questions, how to commit, how to test, what style to follow - not micromanaging every line.

Tell it to reuse existing files, methods, and structure; don’t start from scratch every time.

- Give your AI agent real responsibility.
- It won’t write code exactly like you do and *that’s fine*.
- Your job is to make sure it sticks to your quality bar.
- Think of it like a senior developer working with a junior: you’re the mentor, not the babysitter.

Set simple guardrails:
– If it fails the same fix five times, stop and ask you.
– If a bug is tricky, let it use extra debug prints *just like you would*.
– If the plan is unclear, force it to ask questions first.

Do this right and your AI agent will surprise you, not by guessing less, but by guessing better.

4. Keep It Clean - Context Hygiene

One thing I learned the hard way: the longer a single chat context grows, the more the AI agent’s code quality tanks. It starts to hallucinate, miss obvious clues, or spam half‑baked suggestions.

The fix is simple: keep tasks small and short. Treat every task like a fresh sprint: reset the context, give your AI agent a clean slate.

Small tasks + clear phases + fresh context = your agent stays sharp, predictable, and actually finishes what you ask.

💡 Pro tip: If you’re pair‑programming with your AI agent, open a new chat more often than you think you need. Context bloat is real, and fresh chats keep your AI agent focused.

5. Trust Is Earned, Even for an AI Agent

Treat the AI agent like a new developer who just joined your team: talented, but not yet at full speed.

Your Planner → Executor → Validator loop is its work audition. Every cycle shows you whether the agent actually writes code the way you expect it to.

When (not if) it stumbles, modify the control stack → update /rules, refine prompts, replay the loop. This learning loop only works if you build real features with the agent, not just ask it for code snippets.

💡 Pro tip: Start by pair‑programming with your AI agent.

Watch how it behaves and explains its reasoning; you’ll quickly spot whether it follows your Planner → Executor → Validator flow or just makes things up. Use these insights to sharpen your /rules/ files and tighten the control loop.

With a clear loop and hands‑on experience, the agent can grow from a curious intern into a trusted teammate. But only if you invest time to coach it.

6. Key Takeaways - Part 1 ✅

Lock your System Prompt early. It anchors the agent’s mental model and yields the same behavior consistently.
Externalize guardrails in /rules/. Each rule-file is a contract the agent must obey.
Roles over tasks. Planner → Executor → Validator keeps scope sharp and failures traceable.
Short context loops beat marathon chats. Reset state often to avoid drift.
Control ⇒ Trust. Pair-program first, then grant autonomy when the agent consistently passes the loop.

Next Up - The Real Code Test:

Now you’ve got the blueprint, the /rules folder, and a single, disciplined loop.

In Part 2, I’ll show you what happened when the AI agent hit real Flutter code, crossed into native iOS/Android glue, and how this blueprint started to turn into concrete results.

💬 Got your own way to keep AI agents from running wild? Drop your favourite guardrail tricks or rules in the comments. I’d love to compare notes.

👉 Stay tuned — [Part 4 → Coming Soon]