Teemu Piirainen

Posted on Jul 21 • Edited on Aug 4

Master Autonomous AI Agent - Control Stack for Production Code

#ai #programming #tutorial #softwaredevelopment

Who’s this for: Software and mobile developers who want to move beyond AI demos and bring an autonomous AI agent into real daily coding work, all the way to production.

TL;DR

Control, not just code
Turn a single-prompt agent into a repeatable, autonomous teammate with three roles (Planner → Executor → Validator) and a /rules/ folder.
Ship production-ready code without babysitting.

Series progress:

Control ▇▇▇▇▇ Build ▢▢▢▢▢ Release ▢▢▢▢▢ Retrospect ▢▢▢▢▢

Welcome to Part 1 of my deep-dive series on building an autonomous AI agent in a real-world SW development project.

0. The Problem: AI Agents Don’t Survive Production

Local IDE AI agents/helpers still need you to approve every diff.
SaaS platforms (such as Lovable, Bolt.new ...) work well for MVPs, but their black-box control stacks and lack of security and quality controls are show-stoppers for many large organizations looking to use them in production.

My hypothesis:

We can build a fully autonomous AI agent that organizations own and audit end-to-end, meeting enterprise-level demands for security, compliance, and CI/CD.

To validate the idea, I built an AI agent workflow, set it loose on a real project, and tracked its performance with the same KPIs my clients use. I capped my own input to roughly 2 hours a day for 30 days, mirroring the stop-and-go rhythm of real-world development with human-in-the-loop pauses.

The Experiment Project

After 20+ years in web and mobile development, I know “hello‑world demos” are cheap; shipping is hard. So instead of a single‑page web app, I threw the AI agent into the deep end:

Mobile app with Apple and Google platform requirements
Flutter as the main framework for a real mobile app
Swift ↔ Kotlin native code and permissions that Flutter can’t hide
A CI/CD pipeline wired with deployment builds from day one (a mobile application that I’d normally budget ~180 h of manual coding)

Most studies report only a 1.2×–1.5× productivity lift (Reuters, 2025‑07).
This blueprint aims much higher, using a scoped agent and control loop.

Series Roadmap - How This Blueprint Works

This is a full process, not just a single trick:

Control - Control Stack & Rules → trust your AI agent won’t drift off course
Build - AI agent starts coding → boundaries begin to show (Build - Part 2)
Release - CI/CD, secrets, real device tests → safe production deploy (Release - Part 3)
Retrospect - The honest verdict → what paid off, what blew up, what’s next

⚠️ I’m using Firebase Studio with Gemini 2.5 Pro here, but the control‑stack principles apply to any agent framework or IDE. I also ran the same flow locally with VS Code + Claude Sonet 4 → same results. ⚠️

If you want an AI agent you can scale, one that follows your rules, works like the rest of your team, and delivers repeatable results, this series is for you.

👉 Let’s dive in - this is Part 1: The Control Stack Blueprint.

1. System Prompt - How to Align the AI Agent’s Mindset

Every dev has seen it. When you use agent mode in your IDE, the agent suddenly touches files it shouldn’t, skips tests, and leaves TODOs in code. We’ve all tested those limits made fixes by modifying prompts.

To make an AI agent behave like a reliable teammate (not a loose cannon) we need to define in much deeper level how it works, not just what it builds. In other words:

How to do the work → covered here in this Control phase article
What work to do → tackled in the next Build phase article

Let’s start with the foundation: shaping how the AI agent thinks, acts, and writes code by creating a System Prompt.

In my setup it’s not just one file; it’s the full /rules/ directory. Together, these files define the AI agent’s mindset: how it commits, how it asks for help, how it escalates risk, how it tests its own code, and how it avoids doing anything dumb. This System Prompt drives the agent’s strategic reasoning, guiding long‑term decision‑making, your organization's SDLC (Software Development Life Cycle) rules, trade‑offs, and actions that consider multiple future scenarios.

These rules don’t live in the prompt. They’re loaded at the start of every task, just like a human would check your coding guide or dev playbook before writing production code.

Unlike the user prompt (more on that in Part 2), there’s no negotiation with System Prompt. This is the law. The AI agent doesn’t get a vote; it just follows the values, rules, and expectations I defined.

This was the first step in enforcing the Control Stack. Shaping how the agent thinks before it even looks at the task.

2. Role Assignment: Planner → Executor → Validator

Once I gave the AI agent clear rules (System Prompt), another question popped up: how do I make sure it actually follows them?

Turns out, trying to make an AI agent do everything at once (plan, code, test, validate) is a great way to watch it spiral into spaghetti logic and forgotten files.

So I gave the agent three distinct roles, each with a clear mission:

Planner → Think before you code: break down tasks, plan architecture, define sub‑tasks, acceptance criteria, and hand a sub‑task to the Executor.
Executor → For a given sub‑task, write and test production‑grade code, and fix issues.
Validator → Check the results, validate coverage, test rigorously, and decide whether the task is actually done.

Why three? Because even humans suck at multitasking. Specialising each role kept the AI agent sharp, scoped, and far less likely to go rogue. It also helped me debug when things went sideways: I could see which role failed and fix that part of the loop.

💡 Pro tip: Once the three core roles are running smoothly, you can add specialised roles, such as Architect, Security, Product Manager, etc. (if needed)

To make this work in production, we should also have a human‑in‑the‑loop, but ideally outside the control loop. Guide at the edges by approving the plan and reviewing PRs. Do this right and you get autonomy with accountability, a teammate that thinks on its own but stays within boundaries. That is the sweet spot.

2.1 Planner - The Architect

The Planner is the brain. It doesn’t write code; it figures out what needs to be written.

Every time a new task begins, the Planner receives a ready‑made context: all relevant files are pre‑loaded (task.md, planning.md, and /rules/). This setup gives the Planner everything it needs to plan the implementation, including scoped instructions and constraints that form the task’s mission.

How this context is constructed, including the initial prompt and the user prompt (more about this in Part 2).

Based on the received context, the Planner:

Analyses the architecture and breaks the task into manageable sub‑tasks
Defines acceptance criteria for each sub‑task
Updates task.md with new task entries, IDs, and current progress
Highlights dependencies or missing inputs
Ensures the plan aligns with the project’s long‑term structure

The goal is clarity. If the Planner messes up, everything downstream goes sideways.

Once the plan is complete, AI agent moves to the next phase.

💡 Pro tip: Don’t over‑restrict your AI agent. Guide it just enough, you only learn this by testing.

2.2 Executor - The Builder

The Executor is the pair of hands. It picks up one sub‑task, utilizes the predefined context from the Planner, and starts writing production‑ready code.

Its job isn’t just to code; it also needs to:

Follow the System Prompt rules: coding style, test‑first discipline, safe commit strategy
Run static analysis (flutter analyze) and write unit tests
Respect architectural boundaries, don’t rewrite unrelated files
Solve issues if tests fail, using debug strategies (like printing state)

If something fails repeatedly, the Executor stops and escalates to me instead of grinding forever.

Executor - Lessons Learned

Honestly, this was the toughest part of the 30‑day sprint.

Rules were too vague. Without strict coding and testing rules in the System Prompt, the agent just improvised.
Context drift was real. Long chats or fuzzy task descriptions made it lose focus fast.
Doing too much at once (planning, building, and testing in one go) led to sloppy, unpredictable behavior.

The whole goal was to build an autonomous AI agent that writes production‑ready code I could trust without babysitting it 24/7. It took me about 15 days of constant tweaking to get this balance right, but in the end I found a good middle ground that worked for this project.

💡 Pro tip: Give your AI agent concrete targets. It won’t write code exactly like you do, but you can train it to stick to your quality bar. Think of it like a senior‑to‑junior dev relationship: mentor it, don’t micromanage it.

💡 Pro tip: Set clear guardrails. For example: “If you fail to fix the same issue five times, stop and ping me.” That one rule alone saved me hours.

2.3 Validator - The Safety Net

The Validator is the reviewer. Once the Executor thinks it’s done, the Validator steps in and double‑checks:

Run all relevant tests and linters again
Verify that acceptance criteria from the Planner are fully met
Look for missing coverage, skipped test logic, or flaky behavior
Confirm that only the expected files changed, nothing outside scope
Make a clean commit: one sub‑task, one atomic commit

If anything fails validation, it bounces the task back to the Planner or Executor with an error message and reason.

Once all checks pass, the Validator triggers the final commit and marks the sub‑task as complete in task.md. If multiple sub‑tasks are defined, the cycle repeats and each one goes through the same Planner → Executor → Validator loop.

The Validator helps avoid the “looks fine to me” trap. It forces the agent to prove quality, not just assume it.

Validator - Lessons Learned

By the end of the sprint I didn’t check every line by hand during development. Instead, the Validator’s task was to find problems automatically. If the AI agent hit a blocker it couldn’t fix, it flagged that sub‑task as blocked in task.md and then pinged me for input.

But your AI agent can get stuck in loops. One time my AI agent got stuck rewriting the same unit test about ten times in a row. It kept trying until I stopped it and together we found the right solution. After that, it finished in a few iterations.

Another lesson: the AI agent once wrote a test loop that spammed the console so badly it pushed massive logs straight into my Gemini prompt, nearly blowing up my token budget.

Luckily, Gemini has some sanity checks. Otherwise my bill would’ve gone straight to the moon. Consider adding your own guardrails to catch runaway output early → every token costs.

The input token count (3 076 984) exceeds the maximum number of tokens allowed (1 048 576).
══╡ EXCEPTION CAUGHT BY RENDERING LIBRARY^C
 *  The terminal process "bash '-c', '(set -o pipefail && flutter test test/features/home/view/home_screen_test.dart 2>&1 | tee /tmp/ai.1.log)'" terminated with exit code: 130. 
 *  Terminal will be reused by tasks, press any key to close it.

💡 Pro tip: Your AI agent tries to diagnose bugs by reading code and terminal logs. When I personally hit tricky bugs during development, I normally add extra debug prints. I instructed the AI agent to do the same.
Once it learned how and when to use extra debug prints, it solved most bugs in five tries or fewer.

3. My Rules Folder - The Real Secret Weapon

We’ve now defined the AI agent’s three roles (Planner, Executor, and Validator) each with clear responsibilities. So where do these responsibilities actually live? How does the agent know how to plan, execute, test, commit, or ask for help?

That’s what the /rules/ folder is for. It’s the instruction manual, the coding playbook, the System Prompt, all written down.

What’s in /rules/?

The folder is small enough for the AI agent to read on every pass, yet opinionated enough to steer it towards my coding style and production requirements.

These rules weren’t static. During the first fifteen days, I tweaked them daily. When the agent slipped up, I updated the rules. By the last half of the sprint I needed to tweak less and less, and the AI agent started to understand how I wanted to approach development.

💡 Pro tip: Don’t over‑constrain your AI agent, but don’t leave it wandering, either. In many cases, the AI agent actually knows better than you how to write a block of code, but only if you teach it how to behave, not exactly what lines to type.

Agent /rules/ folder content ↓

Expand to open full list of rules

My Rules Folder - What My AI Agent Reads (tweak for your own use case)
(each file is 10‑70 lines)

✅ airules.md
→ Core principles & how to chain docs.

✅ accessibility.instructions.md
→ WCAG‑AA, large touch targets, screen‑reader OK.

✅ code-reuse.instructions.md
→ Reuse core/utils and shared test helpers.

✅ context-management.instructions.md
→ Ask specific PLANNING slices, skip huge files.

✅ development-workflow.instructions.md
→ File‑by‑file, analyse/test, quality gates.

✅ flutter-mobile.instructions.md
→ Riverpod, feature‑first, ≤500 code lines per file.

✅ git-workflow.instructions.md
→ Short branches, ID‑xx commits, CLI only.

✅ pull-request.instructions.md
→ Auto‑PR via CLI, extract ID‑xx, shell‑safe text.

…and more whenever the agent slips up.

Real‑world tips for writing effective AI agent rules

Focus on defining clear behavior rules: how to ask questions, how to commit, how to test, what style to follow - not micromanaging every line.

Tell it to reuse existing files, methods, and structure; don’t start from scratch every time.

- Give your AI agent real responsibility.
- It won’t write code exactly like you do and *that’s fine*.
- Your job is to make sure it sticks to your quality bar.
- Think of it like a senior developer working with a junior: you’re the mentor, not the babysitter.

Set simple guardrails:
– If it fails the same fix five times, stop and ask you.
– If a bug is tricky, let it use extra debug prints *just like you would*.
– If the plan is unclear, force it to ask questions first.

Do this right and your AI agent will surprise you, not by guessing less, but by guessing better.

4. Keep It Clean - Context Hygiene

One thing I learned the hard way: the longer a single chat context grows, the more the AI agent’s code quality tanks. It starts to hallucinate, miss obvious clues, or spam half‑baked suggestions.

The fix is simple: keep tasks small and short. Treat every task like a fresh sprint: reset the context, give your AI agent a clean slate.

Small tasks + clear phases + fresh context = your agent stays sharp, predictable, and actually finishes what you ask.

💡 Pro tip: If you’re pair‑programming with your AI agent, open a new chat more often than you think you need. Context bloat is real, and fresh chats keep your AI agent focused.

5. Trust Is Earned, Even for an AI Agent

Treat the AI agent like a new developer who just joined your team: talented, but not yet at full speed.

Your Planner → Executor → Validator loop is its work audition. Every cycle shows you whether the agent actually writes code the way you expect it to.

When (not if) it stumbles, modify the control stack → update /rules, refine prompts, replay the loop. This learning loop only works if you build real features with the agent, not just ask it for code snippets.

💡 Pro tip: Start by pair‑programming with your AI agent.

Watch how it behaves and explains its reasoning; you’ll quickly spot whether it follows your Planner → Executor → Validator flow or just makes things up. Use these insights to sharpen your /rules/ files and tighten the control loop.

With a clear loop and hands‑on experience, the agent can grow from a curious intern into a trusted teammate. But only if you invest time to coach it.

6. Key Takeaways - Part 1 ✅

Lock your System Prompt early. It anchors the agent’s mental model and yields the same behavior consistently.
Externalize guardrails in /rules/. Each rule-file is a contract the agent must obey.
Roles over tasks. Planner → Executor → Validator keeps scope sharp and failures traceable.
Short context loops beat marathon chats. Reset state often to avoid drift.
Control ⇒ Trust. Pair-program first, then grant autonomy when the agent consistently passes the loop.

Next Up - The Real Code Test:

Now you’ve got the blueprint, the /rules folder, and a single, disciplined loop.

In Part 2, I’ll show you what happened when the AI agent hit real Flutter code, crossed into native iOS/Android glue, and how this blueprint started to turn into concrete results.

💬 Got your own way to keep AI agents from running wild? Drop your favourite guardrail tricks or rules in the comments. I’d love to compare notes.

👉 Stay tuned — [Part 4 → Coming Soon]

Top comments (3)

Davide • Jul 21

Wow, that was very interesting and instructive — thanks for sharing!

Could you add more detail on how you managed to enforce this part so effectively?

“Unlike the user prompt (more on that in Part 2), there’s no negotiation with System Prompt. This is the law. The AI agent doesn’t get a vote; it just follows the values, rules, and expectations I defined.”

You hint at the solution in the Planner → Executor → Validator loop:

A strict /rules/ folder acts as the agent’s non-negotiable guide.

Each role reads and applies these rules at runtime.

The Validator, in particular, ensures code meets all criteria before anything is marked complete — which is especially hard to get right.

In my experience, getting agents to consistently follow rules — especially for validation and quality checks — is one of the hardest challenges. If you pulled it off, that’s seriously impressive.

Teemu Piirainen • Jul 21

Thanks — that’s a great observation, and honestly, I don’t have a silver bullet answer for it.
But here’s how I tried to tackle it:

Getting agents to follow rules reliably was tough. So instead of trying to “teach” the agent everything upfront, I broke the problem down structurally.

The key was scope, well defined task (first part done by me, and then initial prompt & chat with agent the second part - more about this in Part 2).
Just like we do in real development, I defined tasks that were small, self-contained, and unambiguous. This gave the agent a narrow, well-defined execution context.

Then I wrapped that into a 3-part control stack:

Planner breaks the main task into smaller subtasks
Executor handles just one subtask at a time
Validator checks that subtask deterministically against a predefined checklist

Because each loop only handles a tiny part of the whole, the agent doesn’t need to hold everything in memory. That structure (plus clear system prompts with architectural principles and modularity rules) made it a lot easier to get consistent behavior, especially around validation.

Would be nice to hear what you’ve learned and how you’ve solved this challenge.

Stephan Bachofen • Oct 31

Very useful article. Brilliant. Thank you! I can't wait to continue reading about your project here.