<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Teemu Piirainen</title>
    <description>The latest articles on DEV Community by Teemu Piirainen (@teppana88).</description>
    <link>https://dev.to/teppana88</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3343587%2F7cf2618a-9170-4210-a879-80d7f8244d91.png</url>
      <title>DEV Community: Teemu Piirainen</title>
      <link>https://dev.to/teppana88</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/teppana88"/>
    <language>en</language>
    <item>
      <title>How I Validate Quality When AI Agents Write My Code</title>
      <dc:creator>Teemu Piirainen</dc:creator>
      <pubDate>Mon, 16 Mar 2026 07:04:39 +0000</pubDate>
      <link>https://dev.to/teppana88/how-i-validate-quality-when-ai-agents-write-my-code-481c</link>
      <guid>https://dev.to/teppana88/how-i-validate-quality-when-ai-agents-write-my-code-481c</guid>
      <description>&lt;p&gt;Someone asked me the best question after I posted about &lt;a href="https://www.linkedin.com/posts/teemupiirainen_aidevelopment-softwareengineering-aiagents-activity-7426512810890747905-ie-e" rel="noopener noreferrer"&gt;managing AI agents like a dev team&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;And how do you validate quality?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fair point. If AI is writing the code, who's making sure it actually works?&lt;/p&gt;

&lt;p&gt;My solution: a system of enforced gates that makes shipping bad code harder than shipping good code. Here's how I built that system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Model: Quality Is a Pipeline, Not a Checkpoint
&lt;/h2&gt;

&lt;p&gt;Often we think of quality as something you check at the end. Run the tests. Do a code review. Ship it.&lt;/p&gt;

&lt;p&gt;But we have already learned this lesson with SDLC / SSDLC:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;security and quality must be embedded in every phase, not bolted on at the end.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The same principle applies when AI writes the code. The difference is that you can't rely on AI agent developer discipline to follow the process. Your AI framework must enforce it through gates that agents cannot bypass.&lt;/p&gt;

&lt;p&gt;AI agents can produce plausible-looking code that passes superficial inspection but drifts from requirements, violates architecture patterns, or introduces subtle bugs. I first tried the obvious approach: detailed instructions telling the coding agent to handle testing, architecture patterns, and edge cases all at once. It never worked reliably. The breakthrough came when I loosened the constraints. Let the LLM write its best code freely, then build independent validation gates with separate agents that catch what the first one missed.&lt;/p&gt;

&lt;p&gt;My workflow has &lt;strong&gt;eight quality gates&lt;/strong&gt;. Code must pass through all of them before it reaches production.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxsxkg6skle8n1ujqpw5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxsxkg6skle8n1ujqpw5.png" alt=" " width="800" height="950"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If issues surface at Gate 5, 6, or 7, the fix flows back through Gate 3 → 4 before proceeding. In my experience, most issues are caught at Gate 4.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Gate 1: Requirements Definition (~70% of My Time)
&lt;/h2&gt;

&lt;p&gt;This is the most counterintuitive part. In an AI-native workflow, I spend roughly &lt;strong&gt;70% of my time&lt;/strong&gt; defining requirements, not writing code. My role has shifted from &lt;em&gt;how to build it&lt;/em&gt; to &lt;em&gt;what to build and why&lt;/em&gt;. The code is the agent's job. Getting the requirements right is mine.&lt;/p&gt;

&lt;p&gt;Why does this matter for quality? Because agents are extremely literal. Give them vague instructions and they'll build something that technically matches what you said but misses what you meant. The quality of AI output is directly proportional to the clarity of input.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;I use a &lt;strong&gt;requirements-analyst agent&lt;/strong&gt; that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue from our project management tool (Linear)&lt;/li&gt;
&lt;li&gt;Researches business requirements documentation to map functional and non-functional requirements&lt;/li&gt;
&lt;li&gt;Searches for industry patterns and best practices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asks me clarifying questions&lt;/strong&gt; until requirements are unambiguous&lt;/li&gt;
&lt;li&gt;Decomposes epics into right-sized stories with clear acceptance criteria&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every issue gets a structured format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## What&lt;/span&gt;
[Problem to solve]

&lt;span class="gu"&gt;## Why&lt;/span&gt;
[Business value]

&lt;span class="gu"&gt;## Context&lt;/span&gt;
[Constraints, dependencies, scope]

&lt;span class="gu"&gt;## Acceptance Criteria&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Criterion 1 (specific, testable)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Criterion 2
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Criterion 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;acceptance criteria are the contract between me and the agents.&lt;/strong&gt; If a criterion is vague, the agent will interpret it however it wants. If it's specific and testable, the agent has a clear target, and so does the validator that checks the work later.&lt;/p&gt;

&lt;p&gt;But requirements alone aren't enough. I also maintain &lt;strong&gt;architecture documentation&lt;/strong&gt;: files that describe the project's patterns, conventions, data models, and design system. When a code-architect agent later designs the implementation, it reads these docs and follows established patterns rather than inventing its own. The requirements define &lt;em&gt;what&lt;/em&gt;, the architecture docs constrain &lt;em&gt;how&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Scope creep (agents build exactly what's specified, nothing more)&lt;/li&gt;
&lt;li&gt;Spec drift (each sub-task traces back to business requirements)&lt;/li&gt;
&lt;li&gt;Wasted iterations (ambiguities are resolved before any code is written)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gate 2: Architecture Design
&lt;/h2&gt;

&lt;p&gt;Before any code is written, a &lt;strong&gt;code-architect agent&lt;/strong&gt; takes the requirements from Gate 1 and the architecture documentation I maintain, then designs the implementation. For example, my project maintains docs like these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docs/code-documentation/
├── architecture-backend.md
├── architecture-frontend.md
├── business-requirements.md
├── gcp-setup.md
├── design-system.md
├── testing-guidelines.md
└── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I typically maintain 10-20 such documents per project. These are living documents that evolve with the codebase. They serve as context for every agent, ensuring each one understands the project's patterns, conventions, and constraints before making any decisions.&lt;/p&gt;

&lt;p&gt;The architect agent reads relevant docs before designing anything, so it follows established patterns instead of inventing its own. Its process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads project architecture docs to understand established patterns and conventions&lt;/li&gt;
&lt;li&gt;Analyzes the existing codebase for relevant precedents&lt;/li&gt;
&lt;li&gt;Researches best practices for the specific technology stack&lt;/li&gt;
&lt;li&gt;Designs the feature architecture with specific file paths, component responsibilities, and data flow&lt;/li&gt;
&lt;li&gt;Breaks the work into ordered implementation phases&lt;/li&gt;
&lt;li&gt;Creates sub-issues for each phase with its own acceptance criteria&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I review the blueprint if it proposes changes to the general architecture. Personally, I want to understand and own the high-level design, but that's a preference, not a requirement of the system.&lt;/p&gt;

&lt;p&gt;The blueprint specifies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Every file&lt;/strong&gt; to create or modify&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Component responsibilities&lt;/strong&gt; and boundaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data flow&lt;/strong&gt; from entry points through transformations to outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build sequence&lt;/strong&gt; that defines which phases must complete before others can start&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each sub-issue carries its own acceptance criteria, which means the validator at Gate 4 has specific targets to check against. The quality chain is: requirements → architecture → implementation → validation, and each gate feeds the next.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Architecture drift (agents follow established patterns, not their own ideas)&lt;/li&gt;
&lt;li&gt;Integration failures (data flow is designed upfront, not discovered during integration)&lt;/li&gt;
&lt;li&gt;Over-engineering (scope is bounded by the blueprint)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gate 3: Implementation with Built-in Validation
&lt;/h2&gt;

&lt;p&gt;The developer agents (separate for each domain, like backend, frontend etc.) don't just write code and hand it off. They have mandatory validation steps built into their process. Why separate agents? Each one has a focused prompt, isolated context window, and role-specific evaluation criteria. A backend agent doesn't need to know about React patterns, and vice versa.&lt;/p&gt;

&lt;h3&gt;
  
  
  Incremental Testing
&lt;/h3&gt;

&lt;p&gt;After modifying or creating &lt;strong&gt;each file&lt;/strong&gt;, the agent runs only the related test file, not the full suite. This is deliberate: running all tests after every file change slows the agent dramatically, especially as the project grows and integration tests get heavier. By scoping to the affected test file, feedback cycles stay at seconds instead of minutes. The agent must fix failures before moving to the next file. This works well when test boundaries are clear (one service = one test file), and catches issues at the smallest possible scope. The full test suite runs later at Gate 4.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pre-Completion Validation
&lt;/h3&gt;

&lt;p&gt;Before reporting back, every developer agent must run and pass three checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Type-check&lt;/strong&gt;: zero errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lint&lt;/strong&gt;: zero errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test suite&lt;/strong&gt;: all tests pass + coverage &amp;gt;= 90% for new/modified files (as a minimum guardrail, not a quality metric, since high coverage alone doesn't prove tests are meaningful)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These checks use custom validation scripts that produce compact, structured output: a 5-line summary instead of hundreds of lines of test runner noise. This matters because &lt;a href="https://dev.to/teppana88/your-ai-coding-agents-are-slow-because-your-tools-talk-too-much-24h6"&gt;verbose tool output slows AI agents down significantly&lt;/a&gt;. When agents can parse results in seconds instead of scrolling through walls of text, the feedback loop stays tight.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cascading failures (small scope means bugs are isolated to one subtask)&lt;/li&gt;
&lt;li&gt;Test regressions (existing tests must still pass before moving on)&lt;/li&gt;
&lt;li&gt;Untested code (90% coverage enforced per file)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gate 4: Code Validator Agent
&lt;/h2&gt;

&lt;p&gt;After each developer agent completes, a dedicated &lt;strong&gt;code-validator agent&lt;/strong&gt; runs independently. This is the quality gate that blocks commits.&lt;/p&gt;

&lt;p&gt;The validator:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the issue and acceptance criteria&lt;/li&gt;
&lt;li&gt;Inspects recent changes and existing tests&lt;/li&gt;
&lt;li&gt;Runs the full test suite for affected packages&lt;/li&gt;
&lt;li&gt;Generates and reviews coverage reports&lt;/li&gt;
&lt;li&gt;Performs a code review focusing on correctness, edge cases, security, and project conventions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Decides: PASS or FAIL&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This review focuses on the current subtask in isolation. The broader feature-level review happens at Gate 5.&lt;/p&gt;

&lt;h3&gt;
  
  
  Confidence Scoring
&lt;/h3&gt;

&lt;p&gt;The validator rates each potential issue on a 0-100 confidence scale:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;False positive, not a real issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;Might be real, might be false positive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Real issue, but minor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;Verified real issue, will impact functionality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;Confirmed critical issue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Only issues with confidence &amp;gt;= 75 are reported.&lt;/strong&gt; The scoring uses structured prompts that require the agent to provide evidence for each finding. No evidence, no report. It's a pragmatic filtering mechanism that dramatically reduces noise and false positives.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Hard Rule
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Commits are blocked until the validator returns PASS.&lt;/strong&gt; If it returns FAIL, the developer agent is re-invoked to fix the issues, and the validator runs again. The workflow enforces this automatically, so there's no way to skip it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer Agent
  ↓
Validator (FAIL)
  ↓
Developer Agent (fix)
  ↓
Validator (PASS)
  ↓
Commit allowed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Convention violations (code that works but doesn't follow project patterns)&lt;/li&gt;
&lt;li&gt;Coverage regressions (no commit without meeting the threshold)&lt;/li&gt;
&lt;li&gt;Blind spots from the writing agent (independent review catches what the author missed)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gate 5: Multi-Agent Code Review
&lt;/h2&gt;

&lt;p&gt;While Gate 4 validates each subtask in isolation, Gate 5 reviews the &lt;strong&gt;entire feature&lt;/strong&gt; across all commits before creating a pull request. A code review skill runs multiple specialized agents in parallel:&lt;/p&gt;

&lt;h3&gt;
  
  
  Parallel Review Agents
&lt;/h3&gt;

&lt;p&gt;Four agents run simultaneously, each with a different focus:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Architecture Compliance&lt;/strong&gt;: Audit changes against architecture documentation, flag violations with exact rule citations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug Detection&lt;/strong&gt;: Scan the diff for logic errors, null handling issues, and edge cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Review&lt;/strong&gt;: Check for vulnerabilities, injection risks, and unsafe patterns in changed code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E2E Test&lt;/strong&gt;: Run an end-to-end test that exercises the new feature from the user's perspective&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Validation Round
&lt;/h3&gt;

&lt;p&gt;Each flagged issue goes through a separate validation agent that confirms the issue actually exists. This filters out false positives before any findings are reported.&lt;/p&gt;

&lt;h3&gt;
  
  
  High-Signal Only
&lt;/h3&gt;

&lt;p&gt;The review explicitly &lt;strong&gt;does not flag&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code style concerns (linters handle that)&lt;/li&gt;
&lt;li&gt;Subjective improvements&lt;/li&gt;
&lt;li&gt;Pre-existing issues not introduced in this change&lt;/li&gt;
&lt;li&gt;Pedantic nitpicks&lt;/li&gt;
&lt;li&gt;Patterns used consistently elsewhere in the codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Architectural violations slipping through&lt;/li&gt;
&lt;li&gt;Security issues in new code&lt;/li&gt;
&lt;li&gt;Logic bugs that tests don't cover&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gate 6: CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;Gates 3-5 all ran on the agent's machine. Gate 6 is the first time code runs in a &lt;strong&gt;completely independent environment&lt;/strong&gt;. When the pull request is marked ready for review, CI runs the full pipeline from scratch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Detect Changed Packages
        ↓
  Lint &amp;amp; Type Check
        ↓
  ┌─────┼─────┐
  ↓     ↓     ↓
Pkg A Pkg B Pkg C   (tests in parallel)
  ↓     ↓     ↓
  └─────┼─────┘
        ↓
      Build
        ↓
  Static Scanners
        ↓
  Ready for Review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Smart Change Detection
&lt;/h3&gt;

&lt;p&gt;The CI pipeline detects which packages changed and only runs their tests. If shared types change, all dependent packages are retested automatically because cascading dependencies are tracked. This keeps CI fast on small changes while still catching cross-package breakage.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Pipeline Runs
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lint &amp;amp; Type Check&lt;/strong&gt;: Static analysis across all changed packages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-package tests&lt;/strong&gt;: Unit and integration tests run in parallel for each affected package&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt;: Full production build of all changed modules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static Scanners&lt;/strong&gt;: Run static analysis tools to catch potential security issues before merging&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Draft PR Strategy
&lt;/h3&gt;

&lt;p&gt;PRs are always created as drafts first. CI skips draft PRs to save CI minutes. When ready for review, the PR is marked as non-draft, which triggers the full pipeline. This means CI resources are only spent on code that's already passed all local gates (Gate 3 + Gate 4).&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Environment-specific failures (clean CI, not the developer's machine)&lt;/li&gt;
&lt;li&gt;Cross-package breakage (shared type changes tested across all dependents)&lt;/li&gt;
&lt;li&gt;Build failures in production configuration&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gate 7: Human Review and Merge
&lt;/h2&gt;

&lt;p&gt;This is the only manual approval gate in the entire pipeline. After CI passes, I personally review the changes before merging. This is a critical checkpoint that forces me to consciously take ownership of delivered code. I want to understand what changed at a high level so that I'm able to steer future work and make informed decisions about architecture and design patterns.&lt;/p&gt;

&lt;p&gt;The review is intentionally lightweight. By this point, the code has already passed five automated gates. I'm not hunting for bugs or style issues. I'm checking that the feature makes sense, the approach aligns with where the project is heading, and nothing looks fundamentally wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Losing architectural awareness (I stay informed about every change)&lt;/li&gt;
&lt;li&gt;Autopilot merging (conscious decision to ship, not rubber-stamping)&lt;/li&gt;
&lt;li&gt;Strategic drift (changes that technically work but move the project in the wrong direction)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Gate 8: Deployment Verification
&lt;/h2&gt;

&lt;p&gt;On merge to main, automated release tooling creates a versioned release, and the deploy pipeline runs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Validates environment variables&lt;/strong&gt; before building (catches missing config early)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Builds all changed modules&lt;/strong&gt; with production configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploys only changed components&lt;/strong&gt;: backend, frontend, and infrastructure rules are deployed independently based on what actually changed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verifies all deployments succeeded&lt;/strong&gt;: if any component fails, the release is marked as failed with actionable retry instructions&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What This Prevents
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Deploying with missing or misconfigured environment variables&lt;/li&gt;
&lt;li&gt;Deploying unchanged components unnecessarily&lt;/li&gt;
&lt;li&gt;Silent partial failures (one component fails but the release looks successful)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The System in Practice
&lt;/h2&gt;

&lt;p&gt;Here's what a typical feature looks like flowing through these gates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Define requirements           [~1 hour]
2. Architecture design           [~10 min]
3. Implementation + tests        [~0,5 - 6 hours in total]
4. Validator after each phase    [~3 min each]
5. Code review before PR         [~5 min]
6. CI pipeline                   [~8 min]
7. I review and merge            [~10 min]
8. Deploy on merge               [~5 min]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Regardless of feature size, the validation overhead stays roughly bounded: about 20 minutes of automated checks. The implementation time scales with complexity, but the quality gates are much less variable. That's the point.&lt;/p&gt;

&lt;h3&gt;
  
  
  When the Pipeline Catches Something
&lt;/h3&gt;

&lt;p&gt;Here's a real example. A developer agent implemented a new feature that added a field to a shared data model. Unit tests passed. Type-check passed. Coverage was above 90%. The agent reported success.&lt;/p&gt;

&lt;p&gt;Then the validator ran. It detected that while the new field existed in the TypeScript interface and the backend service, the Firestore converter (responsible for translating between the database and the application) was never updated. Data would be written to the database but silently lost on read. The validator returned FAIL, the developer agent was re-invoked with the specific finding, and it fixed the converter in under a minute.&lt;/p&gt;

&lt;p&gt;Without Gate 4, that bug would have shipped. Unit tests didn't catch it because they mocked the database layer. The type system didn't catch it because the converter used a spread operator that silently dropped unknown fields. Only an independent agent reviewing the full change against project conventions found the gap.&lt;/p&gt;

&lt;p&gt;That failure became a permanent memory entry. Now every agent touching shared data models gets warned: &lt;em&gt;"Converter updates require synchronized changes in 4+ locations."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Long-Term Memory: Quality That Improves Itself
&lt;/h2&gt;

&lt;p&gt;Without persistent memory, every session starts from zero. Agents repeat the same mistakes, validators catch the same failures, and I re-explain the same constraints. The quality gates work, but they don't get better.&lt;/p&gt;

&lt;p&gt;Long-term memory closes this gap. It forms a feedback loop with the gates: the validator catches a failure, that failure becomes a memory entry, and in the next session the developer agent gets warned &lt;em&gt;before it writes a single line of code&lt;/em&gt;. The agent avoids the mistake. The validator confirms. Gates catch problems once. Memory prevents them forever.&lt;/p&gt;

&lt;p&gt;This compounds. Early in a project, agents make more mistakes and the validator catches them frequently. After 10+ runs, agents start each session already knowing dozens of project-specific traps. Validator failures become rarer. The system gets faster because it spends less time fixing and re-running.&lt;/p&gt;

&lt;p&gt;Here are a few real pitfalls that the pipeline caught and encoded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zod/TypeScript sync&lt;/strong&gt;: Adding interface fields requires updating Zod schemas AND all consumers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test mock indices&lt;/strong&gt;: New LLM-calling nodes shift ALL mock call indices in integration tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config wiring&lt;/strong&gt;: Adding a parameter signature without reading config is a silent no-op&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Converter updates&lt;/strong&gt;: New conversation fields require synchronized updates in 4+ locations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't hypothetical. Each one caused a real failure, was caught by the validation pipeline, and became permanent institutional knowledge. Every project accumulates its own version of this list.&lt;/p&gt;

&lt;p&gt;The result is a quality system that develops itself. Every feature it validates teaches it how to validate the next one better. No human intervention needed for this loop to run. This is fundamentally different from static tooling that works the same way on day one and day three hundred.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I've Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Front-load requirements, not reviews
&lt;/h3&gt;

&lt;p&gt;The biggest quality lever isn't better testing. It's clearer requirements. When I spend an hour defining exactly what a feature should do with specific acceptance criteria, the agents produce correct code on the first try far more often than when I rush through requirements and rely on review to catch problems.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Separate writing from validation
&lt;/h3&gt;

&lt;p&gt;Don't ask the same agent to write code and verify it. That's like having students grade their own exams. The coding agent's job is to write the best code it can. The validator is a separate agent with a separate prompt, separate context, and explicit permission to fail the work. It has no incentive to pass. This separation is what makes the gates trustworthy.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. One subtask at a time
&lt;/h3&gt;

&lt;p&gt;The natural instinct is to implement a full feature and test at the end. That's where quality breaks down. Instead, break the work into small subtasks, implement one, validate it, commit it, then move to the next. Each commit is a known-good checkpoint. When something fails, the blast radius is one subtask, not an entire feature. This pattern is counterintuitive but it's the most practical change another developer could adopt immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Enforce the process in the framework, not in prompts
&lt;/h3&gt;

&lt;p&gt;You can't tell an AI agent to "be careful" and expect consistent results. The quality comes from a workflow that runs validation automatically after every subtask, not from instructions asking agents to remember to test. Bake the gates into the framework so they execute by default. When skipping a gate is harder than following it, quality becomes a property of the system rather than a hope.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. This is an engineering problem, not an AI problem
&lt;/h3&gt;

&lt;p&gt;The question isn't:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can AI write good code?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It can.&lt;/p&gt;

&lt;p&gt;The question is&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does your system prevent bad code from shipping?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That requires overlapping automated gates, independent validation agents, long-term memory, and a workflow that enforces all of it. No single technique is enough. The system is the product.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tools: Claude Code, Codex, specialized AI agents per role, skills, long-term memory for persistent learnings, git worktrees, Linear for issue tracking, GitHub Actions for CI/CD.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>agents</category>
      <category>security</category>
    </item>
    <item>
      <title>Your AI Coding Agents Are Slow Because Your Tools Talk Too Much</title>
      <dc:creator>Teemu Piirainen</dc:creator>
      <pubDate>Sat, 07 Feb 2026 10:39:43 +0000</pubDate>
      <link>https://dev.to/teppana88/your-ai-coding-agents-are-slow-because-your-tools-talk-too-much-24h6</link>
      <guid>https://dev.to/teppana88/your-ai-coding-agents-are-slow-because-your-tools-talk-too-much-24h6</guid>
      <description>&lt;p&gt;Our AI code validator agent took &lt;strong&gt;608 seconds&lt;/strong&gt; to report results from a test suite that runs in 96 seconds. The agent wasn't stupid. The tool output was.&lt;/p&gt;

&lt;p&gt;Every developer tool we use (test runners, linters, compilers, build systems) was designed for humans reading a terminal. When an AI agent reads that same output through a context window, things break in ways you don't expect. This is one example of that problem, and a pattern for fixing it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Symptom
&lt;/h2&gt;

&lt;p&gt;We run a TypeScript monorepo with ~12,000 tests across four packages. After each feature, a code-validator agent runs tests and reports pass/fail with coverage. Simple job.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent Task&lt;/th&gt;
&lt;th&gt;Actual Test Time&lt;/th&gt;
&lt;th&gt;Agent Time&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Backend (3,683 tests)&lt;/td&gt;
&lt;td&gt;24s&lt;/td&gt;
&lt;td&gt;224s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.3x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend (7,450 tests)&lt;/td&gt;
&lt;td&gt;96s&lt;/td&gt;
&lt;td&gt;608s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.3x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent was spending 6-9x longer &lt;em&gt;understanding the results&lt;/em&gt; than the tests took to run.&lt;/p&gt;

&lt;h2&gt;
  
  
  What The Agent Actually Did
&lt;/h2&gt;

&lt;p&gt;We parsed the agent transcripts (every tool call, every reasoning step). Here's the backend agent's actual sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. npm run test:coverage           → 419KB output, truncated at 235KB
2. grep "Tests" /tmp/output.log    → matched console.log JSON, not summary
3. npm run test:coverage           → re-ran entire suite. Truncated again.
4. tail -20 /tmp/output.log        → got coverage table row, not summary
5. grep -E "passed|failed"         → matched 47 lines of noise
6. npm run test:coverage           → third complete re-run
   ... repeated 6 times total ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;12 tool calls. 6 complete test re-runs. 224 seconds.&lt;/strong&gt; To answer a yes/no question.&lt;/p&gt;

&lt;p&gt;The frontend agent was worse: &lt;strong&gt;28 tool calls&lt;/strong&gt;, 5 test re-runs, 13 different grep/tail/head combinations trying to parse a coverage text table. It even reported a &lt;strong&gt;false failure&lt;/strong&gt; — incorrectly flagging coverage as below threshold because it parsed the wrong line.&lt;/p&gt;

&lt;p&gt;Why? Because vitest produces this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ✓ src/services/__tests__/userService.test.ts (12 tests) 45ms
 ✓ src/services/__tests__/authService.test.ts (8 tests) 23ms
   ... 1,386 more files ...

 Test Files  1389 passed (1389)
      Tests  3683 passed (3683)
   Duration  24.1s

----------|---------|----------|---------|---------|
File      | % Stmts | % Branch | % Funcs | % Lines |
----------|---------|----------|---------|---------|
   ... 141 rows ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;419KB of human-readable output. The answer &lt;strong&gt;five numbers&lt;/strong&gt; is at the bottom. The context window truncates from the bottom. The agent never sees it.&lt;/p&gt;

&lt;p&gt;You wouldn't send 419KB of raw HTML to a mobile app and tell it to regex out the data. But that's exactly what we were doing with our agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;We stopped asking "how do we make the agent parse this better" and asked "&lt;strong&gt;can we give the agent a command that just outputs the answer?&lt;/strong&gt;"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;RESULT_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;trap&lt;/span&gt; &lt;span class="s1"&gt;'rm -f "$RESULT_FILE"'&lt;/span&gt; EXIT

&lt;span class="c"&gt;# JSON reporter writes structured data to file. Everything else → /dev/null.&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PKG_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npx vitest run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reporter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--outputFile&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null 2&amp;gt;&amp;amp;1

&lt;span class="c"&gt;# Extract exactly what the agent needs&lt;/span&gt;
&lt;span class="nv"&gt;PASSED_TESTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;jq &lt;span class="s1"&gt;'.numPassedTests'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;FAILED_TESTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;jq &lt;span class="s1"&gt;'.numFailedTests'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;SUCCESS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;jq &lt;span class="s1"&gt;'.success'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"RESULT=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SUCCESS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"PASS"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"FAIL"&lt;/span&gt; &lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"TESTS=&lt;/span&gt;&lt;span class="nv"&gt;$PASSED_TESTS&lt;/span&gt;&lt;span class="s2"&gt; passed, &lt;/span&gt;&lt;span class="nv"&gt;$FAILED_TESTS&lt;/span&gt;&lt;span class="s2"&gt; failed"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"WALL_TIME=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;WALL_TIME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s"&lt;/span&gt;

&lt;span class="c"&gt;# On failure only: extract what failed&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SUCCESS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.testResults[] | select(.status == "failed") |
    "FILE: \(.name)\n\([.assertionResults[] |
    select(.status == "failed") | "  - " + .fullName] | join("\n"))"
  '&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-30&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three decisions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reporter=json&lt;/code&gt;&lt;/strong&gt; — vitest writes structured JSON to a file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;&amp;gt; /dev/null 2&amp;gt;&amp;amp;1&lt;/code&gt;&lt;/strong&gt; — 419KB of terminal noise disappears&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;jq&lt;/code&gt;&lt;/strong&gt; — extracts five numbers from structured data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agent now sees this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== VALIDATION: test:backend ===
RESULT=PASS
SUITES=1389 passed, 0 failed (1389 total)
TESTS=3683 passed, 0 failed (3683 total)
WALL_TIME=40s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five lines. One tool call. No parsing, no ambiguity, no re-runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern Is Everywhere
&lt;/h2&gt;

&lt;p&gt;This isn't a vitest problem. It's a tool output problem. Every developer tool your agent touches has the same issue:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Linters&lt;/strong&gt; — ESLint's default output is human-friendly. &lt;code&gt;eslint --format json&lt;/code&gt; gives your agent structured violations with file paths, line numbers, and severity — no parsing needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type checkers&lt;/strong&gt; — &lt;code&gt;tsc --noEmit&lt;/code&gt; dumps errors to stderr as human-readable text. A 5-line wrapper that counts errors and captures file paths turns it into a structured report.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build tools&lt;/strong&gt; — &lt;code&gt;docker build&lt;/code&gt; streams layers of progress output. The agent only needs: did it succeed, what's the image size, how long did it take.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt; — &lt;code&gt;terraform plan&lt;/code&gt; produces pages of human-readable diff. &lt;code&gt;terraform plan -json&lt;/code&gt; gives your agent a structured changeset it can reason about.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern is always the same: the tool already has structured output (JSON, machine-readable flags), but the default is designed for a terminal. &lt;strong&gt;Switch the format, discard the noise, extract the answer.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Backend: tool calls&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend: agent time&lt;/td&gt;
&lt;td&gt;224s&lt;/td&gt;
&lt;td&gt;42s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend: tool calls&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend: agent time&lt;/td&gt;
&lt;td&gt;608s&lt;/td&gt;
&lt;td&gt;66s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;False failures&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test re-runs per agent&lt;/td&gt;
&lt;td&gt;5-6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same tests. Same agent. Same model. Same prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;The industry is pouring effort into prompt engineering, model selection, and agent frameworks. Meanwhile, half the agent's context window is filled with ANSI color codes, progress bars, and output that was never meant for machine consumption. The context window is a scarce resource =&amp;gt; treat it like memory, not a terminal screen.&lt;/p&gt;

&lt;p&gt;When your agent is slow, don't start with the prompt. Start with what the tools are sending back. Audit every command your agent runs. If the output is more than a screenful, the agent is probably struggling with it. Most tools already support structured output: JSON flags, machine-readable formats, quiet modes. Use them. And where they don't exist, a simple wrapper script that filters noise and extracts the answer will do more for your agent's performance than any prompt rewrite.&lt;/p&gt;

&lt;p&gt;The fastest agent isn't the one with the best reasoning. It's the one that doesn't have to reason about the data format at all.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Based on a real optimization on a production TypeScript monorepo with ~12,000 vitest tests. The pattern — structured output, noise suppression, answer extraction — applies to any tool your agents touch.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>react</category>
      <category>learning</category>
    </item>
    <item>
      <title>AI Agent - Lessons Learned</title>
      <dc:creator>Teemu Piirainen</dc:creator>
      <pubDate>Mon, 11 Aug 2025 05:12:00 +0000</pubDate>
      <link>https://dev.to/teppana88/ai-agent-lessons-learned-4lmg</link>
      <guid>https://dev.to/teppana88/ai-agent-lessons-learned-4lmg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who’s this for:&lt;/strong&gt; Builders and skeptics who want honest numbers: did an AI coding agent really save time, money, and sanity or just make a mess faster?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⌛ &lt;strong&gt;~60 h&lt;/strong&gt; build time (↓~66 % from 180 h)&lt;/li&gt;
&lt;li&gt;💸 &lt;strong&gt;$283&lt;/strong&gt; token spend&lt;/li&gt;
&lt;li&gt;🚀 &lt;strong&gt;374 commits, 174 files, 16'215 lines of code&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🤖 &lt;strong&gt;1 new teammate&lt;/strong&gt; - writes code 10× faster but only listens if you give it rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Series progress:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▇▇▇▇▇ Retrospect ▇▇▢▢▢&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;Part 4&lt;/strong&gt;, the final piece: my honest verdict.&lt;/p&gt;

&lt;p&gt;Now the question: &lt;strong&gt;Was it worth it?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the numbers add up?&lt;/li&gt;
&lt;li&gt;Where did the agent pay off?&lt;/li&gt;
&lt;li&gt;Where did it backfire?&lt;/li&gt;
&lt;li&gt;How would I push it further next time?&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Series Roadmap - How This Blueprint Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One last time, here’s the big picture:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Control&lt;/strong&gt; - Control Stack &amp;amp; Rules → trust your AI agent won’t drift off course (&lt;a href="https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je"&gt;Control - Part 1&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt; - AI agent starts coding → boundaries begin to show (&lt;a href="https://dev.to/teppana88/i-shipped-3x-more-features-with-one-ai-agent-all-production-ready-3lf"&gt;Build - Part 2&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release&lt;/strong&gt; - CI/CD, secrets, real device tests → safe production deploy (&lt;a href="https://dev.to/teppana88/release-ai-agent-code-safely-production-cicd-secrets-5ecj"&gt;Release - Part 3&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrospect&lt;/strong&gt; - The honest verdict → what paid off, what blew up, what’s next&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why care?&lt;/strong&gt;&lt;br&gt;
The AI agent is a &lt;strong&gt;code machine&lt;/strong&gt; that never sleeps, knows every library, and &lt;strong&gt;wants to push commits 24/7&lt;/strong&gt; but without your control, it has no clue what the end product should be.&lt;/p&gt;

&lt;p&gt;👉 &lt;em&gt;Let’s break it down - this is Part 4.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Did the Numbers Add Up?
&lt;/h2&gt;

&lt;p&gt;Back in &lt;a href="https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je"&gt;Part 1&lt;/a&gt;, I posed the core challenge:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Can we build a fully autonomous AI agent (one that an organisation can own and audit end-to-end) and make it deliver real, production-grade code, with just a fraction of human input?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That meant no black-box SaaS tools. No prompt-hacking toys. Just a scoped AI teammate working inside a real, observable control loop: &lt;strong&gt;Planner → Executor → Validator&lt;/strong&gt;, backed by rules I could evolve and CI/CD pipelines I already trust.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Short answer:&lt;/strong&gt; &lt;em&gt;YES.&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Here’s the breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Effort&lt;/strong&gt; - ~60 h of my time with the AI agent delivered the same output I’d expect from 180 h solo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Money&lt;/strong&gt; - $283 in Gemini 2.5 Pro tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; - Flutter work flew by &lt;strong&gt;5–7× faster&lt;/strong&gt;, native Swift/Kotlin dragged to &lt;strong&gt;&amp;lt;2×&lt;/strong&gt;, landing a real-world &lt;strong&gt;~3× boost&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delivery&lt;/strong&gt; - 374 commits, 174 files, 16'215 lines of code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust&lt;/strong&gt; - Every task passed through my control loop, tested and clean. Full control. Full traceability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So was it cheap? &lt;strong&gt;Absolutely - but only because I stayed in the loop&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
The $283 didn’t &lt;em&gt;magically buy 180 hours of code&lt;/em&gt;. It bought an extra pair of hands that turned my 60 hours into a full 180h deliverable.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Bottom line:&lt;/strong&gt; The &lt;strong&gt;3× boost&lt;/strong&gt; didn’t come from magic, it came from structure.&lt;br&gt;
The agent didn’t invent new skills; it scaled the ones I already had.&lt;br&gt;
Sometimes it even surprised me, but only because the groundwork made it possible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The stack I share here worked and was battle-tested in &lt;strong&gt;June 2025&lt;/strong&gt;. Treat this write-up as a snapshot, not a rulebook.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Lessons Learned - What I’d &lt;strong&gt;Keep&lt;/strong&gt; Next Time ✅
&lt;/h2&gt;

&lt;p&gt;Some parts of the setup worked better than expected and these are the ones I’d repeat from day one. Most of them are invisible from the outside, but they made the difference between chaos and clarity.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Trust Isn’t Given - It’s Built
&lt;/h3&gt;

&lt;p&gt;One of the biggest takeaways from integrating an AI agent into my workflow is that trust doesn’t happen by default. You don’t get it just because the agent can write “good” code. You earn it by proving, over and over, that the agent can operate reliably inside the same guardrails as the human team.&lt;/p&gt;

&lt;p&gt;When trust is missing, adoption stalls. Every bug becomes a reason to sideline the agent rather than improve it. Pull requests sit unreviewed because no one wants to take responsibility. Eventually, the “AI teammate” becomes just another unused tool.&lt;/p&gt;

&lt;p&gt;The turning point was treating the agent like a real developer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make its actions visible so everyone can see what it’s doing and why&lt;/li&gt;
&lt;li&gt;Start small and collect wins before scaling up&lt;/li&gt;
&lt;li&gt;Learn from mistakes and feed those lessons back into its instructions and rules&lt;/li&gt;
&lt;li&gt;Require approval for every plan before coding starts&lt;/li&gt;
&lt;li&gt;Apply the same rules as for human devs (no shortcuts because it’s an agent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built through visibility, shared rules, guardrails, and real accountability, that trust made me comfortable approving the agent’s work. Without it, none of the technical improvements would have mattered.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Control Stack First, Prompts Second
&lt;/h3&gt;

&lt;p&gt;Giving the agent a &lt;strong&gt;state‑machine style loop&lt;/strong&gt; (Planner → Executor → Validator), similar what &lt;a href="https://www.anthropic.com/engineering/claude-code-best-practices" rel="noopener noreferrer"&gt;Anthropic’s best‑practice write‑up&lt;/a&gt;. It forced the AI agent to think before splatting code and gave me natural checkpoints to cancel nonsense.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Rules as Live Documentation
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;/rules/airules.md&lt;/code&gt; began as &lt;strong&gt;nothing&lt;/strong&gt;, ballooned into a &lt;strong&gt;400-line beast&lt;/strong&gt;, and finally slimmed down to a &lt;strong&gt;tight 70-liner&lt;/strong&gt; that covers only what matters. By week’s end the agent spoke my dialect (thinking process, code architecture, commit style) with minimal reminders.&lt;/p&gt;

&lt;p&gt;JetBrains’ &lt;a href="https://www.jetbrains.com/help/junie/customize-guidelines.html" rel="noopener noreferrer"&gt;Junie guideline files&lt;/a&gt; show the same pattern: &lt;em&gt;write rules once, enforce forever&lt;/em&gt;. But “forever” takes discipline.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 Ruthless Task Scoping
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with a living PRD:&lt;/strong&gt;
Draft a concise &lt;strong&gt;Product Requirements Document&lt;/strong&gt; that maps the entire service: user flows, non-functional needs, “nice-to-have” ideas, everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slice every feature into bite-size tasks:&lt;/strong&gt;
Break big rocks into shovel-ready tickets. Do it yourself, subdivide the work, just be sure each task fits comfortably in a one AI agent task context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let the agent explode tasks into execution units:&lt;/strong&gt;
When implementation starts, the agent generates its own &lt;strong&gt;subtasks, acceptance notes, and edge cases&lt;/strong&gt;, and keeps that checklist current as it commits code.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2.5 Secrets Stay Secret
&lt;/h3&gt;

&lt;p&gt;Fine‑grained PAT plus GitHub Secrets meant the agent never held a signing key. The 2025 &lt;a href="https://www.wiz.io/blog/leaking-ai-secrets-in-public-code" rel="noopener noreferrer"&gt;Wiz secrets‑leak report&lt;/a&gt; is proof that anything less is asking for page‑one headlines.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.6 Real-Device CI/CD - The Only Trustworthy Loop
&lt;/h3&gt;

&lt;p&gt;CI/CD isn’t optional when working with AI agents, it’s what turns speed into reliability. No pipeline, no autonomy - just faster mistakes.&lt;/p&gt;

&lt;p&gt;Every pull request goes through the same pipeline: build, sign, ship to &lt;strong&gt;TestFlight&lt;/strong&gt; and &lt;strong&gt;Play Console&lt;/strong&gt;. That means the code lands on real hardware, gets tested by real eyes, and reveals real bugs the agent never saw coming.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The first sprint showed what works. These are the bets I’ll double down on next time to turn speed into consistency - not just more commits.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Lessons Learned - Where I’ll &lt;strong&gt;Push Further&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 Smarter Context, Longer Autonomy
&lt;/h3&gt;

&lt;p&gt;Big LLMs forget fast. The fix isn’t just more tokens. It’s structured, real-time access to the whole repo, open tasks, recent merges - everything that defines “what’s really going on.”&lt;/p&gt;

&lt;p&gt;The longer a single chat grows, the worse the output gets. So I’d like to push for smarter retrieval next time: live task lists, commit-aware reasoning, and context that updates as the codebase evolves.&lt;/p&gt;

&lt;p&gt;Both &lt;a href="https://cognition.ai/blog/dont-build-multi-agents#applying-the-principles" rel="noopener noreferrer"&gt;Devin&lt;/a&gt; and &lt;a href="https://www.anthropic.com/engineering/built-multi-agent-research-system" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt; hit the same wall: without structured, evolving context, long autonomous runs just fall apart. Even though one favors single-agent and the other multi-agent setups.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In my own sprint, I tackled the same challenge by keeping tasks small and starting each with a clean context. A simple but surprisingly effective workaround.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3.2 Scaling the Team
&lt;/h3&gt;

&lt;p&gt;One dev plus one agent is simple. But the moment you add more people (or more agents) things can get messy fast.&lt;/p&gt;

&lt;p&gt;Who owns what? What changed while the agent was thinking? What if two agents fix the same bug? Without shared state and safe commit boundaries, you don’t get more speed. Just more conflict.&lt;/p&gt;

&lt;p&gt;One thing I’d like to try: &lt;strong&gt;task leasing&lt;/strong&gt;. AI agents (Planner) pick up tasks from a shared hub (like &lt;code&gt;task.md&lt;/code&gt;), evaluates is there other tasks running parallel (by other agents / humans) that might impact to work and validate state (Validator) before pushing code. Paired with clean CI/CD and commit guards, that might keep the swarm aligned.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;These ideas would need careful testing in real-world coding workflows, as current multi-agent systems often fail due to shared-state complexity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3.3. True TDD Loop
&lt;/h3&gt;

&lt;p&gt;Tests after code worked fine, but next run the agent will flip it: failing test first, fix second. This tightens the feedback loop and cuts down surprises at the Validator step.&lt;/p&gt;

&lt;p&gt;Anthropic recommends the same test-first mindset in their &lt;a href="https://www.anthropic.com/engineering/claude-code-best-practices" rel="noopener noreferrer"&gt;Claude Code Best Practices&lt;/a&gt;: write the tests first, confirm they fail, then guide your agent to turn them green. The goal is the same: catch bugs early, not after they hit production.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4. Deeper Static Analysis
&lt;/h3&gt;

&lt;p&gt;Syntax checks aren’t enough. Unlike experienced developers, AI agents don’t intuitively spot complex or fragile code structures. Adding tools like SonarQube or Qodana to the CI pipeline gives early feedback on code quality, helping catch issues the AI agent might repeat without realizing.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5. UAT Feedback Automation
&lt;/h3&gt;

&lt;p&gt;One practical issue that I didn't solve was: how to get human testers' feedback into the agent's task list? When working in a team, I would create a separate hook integrated to Jira or Slack (or what ever tool the team is using), so that testers could report issues and the AI agent would pick them up and automatically create a linked task. But in this case I didn't have a team, so I just added the issues manually into &lt;code&gt;task.md&lt;/code&gt; and let the agent handle them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Mobile dev: In general, I think that we could get the biggest improvement if the AI agent would have access to Android emulator UI during the development. AI agent would have been able to run the tests and fix majority of issues automatically during the development. But as I mentioned earlier, due to technical limitations in Firebase Studio, real-time emulator access wasn’t available in this project.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. MCP (Model Context Protocol) - The Next Frontier
&lt;/h2&gt;

&lt;p&gt;Everything so far (coding speed-up, the Planner–Executor–Validator loop, and the CI/CD pipeline) was built and tested during the initial 30-day sprint. But while writing this recap, one thing became clear:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There’s still room to grow. And it starts with context.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The core limitation I ran into was this: the AI agent didn’t truly “know” what is happening outside the project. Each task was handled in context-isolation with one AI agent.&lt;/p&gt;

&lt;p&gt;That missing context (the lack of shared memory or real-time awareness) is what I’ll explore next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why MCP matters&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://modelcontextprotocol.io/introduction" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt; lets an agent bolt on extra &lt;strong&gt;tools&lt;/strong&gt; in real time. Need repo search? A test runner? Up to date docs? Hook it up as a structured API call instead of fragile prompt glue.&lt;/p&gt;

&lt;p&gt;Below are some candidate features I’m excited to try:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared brain, real‑time&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Every agent writes to (and reads from) a live &lt;strong&gt;&lt;code&gt;task.md&lt;/code&gt; hub&lt;/strong&gt;. Spin up ten runners, one planner, one validator; they all share the same queue and state, acting like one coherent engineer instead of a Slack channel on fire.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always‑fresh docs (Context 7 FTW)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Because MCP streams docs in via &lt;code&gt;context 7&lt;/code&gt;, the agent’s knowledge is never stale. Update the README, push to main, the next call sees it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise-ready glue&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Forget brittle webhooks - MCP exposes Jira, GitHub, Slack as typed plugin calls. The agent plugs straight into your existing workflows without extra scaffolding.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Bottom line: MCP turns one clever AI agent into a fully-armed AI agent squad, all speaking the same language and pulling from the same live playbook.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Final Word
&lt;/h2&gt;

&lt;p&gt;AI agents won’t replace you, but they will scale what you can deliver.  &lt;/p&gt;

&lt;p&gt;All this worked because I’ve got 20+ years of real dev work behind me: the blueprint, the rules, the guardrails all come from doing the work first.&lt;/p&gt;

&lt;p&gt;If we hand every line of code to the AI, that craft fades and soon there’s nothing left to steer.&lt;/p&gt;

&lt;p&gt;So protect your craft. Build the hard parts yourself. Keep your edge, so the agent stays a partner - not the boss.&lt;/p&gt;

&lt;p&gt;Wire it tight, scope it clear, and let your AI agent prove it can keep up!&lt;/p&gt;




&lt;h2&gt;
  
  
  6. If You Want The 3× Boost - Do This ✅
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control Stack First.&lt;/strong&gt; Planner → Executor → Validator - no shortcuts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep &lt;code&gt;/rules&lt;/code&gt; alive.&lt;/strong&gt; Update instructions as your agent learns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope tasks tight.&lt;/strong&gt; Small tasks, clear acceptance notes, tracked commits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets stay secret.&lt;/strong&gt; Repo-scoped PAT + CI/CD secrets only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real tests.&lt;/strong&gt; Always run on real hardware, no emulator-only trust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch, learn, tweak.&lt;/strong&gt; Your agent only stays smart if you guide it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ready for next?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
I’m planning to plug MCP into the &lt;a href="https://www.all-hands.dev/" rel="noopener noreferrer"&gt;All‑Hands AI framework&lt;/a&gt; next, linking multiple agents with a shared brain and tighter feedback loops. I’ll share how that turns out once I’ve pushed it far enough to see what breaks.&lt;/p&gt;

&lt;p&gt;💬 Seen anything I missed? Or got your own battle story testing AI agents on real projects? Drop it in the comments. I read them all.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>Release AI Agent Code Safely - Production CI/CD &amp; Secrets</title>
      <dc:creator>Teemu Piirainen</dc:creator>
      <pubDate>Mon, 04 Aug 2025 05:40:00 +0000</pubDate>
      <link>https://dev.to/teppana88/release-ai-agent-code-safely-production-cicd-secrets-5ecj</link>
      <guid>https://dev.to/teppana88/release-ai-agent-code-safely-production-cicd-secrets-5ecj</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who’s this for:&lt;/strong&gt; Devs, team leads, and DevOps folks responsible for a production CI/CD pipeline - looking to integrate AI agents that generate code, without losing reliability or control.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Secrets, Pipelines, Real Tests:&lt;/strong&gt;&lt;br&gt;
Fine-grained Personal Access Tokens (PATs) protected my repo, GitHub Actions auto-built every PR, a second AI agent reviewed commits and human approved the PR. Real device tests closed the loop, &lt;strong&gt;still ~3× faster&lt;/strong&gt;.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Series progress:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▇▇▇▇▇ Retrospect ▢▢▢▢▢&lt;/p&gt;

&lt;p&gt;Welcome to &lt;strong&gt;Part 3&lt;/strong&gt; of my deep-dive series on building an autonomous AI agent: how do you &lt;strong&gt;actually deploy&lt;/strong&gt; AI agent code safely?&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 1&lt;/strong&gt;, I locked my agent inside a clear &lt;strong&gt;Planner → Executor → Validator&lt;/strong&gt; loop.&lt;br&gt;
In &lt;strong&gt;Part 2&lt;/strong&gt;, I proved it could blast through real Flutter tasks and handle native Swift/Kotlin with human guard-rails.&lt;/p&gt;

&lt;p&gt;But shipping day is where AI agents usually faceplant: secrets leaks, who is actually responsible, app doesn't work on a real device.&lt;/p&gt;

&lt;p&gt;This part breaks down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How I kept secrets safe (fine-grained GitHub PATs, repo-scoped only)&lt;/li&gt;
&lt;li&gt;How I automated CI/CD (GitHub Actions, PR reviews with a second AI)&lt;/li&gt;
&lt;li&gt;How to integrate real device testing into the loop&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;&lt;strong&gt;Series Roadmap - How This Blueprint Works&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Control&lt;/strong&gt; - Control Stack &amp;amp; Rules → trust your AI agent won’t drift off course (&lt;a href="https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je"&gt;Control - Part 1&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt; - AI agent starts coding → boundaries begin to show (&lt;a href="https://dev.to/teppana88/i-shipped-3x-more-features-with-one-ai-agent-all-production-ready-3lf"&gt;Build - Part 2&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release&lt;/strong&gt; - CI/CD, secrets, real device tests → safe production deploy
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrospect&lt;/strong&gt; - The honest verdict → what paid off, what blew up, what’s next&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why care?&lt;/strong&gt;&lt;br&gt;
Without proper CI/CD controls and human-in-the-loop rules, your AI agent can go rogue - just like &lt;a href="https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/" rel="noopener noreferrer"&gt;Replit’s did when it dropped a live database&lt;/a&gt; during a code freeze. No guardrails, no mercy.&lt;/p&gt;

&lt;p&gt;👉 &lt;em&gt;Let’s secure it - this is Part 3.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  1. App Observability: Keep Analytics and Crash Reports Under Control
&lt;/h2&gt;

&lt;p&gt;Firebase Studio can auto-generate configuration files, but I didn’t want it touching anything sensitive. It also offers a one-click setup for Crashlytics and Analytics, but using it would have meant linking my personal credentials. That wasn’t acceptable.&lt;/p&gt;

&lt;p&gt;Instead, I handled the setup manually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Created the Firebase project through the Firebase Console
&lt;/li&gt;
&lt;li&gt;Registered iOS and Android apps and downloaded the required config files
&lt;/li&gt;
&lt;li&gt;Added &lt;code&gt;GoogleService-Info.plist&lt;/code&gt; and &lt;code&gt;google-services.json&lt;/code&gt; to the project folders
&lt;/li&gt;
&lt;li&gt;Configured dependencies and updated the Podfile by hand&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using the Firebase CLI was an option, but running it inside Studio through an AI agent didn’t meet my security bar.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. GitHub Access: Minimal Permissions, Full Workflow
&lt;/h2&gt;

&lt;p&gt;Firebase Studio initially requested full GitHub access. That was not acceptable. It then suggested a general Personal Access Token, which was still too broad for my setup.&lt;/p&gt;

&lt;p&gt;Instead, I configured a fine-grained PAT with only the permissions required for this single repository. That allowed the AI agent to commit code, open pull requests, and read comments, nothing more. I also installed the GitHub CLI and used the same token for PR management.&lt;/p&gt;

&lt;p&gt;All sensitive keys stayed out of the repository. I stored &lt;code&gt;key.properties&lt;/code&gt; and Apple certificates in GitHub Secrets. The pipeline injected them only during the build process. The AI agent had zero access to any secrets at rest, keeping the risk surface small and controlled.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. CI/CD - Let the Pipeline Do the Dirty Work
&lt;/h2&gt;

&lt;p&gt;Once the AI agent created a PR, my custom GitHub Actions pipeline took over. PR reviews were done first by the GitHub Copilot and then by &lt;strong&gt;human in the loop&lt;/strong&gt; - me.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.1 Who does the PR review and takes responsibility?
&lt;/h3&gt;

&lt;p&gt;In a normal software lifecycle, we don’t let developers review their own code, not because we’re careless, but because we know we miss things. The same principle applies to AI agents, and arguably even more so.&lt;/p&gt;

&lt;p&gt;In this setup, every pull request went through a two-phase review: first by GitHub Copilot, then by me. The code was originally written by Gemini 2.5 Pro, and I honestly expected Copilot to just nod along. But surprisingly, it flagged real issues, especially around edge cases and error handling.&lt;/p&gt;

&lt;p&gt;Early on, I followed every line the AI agent wrote. But as the control stack matured, I trusted it more. By the end, I reviewed its pull requests just like I would with any human teammate.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.2 GitHub Actions Pipeline
&lt;/h3&gt;

&lt;p&gt;When it was time to create a release build, this was triggered manually by me utilizing &lt;code&gt;create_release.yml&lt;/code&gt; workflow. Pipeline then took care of the whole release process.&lt;/p&gt;

&lt;p&gt;Release notes and the whole CI/CD pipeline was very similar to what I would do with my real customers and human developer colleague (Dependabot, Release Drafter, analyzer, linters, test, build generation, signing, version bumps etc.). This setup worked the same way I use with human teammates.&lt;/p&gt;

&lt;p&gt;Example of my &lt;code&gt;.github&lt;/code&gt; folder structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.github/
├── CODEOWNERS
├── dependabot.yml
├── release-drafter.yml
└── workflows/
    ├── create_release.yml
    ├── labeler-pr.yml
    └── labeler-update-draft.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Branch and PR Flow
&lt;/h2&gt;

&lt;p&gt;This is how the full development cycle played out with the AI agent in control.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create a new branch&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Once the task prompt was clear and scoped, the AI agent created a new branch from &lt;code&gt;main&lt;/code&gt;. I used trunk-based development, where all releases were built from &lt;code&gt;main&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The agent followed the &lt;code&gt;git-workflow.instructions.md&lt;/code&gt; rules to stay aligned with my CI/CD pipeline.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement the task&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The agent executed the Planner → Executor → Validator loop. &lt;/li&gt;
&lt;li&gt;One commit per one subtask, so that it was easy to make rollbacks when (not if) AI agent went ballistics. &lt;strong&gt;⚠️ And yes, this will happen. Be prepared to revert fast.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;It committed changes with descriptive messages, including the task ID (e.g., &lt;code&gt;ID-1234: Add UI widget for xx&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open PR&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;After completing the full task, the agent opened a PR to &lt;code&gt;main&lt;/code&gt;, which triggered the CI/CD pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; If you use secondary AI agent for PR review, request your coding AI agent to add relevant PR description and possible instructions for the reviewer. This way you can get better results from the PR review AI agent.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR review&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The PR review agent left comments, which were then passed back to the coding agent. The coding agent addressed the feedback and pushed updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Make sure your coding agent treats PR comments critically and does not blindly implement all suggestions. It's also important to distinguish between human and AI-generated comments.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR approval&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;After my review and approval, the CI/CD pipeline automatically merged the PR to &lt;code&gt;main&lt;/code&gt; and started the build process.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  4.1 Picking a Git Strategy Your Human + AI Crew Won’t Hate
&lt;/h3&gt;

&lt;p&gt;This is my personal opinion but I would say that trunk-based (single &lt;code&gt;main&lt;/code&gt; + short-lived feature branches) keeps merge hell minimal and CI green, exactly what an always-on coding AI agent needs. But copy–paste doesn’t fit every org, so sanity-check against &lt;em&gt;your&lt;/em&gt; constraints.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tips&lt;/strong&gt; for AI agent repos and team work&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-source state:&lt;/strong&gt; keep &lt;code&gt;/rules&lt;/code&gt;, prompts and &lt;code&gt;task.md&lt;/code&gt; on the same branch the agent edits, no “hidden” gist or Wiki versions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic commits per subtask:&lt;/strong&gt; easier to revert when the AI agent goes rogue (&lt;code&gt;git reset HEAD&lt;/code&gt; or &lt;code&gt;git revert -m 1 HEAD&lt;/code&gt; saves the day).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branch-naming conventions&lt;/strong&gt; like &lt;code&gt;feat/ID-1234-short-slug&lt;/code&gt; help the agent map Jira ↔ Git without spaghetti regexes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5. Real Devices and Store Metadata: What Still Needs a Human
&lt;/h2&gt;

&lt;p&gt;End-to-end testing in mobile development can’t rely on emulators alone. Once a feature was merged, my CI/CD pipeline shipped a staging build directly to real devices. This uncovered bugs that didn’t show up in simulators, issues in widgets, deep links, permissions, and screen behavior. The Validator phase kept test coverage high, but hands-on testing still revealed critical gaps.&lt;/p&gt;

&lt;p&gt;Each bug I found was added to &lt;code&gt;task.md&lt;/code&gt; as a tracked fix with a task ID, and the AI agent processed them through the same Planner → Executor → Validator loop. This kept the feedback loop tight and repeatable.&lt;/p&gt;

&lt;p&gt;But automation stops at the app stores. Submitting release builds to Google Play and App Store Connect is still a manual process. Review feedback from the stores must be collected, analyzed, and addressed by a human. Many rejections can be avoided by setting correct metadata and permissions early. But when something does slip through, you need to decide whether it’s your job or the agent’s to fix it.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Run &amp;amp; Observe - Releasing Is Just the Beginning
&lt;/h2&gt;

&lt;p&gt;Once the release pipeline is humming, flip the switch on &lt;strong&gt;observability&lt;/strong&gt; and feed the data back into your development process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run &amp;amp; Observe Checklist&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Crash &amp;amp; Error Rates&lt;/strong&gt; - Use &lt;strong&gt;Firebase Crashlytics&lt;/strong&gt; (or Sentry) to track crashes and errors on real devices, not just emulators. Auto-symbolication shows exactly where the agent’s code fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance &amp;amp; Responsiveness&lt;/strong&gt; - Monitor &lt;strong&gt;App Store Connect&lt;/strong&gt; and &lt;strong&gt;Google Play Console&lt;/strong&gt; dashboards for frame drops, slow rendering warnings, and battery drain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ANR &amp;amp; Startup Time&lt;/strong&gt; - Critical for Android: watch for Application Not Responding (&lt;strong&gt;ANR&lt;/strong&gt;) cases and slow app launches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Agent Hit‑Rate&lt;/strong&gt; - Custom metric: track &lt;strong&gt;AI-generated LOC merged vs. reverted&lt;/strong&gt; and &lt;strong&gt;Defect Rate per Feature&lt;/strong&gt;. If reverts or bugs climb, tighten your &lt;code&gt;/rules&lt;/code&gt; and boost your tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Don’t just collect these metrics, feed them straight back into your &lt;code&gt;task.md&lt;/code&gt; plans. If crash rates, ANRs or defect rates creep up, adjust your AI agent’s scope, tighten testing, or split tasks smaller to keep that 3× boost real.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  7. Recap – Parts 1 → 3 at Warp Speed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What Happened&lt;/th&gt;
&lt;th&gt;Why It Mattered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Control&lt;/strong&gt; (Part 1)&lt;/td&gt;
&lt;td&gt;Locked the AI agent into the &lt;strong&gt;Planner → Executor → Validator&lt;/strong&gt; loop and defined clear guardrails in the &lt;code&gt;/rules&lt;/code&gt; folder to keep its scope tight and behavior predictable.&lt;/td&gt;
&lt;td&gt;Gave the agent a sandbox it can’t break out of. No random rewrites, no scope creep.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Build&lt;/strong&gt; (Part 2)&lt;/td&gt;
&lt;td&gt;Turned the high-level PRD into &lt;code&gt;task.md&lt;/code&gt;, it is the agent’s working brain.&lt;/td&gt;
&lt;td&gt;Made sure the agent builds only what you planned, nothing more, nothing less.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Release&lt;/strong&gt; (Part 3)&lt;/td&gt;
&lt;td&gt;Fine-grained tokens, secrets locked in CI/CD, GitHub Actions pipeline, PR reviews, real-device test gates.&lt;/td&gt;
&lt;td&gt;Closed the “works-on-my-machine” gap and hit &lt;strong&gt;production-ready&lt;/strong&gt; confidently.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; one well-guarded AI agent can turn a &lt;strong&gt;180 h&lt;/strong&gt; project into a &lt;strong&gt;60 h&lt;/strong&gt; sprint for &amp;lt;$300.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Key Takeaways - Part 3 ✅
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use fine-grained Personal Access Tokens (PATs).&lt;/strong&gt; Never give your agent repo-wide access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep secrets secret.&lt;/strong&gt; Store keys in GitHub Secrets - never hardcode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate checks.&lt;/strong&gt; Use a second AI for PR reviews + human final pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real device tests.&lt;/strong&gt; Don’t trust emulators - deploy staging builds to real phones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trunk-based flow.&lt;/strong&gt; Short branches, atomic commits, fast merges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Next Up - The Brutal Reality Check:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
So the AI agent built, tested, and shipped real mobile code with secrets locked and pipelines green. But was it really faster, cheaper, or safer? Did all those &lt;code&gt;/rules&lt;/code&gt; and CI/CD gates pay off or just look good on paper?&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 4&lt;/strong&gt;, I’ll break down exactly what worked, what blew up in my face, and how I’d tweak the setup to squeeze out more ROI next time.&lt;/p&gt;

&lt;p&gt;💬 How are you keeping your secrets and pipelines locked down when you add AI into the mix? Got a trick or tool I should try next? Tell me below!&lt;/p&gt;

&lt;p&gt;👉 &lt;em&gt;[Part 4 → Coming Soon]&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>I Shipped 3x More Features with One AI Agent, All Production Ready</title>
      <dc:creator>Teemu Piirainen</dc:creator>
      <pubDate>Mon, 28 Jul 2025 05:19:00 +0000</pubDate>
      <link>https://dev.to/teppana88/i-shipped-3x-more-features-with-one-ai-agent-all-production-ready-3lf</link>
      <guid>https://dev.to/teppana88/i-shipped-3x-more-features-with-one-ai-agent-all-production-ready-3lf</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who’s this for:&lt;/strong&gt; Developers exploring AI agent in software development for real coding work. Especially those struggling with hallucinated code, unclear task boundaries, or when to keep a human in the loop.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;With a scoped workflow, my AI agent delivered code up to &lt;strong&gt;6× faster&lt;/strong&gt; but only when the task matched its strengths.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6× in app UI, 4.5× in business logic, 1.7× in platform code.&lt;/li&gt;
&lt;li&gt;1× in publishing and UI design. Still fully manual.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Net result: solid 3× productivity increase&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key? A 4-step flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick the right coder - me or the agent
&lt;/li&gt;
&lt;li&gt;Define the task clearly
&lt;/li&gt;
&lt;li&gt;Talk before code - ask, plan, refine
&lt;/li&gt;
&lt;li&gt;Code only after approval&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Series progress:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▢▢▢▢▢ Retrospect ▢▢▢▢▢&lt;/p&gt;

&lt;p&gt;Welcome to &lt;strong&gt;Part 2&lt;/strong&gt; of my honest deep-dive: does an AI agent really hold up when you move from one code framework to a completely different tech stack?&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 1&lt;/strong&gt;, I showed how the &lt;strong&gt;Planner → Executor → Validator&lt;/strong&gt; loop + &lt;code&gt;/rules&lt;/code&gt; folder kept my AI agent from rewriting files it shouldn’t touch.&lt;br&gt;
Now it’s time for the real test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up a real Flutter project, App Store Connect and Google Play Console.&lt;/li&gt;
&lt;li&gt;Stress-test Flutter quirks, native iOS / Android bridges, OS-level permissions.&lt;/li&gt;
&lt;li&gt;Measure what the agent did fast and where it wasted my time.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Even though this series uses a Flutter mobile app as the demo project, everything here - from the agent control loop to testing, PR review, and CI/CD - maps directly to backend, web or other SW dev work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Spoiler:&lt;/strong&gt; Pure Flutter work? Lightning-fast. Native iOS / Android? Still half manual and parts where you’ll still crack open Xcode and Android Studio.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Series Roadmap - How This Blueprint Works&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Control&lt;/strong&gt; - Control Stack &amp;amp; Rules → trust your AI agent won’t drift off course (&lt;a href="https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je"&gt;Control - Part 1&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt; - AI agent starts coding → boundaries begin to show
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release&lt;/strong&gt; - CI/CD, secrets, real device tests → safe production deploy (&lt;a href="https://dev.to/teppana88/release-ai-agent-code-safely-production-cicd-secrets-5ecj"&gt;Release - Part 3&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrospect&lt;/strong&gt; - The honest verdict → what paid off, what blew up, what’s next&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why care?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Give your AI agent the wrong job, and it’ll cheerfully break your build at lightning speed. Your role? Decide what the AI agent should do, what to protect from it, and when to step in as a teammate.&lt;/p&gt;

&lt;p&gt;👉 &lt;em&gt;Let’s break it down - this is Part 2.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  1. Task.md - The Agent’s Single Source of Truth
&lt;/h2&gt;

&lt;p&gt;Before I wrote a single prompt, I wrote a real blueprint &lt;code&gt;planning.md&lt;/code&gt; (a.k.a Product Requirements Document - PRD): what I wanted to build, high level architecture, coding principles that &lt;strong&gt;I want to follow&lt;/strong&gt;, folder structure, technical requirements, and what my delivery looked like.&lt;/p&gt;

&lt;p&gt;Then I turned that blueprint into a detailed &lt;code&gt;task.md&lt;/code&gt; (IDs, subtasks, clear acceptance notes). The AI agent was never in charge of defining scope. That’s still my job. I also bolted Jira to GitHub and linked to my task list, so I could see how well the agent was tracking my tasks.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;task.md&lt;/code&gt; is where the whole &lt;strong&gt;Planner → Executor → Validator&lt;/strong&gt; loop keeps the state in practice. In the &lt;strong&gt;Control&lt;/strong&gt; phase, it keeps the AI agent on a tight leash: every task starts here, every plan gets approved here, and every bug or test failure loops back here. In &lt;strong&gt;Build&lt;/strong&gt; phase, it turns the blueprint into concrete commits: it tracks what the AI agent built, what needed manual fixes, and what code parts still demanded human tweaks. By the &lt;strong&gt;Ship&lt;/strong&gt; phase, &lt;code&gt;task.md&lt;/code&gt; is still the single source of truth. PR reviews, CI/CD pipeline checks, and real-device test results all feed back into it, creating new tasks when blockers pop up.&lt;/p&gt;

&lt;p&gt;It’s not just a to-do list. It’s the living playbook that ties blueprint, AI agent, and CI/CD pipeline together.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Treat &lt;code&gt;task.md&lt;/code&gt; like a living log - never let it freeze.&lt;br&gt;
Whenever the agent spots edge cases, test failures, or blockers, make it write them back to &lt;code&gt;task.md&lt;/code&gt;. That way your blueprint always matches reality, not just the plan on day one.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  1.1 Designing a Project to Reveal AI Agent Boundaries
&lt;/h3&gt;

&lt;p&gt;I didn’t want a hello‑world demo app. From day one, I scoped my mobile app so that the AI agent &lt;em&gt;had&lt;/em&gt; to touch native iOS and Android code. You can’t cheat that with pure Dart. You need native Swift for iOS and native Kotlin for Android.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Home Widget on both OSs – a SwiftUI WidgetExtension for iOS, a Kotlin AppWidgetProvider on Android.&lt;/li&gt;
&lt;li&gt;Platform-channel glue – JSON method channels moving data between Dart and native layers.&lt;/li&gt;
&lt;li&gt;OS-level permissions &amp;amp; new targets – schedule tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup helped me push the AI agent to its limits, shape a blueprint for splitting tasks the right way, and define when to bring in human hands and where to draw that line.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; When you kick off a new project, map out which parts are likely to trip up your AI agent (native glue, tricky permissions, platform quirks). Watch these spots closely and be ready to jump in yourself when the agent hits its limits.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  2. Project and Firebase Studio’s setup
&lt;/h2&gt;
&lt;h3&gt;
  
  
  2.1 Cloud vs. Local environment
&lt;/h3&gt;

&lt;p&gt;Local dev is fast and familiar: my Mac runs Flutter, emulators, and real devices just fine. But local isn’t built for agents that run around the clock.&lt;/p&gt;

&lt;p&gt;That’s why I used a &lt;strong&gt;cloud setup from day one&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent stays live 24/7, not tied to any laptop.&lt;/li&gt;
&lt;li&gt;Bugs go from Jira → webhook → Planner → Executor → Validator → PR. No manual wake-ups.&lt;/li&gt;
&lt;li&gt;Adding more devs or agents? Everyone shares the same environment. No “works on my machine” issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;🔍 &lt;strong&gt;Why Firebase Studio?&lt;/strong&gt;&lt;br&gt;
It supported Flutter and Android emulator well enough for a single sprint, but lacks full 24/7 agent support. (Next time I’ll likely use &lt;a href="https://www.all-hands.dev" rel="noopener noreferrer"&gt;All Hands&lt;/a&gt; built for continuous agents from day one)&lt;br&gt;
💡 Cloud setup tip: Private or public? Just make sure your agent can stay live, productive, and compliant with your org’s data rules.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  2.2 Firebase Studio’s Flutter Setup - Avoiding the Trap
&lt;/h3&gt;

&lt;p&gt;When I hit &lt;strong&gt;“Start coding an app”&lt;/strong&gt; in &lt;a href="https://firebase.studio" rel="noopener noreferrer"&gt;Firebase Studio&lt;/a&gt; and picked Flutter, the Firebase scaffolded every wrong default: &lt;code&gt;MyApp&lt;/code&gt;, a useless &lt;code&gt;web/&lt;/code&gt; folder, no iOS target (🚫 Firebase Studio does not support iOS), and the classic &lt;code&gt;com.example.myapp&lt;/code&gt; package ID. I spent some time cleaning that up, but it was waste of time and decided to try another approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This was my mistake:&lt;/strong&gt; I should have started at the first place with a clean Flutter project locally, then imported it into Firebase Studio. But I get why Firebase Studio does this: it’s meant as a playground setup, not production-ready.&lt;/p&gt;

&lt;p&gt;I created a clean Flutter project locally, then imported it into Firebase Studio via Git. See the details below for how to do this.&lt;/p&gt;

&lt;p&gt;
  How to create a clean Flutter project locally
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# local init (tweak org "fi.awave" &amp;amp; name "sampleapp" for your own use case)&lt;/span&gt;
flutter create &lt;span class="nt"&gt;--platforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;android,ios &lt;span class="se"&gt;\&lt;/span&gt;
               &lt;span class="nt"&gt;--org&lt;/span&gt; &lt;span class="k"&gt;fi&lt;/span&gt;.awave &lt;span class="se"&gt;\&lt;/span&gt;
               sampleapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;
  🍏 iOS: Targets, Deployment &amp;amp; Certificates
  &lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trim targets&lt;/strong&gt; – Xcode → Targets → General tab → delete unneeded macOS / visionOS schemes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment target&lt;/strong&gt; – Xcode → Targets → General tab → update minimum iOS version + Display name.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment target&lt;/strong&gt; – platform :ios, '15.6' in Podfile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;App category &amp;amp; capabilities&lt;/strong&gt; – Targets → General tab → select category.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signing&lt;/strong&gt; – Xcode → Signing &amp;amp; Capabilities

&lt;ul&gt;
&lt;li&gt;Select the correct team.&lt;/li&gt;
&lt;li&gt;Let Xcode manage certificates, or upload your .p12 + provisioning profiles.
(tweak values for your own use case)
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;






&lt;/p&gt;
&lt;p&gt;
  🤖 Android: SDK Levels &amp;amp; Signing
  &lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;SDK Levels&lt;/strong&gt; – &lt;code&gt;android/app/build.gradle&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gradle"&gt;&lt;code&gt;   &lt;span class="n"&gt;android&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
     &lt;span class="n"&gt;compileSdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;36&lt;/span&gt;
     &lt;span class="n"&gt;ndkVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"27.0.12077973"&lt;/span&gt;
     &lt;span class="n"&gt;defaultConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
       &lt;span class="n"&gt;minSdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
       &lt;span class="n"&gt;targetSdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;36&lt;/span&gt;
     &lt;span class="o"&gt;}&lt;/span&gt;
   &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Gradle &amp;amp; Kotlin&lt;/strong&gt; – Bump to the latest wrapper and AGP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release signing&lt;/strong&gt; – Android Studio → Build &amp;gt; Generate Signed Bundle…

&lt;ul&gt;
&lt;li&gt;Generates keystore.jks and key.properties (keep them out of Git).&lt;/li&gt;
&lt;li&gt;Add debug / staging / release build flavours.&lt;/li&gt;
&lt;li&gt;Store &lt;code&gt;key.properties&lt;/code&gt; outside Git and add these into GitHub secrets for CI/CD pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Firebase Studio&lt;/strong&gt; -friendly tweak – In build.gradle, wrap the signingConfig block (see below details)
(tweak values for your own use case)
&lt;/li&gt;
&lt;/ol&gt;




&lt;/p&gt;
&lt;p&gt;
  🤖 Android: Gradle to skip signing on Firebase Studio debug version
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gradle"&gt;&lt;code&gt;&lt;span class="c1"&gt;// skip signing on Firebase Studio&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="n"&gt;keystoreProperties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Properties&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="n"&gt;keystorePropertiesFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rootProject&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"key.properties"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keystorePropertiesFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exists&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;keystoreProperties&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;load&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FileInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keystorePropertiesFile&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"⚠️  Note: key.properties not found – release, staging build not possible to be signed."&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;buildTypes&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;getByName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"debug"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// uses default debug signing config located in: ~/.android/debug.keystore&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;getByName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"staging"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keystorePropertiesFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exists&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;signingConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;signingConfigs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"staging"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;getByName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"release"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keystorePropertiesFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exists&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;signingConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;signingConfigs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"release"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;(tweak values for your own use case)&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;
  Run the basic Flutter app inside Firebase Studio
  &lt;br&gt;
Firebase Studio offers a possibility to run the app in an Android emulator (iOS you need to run locally on your Mac). In general the emulator works ok(ish), but during my test I found it a bit slow and buggy. In many cases I ended up pulling the code locally and running the app on my own Android and iOS phones / simulator, which was much faster and more reliable.

&lt;blockquote&gt;
&lt;p&gt;Thinking the whole AI agent development process, I think that this is the biggest drawback compared to web development, where the AI agent can run the app in a browser and pull logs directly into chat context.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Quick checklist before importing to Firebase Studio&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅ Do &lt;em&gt;locally&lt;/em&gt; first&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Create app shells in Google Play Console and App Store Connect&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;flutter create …&lt;/code&gt; with proper &lt;code&gt;--org&lt;/code&gt;, &lt;code&gt;--no-web&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;[ ] Update app names &amp;amp; other meta data&lt;/li&gt;
&lt;li&gt;[ ] 🍏 Add iOS target, set deployment version&lt;/li&gt;
&lt;li&gt;[ ] 🤖 Generate &lt;code&gt;key.properties&lt;/code&gt;, keep outside Git&lt;/li&gt;
&lt;li&gt;[ ] 🤖 Upgrade dependencies + Gradle wrapper, set SDK targets&lt;/li&gt;
&lt;li&gt;[ ] 🤖 Upgrade build.gradle to skip signing on Firebase Studio debug version&lt;/li&gt;
&lt;li&gt;[ ] Replace icons &amp;amp; launch screens with vectors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🚀 After import&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] When importing project into Firebase Studio check “This is a Flutter project”&lt;/li&gt;
&lt;li&gt;[ ] Try to run &lt;code&gt;flutter doctor&lt;/code&gt; and &lt;code&gt;flutter run&lt;/code&gt; in Firebase Studio to make sure everything works
&lt;/li&gt;
&lt;/ul&gt;




&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;⚠️ Heads-up:&lt;/strong&gt; you’ll still create the app shells in Google Play Console and App Store Connect manually.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Skip the boilerplate trap. Create &amp;amp; clean the project locally, push to Git, &lt;em&gt;then&lt;/em&gt; import from Git to Firebase Studio. Firebase will create required environment &lt;code&gt;dev.nix&lt;/code&gt; files when you import your project.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  3. Initial Prompt - First the Plan, Then the Code
&lt;/h2&gt;

&lt;p&gt;Before the AI agent could write a single line of code, I loaded a clean and complete context. This started with loading four critical pieces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/rules/&lt;/code&gt;&lt;/strong&gt; - to activate the full System Prompt: coding rules, feedback loops, commit strategy, testing discipline (&lt;a href="https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je"&gt;Control - Part 1&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;planning.md&lt;/code&gt;&lt;/strong&gt; - to understand the big picture and architecture, and avoid writing code that would later conflict with future features
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;task.md&lt;/code&gt;&lt;/strong&gt; - to get context on what’s already done, what’s in progress, and any known constraints
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task ID&lt;/strong&gt; - to know exactly which feature to focus on and avoid scope bleed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reset was not optional, it was the countermeasure to context drift.&lt;br&gt;&lt;br&gt;
As covered in &lt;a href="https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je"&gt;Part 1 → Context Hygiene&lt;/a&gt;, long-running chats quickly spiral into hallucination territory. Each task was treated as a clean slate with fresh context.&lt;/p&gt;

&lt;p&gt;But context alone wasn’t enough. The agent wasn’t allowed to just start coding.&lt;/p&gt;

&lt;p&gt;Instead, the first step was scoping the task properly. The agent had to pause, reflect, and write down every question or uncertainty it still had. No assumptions. No guessing silently.&lt;/p&gt;

&lt;p&gt;This up-front conversation was the foundation for building the &lt;strong&gt;User Prompt&lt;/strong&gt;, our shared understanding of what we’re about to build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example of the first prompt to start new feature development&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;-&lt;/span&gt; Read &lt;span class="sb"&gt;`/rules/airules.md`&lt;/span&gt; and other instruction files that are defined in the &lt;span class="sb"&gt;`airules.md`&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; Read &lt;span class="sb"&gt;`task.md`&lt;/span&gt; for the implementation context.
&lt;span class="p"&gt;-&lt;/span&gt; Now we are working with the task: ID-123 Implement Feature X
&lt;span class="p"&gt;-&lt;/span&gt; Think step by step how to make implementation.
&lt;span class="p"&gt;-&lt;/span&gt; Write an implementation plan and ask my approval before starting the implementation.
&lt;span class="p"&gt;-&lt;/span&gt; _Before you start, list anything unclear. If you don’t know, ask now._
&lt;span class="p"&gt;-&lt;/span&gt; Do not start coding before I approve your plan.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This small ritual turned the agent from a prompt follower into a planning partner. Only after this plan was reviewed and approved by me did the &lt;strong&gt;Planner&lt;/strong&gt; begin its work and that up-front clarity meant fewer surprises and less cleanup later.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 User Prompt - What to Build, Together
&lt;/h3&gt;

&lt;p&gt;An AI agent can’t guess what to build. Every task begins by establishing a shared mental model. A short-lived but precise agreement on what the feature is, how it should behave, and how we’ll know it’s done. That’s the &lt;strong&gt;User Prompt&lt;/strong&gt;, and it’s where tactical reasoning happens.&lt;/p&gt;

&lt;p&gt;The User Prompt is built from three sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;planning.md&lt;/code&gt; + &lt;code&gt;task.md&lt;/code&gt;&lt;/strong&gt; these define the overall scope and architecture, but only the relevant slices are pulled in for the current task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Initial prompt and conversation&lt;/strong&gt; an active dialogue with the agent to review the scope, clarify anything unclear, and co-create the plan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike the &lt;strong&gt;System Prompt&lt;/strong&gt; (&lt;a href="https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je"&gt;Part 1 → System Prompt&lt;/a&gt;), which can always be reloaded from &lt;code&gt;/rules/&lt;/code&gt; to restore the same mindset, the &lt;strong&gt;User Prompt&lt;/strong&gt; exists only in memory for the current conversation. When the chat ends, it’s gone, and must be rebuilt from scratch the next time.&lt;/p&gt;

&lt;p&gt;But this isn’t a top-down instruction set. The agent has to ask, clarify, and plan and I have to approve. That shared clarity is what keeps the Planner focused, the Executor scoped, and the Validator relevant.&lt;/p&gt;

&lt;p&gt;When the AI agent stumbles, it’s almost always because the User Prompt was vague or incomplete. That’s why I force the agent to ask questions and write a plan before it writes a single line of code. The implementation must be explicit, not assumed.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 The Power of Two Prompts - Behavior Meets Implementation
&lt;/h3&gt;

&lt;p&gt;When you combine a &lt;strong&gt;System Prompt&lt;/strong&gt; with a task-specific &lt;strong&gt;User Prompt&lt;/strong&gt;, you give the AI agent both its job description and its exact mission.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System Prompt&lt;/strong&gt; shapes how the agent works: how it commits, how it asks questions, how it tests.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User Prompt&lt;/strong&gt; defines what to build right now: the logic, constraints, edge cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, they turn the AI agent into a real teammate. By keeping the rules persistent and the scope specific, I could trust the agent to move fast without breaking things that weren’t part of the current task.&lt;/p&gt;

&lt;p&gt;Without both, the agent either forgets the bigger picture or gets lost in implementation details.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fv4egmsi4qnqx5crip4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fv4egmsi4qnqx5crip4.png" alt="Firebase.studio UI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Flutter Development - How the Agent Held Up
&lt;/h2&gt;

&lt;p&gt;To test how far an AI agent could really go - in terms of code quality, delivery speed, and handling edge cases - I deliberately picked two tricky features from a fully built-out app. The goal was to push the agent in real-world scenarios and see if this workflow could actually &lt;strong&gt;save serious dev hours&lt;/strong&gt; in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;An analog clock widget that ticks every second&lt;/strong&gt; - constant UI and state updates, custom painter logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Riverpod-powered state management&lt;/strong&gt; - scoped providers and reactive rebuilds keep the architecture clean but can trip up less experienced setups fast.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  4.1 Speed &amp;amp; code quality
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Analog clock:&lt;/strong&gt; The AI agent drew the full custom-painter dial, hands and smooth tick animation in &lt;strong&gt;≈ 10 minutes&lt;/strong&gt;. A couple of follow-up prompts polished the details. Doing this by hand would have been hours of work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State management:&lt;/strong&gt; The AI agent didn’t make solid architectural choices on its own. I mapped out the high-level graph up front. After a fresh start and a few rounds of tweaks and clarifications, the raw wiring went smoothly and the AI agent handled the repetitive parts well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: On pure Flutter UI tasks the AI agent outpaced me &lt;strong&gt;6–0&lt;/strong&gt;. On architectural decisions, it needed my direction but once pointed, it did the heavy lifting.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Where It Struggled
&lt;/h3&gt;

&lt;p&gt;These tests also revealed clear weak spots.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No eyes:&lt;/strong&gt; The AI agent can’t &lt;em&gt;see&lt;/em&gt; what’s happening in the Firebase Studio Android emulator screen. Screenshots were mandatory: I captured the UI, dropped it into the chat and asked for pixel tweaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No design instinct:&lt;/strong&gt; Without explicit references, the AI agent’s visual taste is pretty much 90s terminal style. Only after I shared reference images did the styling improve.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No architecture awareness:&lt;/strong&gt; The AI agent had no real sense of what helper methods or reusable patterns were already in the codebase. It often rewrote logic that existed elsewhere or missed calling shared utils, unless I pointed it to the right files and explained how similar code was structured.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.3 Key Takeaway
&lt;/h3&gt;

&lt;p&gt;These tricky features were perfect to test real limits. For Flutter code generation and Riverpod wiring the AI agent is a monster accelerator - &lt;strong&gt;minutes instead of hours&lt;/strong&gt;. But it stays blind and tasteless. &lt;strong&gt;You’re still the art director&lt;/strong&gt; feeding it screenshots and clear visual direction. With that human guidance, though, the AI agent can paint pixels and write Dart faster than you can open Figma.&lt;/p&gt;

&lt;p&gt;One more thing: always guide your AI agent to refactor code systematically. This is how you can catch duplicated logic and keep your codebase clean over time. Add regular refactoring checkpoints to your &lt;code&gt;task.md&lt;/code&gt; or set clear &lt;code&gt;/rules&lt;/code&gt; for how often and where the AI agent should look for patterns to merge.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Native iOS and Android development
&lt;/h2&gt;

&lt;p&gt;Flutter part proved the AI agent could handle coding fast. But mobile apps need native code, here's where things got interesting.&lt;/p&gt;

&lt;p&gt;I knew that adding new targets is a task that you &lt;strong&gt;must&lt;/strong&gt; do manually in Xcode and Android Studio, so I didn’t even try to ask the AI agent to do that. Instead, I created the targets manually and then let the AI agent focus on writing the code that runs in those targets.&lt;/p&gt;

&lt;p&gt;At first, the AI agent treated the native code like it was just another SwiftUI / Android screen. Ignoring all the platform-specific quirks related to newly added targets: permission, entitlements, manifest tweaks.&lt;/p&gt;

&lt;p&gt;So I did the one thing AI agent does well when you nudge it right: I told it to &lt;strong&gt;read Apple’s and Google’s docs first&lt;/strong&gt;, then come back with a plan.&lt;/p&gt;

&lt;p&gt;Once it read the docs, it surprised me. The AI agent wrote decent Swift and Kotlin glue. Not perfect, but runnable.&lt;/p&gt;

&lt;p&gt;In native-heavy features I broke the work into four crystal-clear stages, each one nudging the AI agent to focus on a single layer at a time:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;What the agent does&lt;/th&gt;
&lt;th&gt;My role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Plan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Draft a complete architecture map for the feature, covering &lt;strong&gt;Flutter&lt;/strong&gt;, &lt;strong&gt;Android&lt;/strong&gt;, and &lt;strong&gt;iOS&lt;/strong&gt; parts. Clearly split each platform’s role so no piece gets missed.&lt;/td&gt;
&lt;td&gt;Verify Flutter ↔ Native bridge logic, fix blind spots, and approve the plan.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Flutter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generate all Dart code first: UI, state management (like Riverpod), and method channels for platform calls.&lt;/td&gt;
&lt;td&gt;Check logic, tweak naming, and make sure native calls are explicit.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Android&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Write the native Android glue in Kotlin: &lt;code&gt;AppWidgetProvider&lt;/code&gt;.&lt;/td&gt;
&lt;td&gt;Validate permissions, Gradle tweaks, and any custom OS-level hooks, utilize Android Studio tools to validate code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. iOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Implement the native iOS side in Swift: &lt;code&gt;WidgetExtension&lt;/code&gt;.&lt;/td&gt;
&lt;td&gt;Open in Xcode check signing &amp;amp; provisioning,entitlements, plist updates, utilize Xcode tools to validate code.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each stage flows into the next: &lt;strong&gt;Flutter first&lt;/strong&gt;, so the AI agent nails down method channel contracts and data shapes. Then Android glue connects to Flutter. Finally, iOS mirrors the same bridge, patching up any extra parameters the AI agent spots during build.&lt;/p&gt;

&lt;p&gt;By carving it up this way, I kept the AI agent focused, the commits clean, and the cross-platform glue tight, one layer at a time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Don’t just run the agent. Decide who does what, guide the architecture, and split tasks so each side plays to its strengths.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Where Native Code Still Needs You
&lt;/h2&gt;

&lt;p&gt;People underestimate this. Flutter won’t do it for you, and your AI agent won’t either.&lt;br&gt;
Some things are still better to do in Android Studio or Xcode (or there are no other ways to do these).&lt;/p&gt;
&lt;h3&gt;
  
  
  6.1 Native Glue Code - 40% AI, 60% Me (Both Platforms)
&lt;/h3&gt;

&lt;p&gt;The moment we crossed into Swift or Kotlin, the speed advantage dropped. The AI agent could write glue code (small methods, platform channels, a bit of SwiftUI). But when it came to signing, entitlements, deployment targets, Xcode project tweaks, it didn't have possibilities to do these or didn't have up-to-date info what / how to do these. But this was easily solved by me doing these manually in Xcode and Android Studio, or instructing AI agent to read the latest docs.&lt;/p&gt;

&lt;p&gt;Available tools in Firebase Studio just doesn’t match Xcode or Android Studio for native tools. So even when the agent gave me runnable Kotlin, I found plenty of deprecated calls, outdated syntax, or just missing parts.&lt;/p&gt;

&lt;p&gt;
  Example: Kotlin code by AI vs. Android Studio
  &lt;br&gt;
// Ai agent code&lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;scheduleNextUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;backgroundColor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Color&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parseColor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"#FF121212"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;alarmManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setExactAndAllowWhileIdle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AlarmManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RTC_WAKEUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nextUpdateTime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pendingIntent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;e&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ClockWidget"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"[$timestamp] Failed to schedule alarm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;// Proper Android Studio / Kotlin code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="nd"&gt;@RequiresPermission&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SCHEDULE_EXACT_ALARM&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;scheduleNextUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;backgroundColor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"#FF121212"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toColorInt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;alarmManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setExactAndAllowWhileIdle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AlarmManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RTC_WAKEUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nextUpdateTime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pendingIntent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;e&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ClockWidget"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"[$timestamp] Failed to schedule alarm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;So the AI agent got me 40% there. That last 60% (making it follow native Android &amp;amp; iOS coding style &amp;amp; annotations) still needed my own eyes and Xcode + Android Studio.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.2 Branding, Permissions &amp;amp; The Final Manual Mile
&lt;/h3&gt;

&lt;p&gt;Some things just stay manual. Visual assets and OS-level entitlements still live in Xcode and Android Studio, not in an AI agent prompt.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Replace app icon&lt;/td&gt;
&lt;td&gt;Xcode &lt;strong&gt;Asset Catalog&lt;/strong&gt; / Android Studio &lt;strong&gt;Image Asset&lt;/strong&gt; wizard&lt;/td&gt;
&lt;td&gt;Use &lt;strong&gt;SVG / PDF&lt;/strong&gt; vectors - tools render all sizes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom launch screen&lt;/td&gt;
&lt;td&gt;Same wizards as above&lt;/td&gt;
&lt;td&gt;Remove the stock Flutter logo; keep it light.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Permissions &amp;amp; capabilities&lt;/td&gt;
&lt;td&gt;Xcode &amp;amp; Android Studio&lt;/td&gt;
&lt;td&gt;Flip toggles, add targets, push to git, let the agent handle code only.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;My rule? Do these by hand once, commit, then keep the agent busy where it adds real value.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Where the Agent Saved (and Didn’t) My Time
&lt;/h2&gt;

&lt;p&gt;The numbers below tell the whole story. Whenever the work stayed inside &lt;strong&gt;pure Flutter territory&lt;/strong&gt; (widgets, Dart models, lightweight state and their unit tests) the AI agent chewed through tasks &lt;strong&gt;~5 × faster&lt;/strong&gt; than I do by hand.&lt;/p&gt;

&lt;p&gt;As soon as we crossed into &lt;strong&gt;native Swift/Kotlin glue&lt;/strong&gt; the boost shrank to 1.7 ×, and for signing, entitlements or full end-to-end runs the speed-up vanished: those steps are still human-only.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Estimated Work Hours&lt;/th&gt;
&lt;th&gt;AI + Human Hours&lt;/th&gt;
&lt;th&gt;Speed Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Planning &amp;amp; Architecture&lt;/td&gt;
&lt;td&gt;~10&lt;/td&gt;
&lt;td&gt;~10&lt;/td&gt;
&lt;td&gt;1.0× (manual)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flutter UI &amp;amp; Layout&lt;/td&gt;
&lt;td&gt;~58&lt;/td&gt;
&lt;td&gt;~11&lt;/td&gt;
&lt;td&gt;5.3×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State Management &amp;amp; Logic&lt;/td&gt;
&lt;td&gt;~44&lt;/td&gt;
&lt;td&gt;~10&lt;/td&gt;
&lt;td&gt;4.4×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native Integrations (iOS &amp;amp; Android)&lt;/td&gt;
&lt;td&gt;~26&lt;/td&gt;
&lt;td&gt;~15&lt;/td&gt;
&lt;td&gt;1.7×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unit &amp;amp; Widget Testing&lt;/td&gt;
&lt;td&gt;~30&lt;/td&gt;
&lt;td&gt;~4&lt;/td&gt;
&lt;td&gt;7.5×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-End Testing &amp;amp; QA&lt;/td&gt;
&lt;td&gt;~6&lt;/td&gt;
&lt;td&gt;~6&lt;/td&gt;
&lt;td&gt;1.0× (manual)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Icons, Store, Metadata&lt;/td&gt;
&lt;td&gt;~6&lt;/td&gt;
&lt;td&gt;~6&lt;/td&gt;
&lt;td&gt;1.0× (manual)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~180&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~62&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~3× overall&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;
  📏 See my real manual vs. agent benchmarks
  &lt;br&gt;
&lt;strong&gt;1️⃣ Global dayTimePeriod state&lt;/strong&gt;

&lt;p&gt;I wired up &lt;code&gt;dayTimePeriod&lt;/code&gt; to update globally in the app, syncing widgets and screens to the current time.&lt;br&gt;&lt;br&gt;
• &lt;strong&gt;Manual run:&lt;/strong&gt; ~3 hours - scoped providers, tested edge cases, debugged rebuilds.&lt;br&gt;&lt;br&gt;
• &lt;strong&gt;AI agent run:&lt;/strong&gt; ~35 minutes - nailed the providers and rebuild logic in few iterations.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Boost:&lt;/strong&gt; ~5.1× faster.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;2️⃣ Analog Clock: New dayTimePeriod color sector&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I added a new color sector to the custom analog clock face, updating dynamically with the &lt;code&gt;dayTimePeriod&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
• &lt;strong&gt;Manual run:&lt;/strong&gt; ~3 hours - tweak painter logic, test rendering on real devices.&lt;br&gt;&lt;br&gt;
• &lt;strong&gt;AI agent run:&lt;/strong&gt; ~25 minutes - handled the custom painter math perfectly, needed few extra round to tweak visuals.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Boost:&lt;/strong&gt; ~7.2× faster.&lt;/p&gt;

&lt;p&gt;These sample benchmarks line up with the overall Speed Factor: pure Flutter tasks easily hit &lt;strong&gt;5-7×&lt;/strong&gt; boost when scoped right.&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notes on What’s Not in These Hours 🔍&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The table above covers only the hands-on coding and test work I clocked directly related to work with the AI agent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CI/CD pipeline &amp;amp; release automation → Not included in the hours. This part was 100% manual, but absolutely critical: the full pipeline work (versioning, signing flows, App Store / Play Console configs, GitHub Actions) is covered separately in Part 3.&lt;/li&gt;
&lt;li&gt;UI / UX design work → Not counted here. The design (screens, flow, user journeys, final mockups) were assumed done up front. This breakdown focuses purely on implementing those assets in code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re reading this as a dev planning your own agent setup: don’t underestimate these “invisible” hours. A clean CI/CD flow will save you weeks later, and &lt;strong&gt;no AI agent yet replaces a good human designer with good taste 🎨&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; A disciplined agent can absolutely deliver a &lt;strong&gt;7× “wow” factor&lt;/strong&gt; for well structure SW dev work, but the minute you need to do something else than &lt;em&gt;coding&lt;/em&gt;, you’re back on the tools. Net result: a very real &lt;strong&gt;3× productivity jump&lt;/strong&gt; as long as your task blueprint is crystal-clear and you keep the AI agent focused on the parts it excels at.&lt;/p&gt;

&lt;p&gt;That moment when you realize you’re just watching your AI agent write, test, commit, push, open a PR and pass checks - all while you sip your coffee - is both magic and just a tiny bit unnerving. More than once I caught myself staring at my screen for 5 minutes doing… nothing. The code just… shipped.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. Firebase Studio Issues
&lt;/h2&gt;

&lt;p&gt;My overall verdict on Firebase Studio as an AI-coding sandbox: &lt;strong&gt;usable, but slower and quirkier than a local setup&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Day-to-day friction&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sluggish Android emulator&lt;/strong&gt; - Boot times were glacial; I often fell back to local setup for testing.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub auth hiccups&lt;/strong&gt; - Setting correct access rights was hard. (I’ll dig into the hacky fix in the next article.)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nice touch:&lt;/strong&gt; the agent pipes terminal output straight into the chat pane, so I didn’t need to tail logs in a separate window.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below some of the quirks I hit:&lt;/p&gt;

&lt;p&gt;
  “No space left on device” 🐘
  &lt;br&gt;
At ~20 GB of usage Firebase Studio simply froze with&lt;br&gt;&lt;br&gt;
&lt;code&gt;no space left on device&lt;/code&gt;.

&lt;ul&gt;
&lt;li&gt;Root cause: Gradle cache alone had ballooned to &lt;strong&gt;12 GB&lt;/strong&gt; inside the &lt;code&gt;dev.nix&lt;/code&gt; env.
&lt;/li&gt;
&lt;li&gt;Quick tries (&lt;code&gt;flutter clean&lt;/code&gt;, &lt;code&gt;gradle clean&lt;/code&gt;) did nothing.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nuclear workaround:&lt;/strong&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;rm -rf ~/.gradle&lt;/code&gt; inside the workspace.
&lt;/li&gt;
&lt;li&gt;Add the Gradle distro back into &lt;code&gt;dev.nix&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;Re-run nix environment.
&lt;/li&gt;
&lt;li&gt;Discover the Android emulator now refuses to boot 🙃
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Final: spin up a &lt;strong&gt;new Firebase Studio project&lt;/strong&gt;, &lt;code&gt;git clone&lt;/code&gt; the repo, and delete the old one. Yet another reason to commit early &amp;amp; often.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Google’s support replied with a generic “list files, then remove them safely” doc link, no actual cache-pruning guide. If you hit the quota wall, be ready to start fresh.&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;
  Android license weirdness
  &lt;br&gt;
Firebase Studio ships a fresh SDK image and normal license approvals were needed. &lt;code&gt;flutter doctor --android-licenses&lt;/code&gt; failed five times in a row; then, a few days later, licenses were magically accepted. I never found a root cause...&lt;br&gt;


&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Firebase Studio works, but expect &lt;strong&gt;sluggish Android emulator, storage quotas, and the occasional phantom SDK glitch&lt;/strong&gt;. Keep your repo pushed, caches light, and patience handy.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  9. Key Takeaways - Part 2 ✅
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keep &lt;code&gt;task.md&lt;/code&gt; alive.&lt;/strong&gt; It’s the agent’s working brain, every plan, blocker, and fix lives here. Keep it up-to-date so the Control loop never loses track.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Split tasks smart.&lt;/strong&gt; Identify where the AI agent really shines and steer your task list to match its strengths.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual native setup.&lt;/strong&gt; You handle iOS/Android configs, signing, and entitlements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed visuals.&lt;/strong&gt; The agent is blind → give UI references and screenshots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tight commits.&lt;/strong&gt; Small, atomic steps keep bugs cheap to fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren’t magic switches the AI flips by itself, they all rely on clear rules and scoped tasks you write up front in &lt;code&gt;/rules&lt;/code&gt; and &lt;code&gt;task.md&lt;/code&gt;. It’s human work first, every project, every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next Up - Keep It Safe:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
So the AI agent can blast through mobile code but shipping it means locking secrets tight and keeping your pipeline bulletproof.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;&lt;a href="https://dev.to/teppana88/release-ai-agent-code-safely-production-cicd-secrets-5ecj"&gt;Part 3&lt;/a&gt;&lt;/strong&gt;, I’ll break down how I managed GitHub PATs, Firebase configs, signing keys, and real device testing. You’ll see exactly how my CI/CD flow kept the agent honest and my keys secure.&lt;/p&gt;

&lt;p&gt;💬 I’m still refining how I scope tasks and split what the agent does vs. me. What’s worked (or backfired) in your dev flow? Might borrow a trick!&lt;/p&gt;

&lt;p&gt;👉 &lt;em&gt;Stay tuned - [Part 4 → Coming Soon]&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>softwaredevelopment</category>
      <category>flutter</category>
    </item>
    <item>
      <title>Master Autonomous AI Agent - Control Stack for Production Code</title>
      <dc:creator>Teemu Piirainen</dc:creator>
      <pubDate>Mon, 21 Jul 2025 05:22:00 +0000</pubDate>
      <link>https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je</link>
      <guid>https://dev.to/teppana88/master-autonomous-ai-agent-control-stack-for-production-code-27je</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who’s this for:&lt;/strong&gt; Software and mobile developers who want to move beyond AI demos and bring an autonomous AI agent into real daily coding work, all the way to production.  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Control, not just code&lt;/li&gt;
&lt;li&gt;Turn a single-prompt agent into a repeatable, autonomous teammate with three roles (Planner → Executor → Validator) and a &lt;code&gt;/rules/&lt;/code&gt; folder.&lt;/li&gt;
&lt;li&gt;Ship production-ready code without babysitting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Series progress:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Control ▇▇▇▇▇ Build ▢▢▢▢▢ Release ▢▢▢▢▢ Retrospect ▢▢▢▢▢&lt;/p&gt;

&lt;p&gt;Welcome to &lt;strong&gt;Part 1&lt;/strong&gt; of my deep-dive series on building an autonomous AI agent in a real-world SW development project.&lt;/p&gt;
&lt;h2&gt;
  
  
  0. The Problem: AI Agents Don’t Survive Production
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Local IDE AI agents/helpers still need &lt;em&gt;you&lt;/em&gt; to approve every diff.
&lt;/li&gt;
&lt;li&gt;SaaS platforms (such as Lovable, Bolt.new ...) work well for MVPs, but their black-box control stacks and lack of security and quality controls are show-stoppers for many large organizations looking to use them in production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  My hypothesis:
&lt;/h3&gt;

&lt;p&gt;We can build a fully autonomous AI agent that organizations own and audit end-to-end, meeting enterprise-level demands for security, compliance, and CI/CD.&lt;/p&gt;

&lt;p&gt;To validate the idea, &lt;strong&gt;I built an AI agent workflow&lt;/strong&gt;, set it loose on a real project, and tracked its performance with the same KPIs my clients use. I capped my own input to roughly 2 hours a day for 30 days, mirroring the stop-and-go rhythm of real-world development with human-in-the-loop pauses.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Experiment Project
&lt;/h3&gt;

&lt;p&gt;After 20+ years in web and mobile development, I know “hello‑world demos” are cheap; shipping is hard. So instead of a single‑page web app, I threw the AI agent into the deep end:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mobile app&lt;/strong&gt; with Apple and Google &lt;strong&gt;platform&lt;/strong&gt; requirements
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flutter&lt;/strong&gt; as the main framework for a real mobile app
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swift ↔ Kotlin&lt;/strong&gt; native code and permissions that Flutter can’t hide
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;CI/CD pipeline&lt;/strong&gt; wired with deployment builds from day one
(a mobile application that I’d normally budget &lt;strong&gt;~180 h&lt;/strong&gt; of manual coding)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Most studies report only a &lt;a href="https://www.reuters.com/business/ai-slows-down-some-experienced-software-developers-study-finds-2025-07-10/" rel="noopener noreferrer"&gt;1.2×–1.5× productivity lift (Reuters, 2025‑07)&lt;/a&gt;.&lt;br&gt;
This blueprint aims much higher, using a scoped agent and control loop.&lt;/p&gt;
&lt;/blockquote&gt;



&lt;p&gt;&lt;strong&gt;Series Roadmap - How This Blueprint Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a full process, not just a single trick:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Control&lt;/strong&gt; - Control Stack &amp;amp; Rules → trust your AI agent won’t drift off course&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt; - AI agent starts coding → boundaries begin to show (&lt;a href="https://dev.to/teppana88/i-shipped-3x-more-features-with-one-ai-agent-all-production-ready-3lf"&gt;Build - Part 2&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release&lt;/strong&gt; - CI/CD, secrets, real device tests → safe production deploy (&lt;a href="https://dev.to/teppana88/release-ai-agent-code-safely-production-cicd-secrets-5ecj"&gt;Release - Part 3&lt;/a&gt;) &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrospect&lt;/strong&gt; - The honest verdict → what paid off, what blew up, what’s next&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ I’m using Firebase Studio with Gemini 2.5 Pro here, but the control‑stack principles apply to any agent framework or IDE. I also ran the same flow locally with VS Code + Claude Sonet 4 → same results. ⚠️&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you want an AI agent you can &lt;strong&gt;scale&lt;/strong&gt;, one that follows your rules, works like the rest of your team, and delivers repeatable results, this series is for you.&lt;/p&gt;

&lt;p&gt;👉 Let’s dive in - this is Part 1: &lt;strong&gt;The Control Stack Blueprint&lt;/strong&gt;.&lt;/p&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyk49mvyurnij5xgrmhj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyk49mvyurnij5xgrmhj.png" alt="AI agent control stack"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  1. System Prompt - How to Align the AI Agent’s Mindset
&lt;/h2&gt;

&lt;p&gt;Every dev has seen it. When you use agent mode in your IDE, the agent suddenly touches files it shouldn’t, skips tests, and leaves TODOs in code. We’ve all tested those limits made fixes by modifying prompts.&lt;/p&gt;

&lt;p&gt;To make an AI agent behave like a reliable teammate (not a loose cannon) we need to define in &lt;strong&gt;much deeper level&lt;/strong&gt; &lt;em&gt;how&lt;/em&gt; it works, not just &lt;em&gt;what&lt;/em&gt; it builds. In other words:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;How to do the work&lt;/strong&gt; → covered here in this &lt;strong&gt;Control phase&lt;/strong&gt; article
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What work to do&lt;/strong&gt; → tackled in the next &lt;strong&gt;Build phase&lt;/strong&gt; article&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s start with the foundation: shaping how the AI agent thinks, acts, and writes code by creating a &lt;strong&gt;System Prompt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In my setup it’s not just one file; it’s the full &lt;code&gt;/rules/&lt;/code&gt; directory. Together, these files define the AI agent’s mindset: how it commits, how it asks for help, how it escalates risk, how it tests its own code, and how it avoids doing anything dumb. This System Prompt drives the agent’s &lt;strong&gt;strategic reasoning&lt;/strong&gt;, guiding long‑term decision‑making, your organization's SDLC (Software Development Life Cycle) rules, trade‑offs, and actions that consider multiple future scenarios.&lt;/p&gt;

&lt;p&gt;These rules don’t live in the prompt. They’re loaded at the start of every task, just like a human would check your coding guide or dev playbook before writing production code.&lt;/p&gt;

&lt;p&gt;Unlike the user prompt (more on that in Part 2), there’s no negotiation with &lt;strong&gt;System Prompt&lt;/strong&gt;. This is the law. The AI agent doesn’t get a vote; it just follows the values, rules, and expectations I defined.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This was the first step in enforcing the Control Stack. Shaping how the agent thinks before it even looks at the task.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  2. Role Assignment: Planner → Executor → Validator
&lt;/h2&gt;

&lt;p&gt;Once I gave the AI agent clear rules (System Prompt), another question popped up: how do I make sure it actually follows them?&lt;/p&gt;

&lt;p&gt;Turns out, trying to make an AI agent do everything at once (plan, code, test, validate) is a great way to watch it spiral into spaghetti logic and forgotten files.&lt;/p&gt;

&lt;p&gt;So I gave the agent three distinct roles, each with a clear mission:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Planner&lt;/strong&gt; → Think before you code: break down tasks, plan architecture, define sub‑tasks, acceptance criteria, and hand a sub‑task to the Executor.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executor&lt;/strong&gt; → For a given sub‑task, write and test production‑grade code, and fix issues.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validator&lt;/strong&gt; → Check the results, validate coverage, test rigorously, and decide whether the task is actually done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why three? Because even humans suck at multitasking. Specialising each role kept the AI agent sharp, scoped, and far less likely to go rogue. It also helped me debug when things went sideways: I could see which role failed and fix that part of the loop.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Once the three core roles are running smoothly, you can add specialised roles, such as Architect, Security, Product Manager, etc. (if needed)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To make this work in production, we should also have a &lt;strong&gt;human‑in‑the‑loop&lt;/strong&gt;, but ideally outside the control loop. Guide at the edges by approving the plan and reviewing PRs. Do this right and you get autonomy with accountability, a teammate that thinks on its own but stays within boundaries. That is the sweet spot.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.1 Planner - The Architect
&lt;/h3&gt;

&lt;p&gt;The Planner is the brain. It doesn’t write code; it figures out what needs to be written.&lt;/p&gt;

&lt;p&gt;Every time a new task begins, the Planner receives a ready‑made context: all relevant files are pre‑loaded (&lt;code&gt;task.md&lt;/code&gt;, &lt;code&gt;planning.md&lt;/code&gt;, and &lt;code&gt;/rules/&lt;/code&gt;). This setup gives the Planner everything it needs to plan the implementation, including scoped instructions and constraints that form the task’s mission.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;How this context is constructed, including the initial prompt and the user prompt (more about this in Part 2).&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Based on the received context, the Planner:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyses the architecture and breaks the task into manageable sub‑tasks
&lt;/li&gt;
&lt;li&gt;Defines acceptance criteria for each sub‑task
&lt;/li&gt;
&lt;li&gt;Updates &lt;code&gt;task.md&lt;/code&gt; with new task entries, IDs, and current progress
&lt;/li&gt;
&lt;li&gt;Highlights dependencies or missing inputs
&lt;/li&gt;
&lt;li&gt;Ensures the plan aligns with the project’s long‑term structure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is clarity. If the Planner messes up, everything downstream goes sideways.&lt;/p&gt;

&lt;p&gt;Once the plan is complete, AI agent moves to the next phase.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Don’t over‑restrict your AI agent. Guide it just enough, you only learn this by testing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  2.2 Executor - The Builder
&lt;/h3&gt;

&lt;p&gt;The Executor is the pair of hands. It picks up one sub‑task, utilizes the predefined context from the Planner, and starts writing production‑ready code.&lt;/p&gt;

&lt;p&gt;Its job isn’t just to code; it also needs to:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow the System Prompt rules: coding style, test‑first discipline, safe commit strategy
&lt;/li&gt;
&lt;li&gt;Run static analysis (&lt;code&gt;flutter analyze&lt;/code&gt;) and write unit tests
&lt;/li&gt;
&lt;li&gt;Respect architectural boundaries, don’t rewrite unrelated files
&lt;/li&gt;
&lt;li&gt;Solve issues if tests fail, using debug strategies (like printing state)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If something fails repeatedly, the Executor stops and escalates to me instead of grinding forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Executor - Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Honestly, this was the toughest part of the 30‑day sprint.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rules were too vague. Without strict coding and testing rules in the System Prompt, the agent just improvised.
&lt;/li&gt;
&lt;li&gt;Context drift was real. Long chats or fuzzy task descriptions made it lose focus fast.
&lt;/li&gt;
&lt;li&gt;Doing too much at once (planning, building, and testing in one go) led to sloppy, unpredictable behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole goal was to build an autonomous AI agent that writes production‑ready code I could trust without babysitting it 24/7. It took me about 15 days of constant tweaking to get this balance right, but in the end I found a good middle ground that worked for this project.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Give your AI agent concrete targets. It won’t write code exactly like you do, but you can train it to stick to your quality bar. Think of it like a senior‑to‑junior dev relationship: mentor it, don’t micromanage it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Set clear guardrails. For example: “If you fail to fix the same issue five times, stop and ping me.” That one rule alone saved me hours.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  2.3 Validator - The Safety Net
&lt;/h3&gt;

&lt;p&gt;The Validator is the reviewer. Once the Executor thinks it’s done, the Validator steps in and double‑checks:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run all relevant tests and linters again
&lt;/li&gt;
&lt;li&gt;Verify that acceptance criteria from the Planner are fully met
&lt;/li&gt;
&lt;li&gt;Look for missing coverage, skipped test logic, or flaky behavior
&lt;/li&gt;
&lt;li&gt;Confirm that only the expected files changed, nothing outside scope
&lt;/li&gt;
&lt;li&gt;Make a clean commit: one sub‑task, one atomic commit
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If anything fails validation, it bounces the task back to the Planner or Executor with an error message and reason.&lt;/p&gt;

&lt;p&gt;Once all checks pass, the Validator triggers the final commit and marks the sub‑task as complete in &lt;code&gt;task.md&lt;/code&gt;. If multiple sub‑tasks are defined, the cycle repeats and each one goes through the same Planner → Executor → Validator loop.&lt;/p&gt;

&lt;p&gt;The Validator helps avoid the “looks fine to me” trap. It forces the agent to prove quality, not just assume it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validator - Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By the end of the sprint I didn’t check every line by hand during development. Instead, the &lt;strong&gt;Validator’s&lt;/strong&gt; task was to find problems automatically. If the AI agent hit a blocker it couldn’t fix, it flagged that sub‑task as &lt;em&gt;blocked&lt;/em&gt; in &lt;code&gt;task.md&lt;/code&gt; and then pinged me for input.&lt;/p&gt;

&lt;p&gt;But your AI agent can get &lt;strong&gt;stuck in loops&lt;/strong&gt;. One time my AI agent got stuck rewriting the same unit test about ten times in a row. It kept trying until I stopped it and together we found the right solution. After that, it finished in a few iterations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Another lesson:&lt;/strong&gt; the AI agent once wrote a test loop that spammed the console so badly it pushed massive logs straight into my Gemini prompt, nearly blowing up my token budget.&lt;br&gt;&lt;br&gt;
Luckily, Gemini has some sanity checks. Otherwise my bill would’ve gone straight to the moon. Consider adding your own guardrails to catch runaway output early → every token costs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The input token count (3 076 984) exceeds the maximum number of tokens allowed (1 048 576).
══╡ EXCEPTION CAUGHT BY RENDERING LIBRARY^C
 *  The terminal process "bash '-c', '(set -o pipefail &amp;amp;&amp;amp; flutter test test/features/home/view/home_screen_test.dart 2&amp;gt;&amp;amp;1 | tee /tmp/ai.1.log)'" terminated with exit code: 130. 
 *  Terminal will be reused by tasks, press any key to close it. 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Your AI agent tries to diagnose bugs by reading code and terminal logs. When I personally hit tricky bugs during development, I normally add extra debug prints. I instructed the AI agent to do the same.&lt;br&gt;
Once it learned how and when to use extra debug prints, it solved most bugs in five tries or fewer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. My Rules Folder - The Real Secret Weapon
&lt;/h2&gt;

&lt;p&gt;We’ve now defined the AI agent’s three roles (Planner, Executor, and Validator) each with clear responsibilities. So where do these responsibilities actually live? How does the agent know &lt;em&gt;how&lt;/em&gt; to plan, execute, test, commit, or ask for help?&lt;/p&gt;

&lt;p&gt;That’s what the &lt;code&gt;/rules/&lt;/code&gt; folder is for. It’s the instruction manual, the coding playbook, the &lt;strong&gt;System Prompt&lt;/strong&gt;, all written down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What’s in &lt;code&gt;/rules/&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The folder is small enough for the AI agent to read on every pass, yet opinionated enough to steer it towards my coding style and production requirements.&lt;/p&gt;

&lt;p&gt;These rules weren’t static. During the first fifteen days, I tweaked them daily. When the agent slipped up, I updated the rules. By the last half of the sprint I needed to tweak less and less, and the AI agent started to understand how I wanted to approach development.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Don’t over‑constrain your AI agent, but don’t leave it wandering, either. In many cases, the AI agent actually knows better than you how to write a block of code, but only if you teach it how to behave, not exactly what lines to type.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Agent &lt;code&gt;/rules/&lt;/code&gt; folder content ↓&lt;/strong&gt;&lt;br&gt;

  Expand to open full list of rules
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;My Rules Folder - What My AI Agent Reads (tweak for your own use case)
(each file is 10‑70 lines)

✅ airules.md
→ Core principles &amp;amp; how to chain docs.

✅ accessibility.instructions.md
→ WCAG‑AA, large touch targets, screen‑reader OK.

✅ code-reuse.instructions.md
→ Reuse core/utils and shared test helpers.

✅ context-management.instructions.md
→ Ask specific PLANNING slices, skip huge files.

✅ development-workflow.instructions.md
→ File‑by‑file, analyse/test, quality gates.

✅ flutter-mobile.instructions.md
→ Riverpod, feature‑first, ≤500 code lines per file.

✅ git-workflow.instructions.md
→ Short branches, ID‑xx commits, CLI only.

✅ pull-request.instructions.md
→ Auto‑PR via CLI, extract ID‑xx, shell‑safe text.

…and more whenever the agent slips up.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;
  Real‑world tips for writing effective AI agent rules
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Focus on defining clear behavior rules: how to ask questions, how to commit, how to test, what style to follow - not micromanaging every line.

Tell it to reuse existing files, methods, and structure; don’t start from scratch every time.

- Give your AI agent real responsibility.
- It won’t write code exactly like you do and *that’s fine*.
- Your job is to make sure it sticks to your quality bar.
- Think of it like a senior developer working with a junior: you’re the mentor, not the babysitter.

Set simple guardrails:
– If it fails the same fix five times, stop and ask you.
– If a bug is tricky, let it use extra debug prints *just like you would*.
– If the plan is unclear, force it to ask questions first.

Do this right and your AI agent will surprise you, not by guessing less, but by guessing better.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Keep It Clean - Context Hygiene
&lt;/h2&gt;

&lt;p&gt;One thing I learned the hard way: the longer a single chat context grows, the more the AI agent’s code quality tanks. It starts to hallucinate, miss obvious clues, or spam half‑baked suggestions.&lt;/p&gt;

&lt;p&gt;The fix is simple: keep tasks small and short. Treat every task like a fresh sprint: reset the context, give your AI agent a clean slate.&lt;/p&gt;

&lt;p&gt;Small tasks + clear phases + fresh context = your agent stays sharp, predictable, and actually finishes what you ask.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; If you’re pair‑programming with your AI agent, open a new chat more often than you think you need. Context bloat is real, and fresh chats keep your AI agent focused.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Trust Is Earned, Even for an AI Agent
&lt;/h2&gt;

&lt;p&gt;Treat the AI agent like a new developer who just joined your team: talented, but not yet at full speed.&lt;/p&gt;

&lt;p&gt;Your Planner → Executor → Validator loop is &lt;strong&gt;its work audition&lt;/strong&gt;. Every cycle shows you whether the agent actually writes code the way you expect it to.&lt;/p&gt;

&lt;p&gt;When (not if) it stumbles, modify the control stack → update &lt;code&gt;/rules&lt;/code&gt;, refine prompts, replay the loop. This learning loop only works if you &lt;em&gt;build real features&lt;/em&gt; with the agent, not just ask it for code snippets.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pro tip:&lt;/strong&gt; Start by pair‑programming with your AI agent.&lt;br&gt;&lt;br&gt;
Watch how it behaves and explains its reasoning; you’ll quickly spot whether it follows your &lt;strong&gt;Planner → Executor → Validator&lt;/strong&gt; flow or just makes things up. Use these insights to sharpen your &lt;code&gt;/rules/&lt;/code&gt; files and tighten the control loop.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With a clear loop and hands‑on experience, the agent can grow from a curious intern into a trusted teammate. But only if &lt;strong&gt;you invest time&lt;/strong&gt; to coach it.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Key Takeaways - Part 1 ✅
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lock your System Prompt early.&lt;/strong&gt; It anchors the agent’s mental model and yields the same behavior consistently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Externalize guardrails in &lt;code&gt;/rules/&lt;/code&gt;.&lt;/strong&gt; Each rule-file is a contract the agent must obey.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roles over tasks.&lt;/strong&gt; Planner → Executor → Validator keeps scope sharp and failures traceable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short context loops beat marathon chats.&lt;/strong&gt; Reset state often to avoid drift.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control ⇒ Trust.&lt;/strong&gt; Pair-program first, then grant autonomy when the agent consistently passes the loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Next Up - The Real Code Test:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Now you’ve got the blueprint, the &lt;code&gt;/rules&lt;/code&gt; folder, and a single, disciplined loop.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt;, I’ll show you what happened when the AI agent hit real Flutter code, crossed into native iOS/Android glue, and how this blueprint started to turn into concrete results.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💬 Got your own way to keep AI agents from running wild? Drop your favourite guardrail tricks or rules in the comments. I’d love to compare notes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;👉 &lt;em&gt;Stay tuned — [Part 4 → Coming Soon]&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>softwaredevelopment</category>
    </item>
  </channel>
</rss>
