Teemu Piirainen

Posted on Aug 4

Release AI Agent Code Safely - Production CI/CD & Secrets

#ai #programming #tutorial #softwaredevelopment

Who’s this for: Devs, team leads, and DevOps folks responsible for a production CI/CD pipeline - looking to integrate AI agents that generate code, without losing reliability or control.

TL;DR

Secrets, Pipelines, Real Tests:
Fine-grained Personal Access Tokens (PATs) protected my repo, GitHub Actions auto-built every PR, a second AI agent reviewed commits and human approved the PR. Real device tests closed the loop, still ~3× faster.

Series progress:

Control ▇▇▇▇▇ Build ▇▇▇▇▇ Release ▇▇▇▇▇ Retrospect ▢▢▢▢▢

Welcome to Part 3 of my deep-dive series on building an autonomous AI agent: how do you actually deploy AI agent code safely?

In Part 1, I locked my agent inside a clear Planner → Executor → Validator loop.
In Part 2, I proved it could blast through real Flutter tasks and handle native Swift/Kotlin with human guard-rails.

But shipping day is where AI agents usually faceplant: secrets leaks, who is actually responsible, app doesn't work on a real device.

This part breaks down:

How I kept secrets safe (fine-grained GitHub PATs, repo-scoped only)
How I automated CI/CD (GitHub Actions, PR reviews with a second AI)
How to integrate real device testing into the loop

Series Roadmap - How This Blueprint Works

Control - Control Stack & Rules → trust your AI agent won’t drift off course (Control - Part 1)
Build - AI agent starts coding → boundaries begin to show (Build - Part 2)
Release - CI/CD, secrets, real device tests → safe production deploy
Retrospect - The honest verdict → what paid off, what blew up, what’s next

Why care?
Without proper CI/CD controls and human-in-the-loop rules, your AI agent can go rogue - just like Replit’s did when it dropped a live database during a code freeze. No guardrails, no mercy.

👉 Let’s secure it - this is Part 3.

1. App Observability: Keep Analytics and Crash Reports Under Control

Firebase Studio can auto-generate configuration files, but I didn’t want it touching anything sensitive. It also offers a one-click setup for Crashlytics and Analytics, but using it would have meant linking my personal credentials. That wasn’t acceptable.

Instead, I handled the setup manually:

Created the Firebase project through the Firebase Console
Registered iOS and Android apps and downloaded the required config files
Added GoogleService-Info.plist and google-services.json to the project folders
Configured dependencies and updated the Podfile by hand

Using the Firebase CLI was an option, but running it inside Studio through an AI agent didn’t meet my security bar.

2. GitHub Access: Minimal Permissions, Full Workflow

Firebase Studio initially requested full GitHub access. That was not acceptable. It then suggested a general Personal Access Token, which was still too broad for my setup.

Instead, I configured a fine-grained PAT with only the permissions required for this single repository. That allowed the AI agent to commit code, open pull requests, and read comments, nothing more. I also installed the GitHub CLI and used the same token for PR management.

All sensitive keys stayed out of the repository. I stored key.properties and Apple certificates in GitHub Secrets. The pipeline injected them only during the build process. The AI agent had zero access to any secrets at rest, keeping the risk surface small and controlled.

3. CI/CD - Let the Pipeline Do the Dirty Work

Once the AI agent created a PR, my custom GitHub Actions pipeline took over. PR reviews were done first by the GitHub Copilot and then by human in the loop - me.

3.1 Who does the PR review and takes responsibility?

In a normal software lifecycle, we don’t let developers review their own code, not because we’re careless, but because we know we miss things. The same principle applies to AI agents, and arguably even more so.

In this setup, every pull request went through a two-phase review: first by GitHub Copilot, then by me. The code was originally written by Gemini 2.5 Pro, and I honestly expected Copilot to just nod along. But surprisingly, it flagged real issues, especially around edge cases and error handling.

Early on, I followed every line the AI agent wrote. But as the control stack matured, I trusted it more. By the end, I reviewed its pull requests just like I would with any human teammate.

3.2 GitHub Actions Pipeline

When it was time to create a release build, this was triggered manually by me utilizing create_release.yml workflow. Pipeline then took care of the whole release process.

Release notes and the whole CI/CD pipeline was very similar to what I would do with my real customers and human developer colleague (Dependabot, Release Drafter, analyzer, linters, test, build generation, signing, version bumps etc.). This setup worked the same way I use with human teammates.

Example of my .github folder structure:

.github/
├── CODEOWNERS
├── dependabot.yml
├── release-drafter.yml
└── workflows/
    ├── create_release.yml
    ├── labeler-pr.yml
    └── labeler-update-draft.yml

4. Branch and PR Flow

This is how the full development cycle played out with the AI agent in control.

Create a new branch
- Once the task prompt was clear and scoped, the AI agent created a new branch from main. I used trunk-based development, where all releases were built from main.
- The agent followed the git-workflow.instructions.md rules to stay aligned with my CI/CD pipeline.
Implement the task
- The agent executed the Planner → Executor → Validator loop.
- One commit per one subtask, so that it was easy to make rollbacks when (not if) AI agent went ballistics. ⚠️ And yes, this will happen. Be prepared to revert fast.
- It committed changes with descriptive messages, including the task ID (e.g., ID-1234: Add UI widget for xx).
Open PR
- After completing the full task, the agent opened a PR to main, which triggered the CI/CD pipeline.
- 💡 Pro tip: If you use secondary AI agent for PR review, request your coding AI agent to add relevant PR description and possible instructions for the reviewer. This way you can get better results from the PR review AI agent.
PR review
- The PR review agent left comments, which were then passed back to the coding agent. The coding agent addressed the feedback and pushed updates.
- 💡 Pro tip: Make sure your coding agent treats PR comments critically and does not blindly implement all suggestions. It's also important to distinguish between human and AI-generated comments.
PR approval
- After my review and approval, the CI/CD pipeline automatically merged the PR to main and started the build process.

4.1 Picking a Git Strategy Your Human + AI Crew Won’t Hate

This is my personal opinion but I would say that trunk-based (single main + short-lived feature branches) keeps merge hell minimal and CI green, exactly what an always-on coding AI agent needs. But copy–paste doesn’t fit every org, so sanity-check against your constraints.

💡 Pro tips for AI agent repos and team work

Single-source state: keep /rules, prompts and task.md on the same branch the agent edits, no “hidden” gist or Wiki versions.

Atomic commits per subtask: easier to revert when the AI agent goes rogue (git reset HEAD or git revert -m 1 HEAD saves the day).

Branch-naming conventions like feat/ID-1234-short-slug help the agent map Jira ↔ Git without spaghetti regexes.

5. Real Devices and Store Metadata: What Still Needs a Human

End-to-end testing in mobile development can’t rely on emulators alone. Once a feature was merged, my CI/CD pipeline shipped a staging build directly to real devices. This uncovered bugs that didn’t show up in simulators, issues in widgets, deep links, permissions, and screen behavior. The Validator phase kept test coverage high, but hands-on testing still revealed critical gaps.

Each bug I found was added to task.md as a tracked fix with a task ID, and the AI agent processed them through the same Planner → Executor → Validator loop. This kept the feedback loop tight and repeatable.

But automation stops at the app stores. Submitting release builds to Google Play and App Store Connect is still a manual process. Review feedback from the stores must be collected, analyzed, and addressed by a human. Many rejections can be avoided by setting correct metadata and permissions early. But when something does slip through, you need to decide whether it’s your job or the agent’s to fix it.

6. Run & Observe - Releasing Is Just the Beginning

Once the release pipeline is humming, flip the switch on observability and feed the data back into your development process.

Run & Observe Checklist

Crash & Error Rates - Use Firebase Crashlytics (or Sentry) to track crashes and errors on real devices, not just emulators. Auto-symbolication shows exactly where the agent’s code fails.
Performance & Responsiveness - Monitor App Store Connect and Google Play Console dashboards for frame drops, slow rendering warnings, and battery drain.
ANR & Startup Time - Critical for Android: watch for Application Not Responding (ANR) cases and slow app launches.
AI Agent Hit‑Rate - Custom metric: track AI-generated LOC merged vs. reverted and Defect Rate per Feature. If reverts or bugs climb, tighten your /rules and boost your tests.

💡 Pro tip: Don’t just collect these metrics, feed them straight back into your task.md plans. If crash rates, ANRs or defect rates creep up, adjust your AI agent’s scope, tighten testing, or split tasks smaller to keep that 3× boost real.

7. Recap – Parts 1 → 3 at Warp Speed

Phase	What Happened	Why It Mattered
Control (Part 1)	Locked the AI agent into the Planner → Executor → Validator loop and defined clear guardrails in the `/rules` folder to keep its scope tight and behavior predictable.	Gave the agent a sandbox it can’t break out of. No random rewrites, no scope creep.
Build (Part 2)	Turned the high-level PRD into `task.md`, it is the agent’s working brain.	Made sure the agent builds only what you planned, nothing more, nothing less.
Release (Part 3)	Fine-grained tokens, secrets locked in CI/CD, GitHub Actions pipeline, PR reviews, real-device test gates.	Closed the “works-on-my-machine” gap and hit production-ready confidently.

Bottom line: one well-guarded AI agent can turn a 180 h project into a 60 h sprint for <$300.

8. Key Takeaways - Part 3 ✅

Use fine-grained Personal Access Tokens (PATs). Never give your agent repo-wide access.
Keep secrets secret. Store keys in GitHub Secrets - never hardcode.
Automate checks. Use a second AI for PR reviews + human final pass.
Real device tests. Don’t trust emulators - deploy staging builds to real phones.
Trunk-based flow. Short branches, atomic commits, fast merges.

Next Up - The Brutal Reality Check:

So the AI agent built, tested, and shipped real mobile code with secrets locked and pipelines green. But was it really faster, cheaper, or safer? Did all those /rules and CI/CD gates pay off or just look good on paper?

In Part 4, I’ll break down exactly what worked, what blew up in my face, and how I’d tweak the setup to squeeze out more ROI next time.

💬 How are you keeping your secrets and pipelines locked down when you add AI into the mix? Got a trick or tool I should try next? Tell me below!

👉 [Part 4 → Coming Soon]

DEV Community