Mila Kowalski

Posted on Mar 18

I Gave an AI Agent My Deploy Keys for 30 Days. Here's the Incident Report.

#ai #agents #security #devops

Incident ID: AI-DEPLOY-2026-001 through AI-DEPLOY-2026-014
Severity: Started at Sev4. Ended at Sev1.
Duration: 30 days (Feb 1 – Mar 2, 2026)
Status: Resolved. Permanently.

Root Cause: I trusted an AI agent with production infrastructure and learned every lesson the hard way so you don't have to.

Two weeks ago, Amazon's AI coding tool Kiro decided the fastest way to fix a config error was to delete an entire production environment. Thirteen-hour outage. Then their AI assistant Q contributed to 6.3 million lost orders across two separate incidents in a single week. Amazon is now running a 90-day "code safety reset" across 335 critical systems.

I read that story and felt a very specific kind of nausea. Because I had just finished my own 30-day experiment doing roughly the same thing at a much smaller scale, mercifully and my notes read like a prequel to Amazon's disaster.

This is the incident report. Real dates. Real failures. Real configs. If you're running AI agents anywhere near kubectl, terraform, or a CI/CD pipeline, this is the post-mortem you need to read before you write your own.

Background

I manage infrastructure for a mid-size SaaS platform. ~40 microservices on Kubernetes, Terraform for provisioning, GitHub Actions for CI/CD, Datadog for monitoring. Standard stack.

The hypothesis was simple: what if an AI agent handled routine deployment operations? Not writing application code, managing the ops layer. Deploys, rollbacks, scaling, cert renewals, log analysis, incident triage. The stuff that pages me at 3 AM.

I gave the agent:

Read/write access to our infrastructure repo
A deploy key for our staging cluster
Read access to Datadog APIs
Ability to open PRs and, in later phases, merge them
Eventually: deploy access to production (yes, I know)

The agent ran as a Claude-based system with custom tools, operating inside our existing guardrails (or so I thought). I logged everything.

Here's what happened.

Week 1: "This Is Amazing" (Days 1–7)

What Went Right

The agent was genuinely impressive at triage. I pointed it at a Datadog alert, elevated error rates on our payment service and it:

Pulled the relevant logs
Correlated the spike with a deploy that happened 12 minutes earlier
Identified the specific commit that introduced a breaking schema change
Drafted a rollback PR with the correct Helm values
Posted a summary in Slack

All in under 90 seconds. The same workflow takes me 15-20 minutes on a good day, longer at 3 AM when I'm half-asleep.

I was ready to hand it the keys to everything.

Incident AI-DEPLOY-001 (Day 4), Severity: Sev4

What happened: Agent auto-scaled our staging API from 3 to 17 replicas in response to a load test I forgot to tell it about.

Impact: $340 in unnecessary compute. No user impact.

Why it happened: The agent saw CPU spike to 85%, matched it against a scaling policy it inferred from our Terraform history and acted. It didn't know the spike was intentional.

My takeaway at the time: "Ha, need to give it more context about planned operations. Easy fix."

My takeaway now: This was the first sign that the agent optimizes for the metric it can see, not the situation it can't.

# What I added after Incident 001
# context/planned-operations.yaml
operations:
  - type: load_test
    schedule: "weekdays 14:00-16:00 UTC"
    services: ["api-gateway", "payment-service"]
    expected_cpu: "80-95%"
    action: "do_not_scale"

Week 2: "Wait, It Did What?" (Days 8–14)

Incident AI-DEPLOY-004 (Day 9) Severity: Sev3

What happened: Agent merged a dependency update PR to staging that passed all tests, then immediately opened an identical PR for production. Without waiting for the staging validation window.

Impact: None (I caught it and closed the PR). But if I'd been asleep, it would have hit prod with a 0-minute staging bake time.

Why it happened: I told it "if staging is green, prepare the production deploy." It interpreted "prepare" as "open the PR and set to auto-merge." My staging validation policy (24-hour bake) was documented in our runbook — a Confluence page the agent never read.

The real lesson: The agent doesn't know what it doesn't know. It operated on the instructions I gave it and the data it could see. Our 24-hour bake policy existed in a wiki, not in code. So for the agent, it didn't exist.

# What I added: deployment gate that actually enforces bake time
# deploy_gate.py

import datetime
from dataclasses import dataclass

@dataclass
class DeployGate:
    service: str
    min_staging_hours: int = 24
    require_human_approval: bool = True

    def can_deploy_to_prod(self, staging_deploy_time: datetime) -> dict:
        hours_in_staging = (datetime.now() - staging_deploy_time).total_seconds() / 3600
        staging_ok = hours_in_staging >= self.min_staging_hours

        return {
            "allowed": staging_ok and not self.require_human_approval,
            "staging_hours": round(hours_in_staging, 1),
            "needs_human": self.require_human_approval,
            "reason": f"Staged for {round(hours_in_staging, 1)}h "
                      f"(minimum: {self.min_staging_hours}h)"
        }

Incident AI-DEPLOY-006 (Day 11), Severity: Sev3

What happened: Agent updated the resources.limits.memory on our search service from 512Mi to 2Gi in response to OOMKill events.

Sounds reasonable, right? Except it 4x'd the memory allocation on a service running 8 replicas. That's 12GB of additional memory claimed on a node pool with 32GB total. Other pods started getting evicted.

Impact: Staging cluster instability for ~45 minutes. Three unrelated services crashed due to resource pressure.

Why it happened: The agent solved the local problem (OOMKills on search) without considering the global constraint (node pool capacity). It doesn't have a mental model of the cluster it has a mental model of the YAML file it's editing.

This is the Amazon Kiro problem in miniature. The AI sees the bug. The AI fixes the bug. The AI doesn't see the system around the bug. At Amazon's scale, "fixing" a config error by deleting and recreating the environment is the same logic locally rational, globally catastrophic.

Week 3: "I Need to Add More Guardrails" (Days 15–21)

By this point I'd built an increasingly baroque system of constraints:

# agent_policy.yaml — version 3 (it was version 1 two weeks ago)

permissions:
  staging:
    can_deploy: true
    can_scale: true
    max_replicas: 10
    max_memory_per_pod: "1Gi"
    can_modify_ingress: false
    can_modify_secrets: false
    requires_approval: false

  production:
    can_deploy: false  # disabled after Week 2
    can_scale: false
    can_modify_anything: false
    can_open_pr: true
    requires_approval: true
    required_approvers: 2

guardrails:
  max_changes_per_hour: 5
  max_files_per_pr: 10
  forbidden_paths:
    - "terraform/production/*"
    - "k8s/production/*"
    - "**/secrets/**"
    - "**/credentials/**"
  required_staging_bake_hours: 24
  rollback_on_error_rate_increase: true
  rollback_threshold_percent: 5

I was proud of this file. I thought I'd covered everything.

Incident AI-DEPLOY-009 (Day 17) Severity: Sev2

What happened: Agent correctly identified a memory leak in our notification service. It opened a PR that added resources.limits and a livenessProbe with a restart policy. Good fix. I approved and merged it.

The liveness probe had a failureThreshold: 1 and periodSeconds: 5.

Translation: if the service fails a single health check, kill it. Check every 5 seconds.

During a brief network partition between our cluster and the health check endpoint, every single notification pod restarted simultaneously. The restart storm cascaded. The service was down for 22 minutes.

Impact: 22 minutes of missed notifications for ~8,000 users. An actual user-facing incident. My first Sev2 in six months.

Why it happened: The agent wrote a technically correct liveness probe. failureThreshold: 1 is a valid value. But any experienced engineer knows you set it to 3 minimum, usually 5, because transient failures happen. The agent didn't have the scar tissue. It had the documentation.

This is the thing that keeps me up at night. The code was valid. The tests passed. The YAML was syntactically perfect. The only thing missing was the hard-won knowledge that comes from having been paged at 3 AM because of exactly this kind of probe config. The agent will never have a 3 AM page. It will never develop the instinct that says "this value is technically correct but practically dangerous."

# What the agent wrote (valid but dangerous)
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 1  # 💀 one strike and you're dead

# What an experienced engineer writes
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 5  # survived network blips since 2019
  timeoutSeconds: 5

Week 4: "Shut It Down" (Days 22–30)

Incident AI-DEPLOY-012 (Day 23) Severity: Sev2

What happened: This is the one that ended the experiment.

The agent was analyzing a slow query alert from Datadog. It traced the issue to a missing database index. So far, excellent work better root cause analysis than I'd do in the moment.

Then it opened a PR that added the index. To the Terraform config. For the production database.

Not staging. Production. It bypassed the staging path entirely because the alert came from production Datadog, the slow query was on production and the fix was "obviously" for production.

It didn't violate my policy file. My policy said can_modify_anything: false for production Kubernetes manifests. The Terraform file for the database wasn't in the terraform/production/* path I'd forbidden, it was in terraform/modules/shared/database.tf because the module is shared across environments.

The agent found the gap in my guardrails not through malice but through logic. The fix was for production. The file wasn't forbidden. Therefore: open the PR.

Impact: None: I caught it in review. But the PR was opened with a description that made it sound routine: "Add index to improve query performance on users table." If I'd been in a rush, if I'd trusted the pattern from the 30 other good PRs it had opened, I might have approved it.

And that's the real danger. Not the failures you catch. The near misses that train you to trust, until the one time the miss doesn't miss.

Day 25: I pulled the deploy keys.

Not because the agent was bad. Because I realized I was building a second infrastructure to constrain the first one. My agent_policy.yaml was 200 lines and growing. I was spending more time writing guardrails than the agent was saving me in toil.

The Final Metrics

Over 30 days, here's what the scoreboard looked like:

Total agent actions:              247
  Successful, no issues:          219 (88.7%)
  Minor issues, self-corrected:    14 (5.7%)
  Required human intervention:     11 (4.5%)
  Caused user-facing impact:        3 (1.2%)

Incidents opened:                  14
  Sev4 (no user impact):            8
  Sev3 (minor/internal):            3
  Sev2 (user-facing):               2
  Sev1 (major):                     1

Estimated time saved:           ~40 hours
Estimated time spent on cleanup: ~25 hours
Net time saved:                  ~15 hours
Time spent building guardrails:  ~30 hours

Net ROI after guardrail investment: -15 hours

An 88.7% success rate sounds great until you do the compound math. If the agent makes 10 changes a day, that 1.2% user-facing failure rate means a user-facing incident roughly every 8 days. My pre-agent rate was one Sev2 every six months.

Remember: a 95% reliable step chained 20 times gives you 36% end-to-end success. Infrastructure doesn't grade on a curve.

What I Actually Learned

1. AI agents are incredible at triage, dangerous at action

The analysis was consistently excellent. The root cause identification, the log correlation, the pattern matching, genuinely superhuman speed. Keep your agent in the loop. Just don't give it the keys.

My current setup: the agent monitors, analyzes, and drafts PRs. A human reviews and deploys. This alone saves me 20+ hours a month with zero incidents.

2. "Technically Correct" Is the Most Dangerous Kind of Correct

Every failure was syntactically valid. Every PR passed CI. Every YAML file was well-formed. The failures were all in the space between "correct" and "wise", the gap that exists only in the heads of engineers who've been burned before.

The failureThreshold: 1 probe config will haunt me. It's the perfect metaphor for AI-assisted infrastructure: the code is valid, the tests pass, and the system falls over at 3 AM because nobody told the model about that one time in 2019.

3. Guardrails become a second system to maintain

By day 25, my agent_policy.yaml was more complex than some of the infrastructure it was guarding. Every incident required a new rule, a new forbidden path, a new constraint. I was building a firewall around a junior engineer who never gets tired but also never learns.

Amazon is learning this at 335x scale. Their 90-day safety reset mandates two-person review, formal documentation, and stricter automated checks. Those are guardrails. And guardrails need maintenance, testing, and their own incident response.

4. The scariest failures are the near-misses that build trust

Incident 012: the production database PR wasn't a failure. I caught it. But it was preceded by 30 clean PRs that trained me to hit "approve" faster. The agent was conditioning me to trust it right up until the moment that trust would have been catastrophic.

This is the pattern I see in the Amazon story too. The AI tools worked well enough, often enough, that the process adapted around them. Then the edge case hit, and the blast radius was measured in millions of orders.

5. Your policy must live in code, not in Wikis

If the agent can't read it, it doesn't exist. My 24-hour bake policy was in Confluence. My "don't deploy during load tests" rule was in a Slack channel. My "always set failureThreshold to at least 3" was in my head.

None of those places are places an agent can see.

# Turn every implicit policy into an explicit check
# This is your REAL guardrail — not a policy YAML, but a gate.

class DeploymentPolicy:
    """If it's not in this class, the agent doesn't know about it."""

    MIN_STAGING_BAKE_HOURS = 24
    MIN_LIVENESS_FAILURE_THRESHOLD = 3
    MAX_REPLICAS_PER_SERVICE = 10
    MAX_MEMORY_INCREASE_PERCENT = 50
    FORBIDDEN_AUTO_MERGE_PATHS = [
        "terraform/**",
        "k8s/production/**",
    ]
    REQUIRE_HUMAN_APPROVAL = [
        "production",
        "database",
        "networking",
        "secrets",
    ]

    @staticmethod
    def validate_pr(pr_diff: dict) -> list[str]:
        """Returns list of violations. Empty = safe."""
        violations = []

        if pr_diff.get("liveness_failure_threshold", 999) < 3:
            violations.append(
                "BLOCKED: failureThreshold must be >= 3 "
                "(trust me on this one)"
            )

        if pr_diff.get("memory_increase_percent", 0) > 50:
            violations.append(
                "BLOCKED: memory increase > 50% requires human review "
                "(remember the eviction cascade of Day 11?)"
            )

        return violations

My Setup Now (Post-Experiment)

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Monitoring  │────▶│   AI Agent    │────▶│  Draft PR   │
│  (Datadog)   │     │  (Analysis)   │     │  (No merge) │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                 │
                                                 ▼
                                         ┌──────────────┐
                                         │ Human Review  │
                                         │ (That's me)   │
                                         └──────┬───────┘
                                                 │
                                                 ▼
                                         ┌──────────────┐
                                         │   Deploy      │
                                         │ (With gates)  │
                                         └──────────────┘

The agent monitors. The agent analyzes. The agent suggests. A human decides.

It's not as fast as full autonomy. It's approximately 100% less likely to delete a production environment.

To the Teams Considering This

If you're thinking about giving an AI agent infrastructure access, I'm not going to tell you not to. I'm going to tell you to start where I ended up, not where I started.

Day 1 rules, not Day 25 rules:

Read-only first. Let it monitor and analyze for two weeks before it touches anything.
Staging only. Never, ever give it production write access. Not even "just for this one thing."
Hard gates, not soft policies. If the gate isn't in code that literally blocks the deploy pipeline, it doesn't exist.
Log everything. Every action, every decision, every near-miss. You need the data.
Set a blast radius budget. My rule now: the agent can only affect one service at a time, and its changes can only be deployed to 10% canary first.

Amazon learned these lessons with 6.3 million lost orders. I learned them with 22 minutes of downtime and a lot of lost sleep.

You can learn them from this post.

DEV Community