Pahud Hsieh

Posted on May 4

AI Can't Fix What It Can't See: How cdk diagnose Enables Autonomous CDK Remediation

#ai #automation #aws #devops

AI Can't Fix What It Can't See: How `cdk diagnose` Enables Autonomous CDK Remediation

Current Behavior vs. What We Want

Today, when a CDK deployment fails through a pipeline, the remediation loop looks like this:

Developer ──▶ Push code ──▶ Pipeline ──▶ CFN deploy ──▶ ❌ Fails
                                                          │
    ┌─────────────────────────────────────────────────────┘
    │
    ▼
Developer manually:
    1. Opens pipeline UI
    2. Finds the failed stage
    3. Navigates to CloudFormation console
    4. Locates the failed change set
    5. Reads the CFN error message
    6. Mentally translates CFN → CDK
    7. Edits code, pushes, waits for pipeline again

🤖 Developer: "AI, fix this deployment for me"
🤖 AI: "Sure! I'll fix the CloudFormation template for you."
🤖 Developer: "...that's not how CDK works."

The AI has no access to the error, no construct path, no source location.
The best it can do is guess — and it guesses wrong, offering to edit
CloudFormation YAML instead of your CDK source code. Totally useless.

So what do we actually need for AI to address the root cause from CDK's perspective?

We need a diagnosis report that's actionable for AI to fix CDK code — not CloudFormation templates. Specifically, the AI needs:

What failed — which CloudFormation resource was rejected and why
Where in CDK — the construct path (MyStack/MyFunction/LogGroup) that maps the CFN logical ID back to your construct tree
Where in source code — the exact file and line number (lib/my-stack.ts:8:5) where the construct was created
What to do — enough context to reason about the fix (set a feature flag? import the existing resource? change the removal policy?)

That's exactly what cdk diagnose provides.

With cdk diagnose, the expected behavior becomes:

Developer ──▶ Push code ──▶ Pipeline ──▶ CFN deploy ──▶ ❌ Fails
                                                          │
    ┌─────────────────────────────────────────────────────┘
    │
    ▼
Developer (or AI agent):
    1. Runs: cdk diagnose MyStack
    2. Gets: construct path + error + source location
    3. Fixes the code
    4. Redeploys ✅

🤖 AI agent can do steps 1–4 autonomously.
   The entire loop is CLI-driven, machine-readable,
   and can be designed as an autonomous agent that
   diagnoses, fixes, and redeploys — without human intervention.

The difference: cdk diagnose turns a manual, console-bound, human-only workflow into a single command that both humans and AI agents can use. This is what makes AI-assisted remediation possible.

The Problem in Detail

Here's a scenario every CDK developer has lived through.

You write a perfectly valid CDK application. You run cdk synth — clean output, valid CloudFormation template, no errors. You push your code, the pipeline picks it up, and then... somewhere in the deployment, CloudFormation rejects it.

Now what?

If you deployed with cdk deploy, you're fine — the CLI catches the error, enriches it with your CDK construct path, and even points you to the source location in your code. But most teams don't deploy that way. They push to a pipeline — CDK Pipelines, CodePipeline, or an internal CI/CD system — and CDK only runs synth. The actual deployment happens through CloudFormation APIs directly.

When that deployment fails, the error is buried. You open the pipeline UI. Click through to the failed stage. Find the CloudFormation stack. Federate to the console. Navigate to the change set. And finally, you see something like:

Resource of type 'AWS::S3::Bucket' with identifier
'my-app-bucket' already exists.

Three to five clicks deep, in CloudFormation's language, with no connection back to your CDK code.

This is the gap that cdk diagnose fills.

┌──────────────┐     ┌──────────────┐     ┌──────────────────────┐
│  Developer   │────▶│  cdk synth   │────▶│  Pipeline / CI/CD    │
│  writes CDK  │     │  ✅ Looks     │     │  deploys to CFN      │
│              │     │     great!    │     │                      │
└──────────────┘     └──────────────┘     └──────────┬───────────┘
                                                     │
                                                     ▼
                                          ┌──────────────────────┐
                                          │   CloudFormation     │
                                          │   ❌ Deploy fails     │
                                          └──────────┬───────────┘
                                                     │
                          ┌──────────────────────────┴──────────────────────────┐
                          │                                                     │
                          ▼                                                     ▼
               ┌─────────────────────┐                            ┌──────────────────────┐
               │  WITHOUT diagnose   │                            │  WITH cdk diagnose   │
               │                     │                            │                      │
               │  1. Open Pipeline   │                            │  $ cdk diagnose      │
               │  2. Find stage      │                            │                      │
               │  3. Find stack      │                            │  ❌ MyFunction        │
               │  4. Open console    │                            │    🛑 LogGroup        │
               │  5. Find changeset  │                            │       already exists  │
               │  6. Read CFN error  │                            │    📍 stack.ts:8:5   │
               │  7. Translate to    │                            │  One command.         │
               │     CDK manually    │                            │  Source location.     │
               │                     │                            │  AI can act on this.  │
               │  🤖 AI can't help   │                            │  🤖 AI fixes code ✅  │
               └─────────────────────┘                            └──────────────────────┘

What is cdk diagnose?

cdk diagnose is a new CDK CLI subcommand that inspects a CloudFormation stack's last failed deployment and surfaces the root cause with CDK-aware context — construct paths, source locations, and actionable fix suggestions.

cdk --unstable=diagnose diagnose MyStack

It queries CloudFormation directly via DescribeChangeSet and related APIs, then enriches the raw error using CDK metadata (aws:cdk:path) to map CloudFormation logical IDs back to your constructs and source code.

The key insight: it works regardless of how the stack was deployed. Pipeline, cdk deploy, manual CloudFormation API call — doesn't matter. If the stack exists and it failed, cdk diagnose can tell you why.

A Real Example: The CDK Upgrade That Breaks Everything

Let me walk through a scenario that hit hundreds of real CDK users as a P0 issue (aws-cdk#34612). It's the kind of failure that makes cdk diagnose invaluable — because the developer did nothing wrong.

The Setup

You have a Lambda function that's been running in production for months:

import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';

export class MyAppStack extends cdk.Stack {
  constructor(scope, id, props) {
    super(scope, id, props);

    new lambda.Function(this, 'MyFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda'),
      logRetention: cdk.aws_logs.RetentionDays.ONE_WEEK,
    });
  }
}

Standard code. Deployed and working. No issues.

Then you upgrade aws-cdk-lib from 2.199.0 to 2.200.0 — a routine version bump. You change nothing else in your code.

Synth: Looks Perfect

$ cdk synth

No errors. The template looks fine. You push to your pipeline.

Deploy: Fails

CREATE_FAILED | AWS::Logs::LogGroup | MyFunctionLogGroup
Resource of type 'AWS::Logs::LogGroup' with identifier
'/aws/lambda/MyFunction' already exists.

Wait — what? You didn't add a log group. You didn't change anything. Why is CloudFormation suddenly trying to create one?

What Happened

CDK 2.200.0 introduced a new feature flag @aws-cdk/aws-lambda:useCdkManagedLogGroup that defaults to true. This causes CDK to add an explicit AWS::Logs::LogGroup resource to your template for every Lambda function. The intent is good — CDK wants to manage the log group lifecycle so it can set retention policies and clean up on deletion.

But here's the catch: when your Lambda function first ran, AWS Lambda automatically created a log group named /aws/lambda/MyFunction. That log group already exists. Now CDK's template tries to create the same log group, and CloudFormation rejects it.

The developer did nothing wrong. The synth output looks correct. The failure only happens at deploy time because it depends on the runtime state of the AWS account.

Diagnose: Root Cause + Source Location

$ cdk --unstable=diagnose diagnose MyAppStack

🔍 Synthesizing with debug information. This may take a bit longer.
❌ Stack MyAppStack:
 └─ MyAppStack
     └─ MyFunction
         └─ LogGroup  (AWS::Logs::LogGroup MyFunctionLogGroupXXXXXXXX)
            🛑 Resource of type 'AWS::Logs::LogGroup' with identifier
               '/aws/lambda/MyFunction' already exists.
            Source Location: new MyAppStack (lib/my-app-stack.ts:8:5)

Now an AI agent has everything it needs:

What failed: AWS::Logs::LogGroup for /aws/lambda/MyFunction already exists
Where in CDK: MyAppStack/MyFunction/LogGroup — it's the log group associated with the Lambda construct
Where in source: lib/my-app-stack.ts:8:5 — the new lambda.Function(...) call
Context to reason about the fix: this is a known issue with the useCdkManagedLogGroup feature flag in CDK 2.200+

The AI-Assisted Fix

An AI agent reading this diagnosis can reason through the fix:

The log group already exists because Lambda auto-created it
CDK 2.200+ now tries to explicitly manage it
Fix option A: Set the feature flag to false to restore previous behavior:

   // cdk.json
   { "context": { "@aws-cdk/aws-lambda:useCdkManagedLogGroup": false } }

Fix option B: Import the existing log group so CDK can manage it going forward

This isn't a trivial "remove the hardcoded name" fix. It requires understanding CDK feature flags, Lambda log group lifecycle, and the tradeoffs between the two fix options. That's exactly the kind of reasoning AI agents excel at — when they have the right input.

Another Example: The Named Resource Trap

The Lambda log group issue is subtle — it only surfaces during a CDK upgrade. But there's an even more common class of failures that hits teams every day: named resources that already exist (aws-cdk#16686, aws-cdk#6183).

new s3.Bucket(this, 'DataBucket', {
  bucketName: 'my-team-data-bucket',
});

Perfectly valid CDK. Synth passes.

But if that bucket already exists in the account — from a previous stack that was torn down, from another team, or from a manual aws s3 mb — CloudFormation rejects it:

Resource of type 'AWS::S3::Bucket' with identifier
'my-team-data-bucket' already exists.

Same pattern with IAM roles, SQS queues, or any resource with a hardcoded physical name. CDK can't catch this at synth time because it's a runtime check against the actual state of your AWS account.

And here's what you see in the CloudFormation console — not helpful at all:

With cdk diagnose:

$ cdk --unstable=diagnose diagnose MyAppStack

❌ Stack MyAppStack:
 └─ MyAppStack
     └─ DataBucket
         └─ Resource  (AWS::S3::Bucket DataBucketXXXXXXXX)
            🛑 Resource of type 'AWS::S3::Bucket' with identifier
               'my-team-data-bucket' already exists.
            Source Location: new MyAppStack (lib/my-app-stack.ts:6:5)

An AI agent sees this and can reason: "The bucket name conflicts with an existing resource. I should either remove the hardcoded bucketName to let CloudFormation generate a unique name, or import the existing bucket with cdk import."

Simple to understand, but impossible for AI to act on without cdk diagnose surfacing the error in the first place.

Why This Matters for AI

Here's where it gets interesting.

Without cdk diagnose, an AI agent has no way to help you fix a deployment failure. The error is locked behind a multi-step console navigation that requires browser interaction, AWS console federation, and human eyeballs. There's no CLI command, no API, no machine-readable output for the agent to consume.

With cdk diagnose, the entire remediation loop becomes automatable:

Pipeline fails
    │
    ▼
AI runs: cdk diagnose MyStack
    │
    ▼
AI reads structured output:
  "LogGroup '/aws/lambda/MyFunction' already exists
   at MyAppStack/MyFunction/LogGroup
   source: lib/my-app-stack.ts:8:5"
    │
    ▼
AI reasons: "CDK 2.200+ feature flag issue.
  Fix: set @aws-cdk/aws-lambda:useCdkManagedLogGroup to false"
    │
    ▼
AI edits cdk.json, redeploys
    │
    ▼
✅ Deployment succeeds

This is AI-assisted remediation. The AI agent can:

Diagnose — run cdk diagnose to get the structured error with construct path and source location
Reason — understand this is a CDK version upgrade issue involving feature flags and Lambda log group lifecycle
Fix — edit cdk.json to set the feature flag, or import the existing log group
Verify — redeploy and confirm the fix works

This isn't a simple text substitution. The AI needs to understand CDK concepts, feature flags, and AWS service behavior to pick the right fix. But it can only do that reasoning if it has the structured diagnosis as input. cdk diagnose is that input.

Putting It Together: Kiro CLI as the Autonomous Remediation Agent

So we have cdk diagnose producing structured, machine-readable error output. But who runs it? Who reads the output, reasons about the fix, edits the code, and redeploys?

This is where Kiro CLI comes in. Kiro CLI's chat subcommand supports a headless mode — set the KIRO_API_KEY environment variable and use --no-interactive, and Kiro runs programmatically without a browser or interactive session. Same tools, same agents, same capabilities — but fully automated.

# After a failed pipeline deployment, just run:
KIRO_API_KEY=your-api-key \
kiro-cli chat --no-interactive --trust-all-tools \
  "My CDK deployment of MyAppStack failed. \
   Run cdk diagnose to find the root cause and fix it."

The KIRO_API_KEY lets Kiro authenticate without a browser — essential for CI/CD pipelines and automated workflows. The --no-interactive flag executes the task and exits. The --trust-all-tools flag lets the agent run shell commands (like cdk diagnose and cdk deploy) without pausing for approval.

Here's what happens under the hood:

┌─────────────────────────────────────────────────────────┐
│  kiro chat (headless agent)                             │
│                                                         │
│  1. Runs: cdk --unstable=diagnose diagnose MyAppStack   │
│                                                         │
│  2. Reads output:                                       │
│     "LogGroup '/aws/lambda/MyFunction' already exists"  │
│     "Source: lib/my-app-stack.ts:8:5"                   │
│                                                         │
│  3. Reads lib/my-app-stack.ts to understand context     │
│                                                         │
│  4. Reasons: "CDK 2.200+ feature flag issue.            │
│     The log group was auto-created by Lambda.            │
│     Fix: set useCdkManagedLogGroup to false"            │
│                                                         │
│  5. Edits cdk.json:                                     │
│     + "@aws-cdk/aws-lambda:useCdkManagedLogGroup": false│
│                                                         │
│  6. Runs: cdk deploy                                    │
│                                                         │
│  7. ✅ Deployment succeeds                               │
└─────────────────────────────────────────────────────────┘

The key: Kiro CLI operates entirely in the terminal. No browser, no console, no clicking. It can run in a CI/CD pipeline's post-failure hook, in an SSH session, or on a developer's laptop. Combined with cdk diagnose, it closes the full loop from failure detection to automated fix.

This is what the autonomous remediation workflow looks like end to end:

Pipeline deploys ──▶ CFN fails
                        │
                        ▼
              Post-failure hook triggers:
              KIRO_API_KEY=$SECRET \
              kiro-cli chat --no-interactive --trust-all-tools \
                "diagnose and fix MyAppStack"
                        │
                        ▼
              Kiro CLI (headless):
                cdk diagnose → read error → edit code → cdk deploy
                        │
                        ▼
                   ✅ Fixed and redeployed
                   (or: opens PR with the fix for human review)

No human in the loop for the diagnosis and fix. A human reviews the PR if you want that gate — but the heavy lifting is done.

Here's what it looks like in practice — Kiro CLI diagnosing, fixing, and redeploying a failed stack in under 2 minutes:

$ kiro-cli chat --no-interactive --trust-all-tools \
    "My CDK stack CdkDiagnoseStack failed to deploy. \
     Run cdk diagnose to find the root cause and fix it."

> Let me start by running cdk diagnose to understand what went wrong.

🔍 Synthesizing with debug information. This may take a bit longer.
❌ Stack CdkDiagnoseStack:
Early validation failed for change set cdk-deploy-change-set:
 └─ CdkDiagnoseStack
     └─ MyBucket
         └─ Resource  (AWS::S3::Bucket MyBucketF68F3FF0)
            🛑 Resource of type 'AWS::S3::Bucket' with identifier
               'cdk-diagnose-demo-903779448426-us-east-1' already exists.
            Source Location:
              new CdkDiagnoseStack (lib/cdk-diagnose-stack.ts:12:5)

> The issue is clear: the S3 bucket has a hardcoded name that already
  exists. The fix is to remove the hardcoded name and let CloudFormation
  generate a unique one.

Updating: lib/cdk-diagnose-stack.ts
  12, 12:     new s3.Bucket(this, 'MyBucket', {
- 13    :       bucketName: 'cdk-diagnose-demo-903779448426-us-east-1',
  14, 13:       removalPolicy: cdk.RemovalPolicy.DESTROY,

> Now let me destroy the failed stack and redeploy:

 ✅  CdkDiagnoseStack: destroyed

CdkDiagnoseStack: deploying... [1/1]
CdkDiagnoseStack | 7/7 | CREATE_COMPLETE | AWS::CloudFormation::Stack

 ✅  CdkDiagnoseStack

✨  Deployment time: 36.28s

> ✅ Done. Here's what happened:
  1. cdk diagnose identified the root cause: hardcoded bucket name
     already existed, causing "resource already exists" error.
  2. Fix: Removed the hardcoded bucketName property.
  3. Redeployed successfully.

 ▸ Credits: 1.74 • Time: 1m 38s

Try It

cdk diagnose is available now in the latest CDK CLI (v2.1120.0+) behind the unstable flag:

npx cdk --unstable=diagnose diagnose <stack-name>

The --unstable flag indicates the API may still change, but the feature is production-ready and fully supported.

To try the full autonomous remediation flow with Kiro CLI:

# 1. Install Kiro CLI: https://kiro.dev/downloads/
# 2. Generate an API key at https://app.kiro.dev (account settings)
# 3. After a failed deployment:
KIRO_API_KEY=your-api-key \
kiro-cli chat --no-interactive --trust-all-tools \
  "My CDK stack MyAppStack failed to deploy. \
   Run cdk diagnose, find the root cause, and fix the code."

If you want a hands-on demo, here's a minimal CDK app that will fail on deploy:

// lib/my-app-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

export class MyAppStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new s3.Bucket(this, 'MyBucket', {
      bucketName: 'cdk-diagnose-demo-bucket',
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
    });
  }
}

Then run:

# 1. Pre-create the bucket so it conflicts
aws s3api create-bucket --bucket cdk-diagnose-demo-bucket

# 2. Deploy — this will fail
npx cdk deploy --require-approval never

# 3. Diagnose
npx cdk --unstable=diagnose diagnose MyAppStack

# 4. Let Kiro fix it
kiro-cli chat --no-interactive --trust-all-tools \
  "My CDK stack MyAppStack failed to deploy. \
   Run cdk diagnose, find the root cause, and fix the code."

# 5. Cleanup
aws s3 rb s3://cdk-diagnose-demo-bucket --force
aws cloudformation delete-stack --stack-name MyAppStack

The gap between "deployment failed" and "here's what to fix" just got a lot smaller. With cdk diagnose and Kiro CLI, it can be fully automated.

cdk diagnose was implemented by Rico Huijbers with contributions from Momo Kornher on the AWS CDK team. The feature landed in aws-cdk-cli#1378.

DEV Community