DEV Community

Cover image for The Autonomous Infrastructure Blueprint: Extending Cloud Success with AWS AI Agents, FinOps, and DevSecOps Automation

The Autonomous Infrastructure Blueprint: Extending Cloud Success with AWS AI Agents, FinOps, and DevSecOps Automation

Level 300

The Infrastructure Maturity Problem

The industry-standard maturity model for Infrastructure Lifecycle Management (ILM) — popularized as the "Enterprise Blueprint for Cloud Success" — defines three stages of adoption across a Day 0 / Day 1 / Day 2+ operations model:

Stage Focus Day 0 (Build) Day 1 (Deploy) Day 2+ (Manage)
1. Adopting Individual teams provisioning with IaC Compose infrastructure code, scaffold with agent templates + Kiro IDE, collaborate via VCS Provision through CLI/CI Manual monitoring
2. Standardizing Platform team emerges Publish tested modules/constructs + golden images to registry Enforce policy as code (OPA, Rego, BDD) Patch management
3. Scaling Self-service for the organization No-code templates (IDP/Backstage) Self-service provisioning with guardrails Drift detection, auto-remediation, decommissioning

This model has guided thousands of organizations from manual provisioning to self-service platforms. However, in practice, most enterprise teams across the industry hit a ceiling at Stage 3. Operations remain reactive: security findings accumulate in dashboards, drift goes undetected until the next audit, cost anomalies surface weeks after the damage, and incident response depends on the one engineer who remembers how the VPC was configured.

The missing stage is Autonomous — where AI agents make infrastructure decisions, remediate findings without tickets, and prevent incidents before they manifest.

This post extends the blueprint with a Stage 4 reference architecture, combining AWS AI agent services (Continuum, DevOps Agent, and FinOps Agent) with the ThothCTL framework to create a closed-loop infrastructure lifecycle that goes beyond self-service into self-operating.

Architecture Overview

The extended blueprint adds a fourth column — Autonomous — to the maturity model, continuing the Day 0/1/2 pattern:

Infrastructure Lifecycle Maturity Grid

The AI Agents: Capabilities and Responsibilities

AWS Continuum (Security at Machine Speed)

AWS Continuum (announced June 2026) is a frontier AI-native security platform that works the full lifecycle of a vulnerability at machine speed — from discovery through prioritization, exploitability validation, and remediation — within guardrails you define.

Capability Technical Detail Trigger
Threat Modeling Auto-generates STRIDE threat models from design documents or source code Manual / pre-development
Code Scanning PR-level and full repository scans against org-defined standards GitHub webhook on PR
Penetration Testing Multi-step exploitable attack chains (OWASP Top 10 + business logic) validated in isolated sandbox On-demand / CI pipeline
Vulnerability Prioritization Context graph of your environment + business to rank what actually matters Continuous
Exploitability Validation Builds reproducible proof in sandbox — confirms which findings are real Post-scan
Auto-Remediation Fast, reversible mitigations within guardrails → durable fixes through your review/deploy process Post-validation
Blast Radius Visibility Shows impact of each fix + rollback capability Pre-remediation

What changed from "Security Agent": Continuum absorbs and extends the previous AWS Security Agent capabilities. Penetration testing and code scanning remain (now as "Continuum penetration testing" and "Continuum code scanning"), but the key addition is the closed-loop vulnerability lifecycle — Continuum doesn't just find; it prioritizes, proves exploitability, and remediates with rollback safety.

Key architectural decision: Continuum works at the application code layer (Python, Java, Node.js). For infrastructure code (Terraform, HCL), you need complementary IaC-specific scanners (Checkov, Trivy, KICS, OPA) that understand Terraform resource semantics. Continuum catches SQL injection in Lambda handlers but won't flag an S3 bucket missing encryption in HCL. You need both layers.

How it fits Stage 4: Continuum shifts security teams from manual triage to setting direction and approving outcomes. The human defines guardrails (approved libraries, encryption requirements, severity thresholds); Continuum operates autonomously within those boundaries. This maps directly to the "self-heal" column in our maturity model.

⚠️ Availability: AWS Continuum for code vulnerabilities is in gated preview (June 2026). Threat modeling is in preview. Works alongside GuardDuty and Security Hub.

AWS DevOps Agent

AWS DevOps Agent spans release management (preview) and production operations (GA).

Capability Technical Detail Integration
Release Readiness Review Evaluates code against production requirements, dependency safety, user-defined standards CI/CD pipeline
Autonomous Release Testing Generates change-specific test plans, runs in sandbox environments Pre-merge
Incident Investigation Correlates telemetry + code + deployments → automated RCA Webhook (PagerDuty, Datadog, Grafana, Splunk)
Proactive Prevention Learns operational patterns, alerts before incidents materialize Continuous
Custom AI Agents User-defined prompts + tools + skills for recurring SRE tasks On-demand

Integration methods (critical for architecture):

  • MCP (Model Context Protocol) — register external tool servers (Streamable HTTP + OAuth/SigV4)
  • Webhooks — HMAC or Bearer token from monitoring tools
  • Agent Client Protocol — programmatic invocation
  • CI/CD — GitHub/GitLab native integration

Key architectural insight: DevOps Agent supports registering custom MCP servers. This means any tool exposing an MCP-compliant endpoint can be called during incident investigation or release validation. At Our projects, we registered our internal ThothCTL MCP server so the DevOps Agent can query IaC drift status and module inventory during incident triage — reducing context-switching for on-call engineers.

⚠️ Availability Note: AWS DevOps Agent Release Management is in preview (us-east-1, us-west-2 as of June 2026). Incident Investigation is GA in 10+ regions. AWS Continuum is in gated preview (June 2026). Check AWS Regional Services for current status.

AWS FinOps Agent

AWS FinOps Agent (public preview, June 2026) is a frontier AI agent for cloud financial management that investigates cost anomalies, answers cost questions in natural language, and runs recurring FinOps workflows autonomously.

Capability Technical Detail Trigger
Cost Anomaly Investigation Correlates cost spikes with CloudTrail events → root cause + responsible owner Event-triggered (Cost Anomaly Detection)
Natural-Language Cost Queries Engineers ask "Why did my cost go up?" → response with services, usage drivers On-demand
Recurring Cost Reporting Scheduled reports (daily/weekly/monthly) in HTML, PDF, or PPT Cron schedule
Optimization Surfacing Pulls from Cost Optimization Hub + Compute Optimizer → Jira tickets Scheduled or on-demand
Memory & Context Organization-specific context files (account-to-owner maps, tagging conventions) Persistent across sessions

Integration methods:

  • Jira — opens tickets with investigation findings routed to the resource owner
  • Slack — posts anomaly summaries to team channels
  • AWS Management Console — web application with conversational UI

Key architectural insight: Unlike the manual FinOps pattern of reactive monthly reviews, the FinOps Agent runs continuously. We configure it to only investigate anomalies above a dollar threshold relevant to each team, reducing alert fatigue while ensuring high-impact changes are caught within hours, not weeks.

⚠️ Availability: AWS FinOps Agent is in public preview in us-east-1 only (June 2026). It manages cost data across all commercial regions. Free during preview with monthly usage limits.

We integrate ThothCTL's cost analysis directly into our CI/CD pipelines as a pre-deploy budget gate. Unlike the FinOps Agent (which investigates anomalies post-deploy), this gate prevents over-budget deployments from reaching production. ThothCTL analyzes both Terraform plans and CloudFormation templates offline — no AWS credentials required for the estimate — making it safe to run in any pipeline stage:

# Pre-deployment cost estimation — supports Terraform plans AND CloudFormation templates
tofu plan -out=tfplan && tofu show -json tfplan > tfplan.json

# Check against budget using ThothCTL (auto-detects tfplan.json or CloudFormation .yaml/.json)
thothctl check iac -type cost-analysis --recursive
# Output: Monthly estimate $1,240 — exceeds team budget ($1,000)
# Action: Block deploy, notify FinOps team

# Also works with CloudFormation templates directly (no plan step needed)
# thothctl check iac -type cost-analysis -d ./cloudformation/

# Post-deployment: AWS FinOps Agent handles anomaly detection + investigation automatically
Enter fullscreen mode Exit fullscreen mode

The Unified Decision Flow

Here is where the agents compose into a single automated pipeline:

We learned that no single agent owns the full lifecycle — Security Agent doesn't understand IaC semantics, DevOps Agent doesn't know your module inventory, and FinOps Agent can't assess blast radius. The breakthrough comes from composing them through MCP, with ThothCTL's decision engine as the orchestration point that aggregates signals from all sources and produces a single, auditable decision per PR. The human sets the thresholds; the system operates within them at machine speed.

The Unified Decision Flow

MCP as the Integration Layer

The architectural key to composing multiple agents is MCP (Model Context Protocol). Both AWS DevOps Agent and IaC tooling can expose/consume MCP interfaces:

MCP as the Integration Layer

Responsibility Boundary Matrix

Clear boundaries prevent agent overlap and define escalation paths:

Lifecycle Phase IaC Security Scanner AWS Continuum AWS DevOps Agent AWS FinOps Agent
Design STRIDE threat model (auto from docs/code)
IaC Development Scan (Checkov, Trivy, KICS, OPA)
PR Review AI-powered decision (approve/reject) App code review + fix Release readiness
CI/CD Policy enforcement (terraform-compliance BDD) Penetration testing + exploitability validation Autonomous test gen Pre-deploy cost gate
Deploy Validation + canary
Production Drift detection + remediation Continuous vulnerability lifecycle (prioritize → validate → mitigate) Incident response Anomaly investigation + Jira routing
Optimization Module version tracking Proactive prevention Rightsizing + recurring reports

Key insight: There is no overlap between IaC scanning and Continuum — they operate at different code layers. Continuum handles the full vulnerability lifecycle for application code; ThothCTL handles IaC-specific scanning, drift, and PR decisions. DevOps Agent is the post-deploy specialist. FinOps Agent is the continuous cost layer. MCP is the interoperability protocol that lets them call each other.

When Things Go Wrong: Agent Failure Modes

Autonomous doesn't mean unsupervised. At GFT, we designed explicit fallback paths for each agent failure mode:

Failure Mode Impact Fallback Strategy
Continuum timeout on large PR PR stuck in review 15-min timeout → fall back to Checkov-only scan + human review
Continuum false-positive remediation Unnecessary code change Guardrail: require human approval for production; 24h rollback window
DevOps Agent false-positive incident Unnecessary rollback Confidence threshold (< 80% → alert only, don't act)
Cost estimation misses resources Deploy exceeds budget Hard budget action caps + weekly CUR reconciliation
AI decision engine hallucination Incorrect approve/reject --dry-run default for first 2 weeks; human audit of all rejections
MCP server unreachable Agent loses IaC context Circuit breaker pattern; cache last-known state for 1 hour
Drift detection false positive Noise fatigue .driftignore + severity filtering; only alert on critical/high

Key design principle: Every autonomous action should be auditable and reversible. The decision engine logs every score calculation, and the thothctl ai-review history command provides full traceability.

You can find the thothctl project and docs in:

GitHub logo thothforge / thothctl

A command line interface tool designed for efficient management and automation within your internal developer platform.

Publish Python Package

Thoth Framework

ThothCTL MCP

Thoth Framework is a framework to create and manage the Internal Developer Platform tasks for infrastructure, devops, devsecops, software developers, and platform engineering teams aligned with the business objectives:

  1. Minimize mistakes.
  2. Increase velocity
  3. Improve products
  4. Enforce compliance
  5. Reduce lock-in

Mapping Mechanisms




































Business Objective Mechanism Implementation
Minimize mistakes Meaninful defaults Templates
Increase velocity Automation IaC Scripts
Improve products Fill product gaps New components
Enforce compliance Restrict choinces Wrappers
Reduce lock-in Abstraction Service layers

Thoth allows you to extend and operate your Developer Control Plane, and enable the developer experience with the internal developer platform trough command line.

Thoth and DCP

Tools

ThothCTL

Package for accelerating the adoption of Internal Frameworks, enable reusing and interaction with the Internal Developer Platform.

Use cases

  • Template Engine:

    • Build and configure any kind of template
    • Handling templates to create, add, remove or update components
    • Code generation
  • Automate tasks:

    • Create and bootstrap local development environment
    • Extend…




Conclusion

The infrastructure maturity journey doesn't end at self-service provisioning. Stage 4: Autonomous represents the convergence of three AWS frontier agents and open-source DevSecOps automation:

  1. AWS Continuum — full vulnerability lifecycle at machine speed: threat modeling, code scanning, exploitability validation, and auto-remediation with rollback
  2. AWS DevOps Agent — release validation, incident investigation, and proactive prevention
  3. AWS FinOps Agent — cost anomaly investigation, natural-language cost queries, and recurring optimization workflows
  4. ThothCTL — IaC scanning, AI-powered PR decisions, drift detection, and auto-fix generation

The glue is MCP — allowing these agents to call each other's capabilities during decision making. The result is an infrastructure platform that not only provisions resources but actively defends, optimizes, and heals itself.


Alejandro Velez, Platform Engineering Latam Lead @ GFT | AWS Ambassador

References

Enterprise Blueprint (Prior Art)

AWS Agent Services

Integration & Tooling

Top comments (0)