Alejandro Velez for AWS Community Builders

Posted on Jul 1

The Autonomous Infrastructure Blueprint: Extending Cloud Success with AWS AI Agents, FinOps, and DevSecOps Automation

#ai #aws #devops #infrastructureascode

Level 300

The Infrastructure Maturity Problem

The industry-standard maturity model for Infrastructure Lifecycle Management (ILM) — popularized as the "Enterprise Blueprint for Cloud Success" — defines three stages of adoption across a Day 0 / Day 1 / Day 2+ operations model:

Stage	Focus	Day 0 (Build)	Day 1 (Deploy)	Day 2+ (Manage)
1. Adopting	Individual teams provisioning with IaC	Compose infrastructure code, scaffold with agent templates + Kiro IDE, collaborate via VCS	Provision through CLI/CI	Manual monitoring
2. Standardizing	Platform team emerges	Publish tested modules/constructs + golden images to registry	Enforce policy as code (OPA, Rego, BDD)	Patch management
3. Scaling	Self-service for the organization	No-code templates (IDP/Backstage)	Self-service provisioning with guardrails	Drift detection, auto-remediation, decommissioning

This model has guided thousands of organizations from manual provisioning to self-service platforms. However, in practice, most enterprise teams across the industry hit a ceiling at Stage 3. Operations remain reactive: security findings accumulate in dashboards, drift goes undetected until the next audit, cost anomalies surface weeks after the damage, and incident response depends on the one engineer who remembers how the VPC was configured.

The missing stage is Autonomous — where AI agents make infrastructure decisions, remediate findings without tickets, and prevent incidents before they manifest.

This post extends the blueprint with a Stage 4 reference architecture, combining AWS AI agent services (Continuum, DevOps Agent, and FinOps Agent) with the ThothCTL framework to create a closed-loop infrastructure lifecycle that goes beyond self-service into self-operating.

Architecture Overview

The extended blueprint adds a fourth column — Autonomous — to the maturity model, continuing the Day 0/1/2 pattern:

The AI Agents: Capabilities and Responsibilities

AWS Continuum (Security at Machine Speed)

AWS Continuum (announced June 2026) is a frontier AI-native security platform that works the full lifecycle of a vulnerability at machine speed — from discovery through prioritization, exploitability validation, and remediation — within guardrails you define.

Capability	Technical Detail	Trigger
Threat Modeling	Auto-generates STRIDE threat models from design documents or source code	Manual / pre-development
Code Scanning	PR-level and full repository scans against org-defined standards	GitHub webhook on PR
Penetration Testing	Multi-step exploitable attack chains (OWASP Top 10 + business logic) validated in isolated sandbox	On-demand / CI pipeline
Vulnerability Prioritization	Context graph of your environment + business to rank what actually matters	Continuous
Exploitability Validation	Builds reproducible proof in sandbox — confirms which findings are real	Post-scan
Auto-Remediation	Fast, reversible mitigations within guardrails → durable fixes through your review/deploy process	Post-validation
Blast Radius Visibility	Shows impact of each fix + rollback capability	Pre-remediation

What changed from "Security Agent": Continuum absorbs and extends the previous AWS Security Agent capabilities. Penetration testing and code scanning remain (now as "Continuum penetration testing" and "Continuum code scanning"), but the key addition is the closed-loop vulnerability lifecycle — Continuum doesn't just find; it prioritizes, proves exploitability, and remediates with rollback safety.

Key architectural decision: Continuum works at the application code layer (Python, Java, Node.js). For infrastructure code (Terraform, HCL), you need complementary IaC-specific scanners (Checkov, Trivy, KICS, OPA) that understand Terraform resource semantics. Continuum catches SQL injection in Lambda handlers but won't flag an S3 bucket missing encryption in HCL. You need both layers.

How it fits Stage 4: Continuum shifts security teams from manual triage to setting direction and approving outcomes. The human defines guardrails (approved libraries, encryption requirements, severity thresholds); Continuum operates autonomously within those boundaries. This maps directly to the "self-heal" column in our maturity model.

⚠️ Availability: AWS Continuum for code vulnerabilities is in gated preview (June 2026). Threat modeling is in preview. Works alongside GuardDuty and Security Hub.

AWS DevOps Agent

AWS DevOps Agent spans release management (preview) and production operations (GA).

Capability	Technical Detail	Integration
Release Readiness Review	Evaluates code against production requirements, dependency safety, user-defined standards	CI/CD pipeline
Autonomous Release Testing	Generates change-specific test plans, runs in sandbox environments	Pre-merge
Incident Investigation	Correlates telemetry + code + deployments → automated RCA	Webhook (PagerDuty, Datadog, Grafana, Splunk)
Proactive Prevention	Learns operational patterns, alerts before incidents materialize	Continuous
Custom AI Agents	User-defined prompts + tools + skills for recurring SRE tasks	On-demand

Integration methods (critical for architecture):

MCP (Model Context Protocol) — register external tool servers (Streamable HTTP + OAuth/SigV4)
Webhooks — HMAC or Bearer token from monitoring tools
Agent Client Protocol — programmatic invocation
CI/CD — GitHub/GitLab native integration

Key architectural insight: DevOps Agent supports registering custom MCP servers. This means any tool exposing an MCP-compliant endpoint can be called during incident investigation or release validation. At Our projects, we registered our internal ThothCTL MCP server so the DevOps Agent can query IaC drift status and module inventory during incident triage — reducing context-switching for on-call engineers.

⚠️ Availability Note: AWS DevOps Agent Release Management is in preview (us-east-1, us-west-2 as of June 2026). Incident Investigation is GA in 10+ regions. AWS Continuum is in gated preview (June 2026). Check AWS Regional Services for current status.

AWS FinOps Agent

AWS FinOps Agent (public preview, June 2026) is a frontier AI agent for cloud financial management that investigates cost anomalies, answers cost questions in natural language, and runs recurring FinOps workflows autonomously.

Capability	Technical Detail	Trigger
Cost Anomaly Investigation	Correlates cost spikes with CloudTrail events → root cause + responsible owner	Event-triggered (Cost Anomaly Detection)
Natural-Language Cost Queries	Engineers ask "Why did my cost go up?" → response with services, usage drivers	On-demand
Recurring Cost Reporting	Scheduled reports (daily/weekly/monthly) in HTML, PDF, or PPT	Cron schedule
Optimization Surfacing	Pulls from Cost Optimization Hub + Compute Optimizer → Jira tickets	Scheduled or on-demand
Memory & Context	Organization-specific context files (account-to-owner maps, tagging conventions)	Persistent across sessions

Integration methods:

Jira — opens tickets with investigation findings routed to the resource owner
Slack — posts anomaly summaries to team channels
AWS Management Console — web application with conversational UI

Key architectural insight: Unlike the manual FinOps pattern of reactive monthly reviews, the FinOps Agent runs continuously. We configure it to only investigate anomalies above a dollar threshold relevant to each team, reducing alert fatigue while ensuring high-impact changes are caught within hours, not weeks.

⚠️ Availability: AWS FinOps Agent is in public preview in us-east-1 only (June 2026). It manages cost data across all commercial regions. Free during preview with monthly usage limits.

We integrate ThothCTL's cost analysis directly into our CI/CD pipelines as a pre-deploy budget gate. Unlike the FinOps Agent (which investigates anomalies post-deploy), this gate prevents over-budget deployments from reaching production. ThothCTL analyzes both Terraform plans and CloudFormation templates offline — no AWS credentials required for the estimate — making it safe to run in any pipeline stage:

# Pre-deployment cost estimation — supports Terraform plans AND CloudFormation templates
tofu plan -out=tfplan && tofu show -json tfplan > tfplan.json

# Check against budget using ThothCTL (auto-detects tfplan.json or CloudFormation .yaml/.json)
thothctl check iac -type cost-analysis --recursive
# Output: Monthly estimate $1,240 — exceeds team budget ($1,000)
# Action: Block deploy, notify FinOps team

# Also works with CloudFormation templates directly (no plan step needed)
# thothctl check iac -type cost-analysis -d ./cloudformation/

# Post-deployment: AWS FinOps Agent handles anomaly detection + investigation automatically

The Unified Decision Flow

Here is where the agents compose into a single automated pipeline:

We learned that no single agent owns the full lifecycle — Security Agent doesn't understand IaC semantics, DevOps Agent doesn't know your module inventory, and FinOps Agent can't assess blast radius. The breakthrough comes from composing them through MCP, with ThothCTL's decision engine as the orchestration point that aggregates signals from all sources and produces a single, auditable decision per PR. The human sets the thresholds; the system operates within them at machine speed.

MCP as the Integration Layer

The architectural key to composing multiple agents is MCP (Model Context Protocol). Both AWS DevOps Agent and IaC tooling can expose/consume MCP interfaces:

Responsibility Boundary Matrix

Clear boundaries prevent agent overlap and define escalation paths:

Lifecycle Phase	IaC Security Scanner	AWS Continuum	AWS DevOps Agent	AWS FinOps Agent
Design	—	STRIDE threat model (auto from docs/code)	—	—
IaC Development	Scan (Checkov, Trivy, KICS, OPA)	—	—	—
PR Review	AI-powered decision (approve/reject)	App code review + fix	Release readiness	—
CI/CD	Policy enforcement (terraform-compliance BDD)	Penetration testing + exploitability validation	Autonomous test gen	Pre-deploy cost gate
Deploy	—	—	Validation + canary	—
Production	Drift detection + remediation	Continuous vulnerability lifecycle (prioritize → validate → mitigate)	Incident response	Anomaly investigation + Jira routing
Optimization	Module version tracking	—	Proactive prevention	Rightsizing + recurring reports

Key insight: There is no overlap between IaC scanning and Continuum — they operate at different code layers. Continuum handles the full vulnerability lifecycle for application code; ThothCTL handles IaC-specific scanning, drift, and PR decisions. DevOps Agent is the post-deploy specialist. FinOps Agent is the continuous cost layer. MCP is the interoperability protocol that lets them call each other.

When Things Go Wrong: Agent Failure Modes

Autonomous doesn't mean unsupervised. At GFT, we designed explicit fallback paths for each agent failure mode:

Failure Mode	Impact	Fallback Strategy
Continuum timeout on large PR	PR stuck in review	15-min timeout → fall back to Checkov-only scan + human review
Continuum false-positive remediation	Unnecessary code change	Guardrail: require human approval for production; 24h rollback window
DevOps Agent false-positive incident	Unnecessary rollback	Confidence threshold (< 80% → alert only, don't act)
Cost estimation misses resources	Deploy exceeds budget	Hard budget action caps + weekly CUR reconciliation
AI decision engine hallucination	Incorrect approve/reject	`--dry-run` default for first 2 weeks; human audit of all rejections
MCP server unreachable	Agent loses IaC context	Circuit breaker pattern; cache last-known state for 1 hour
Drift detection false positive	Noise fatigue	`.driftignore` + severity filtering; only alert on critical/high

Key design principle: Every autonomous action should be auditable and reversible. The decision engine logs every score calculation, and the thothctl ai-review history command provides full traceability.

You can find the thothctl project and docs in:

thothforge / thothctl

A command line interface tool designed for efficient management and automation within your internal developer platform.

Thoth Framework

Thoth Framework is a framework to create and manage the Internal Developer Platform tasks for infrastructure, devops, devsecops, software developers, and platform engineering teams aligned with the business objectives:

Minimize mistakes.
Increase velocity
Improve products
Enforce compliance
Reduce lock-in

Mapping Mechanisms

Business Objective	Mechanism	Implementation
Minimize mistakes	Meaninful defaults	Templates
Increase velocity	Automation	IaC Scripts
Improve products	Fill product gaps	New components
Enforce compliance	Restrict choinces	Wrappers
Reduce lock-in	Abstraction	Service layers

Thoth allows you to extend and operate your Developer Control Plane, and enable the developer experience with the internal developer platform trough command line.

Tools

ThothCTL

Package for accelerating the adoption of Internal Frameworks, enable reusing and interaction with the Internal Developer Platform.

Use cases

Template Engine:
- Build and configure any kind of template
- Handling templates to create, add, remove or update components
- Code generation
Automate tasks:
- Create and bootstrap local development environment
- Extend…

View on GitHub

Conclusion

The infrastructure maturity journey doesn't end at self-service provisioning. Stage 4: Autonomous represents the convergence of three AWS frontier agents and open-source DevSecOps automation:

AWS Continuum — full vulnerability lifecycle at machine speed: threat modeling, code scanning, exploitability validation, and auto-remediation with rollback
AWS DevOps Agent — release validation, incident investigation, and proactive prevention
AWS FinOps Agent — cost anomaly investigation, natural-language cost queries, and recurring optimization workflows
ThothCTL — IaC scanning, AI-powered PR decisions, drift detection, and auto-fix generation

The glue is MCP — allowing these agents to call each other's capabilities during decision making. The result is an infrastructure platform that not only provisions resources but actively defends, optimizes, and heals itself.

✨ Alejandro Velez, Platform Engineering Latam Lead @ GFT | AWS Ambassador

DEV Community