Level 300
The Infrastructure Maturity Problem
The industry-standard maturity model for Infrastructure Lifecycle Management (ILM) — popularized as the "Enterprise Blueprint for Cloud Success" — defines three stages of adoption across a Day 0 / Day 1 / Day 2+ operations model:
| Stage | Focus | Day 0 (Build) | Day 1 (Deploy) | Day 2+ (Manage) |
|---|---|---|---|---|
| 1. Adopting | Individual teams provisioning with IaC | Compose infrastructure code, scaffold with agent templates + Kiro IDE, collaborate via VCS | Provision through CLI/CI | Manual monitoring |
| 2. Standardizing | Platform team emerges | Publish tested modules/constructs + golden images to registry | Enforce policy as code (OPA, Rego, BDD) | Patch management |
| 3. Scaling | Self-service for the organization | No-code templates (IDP/Backstage) | Self-service provisioning with guardrails | Drift detection, auto-remediation, decommissioning |
This model has guided thousands of organizations from manual provisioning to self-service platforms. However, in practice, most enterprise teams across the industry hit a ceiling at Stage 3. Operations remain reactive: security findings accumulate in dashboards, drift goes undetected until the next audit, cost anomalies surface weeks after the damage, and incident response depends on the one engineer who remembers how the VPC was configured.
The missing stage is Autonomous — where AI agents make infrastructure decisions, remediate findings without tickets, and prevent incidents before they manifest.
This post extends the blueprint with a Stage 4 reference architecture, combining AWS AI agent services (Continuum, DevOps Agent, and FinOps Agent) with the ThothCTL framework to create a closed-loop infrastructure lifecycle that goes beyond self-service into self-operating.
Architecture Overview
The extended blueprint adds a fourth column — Autonomous — to the maturity model, continuing the Day 0/1/2 pattern:
The AI Agents: Capabilities and Responsibilities
AWS Continuum (Security at Machine Speed)
AWS Continuum (announced June 2026) is a frontier AI-native security platform that works the full lifecycle of a vulnerability at machine speed — from discovery through prioritization, exploitability validation, and remediation — within guardrails you define.
| Capability | Technical Detail | Trigger |
|---|---|---|
| Threat Modeling | Auto-generates STRIDE threat models from design documents or source code | Manual / pre-development |
| Code Scanning | PR-level and full repository scans against org-defined standards | GitHub webhook on PR |
| Penetration Testing | Multi-step exploitable attack chains (OWASP Top 10 + business logic) validated in isolated sandbox | On-demand / CI pipeline |
| Vulnerability Prioritization | Context graph of your environment + business to rank what actually matters | Continuous |
| Exploitability Validation | Builds reproducible proof in sandbox — confirms which findings are real | Post-scan |
| Auto-Remediation | Fast, reversible mitigations within guardrails → durable fixes through your review/deploy process | Post-validation |
| Blast Radius Visibility | Shows impact of each fix + rollback capability | Pre-remediation |
What changed from "Security Agent": Continuum absorbs and extends the previous AWS Security Agent capabilities. Penetration testing and code scanning remain (now as "Continuum penetration testing" and "Continuum code scanning"), but the key addition is the closed-loop vulnerability lifecycle — Continuum doesn't just find; it prioritizes, proves exploitability, and remediates with rollback safety.
Key architectural decision: Continuum works at the application code layer (Python, Java, Node.js). For infrastructure code (Terraform, HCL), you need complementary IaC-specific scanners (Checkov, Trivy, KICS, OPA) that understand Terraform resource semantics. Continuum catches SQL injection in Lambda handlers but won't flag an S3 bucket missing encryption in HCL. You need both layers.
How it fits Stage 4: Continuum shifts security teams from manual triage to setting direction and approving outcomes. The human defines guardrails (approved libraries, encryption requirements, severity thresholds); Continuum operates autonomously within those boundaries. This maps directly to the "self-heal" column in our maturity model.
⚠️ Availability: AWS Continuum for code vulnerabilities is in gated preview (June 2026). Threat modeling is in preview. Works alongside GuardDuty and Security Hub.
AWS DevOps Agent
AWS DevOps Agent spans release management (preview) and production operations (GA).
| Capability | Technical Detail | Integration |
|---|---|---|
| Release Readiness Review | Evaluates code against production requirements, dependency safety, user-defined standards | CI/CD pipeline |
| Autonomous Release Testing | Generates change-specific test plans, runs in sandbox environments | Pre-merge |
| Incident Investigation | Correlates telemetry + code + deployments → automated RCA | Webhook (PagerDuty, Datadog, Grafana, Splunk) |
| Proactive Prevention | Learns operational patterns, alerts before incidents materialize | Continuous |
| Custom AI Agents | User-defined prompts + tools + skills for recurring SRE tasks | On-demand |
Integration methods (critical for architecture):
- MCP (Model Context Protocol) — register external tool servers (Streamable HTTP + OAuth/SigV4)
- Webhooks — HMAC or Bearer token from monitoring tools
- Agent Client Protocol — programmatic invocation
- CI/CD — GitHub/GitLab native integration
Key architectural insight: DevOps Agent supports registering custom MCP servers. This means any tool exposing an MCP-compliant endpoint can be called during incident investigation or release validation. At Our projects, we registered our internal ThothCTL MCP server so the DevOps Agent can query IaC drift status and module inventory during incident triage — reducing context-switching for on-call engineers.
⚠️ Availability Note: AWS DevOps Agent Release Management is in preview (us-east-1, us-west-2 as of June 2026). Incident Investigation is GA in 10+ regions. AWS Continuum is in gated preview (June 2026). Check AWS Regional Services for current status.
AWS FinOps Agent
AWS FinOps Agent (public preview, June 2026) is a frontier AI agent for cloud financial management that investigates cost anomalies, answers cost questions in natural language, and runs recurring FinOps workflows autonomously.
| Capability | Technical Detail | Trigger |
|---|---|---|
| Cost Anomaly Investigation | Correlates cost spikes with CloudTrail events → root cause + responsible owner | Event-triggered (Cost Anomaly Detection) |
| Natural-Language Cost Queries | Engineers ask "Why did my cost go up?" → response with services, usage drivers | On-demand |
| Recurring Cost Reporting | Scheduled reports (daily/weekly/monthly) in HTML, PDF, or PPT | Cron schedule |
| Optimization Surfacing | Pulls from Cost Optimization Hub + Compute Optimizer → Jira tickets | Scheduled or on-demand |
| Memory & Context | Organization-specific context files (account-to-owner maps, tagging conventions) | Persistent across sessions |
Integration methods:
- Jira — opens tickets with investigation findings routed to the resource owner
- Slack — posts anomaly summaries to team channels
- AWS Management Console — web application with conversational UI
Key architectural insight: Unlike the manual FinOps pattern of reactive monthly reviews, the FinOps Agent runs continuously. We configure it to only investigate anomalies above a dollar threshold relevant to each team, reducing alert fatigue while ensuring high-impact changes are caught within hours, not weeks.
⚠️ Availability: AWS FinOps Agent is in public preview in us-east-1 only (June 2026). It manages cost data across all commercial regions. Free during preview with monthly usage limits.
We integrate ThothCTL's cost analysis directly into our CI/CD pipelines as a pre-deploy budget gate. Unlike the FinOps Agent (which investigates anomalies post-deploy), this gate prevents over-budget deployments from reaching production. ThothCTL analyzes both Terraform plans and CloudFormation templates offline — no AWS credentials required for the estimate — making it safe to run in any pipeline stage:
# Pre-deployment cost estimation — supports Terraform plans AND CloudFormation templates
tofu plan -out=tfplan && tofu show -json tfplan > tfplan.json
# Check against budget using ThothCTL (auto-detects tfplan.json or CloudFormation .yaml/.json)
thothctl check iac -type cost-analysis --recursive
# Output: Monthly estimate $1,240 — exceeds team budget ($1,000)
# Action: Block deploy, notify FinOps team
# Also works with CloudFormation templates directly (no plan step needed)
# thothctl check iac -type cost-analysis -d ./cloudformation/
# Post-deployment: AWS FinOps Agent handles anomaly detection + investigation automatically
The Unified Decision Flow
Here is where the agents compose into a single automated pipeline:
We learned that no single agent owns the full lifecycle — Security Agent doesn't understand IaC semantics, DevOps Agent doesn't know your module inventory, and FinOps Agent can't assess blast radius. The breakthrough comes from composing them through MCP, with ThothCTL's decision engine as the orchestration point that aggregates signals from all sources and produces a single, auditable decision per PR. The human sets the thresholds; the system operates within them at machine speed.
MCP as the Integration Layer
The architectural key to composing multiple agents is MCP (Model Context Protocol). Both AWS DevOps Agent and IaC tooling can expose/consume MCP interfaces:
Responsibility Boundary Matrix
Clear boundaries prevent agent overlap and define escalation paths:
| Lifecycle Phase | IaC Security Scanner | AWS Continuum | AWS DevOps Agent | AWS FinOps Agent |
|---|---|---|---|---|
| Design | — | STRIDE threat model (auto from docs/code) | — | — |
| IaC Development | Scan (Checkov, Trivy, KICS, OPA) | — | — | — |
| PR Review | AI-powered decision (approve/reject) | App code review + fix | Release readiness | — |
| CI/CD | Policy enforcement (terraform-compliance BDD) | Penetration testing + exploitability validation | Autonomous test gen | Pre-deploy cost gate |
| Deploy | — | — | Validation + canary | — |
| Production | Drift detection + remediation | Continuous vulnerability lifecycle (prioritize → validate → mitigate) | Incident response | Anomaly investigation + Jira routing |
| Optimization | Module version tracking | — | Proactive prevention | Rightsizing + recurring reports |
Key insight: There is no overlap between IaC scanning and Continuum — they operate at different code layers. Continuum handles the full vulnerability lifecycle for application code; ThothCTL handles IaC-specific scanning, drift, and PR decisions. DevOps Agent is the post-deploy specialist. FinOps Agent is the continuous cost layer. MCP is the interoperability protocol that lets them call each other.
When Things Go Wrong: Agent Failure Modes
Autonomous doesn't mean unsupervised. At GFT, we designed explicit fallback paths for each agent failure mode:
| Failure Mode | Impact | Fallback Strategy |
|---|---|---|
| Continuum timeout on large PR | PR stuck in review | 15-min timeout → fall back to Checkov-only scan + human review |
| Continuum false-positive remediation | Unnecessary code change | Guardrail: require human approval for production; 24h rollback window |
| DevOps Agent false-positive incident | Unnecessary rollback | Confidence threshold (< 80% → alert only, don't act) |
| Cost estimation misses resources | Deploy exceeds budget | Hard budget action caps + weekly CUR reconciliation |
| AI decision engine hallucination | Incorrect approve/reject |
--dry-run default for first 2 weeks; human audit of all rejections |
| MCP server unreachable | Agent loses IaC context | Circuit breaker pattern; cache last-known state for 1 hour |
| Drift detection false positive | Noise fatigue |
.driftignore + severity filtering; only alert on critical/high |
Key design principle: Every autonomous action should be auditable and reversible. The decision engine logs every score calculation, and the thothctl ai-review history command provides full traceability.
You can find the thothctl project and docs in:
thothforge
/
thothctl
A command line interface tool designed for efficient management and automation within your internal developer platform.
Thoth Framework
Thoth Framework is a framework to create and manage the Internal Developer Platform tasks for infrastructure, devops, devsecops, software developers, and platform engineering teams aligned with the business objectives:
- Minimize mistakes.
- Increase velocity
- Improve products
- Enforce compliance
- Reduce lock-in
Mapping Mechanisms
Business Objective
Mechanism
Implementation
Minimize mistakes
Meaninful defaults
Templates
Increase velocity
Automation
IaC Scripts
Improve products
Fill product gaps
New components
Enforce compliance
Restrict choinces
Wrappers
Reduce lock-in
Abstraction
Service layers
Thoth allows you to extend and operate your Developer Control Plane, and enable the developer experience with the internal developer platform trough command line.
Tools
ThothCTL
Package for accelerating the adoption of Internal Frameworks, enable reusing and interaction with the Internal Developer Platform.
Use cases
-
- Build and configure any kind of template
- Handling templates to create, add, remove or update components
- Code generation
-
Automate tasks:
- Create and bootstrap local development environment
- Extend…
Conclusion
The infrastructure maturity journey doesn't end at self-service provisioning. Stage 4: Autonomous represents the convergence of three AWS frontier agents and open-source DevSecOps automation:
- AWS Continuum — full vulnerability lifecycle at machine speed: threat modeling, code scanning, exploitability validation, and auto-remediation with rollback
- AWS DevOps Agent — release validation, incident investigation, and proactive prevention
- AWS FinOps Agent — cost anomaly investigation, natural-language cost queries, and recurring optimization workflows
- ThothCTL — IaC scanning, AI-powered PR decisions, drift detection, and auto-fix generation
The glue is MCP — allowing these agents to call each other's capabilities during decision making. The result is an infrastructure platform that not only provisions resources but actively defends, optimizes, and heals itself.
✨ Alejandro Velez, Platform Engineering Latam Lead @ GFT | AWS Ambassador
References
Enterprise Blueprint (Prior Art)
- Enterprise Blueprint for Cloud Success — Adopting, Standardizing, Scaling (Infographic)
- Terraform Recommended Practices — Evaluating Maturity Stages




Top comments (0)