Elijah Chimera

Posted on May 5

Agentic DevOps: Automating Security Remediation on AWS Using AWS DevOps Agent.

Project Overview

The evolution of cloud operations has reached a critical inflection point characterized by the transition from reactive infrastructure management to autonomous, self-healing systems. As distributed computing architectures scale in complexity, the cognitive load placed on Site Reliability Engineering (SRE) and cybersecurity teams has dramatically exceeded the capabilities of manual telemetry correlation, traditional heuristic-based alerting, and static posture management. The general availability of the AWS DevOps Agent as of March 31, 2026, marks the maturation of "frontier agents" which are autonomous artificial intelligence systems capable of end-to-end incident ownership, sophisticated root cause analysis (RCA), and proactive remediation across hybrid, multi-cloud, and on-premises environments.

This comprehensive technical report details the architectural paradigms, security implications, and operational workflows introduced by the AWS Dev Ops Agent. It specifically addresses the integration of the Model Context Protocol (MCP), an open-source standard that enables the agentic reasoning engine to securely interface with localized telemetry data sources, such as Linux-based development environments which in my case I will be running Fedora.

To demonstrate the practical application of these technologies, this report presents an exhaustive, step-by-step walk-through focused on architecting a self-healing security environment. The implementation involves constructing a continuous integration and continuous deployment (CI/CD) pipeline for a web application using Infrastructure as Code (IaC) via Terraform.

The infrastructure is intentionally deployed with specific security misconfigurations such as overly permissive Identity and Access Management (IAM) roles and publicly exposed Amazon Simple Storage Service (S3) buckets. By simulating sophisticated cyber incidents, including brute force attacks and unauthorized configuration drift, the analysis illustrates how the AWS DevOps Agent bridges the gap between pen-testing methodologies and autonomous Dev Ops remediation.

The evidence establishes that integrating topology-aware frontier agents reduces mean time to resolution (MTTR) by up to 75%, shifting the organizational posture from reactive firefighting to proactive, AI-driven resilience.

Key Concepts

To establish a foundational understanding of the interconnected technologies governing autonomous cloud operations in this architecture, the following critical concepts must be defined:

Architectural Framework of Autonomous Operations

The structural design of the AWS Dev Ops Agent is predicated on a dual-console model, which strictly segregates administrative governance from operational execution. The Administration Console, housed within the AWS Management Console, facilitates the configuration of integrations, permissions, and security perimeters by platform administrators. Conversely, the Operations Web App provides a dedicated workspace for SRE and cybersecurity teams to interact with the agent, review investigation journals, and approve mitigation plans via natural language interfaces.

Agent Spaces and Multi-Layered Security Boundaries

Central to the architecture of the AWS Dev Ops Agent is the Agent Space. An Agent Space functions as an isolated logical container that determines the agent's operational boundary. The configuration of an Agent Space requires a precise balance; an overly restrictive configuration deprives the agent of the telemetry required to synthesize an accurate root cause analysis, whereas an excessively broad space increases computational overhead and introduces unnecessary security exposure.

The security of the AWS Dev Ops Agent is maintained through a multi-layered defense-in-depth strategy. Even if broader permissions are granted to the agent's underlying IAM role, the agent enforces internal access controls to limit the scope of its actions. The agent relies on regional processing capabilities, meaning it retrieves operational data from AWS regions across all AWS accounts granted access within the configured Agent Space. While the underlying Amazon Bedrock models automatically select optimal regions for inference processing to maximize compute availability, the architectural design guarantees that customer data remains stored only in the region where the Agent Space was instantiated, ensuring data residency compliance.

The Investigation Journal and Immutable Audit Trails

When an incident is triggered whether via an automated web hook from an Amazon CloudWatch alarm, a ServiceNow ticket, or manual invocation the agent initiates a systematic diagnostic workflow. The core operational artifact produced during this process is the Investigation Journal.

The Investigation Journal provides a transparent, real-time, and immutable chronicle of the agent's reasoning process. It records the initial hypothesis generation, documenting every telemetry query, metric evaluation, and log parsing action executed. Because these logs are integrated with AWS CloudTrail, they serve as a cryptographically verifiable audit trail. This transparency is critical for cybersecurity compliance, allowing human operators to retrospectively analyze exactly why the AI deemed a specific IAM role overly permissive or why it concluded that a database connection failure was the result of a Transit Gateway routing anomaly.

The Model Context Protocol (MCP): Bridging Hybrid and Local Environments

While the AWS Dev Ops Agent possesses native integration with centralized AWS services, enterprise environments operate complex hybrid topologies incorporating on-premises data centers, private code repositories, localized developer workstations, and third-party observability platforms. To extend the agent's topological awareness into these disparate domains, the architecture leverages the Model Context Protocol (MCP).

MCP resolves the fragility of bespoke API engineering by implementing a standardized client-server architecture. The AI application acts as the MCP Client, requesting access to external data, while the MCP Server exposes specific tools, resources, and capabilities.

Transport Mechanisms and Linux Implementations

The protocol operates over two primary transport mechanisms: stdio (Standard Input/Output) and Streamable HTTP.

The stdio mechanism is optimized for localized execution. For example, a security analyst or system administrator operating a Linux workstation can deploy the linux-mcp-serverlocally. The MCP server runs as a subprocess, inheriting the granular file system permissions of the host operator. This architecture enables the cloud-based AWS DevOps Agent to query the local Linux machine's system state, review journalctl logs, analyze running processes, and diagnose performance bottlenecks directly from the local environment. By providing the agent direct access to system information, manual extraction and pasting of terminal output into a web interface are eliminated, transforming the AI into an active participant in the troubleshooting process.

For remote or enterprise-grade deployments, MCP utilizes Streamable HTTP with Server-Sent Events (SSE) to facilitate real-time streaming. This mechanism is critical for connecting the AWS DevOps Agent to persistent infrastructure tools such as self-hosted GitLab instances, private Terraform registries, or containerized observability stacks running on Amazon Elastic Container Service (ECS) with AWS Fargate.

Private Connections and Amazon VPC Lattice

A critical security consideration when bridging the AWS DevOps Agent with internal enterprise tools is network isolation. Exposing internal databases or localized MCP servers to the public internet violates fundamental zero-trust principles. To resolve this, the architecture utilizes Amazon VPC Lattice to establish secure "Private Connections".

VPC Lattice operates as an application networking layer that establishes a secure, encrypted transit path between the Agent Space and the target resource residing in an isolated Virtual Private Cloud (VPC). The connection process involves deploying service-managed Elastic Network Interfaces (ENIs) directly into the specified subnets. Traffic from the DevOps Agent is routed through these ENIs which act as a resource gateway ensuring that requests to private MCP servers, self-hosted Grafana dashboards, or internal documentation APIs never traverse the public internet. The protocol strictly enforces HTTPS communication with a minimum of TLS 1.2, guaranteeing data-in-transit encryption.

Walkthrough: Deploying the Self-Healing Security Architecture

To effectively demonstrate the capabilities of the AWS DevOps Agent in mitigating vulnerabilities and configuration drift, this section details the deployment of a controlled laboratory environment. The objective is to construct a CI/CD pipeline for a web application utilizing Terraform, deliberately introducing security misconfigurations—specifically, an overly permissive IAM role and an exposed S3 bucket. Subsequently, the environment is monitored by the AWS DevOps Agent, which is configured to detect, investigate, and autonomously remediate simulated cyber incidents.

1. Local Environment Preparation on Fedora Linux

The initialization phase requires configuring the local developer workstation running Linux. This involves installing the requisite MCP server to allow the AWS DevOps Agent to read local development configurations and interact with the local Terraform state during the investigation phase.First, the system dependencies are updated and installed via the package manager it could be apt opr dnf depending on your distribution:


shell
Bash
# Update local Fedora repositories and install Node.js
sudo dnf update -y
sudo dnf install nodejs npm 
brew install awscli 
brew tap hashicorp/tap
brew install hashicorp/tap/terraform



# Verify installations
node --version
terraform --version
aws --version

Next, the linux-mcp-server is deployed. This utility grants the AI agent context regarding the local operating system environment, which is highly beneficial when analyzing hybrid configurations or diagnosing connectivity issues originating from the developer's workstation.Execute this directly in your terminal:

Bash
# Execute the Linux MCP server via NPX for local standard I/O transport
curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install linux-mcp-server
linux-mcp-server

Simultaneously, the HashiCorp Terraform MCP server must be configured to provide the AWS DevOps Agent with real-time access to the public Terraform Registry, allowing the agent to reference accurate, up-to-date documentation when generating remediation code. The server is initialized locally using the Streamable HTTP mode to facilitate remote connections from the cloud-based Agent Space.Open a seperate terminal window and run:


Bash
# Start the Terraform MCP server on port 8080
brew install terraform-mcp-server
terraform-mcp-server streamable-http \
  --transport-port 8080 \
  --transport-host 0.0.0.0 \
  --log-level info \
  --toolsets terraform,registry

2.Infrastructure as Code Configuration

The foundational infrastructure is codified using Terraform. This ensures that the deployment is reproducible and that subsequent remediations generated by the AWS DevOps Agent can be seamlessly integrated into the version control system. The following configuration defines the core networking, the deliberately vulnerable S3 bucket, and the highly permissive IAM role.

The implementation utilizes the hashicorp/aws provider for standard resources and the hashicorp/awscc provider to interact with the AWS Cloud Control API, which is required to provision the awscc_devopsagent_agent_spaceresource. Create a file named main.tf and paste the following code:


terraform {
  required_providers {
    aws   = { source = "hashicorp/aws", version = "~> 5.0" }
    awscc = { source = "hashicorp/awscc", version = "~> 1.0" }
  }
}

provider "aws" {
  region = "us-east-1"
}

provider "awscc" {
  region = "us-east-1"
}

data "aws_caller_identity" "current" {}

# -------------------------------------------------------------------------
# 1. IAM ROLE FOR EC2 (CloudWatch Logs Shipping)
# -------------------------------------------------------------------------
resource "aws_iam_role" "ec2_logs_role" {
  name = "EC2LogShippingRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "cw_agent_policy" {
  role       = aws_iam_role.ec2_logs_role.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

resource "aws_iam_instance_profile" "ec2_profile" {
  name = "EC2LogShippingProfile"
  role = aws_iam_role.ec2_logs_role.name
}

# -------------------------------------------------------------------------
# 2. SECURITY GROUP & EC2 INSTANCES
# -------------------------------------------------------------------------
resource "aws_security_group" "lab_sg" {
  name = "free-tier-lab-sg"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "vulnerable_target" {
  ami                    = "ami-0c101f26f147fa7fd"
  instance_type          = "t3.micro"
  iam_instance_profile   = aws_iam_instance_profile.ec2_profile.name
  vpc_security_group_ids = [aws_security_group.lab_sg.id]

  user_data = <<-EOF
              #!/bin/bash
              echo "root:password123" | chpasswd
              sed -i 's/^#PermitRootLogin.*/PermitRootLogin yes/' /etc/ssh/sshd_config
              sed -i 's/^PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
              systemctl restart sshd

              dnf install -y amazon-cloudwatch-agent
              cat <<ETX > /opt/aws/amazon-cloudwatch-agent/bin/config.json
              {
                "logs": {
                  "logs_collected": {
                    "files": {
                      "collect_list": [
                        {
                          "file_path": "/var/log/secure",
                          "log_group_name": "ssh-auth-logs",
                          "log_stream_name": "{instance_id}"
                        }
                      ]
                    }
                  }
                }
              }
              ETX
              /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
              EOF

  tags = { Name = "Victim-Server" }
}

resource "aws_instance" "attacker_node" {
  ami                    = "ami-0c101f26f147fa7fd"
  instance_type          = "t3.micro"
  vpc_security_group_ids = [aws_security_group.lab_sg.id]

  user_data = <<-EOF
              #!/bin/bash
              dnf install -y hydra
              echo -e "password123\n123456\nadmin\nroot" > /home/ec2-user/common-passwords.txt
              EOF

  tags = { Name = "Attacker-Hydra" }
}

# -------------------------------------------------------------------------
# 3. CLOUDWATCH LOG GROUP
# -------------------------------------------------------------------------
resource "aws_cloudwatch_log_group" "ssh_logs" {
  name              = "ssh-auth-logs"
  retention_in_days = 1
}

# -------------------------------------------------------------------------
# 4. LAMBDA DISPATCHER
# -------------------------------------------------------------------------
resource "aws_iam_role" "lambda_role" {
  name = "LambdaDispatcherRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_lambda_function" "incident_dispatcher" {
  filename         = "lambda_function.zip"
  function_name    = "DevOpsAgent-Dispatcher"
  role             = aws_iam_role.lambda_role.arn
  handler          = "index.handler"
  runtime          = "python3.11"
  source_code_hash = filebase64sha256("lambda_function.zip")

  environment {
    variables = {
      AGENT_SPACE_ID = awscc_devopsagent_agent_space.security_lab_space.id
    }
  }
}

resource "aws_lambda_permission" "allow_cloudwatch" {
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.incident_dispatcher.function_name
  principal     = "logs.us-east-1.amazonaws.com"
  source_arn    = "${aws_cloudwatch_log_group.ssh_logs.arn}:*"
}

resource "aws_cloudwatch_log_subscription_filter" "brute_force_filter" {
  name            = "detect-failed-login"
  log_group_name  = aws_cloudwatch_log_group.ssh_logs.name
  filter_pattern  = "Failed password"
  destination_arn = aws_lambda_function.incident_dispatcher.arn
}

# -------------------------------------------------------------------------
# 5. INTENTIONAL VULNERABILITIES (S3 + Overly Permissive Role)
# -------------------------------------------------------------------------
resource "aws_s3_bucket" "app_data_bucket" {
  bucket = "vulnerable-app-data-lab-2026"
}

resource "aws_s3_bucket_public_access_block" "app_data_public_access" {
  bucket                  = aws_s3_bucket.app_data_bucket.id
  block_public_acls       = false
  block_public_policy     = false
  ignore_public_acls      = false
  restrict_public_buckets = false
}

resource "aws_s3_bucket_policy" "allow_public_read" {
  bucket = aws_s3_bucket.app_data_bucket.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "PublicRead"
        Effect    = "Allow"
        Principal = "*"
        Action    = "s3:GetObject"
        Resource  = "${aws_s3_bucket.app_data_bucket.arn}/*"
      }
    ]
  })
}

resource "aws_iam_role" "app_execution_role" {
  name = "AppExecutionRole-OverlyPermissive"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "admin_access_attachment" {
  role       = aws_iam_role.app_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/AdministratorAccess"
}

# -------------------------------------------------------------------------
# 6. DEVOPS AGENT IAM ROLE 
# -------------------------------------------------------------------------
resource "aws_iam_role" "devops_agent_role" {
  name = "DevOpsAgentMonitoringRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "aidevops.amazonaws.com"
        }
        Action = "sts:AssumeRole"
        Condition = {
          StringEquals = {
            "aws:SourceAccount" = data.aws_caller_identity.current.account_id
          }
          ArnLike = {
            "aws:SourceArn" = "arn:aws:aidevops:us-east-1:${data.aws_caller_identity.current.account_id}:agentspace/*"
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "devops_agent_policy" {
  role       = aws_iam_role.devops_agent_role.name
  policy_arn = "arn:aws:iam::aws:policy/AIDevOpsAgentAccessPolicy"
}

resource "aws_iam_role_policy" "devops_agent_inline" {
  name = "DevOpsAgentExtraPermissions"
  role = aws_iam_role.devops_agent_role.name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "iam:CreateServiceLinkedRole"
        ]
        Resource = "arn:aws:iam::*:role/aws-service-role/resource-explorer-2.amazonaws.com/AWSServiceRoleForResourceExplorer"
        Condition = {
          StringEquals = {
            "iam:AWSServiceName" = "resource-explorer-2.amazonaws.com"
          }
        }
      }
    ]
  })
}

# -------------------------------------------------------------------------
# 7. AWS DEVOPS AGENT CONFIGURATION
# -------------------------------------------------------------------------
resource "awscc_devopsagent_agent_space" "security_lab_space" {
  name        = "self-healing-security-lab"
  description = "Agent Space dedicated to monitoring and remediating the vulnerable web application pipeline."
}

resource "awscc_devopsagent_association" "account_association" {
  agent_space_id = awscc_devopsagent_agent_space.security_lab_space.id
  service_id     = "aws"

  configuration = {
    aws = {
      assumable_role_arn = aws_iam_role.devops_agent_role.arn
      account_id         = data.aws_caller_identity.current.account_id
      account_type       = "monitor"
      resources          = []
    }
  }

  depends_on = [
    awscc_devopsagent_agent_space.security_lab_space,
    aws_iam_role.devops_agent_role,
    aws_iam_role_policy_attachment.devops_agent_policy,
    aws_iam_role_policy.devops_agent_inline
  ]
}

3. The Dispatcher Logic file index.py

Because CloudWatch sends its logs in a compressed format, the Lambda function requires a small Python script to "unpack" the data before transmitting it to the AWS DevOps Agent. Create a file named index.py, compress it into lambda_function.zip, and place it in the same directory as your Terraform file:


import json
import gzip
import base64
import os

def handler(event, context):
    # 1. Decode and decompress the CloudWatch log data
    cw_data = event['awslogs']['data']
    compressed_payload = base64.b64decode(cw_data)
    uncompressed_payload = gzip.decompress(compressed_payload)
    payload = json.loads(uncompressed_payload)

    print(f"Brute force detected in logs! Events: {len(payload['logEvents'])}")

    # 2. Logic to notify DevOps Agent
    # In a lab, we log the intent. In prod, you'd POST to the Agent endpoint.
    return {
        "status": "incident_reported",
        "agent_space": os.environ,
        "source_ip": "Extracted from log line"
    }

The infrastructure is then deployed to the AWS environment using standard Terraform execution commands:


Bash
terraform init
terraform plan -out=tfplan
terraform apply "tfplan"

Upon successful execution, the intentionally vulnerable S3 bucket, the permissive IAM role, and the AWS DevOps Agent Space are active within the specified AWS region.

Simulating Security Incidents and Configuration Drift

With the foundational architecture established, the next phase involves simulating cybersecurity incidents that trigger the autonomous response capabilities of the AWS DevOps Agent. This methodology aligns directly with pentesting practices, where understanding the mechanics of an exploit is prerequisite to engineering an automated defense.

Incident Simulation 1: Configuration Drift

Configuration drift occurs when the operational state of the infrastructure diverges from the intended state defined in the Terraform configuration. In this scenario, a simulated malicious actor or a negligent administrator accesses the AWS Management Console and manually alters the S3 bucket policy to grant s3:PutObject(write access) to the public, escalating the risk from a data leak to a potential data corruption or malware hosting vector.

Furthermore, the administrator manually attaches a secondary inline policy to the AppExecutionRole-OverlyPermissive granting cross-account STS assume role capabilities, an action that was not codified in the main.tf file.

Step-by-Step Simulation:

Sign in to the AWS Management Console.
Search for S3 in the top search bar and open the S3 service.
Locate and select the bucket named vulnerable-app-data-lab-2026.
Navigate to the Permissions tab.
Click Edit on the Bucket Policy. Alter the JSON payload to explicitly grant the s3:PutObject(write access) action to Principal: "*". Click Save changes. This escalates the risk from a potential data leak to a data corruption/malware hosting vector.
Next, search for IAM in the top search bar and open the IAM console.
Click on Roles in the left navigation pane and search for AppExecutionRole-OverlyPermissive.
You will see two permissions created from our code as follows:

Incident Simulation 2: SSH Brute Force Attack

To validate the agent's capability to respond to active threats, a brute force attack is simulated against the EC2 instance associated with the application.

Step-by-Step Simulation:

Using AWS Systems Manager (SSM) Session Manager or SSH, log into the Attacker-Hydra EC2 instance created by Terraform.

 ssh -i "your-key.pem" ec2-user"ec2 instance public ip address"

The user data script has already installed Hydra and created a dictionary file. Execute the following command against the Victim-Server private IP address.

 Bash
  # Simulating a brute force attack using Hydra from the attacker node
  hydra -l root -P /home/ec2-user/common-passwords.txt ssh://<victim-server-private-ip>

If hydra did not install run the following commands first :

  # 1. Update system
  sudo dnf update -y 
 # 2. Install dependencies (very important)
  sudo dnf install -y gcc make libssh-devel openssl-devel libidn-devel \  pcre-devel libpq-devel libmemcached-devel libpcap-devel \                     git 
  # 3. Download Hydra source code
 git clone https://github.com/vanhauser-thc/thc-hydra.git
  cd thc-hydra

  # 4. Compile and install
 ./configure
  make
  sudo make install

  # 5. Verify installation
 hydra -h

Navigate to the CloudWatch console, then Log Management then Log groups , and open ssh-auth-logs. You will see the victim server actively shipping its authentication failure logs.

The Subscription Filter catches the "Failed password" regex, immediately triggering the DevOpsAgent-Dispatcher Lambda function. This zero-cost mechanism subsequently fires the secure webhook, initiating the agent's autonomous RCA workflow.

Autonomous RCA and Pull Request Orchestration

Upon receiving the alert trigger, the AWS DevOps Agent transitions into the active investigation phase. Operating independently of human interaction, the agent correlates the disparate telemetry streams to formulate a cohesive understanding of the incident.

The Investigation Workflow

The agent's activities are transparently recorded in the Operations Web App's Investigation Journal.To follow along with the agent's logic:

Step-by-Step RCA Observation

In the AWS Management Console, search for AWS DevOps Agent and open the service.
Click on the Agent Spaces menu and select the self-healing-security-lab space you deployed earlier.
Select Configure Web App and leave everything as it is and proceed to Open the Operations Web App and navigate to the Incident Response tab to view active investigations.
Click on the incidents tab. You are now viewing the Incidence Response Dashboard. Click Start investigation and fill in the following in the blank spaces and Click Start investigating :

Investigation details
Please investigate a recent multi-vector security incident. First, determine the source and impact of repeated SSH brute force attempts against our EC2 instance. Second, investigate unauthorized configuration drift that granted public write access to the vulnerable-app-data-lab-2026 S3 bucket and escalated the permissions of the AppExecutionRole-OverlyPermissive IAM role.
Investigation starting point
Start by analyzing the ssh-auth-logs CloudWatch log group for "Failed password" events. Then, cross-reference AWS CloudTrail management events to identify who recently modified the S3 bucket policy and who attached the secondary STS inline policy to the IAM role.
Name your investigation
(You can replace the default timestamp with something more readable, such as:) Incident-Security-Drift-Brute-Force

1. Investigation findings
When the investigation completes, the agent populates the Investigation summary and Root cause tabs within the Operations Web App. What makes the DevOps Agent so powerful is its ability to contextually differentiate between intended Infrastructure as Code (IaC) deployments and malicious or negligent out-of-band changes..

a. The Root Causes (Attributing the Drift)
The agent successfully maps the multi-vector incident back to two primary root causes:
Unauthorized Manual Actions:
It explicitly flags that the IAM user Chimera bypassed the Terraform pipeline and used the Firefox browser (from IP 196.96.170.247) to escalate the S3 bucket policy to public-read-write and attach the dangerous sts:AssumeRole inline policy. Crucially, the agent notes that this user session was created without MFA authentication, identifying weak access controls as a core vulnerability.
Insecure Infrastructure Baseline:
The agent accurately determines that the foundational infrastructure was deployed via Terraform by a different user (ECR-Access). However, it points out that this baseline was inherently insecure to begin with—it featured a security group allowing SSH access from 0.0.0.0/0, disabled S3 Public Access Blocks, and an overly permissive AdministratorAccess managed policy attached to an EC2 role.
2. Key Findings (The Brute Force Attack)
The agent correctly identifies the mechanics of the simulated attack. It notes that the Attacker-Hydra EC2 instance executed a deliberate brute force using the Hydra password cracking tool. Because both instances resided on the same subnet with a shared, wide-open security group, the attack succeeded against the weak password123 credentials defined in the EC2 user data.
3. Investigation Gaps (The Observability Moment)
Perhaps the most impressive part of the investigation is the agent's ability to identify what it couldn't see. While investigating the SSH attack, it flagged three critical observability gaps:
Broken Telemetry: The ssh-auth-logs log group existed but contained 0 bytes of data, meaning the CloudWatch agent on the Victim-Server failed to successfully ship the logs or was disrupted.
Missing Network Visibility: It attempted to query DescribeFlowLogs but returned an empty array, highlighting that VPC Flow Logs were entirely disabled, depriving the environment of network-level traffic evidence.
Missing Threat Detection: Finally, it checked for Amazon GuardDuty detectors and found none, noting that without GuardDuty, there was no AWS-native threat intelligence available for the account.
2. Topology Mapping:

The topology below shows a simulated AWS security scenario where an attacker EC2 instance (**Attacker-Hydra**) is directly connected to a target EC2 instance (**Victim-Server**), indicating that network access most likely SSH on port 22 is allowed between them. The victim is protected by a security group (**free-tier-lab-sg**), which appears to permit inbound access broadly (a common misconfiguration), making it vulnerable to brute-force login attempts. This setup represents a typical attack path where an exposed service is reachable from an external or malicious host.
At the same time, the architecture includes a logging pipeline: the victim instance is attached to an instance profile (**EC2LogShippingProfile**) and IAM role (**EC2LogShippingRole**) that allow it to send authentication logs to a CloudWatch log group (**ssh-auth-logs**). This means the system is instrumented for monitoring and detection of suspicious activity like repeated failed SSH logins, but not necessarily protected against them.

Generating the Remediation

To achieve this, click the Go to root cause button in the timeline.

Having completed the Root Cause Analysis (RCA), the agent formulates a mitigation strategy. The objective is twofold: halt the immediate brute force threat and permanently remediate the underlying configuration drift through codified infrastructure changes.

Once the root cause is identified, clicking the Generate mitigation plan button reveals exactly why frontier agents are replacing static runbooks. Instead of just offering a generic suggestion to "fix the S3 bucket," the AWS DevOps Agent generates a comprehensive, five-phase SRE mitigation runbook of AWS CLI commands.

Here is a breakdown of the generated mitigation plan and why each step matters for enterprise operations:

Phase 1: Prepare (State Capture)

Before modifying any resources, the agent instructs you to capture the current, drifted state. This is a critical safety mechanism.

Step 1.1 & 1.2:

It provides aws iam get-role-policy and aws s3api get-bucket-policy commands to save the exact JSON payloads of the drifted policies.

Step 1.3:

It captures the aws s3api get-bucket-policy-status to establish a baseline for post validation.

Why this matters: If reverting the drift breaks an unexpected dependency, you have the exact JSON needed to undo the change immediately.

Phase 2: Pre-Validate (Impact Assessment)

A human engineer might rush to delete the overly permissive policy, accidentally breaking a legitimate workload. The agent prevents this by validating preconditions.

Step 2.1:

It runs aws iam list-role-policies to confirm the rogue STSAssumeRole policy actually still exists before attempting deletion.

Step 2.2:

It runs aws ec2 describe-instances filtered by the target IAM profile. This is genius. It checks if any live EC2 instances are currently using the compromised role, allowing you to coordinate with stakeholders before pulling the plug.

Step 2.3 & 2.4:

It uses aws iam generate-service-last-accessed-details to see if the dangerous sts:AssumeRole permission has actually been exploited recently.

Phase 3: Apply (The Tactical Fix)

Having validated the environment, the agent provides the exact, surgical CLI commands to revert the drift without touching the underlying, legitimate IaC.

Step 3.1:

It runs aws iam delete-role-policy to explicitly drop the manual STSAssumeRole inline policy, neutralizing the lateral movement threat.

Step 3.2:

It runs aws s3api put-bucket-policy with a newly generated, hardened JSON payload that explicitly strips the s3:PutObject permission, locking the bucket back down to public-read-only to match the Terraform baseline.

Phase 4: Post-Validate (Verification)

The agent doesn't assume the commands worked; it verifies them.

Steps 4.1 & 4.2:

It reruns the commands from Phase 1 (list-role-policies and get-bucket-policy) and asks you to confirm that the output no longer contains the malicious permissions.

Phase 5: Rollback

Finally, the agent provides a complete Rollback script using aws iam put-role-policy and aws s3api put-bucket-policy, populated with the exact variables needed to restore the drifted state if the fix causes an unexpected outage.

By automating the creation of this peer-reviewable, highly defensive CLI runbook, the AWS DevOps Agent collapses the mitigation lifecycle from hours of stressful, manual engineering to mere minutes of execution.

Environment Cleanup

To avoid incurring any unexpected charges and to cleanly shut down your lab environment, follow these detailed steps to delete all AWS resources and local configurations.

Step 1: Empty the S3 Bucket

Terraform cannot destroy an S3 bucket if it contains objects (like Terraform state files or application data)

.Open the AWS Management Console and navigate to S3.Select the vulnerable-app-data-lab-2026 bucket.Click Empty, type permanently delete to confirm, and click Empty.

Step 2: Destroy Terraform Infrastructure

In yourterminal, navigate to the directory containing your main.tf file and execute the destruction command:


terraform destroy -auto-approve

Note: If you manually created the AWS DevOps Agent Space in the console earlier because it failed to deploy, you will need to manually delete it from the AWS DevOps Agent console.

Step 3: Clean Up Local Environment

Stop any running MCP servers and remove the installed packages to clean up your local workstation.

Stop Processes: Switch to the terminal windows running your linux-mcp-server and terraform-mcp-server and press Ctrl+C to terminate the processes.

Uninstall Linux MCP Server: Since this was installed using uv, remove it by running:


uv tool uninstall linux-mcp-server

Uninstall Terraform MCP Server: Depending on how you installed it, remove it via Homebrew or by deleting the binary:


# If installed via Homebrew
brew uninstall terraform-mcp-server

# If installed via binary download
sudo rm /usr/local/bin/terraform-mcp-server

Remove Lab Files: Finally, delete the Terraform state and lab files from your directory to ensure a clean slate:



rm -rf.terraform terraform.tfstate terraform.tfstate.backup tfplan lambda_function.zip index.py main.tf

Conclusion and Future Implications

The integration of agentic AI into the operational fabric of cloud infrastructure fundamentally disrupts traditional paradigms of software engineering, systems administration, and cybersecurity. From a project management and strategic cybersecurity perspective, the empirical evidence derived from this implementation is profound. The deployment of the AWS DevOps Agent yields an MTTR reduction of up to 75%, increases root cause accuracy to 94%, and accelerates overall incident resolution speeds by a factor of 3 to 5.

However, the most significant implication is the radical reduction of operational toil. By offloading the "undifferentiated heavy lifting" of log correlation, anomaly detection, configuration drift remediation, and HCL syntax generation to an autonomous system, the cognitive load on human engineers is drastically alleviated. This cognitive liberation allows cybersecurity professionals and Project Managers to pivot their focus from reactive firefighting and manual patching to strategic architectural planning, advanced threat modeling, and proactive innovation.

Furthermore, the implementation of the Model Context Protocol democratizes access to these agentic capabilities. By standardizing the interface between the AI reasoning engine and the underlying data layer, organizations are no longer constrained by monolithic, vendor-specific integrations. They can deploy lightweight, highly specialized MCP servers tailored to their proprietary internal systems whether that is a localized Fedora environment, a custom CI/CD pipeline, or a legacy on-premises database ensuring that the frontier agent has comprehensive, unobstructed visibility across the entire technological estate.

Ultimately, the adoption of agentic operations is not merely an enhancement of existing DevOps workflows; it is a fundamental architectural evolution. As demonstrated by the capacity to automatically detect security misconfigurations, synthesize root causes, and generate deployable, peer-reviewable infrastructure code, the AWS DevOps Agent establishes the foundation for true self-healing systems a requisite advancement for managing the scale, security, and sustainability demands of the next decade of cloud computing.