Masaki Okuda

Posted on Mar 1

Testing AWS DevOps Agent with Multi-Account

#aws #devops #testing #tutorial

One of the key features of AWS DevOps Agent is multi-account support. In this hands-on article, we'll verify the multi-account functionality and discuss practical considerations for real-world implementation.

Disclaimer

This blog content is based on testing during the preview phase.
The information may change as updates are released.

Target Audience

Those who want to explore AWS DevOps Agent's multi-account functionality
Those who want to understand considerations when testing multi-account setups

Key Takeaways

Fault detection in multi-account configurations is achievable
Multi-account setup itself is straightforward, but practical implementation requires careful consideration
IAM role names and permission management need to be determined before implementing DevOps Agent

About the Test Environment

We'll conduct verification using the test sample code from the AWS DevOps Agent official documentation. (The sample code allows you to test EC2 stress tests and Lambda processing performance)

Creating a Test Environment

The implementation steps are described in the guide, so we'll omit them from this blog.

Note: Since AWS DevOps Agent is in preview, please create resources in the US East (N. Virginia) region

Note: Indentation corrections are required for Test A/Test B YAML files

Test A

AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS DevOps Agent EC2 CPU Test Stack'
Parameters:
  MyIP:
    Type: String
    Description: Your current IP address for SSH access (find at https://whatismyipaddress.com)
    Default: '0.0.0.0/0'
Resources:
  # Security Group for SSH access
  TestSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: AWS-DevOpsAgent-test-sg
      GroupDescription: AWS DevOps Agent beta testing security group
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: !Ref MyIP
          Description: SSH access from your IP
      Tags:
        - Key: Name
          Value: AWS-DevOpsAgent-Test-SG
        - Key: Purpose
          Value: AWS-DevOpsAgent-Testing
  # Key Pair for SSH access
  TestKeyPair:
    Type: AWS::EC2::KeyPair
    Properties:
      KeyName: AWS-DevOpsAgent-test-key
      KeyType: rsa
      Tags:
        - Key: Name
          Value: AWS-DevOpsAgent-Test-Key
        - Key: Purpose
          Value: AWS-DevOpsAgent-Testing
  # EC2 Instance for CPU testing
  TestInstance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t3.micro
      ImageId: '{{resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64}}'
      KeyName: !Ref TestKeyPair
      SecurityGroupIds:
        - !Ref TestSecurityGroup
      UserData:
        Fn::Base64:
          !Sub |
          #!/bin/bash
          yum update -y
          yum install -y htop

          # Create the CPU stress test script
          cat > /home/ec2-user/cpu-stress-test.sh << 'EOF'
          #!/bin/bash
          echo "Starting AWS DevOpsAgent CPU Stress Test"
          echo "Time: $(date)"
          echo "Instance: $(curl -s http://169.254.169.254/latest/meta-data/instance-id)"
          echo ""

          # Get number of CPU cores
          CORES=$(nproc)
          echo "CPU Cores: $CORES"
          echo ""

          echo "Starting stress test (5 minutes)..."
          echo "This will generate >70% CPU usage to trigger CloudWatch alarm"
          echo ""

          # Create CPU load using yes command
          echo "Starting CPU load processes..."
          for i in $(seq 1 $CORES); do
              (yes > /dev/null) &
              CPU_PID=$!
              echo "Started CPU load process $i (PID: $CPU_PID)"
              echo $CPU_PID >> /tmp/cpu_test_pids
          done

          # Auto-cleanup after 5 minutes
          (sleep 300 && echo "Stopping CPU load processes..." && kill $(cat /tmp/cpu_test_pids 2>/dev/null) 2>/dev/null && rm -f /tmp/cpu_test_pids) &

          echo ""
          echo "CPU load processes started for 5 minutes"
          echo "Check CloudWatch for alarm trigger in 3-5 minutes"
          EOF

          chmod +x /home/ec2-user/cpu-stress-test.sh
          chown ec2-user:ec2-user /home/ec2-user/cpu-stress-test.sh

          # Create auto-shutdown script (safety mechanism)
          cat > /home/ec2-user/auto-shutdown.sh << 'SHUTDOWN_EOF'
          #!/bin/bash
          echo "Auto-shutdown scheduled for 2 hours from now: $(date)"
          sleep 7200
          echo "Auto-shutdown executing at: $(date)"
          sudo shutdown -h now
          SHUTDOWN_EOF

          chmod +x /home/ec2-user/auto-shutdown.sh
          nohup /home/ec2-user/auto-shutdown.sh > /home/ec2-user/auto-shutdown.log 2>&1 &

          echo "AWS DevOpsAgent test setup completed at $(date)" > /home/ec2-user/setup-complete.txt
      Tags:
        - Key: Name
          Value: AWS-DevOpsAgent-Test-Instance
        - Key: Purpose
          Value: AWS-DevOpsAgent-Testing
  # CloudWatch Alarm for CPU utilization
  CPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: AWS-DevOpsAgent-EC2-CPU-Test
      AlarmDescription: AWS-DevOpsAgent beta test - EC2 CPU utilization alarm
      MetricName: CPUUtilization
      Namespace: AWS/EC2
      Statistic: Average
      Period: 60
      EvaluationPeriods: 1
      Threshold: 70
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: InstanceId
          Value: !Ref TestInstance
      TreatMissingData: notBreaching
Outputs:
  InstanceId:
    Description: EC2 Instance ID for testing
    Value: !Ref TestInstance

  SecurityGroupId:
    Description: Security Group ID
    Value: !Ref TestSecurityGroup

  AlarmName:
    Description: CloudWatch Alarm Name
    Value: !Ref CPUAlarm

  SSHCommand:
    Description: SSH command to connect to instance
    Value: !Sub 'ssh -i "AWS-DevOpsAgent-test-key.pem" ec2-user@${TestInstance.PublicDnsName}'

Test B

AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS DevOpsAgent Lambda Error Test Stack'
Resources:
  # IAM Role for Lambda function
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: AWS-DevOpsAgentLambdaTestRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Tags:
        - Key: Name
          Value: AWS-DevOpsAgent-Lambda-Test-Role
        - Key: Purpose
          Value: AWS-DevOpsAgent-Testing
  # Lambda function that generates errors
  TestLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: AWS-DevOpsAgent-test-lambda
      Runtime: python3.12
      Handler: index.lambda_handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          import json
          import random
          import time
          from datetime import datetime
          def lambda_handler(event, context):
              print(f"AWS DevOpsAgent Test Lambda - {datetime.now()}")
              print(f"Event: {json.dumps(event)}")

              # Intentionally generate errors for testing
              error_scenarios = [
                  "Simulated database connection timeout",
                  "Test API rate limit exceeded",
                  "Intentional validation error for AWS DevOpsAgent testing"
              ]

              # Always throw an error for testing purposes
              error_message = random.choice(error_scenarios)
              print(f"Generating test error: {error_message}")

              # This will create a Lambda error that CloudWatch will detect
              raise Exception(f"AWS DevOpsAgent Test Error: {error_message}")
      Description: AWS DevOpsAgent beta test function - intentionally generates errors
      Timeout: 30
      Tags:
        - Key: Name
          Value: AWS-DevOpsAgent-Test-Lambda
        - Key: Purpose
          Value: AWS-DevOpsAgent-Testing
  # CloudWatch Alarm for Lambda errors
  LambdaErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: AWS-DevOpsAgent-Lambda-Error-Test
      AlarmDescription: AWS-DevOpsAgent beta test - Lambda error rate alarm
      MetricName: Errors
      Namespace: AWS/Lambda
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: FunctionName
          Value: !Ref TestLambdaFunction
      TreatMissingData: notBreaching
Outputs:
  LambdaFunctionName:
    Description: Lambda Function Name for testing
    Value: !Ref TestLambdaFunction

  LambdaFunctionArn:
    Description: Lambda Function ARN
    Value: !GetAtt TestLambdaFunction.Arn

  AlarmName:
    Description: CloudWatch Alarm Name
    Value: !Ref LambdaErrorAlarm

  TestCommand:
    Description: AWS CLI command to test the function
    Value: !Sub 'aws lambda invoke --function-name ${TestLambdaFunction} --payload "{\"test\":\"AWS DevOpsAgent validation\"}" response.json'

Once the CloudFormation processing completes successfully, the test environment is ready.
Since it takes time for error conditions to occur, let's configure AWS DevOps Agent in the meantime.

AWS DevOps Agent Multi-Account Configuration

First, create an AWS DevOps Agent for multi-account integration.
By the way, the top screen when creating an Agentspace has changed.
Since it's a preview version, breaking changes are likely occurring.

Link the target AWS account to the Secondary Source.
Click the Add button.

After clicking, IAM role information to be created in the target account is displayed. Navigate to the target AWS account and create the IAM role.

Let's try creating the role with a name different from the initially instructed name

The IAM role has been created, but don't forget to configure the inline policy.

Looking at the inline policy configuration, only minimal resource settings are permitted. For example, if you have needs like storing logs in S3 or wanting to monitor Control Tower, you'll need to tune the inline policy configuration.

The IAM role for AWS DevOps Agent should follow these rules:

You should see alarms displayed, so let's check them.
The Lambda alarm went into alarm state, but no alarm occurred for EC2. The script itself didn't seem to be running either, so it might be better to create your own verification resources.

Return to the AWS console on the AWS DevOps Agent side and configure multi-account settings. Configure the following:

Once configuration is complete, the AWS account for multi-account setup will be displayed in the Secondary Source.

An error will occur if the IAM role name doesn't match

By correcting the IAM role name, it becomes Valid.

Testing with AWS DevOps Agent

Now let's verify whether AWS DevOps Agent can detect Lambda failures.

The test alarm name is appearing, so multi-account is functioning properly.

The following was output as the investigation result.
The cause was identified as intentional Lambda code designed to output test errors, confirming that detection is working properly.

The Lambda function AWS-DevOpsAgent-test-lambda, intentionally designed to generate test errors, experienced a 100% error rate (4 errors out of 4 invocations) between 13:40:00 and 13:41:00 UTC. This function was deployed via CloudFormation at 13:15:04 UTC with the description "AWS DevOpsAgent beta test function - intentionally generates errors". Log analysis revealed that the function code (index.py, line 21) intentionally raises exceptions with three different test scenarios: "Test API rate limit exceeded", "Intentional validation error for AWS DevOpsAgent testing", and "Simulated database connection timeout". All errors follow the pattern "AWS DevOpsAgent Test Error: {error_message}". Metrics confirmed these were immediate failures (5-19 millisecond durations) without throttling or timeout issues, consistent with intentional error generation. The CloudWatch alarm "AWS-DevOpsAgent-Lambda-Error-Test" triggered as designed when detecting these errors. This is expected behavior for this test function and not an incident requiring remediation in production.

Insights Gained Through Verification

Through multi-account configuration verification, we gained several insights.

Insight 1: Configuration itself is not difficult with the guide

Since the AWS DevOps Agent multi-account configuration screen provides guidance on what policies and roles to configure, even beginners can handle the setup. However, the default policy permissions are minimal, so customization may be necessary depending on the project.

Insight 2: There are risks with current specifications when configuring for user environments

When performing operational support or monitoring configuration, settings are likely needed for both the vendor's AWS account (where DevOps Agent is set up) and the user's AWS account (where IAM roles are configured). In such cases, if IAM role names and permission settings are not discussed in advance, there's a possibility that the desired troubleshooting cannot be achieved.

Insight 3: Introducing multi-account makes DevOps Agent operations more complex

In this case, we're targeting one AWS account, so management isn't too complex, but when managing multiple AWS accounts with a single DevOps Agent, it becomes difficult to track which AgentSpace manages which AWS account. (You need to check each AgentSpace's accounts one by one)

Future Outlook

Writing this blog has raised the need to consider the following:

Permission management considering multi-account configuration (IAM Identity Center)
DevOps Agent management itself considering IaC
Accuracy verification when registering multiple multi-accounts + same failure occurs simultaneously

Additionally, we'd like to conduct verification from these perspectives:

Integration with unsupported observability tools
Implementation ideas for enterprise environments

DEV Community