DEV Community: Usman Ahmad

Building a Self-Healing Infrastructure on AWS with Amazon Bedrock

Usman Ahmad — Mon, 29 Jun 2026 07:46:45 +0000

When your server fixes itself before you even open your laptop
Have you ever been woken up at 2 AM because Nginx went down on a production server?

What if your infrastructure could detect the failure, analyze it with AI, and restart the service, all before you even picked up your phone?

That is exactly what I built. In this article, I will walk you through Demo: Automated AIOps Self-Healing Pipeline — a real, working system on AWS that monitors an EC2-hosted Nginx server, uses Amazon Bedrock (Claude — claude-sonnet-4–6) to analyze failures, and automatically remediates them via AWS Systems Manager.

No SSH. No manual intervention. No 2 AM alerts.

What We Are Building

**
The pipeline follows a simple but powerful flow:

EC2 (Nginx) → CloudWatch Agent → CloudWatch Logs
                                        ↓
                              Subscription Filter
                                        ↓
                               Lambda Function
                                        ↓
                            Amazon Bedrock (claude-sonnet-4-6)
                                        ↓
                           SSM Run Command → Auto-Fix ✅

When Nginx stops for any reason, the system detects it, understands why it stopped, and brings it back up automatically.

The Architecture — Five Components

EC2 + Nginx A standard Amazon Linux 2023 EC2 instance running Nginx. The CloudWatch Agent is installed to ship error logs to CloudWatch in near real time.

CloudWatch Log Group + Subscription Filter: Nginx error logs land in /aiops/ec2/nginx/error-logs. A subscription filter with a blank pattern forwards every log batch to Lambda.

AWS Lambda: The brain of the pipeline. The function runs a two-stage pre-check before calling Bedrock:

Pre-check 1 — if logs are startup-only (Nginx just came back up), exit early. No Bedrock call needed.
Pre-check 2 — if the service is in the maintenance list, exit early. Bedrock and SSM are both skipped.
If both checks pass, logs are sent to Bedrock for analysis.

Github Repository Link (Lambda Function): https://github.com/engr-usman/aws-aiops-demo-repo/tree/main/demo-3-complete-aiops

import json
import boto3
import base64
import gzip
import re
import time
import os
from datetime import datetime

# ─────────────────────────────────────────────
# CONFIGURATION — loaded from environment variables
# ─────────────────────────────────────────────
AWS_REGION          = os.environ.get('AWS_REGION', 'eu-central-1')
ANALYSIS_LOG_GROUP  = os.environ.get('ANALYSIS_LOG_GROUP', '/aiops/lambda/nginx-analysis')
MODEL_ID            = os.environ.get('MODEL_ID', 'global.anthropic.claude-sonnet-4-6')

# ─────────────────────────────────────────────
# AWS CLIENTS
# Note: Bedrock client stays in us-east-1 because the cross-region
#       inference profile (global.*) is hosted there.
#       All other clients use the Lambda function's own region.
# ─────────────────────────────────────────────
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
ssm     = boto3.client('ssm',             region_name=AWS_REGION)
logs    = boto3.client('logs',            region_name=AWS_REGION)
ec2     = boto3.client('ec2',             region_name=AWS_REGION)

# ─────────────────────────────────────────────
# REMEDIATION ACTION MAP
# Maps Bedrock-returned action keys to shell commands executed
# on the EC2 instance via SSM Run Command (no SSH required).
# ─────────────────────────────────────────────
REMEDIATION_ACTIONS = {
    "nginx_service_stopped": {
        "command": "sudo systemctl start nginx && sudo systemctl status nginx",
        "description": "Starting Nginx service"
    },
    "nginx_service_failed": {
        "command": "sudo systemctl restart nginx && sudo systemctl status nginx",
        "description": "Restarting failed Nginx service"
    },
    "nginx_config_error": {
        "command": "sudo nginx -t && sudo systemctl reload nginx",
        "description": "Testing and reloading Nginx config"
    },
    "nginx_port_conflict": {
        "command": "sudo systemctl stop nginx && sudo fuser -k 80/tcp && sudo systemctl start nginx",
        "description": "Resolving port conflict and restarting Nginx"
    },
    "disk_full": {
        "command": "sudo journalctl --vacuum-size=100M && sudo find /var/log/nginx -name '*.log' -mtime +7 -delete",
        "description": "Clearing old logs to free disk space"
    },
    "permission_error": {
        "command": "sudo chown -R nginx:nginx /var/log/nginx && sudo chmod 755 /var/log/nginx",
        "description": "Fixing Nginx file permissions"
    },
    "general_restart": {
        "command": "sudo systemctl restart nginx && sudo systemctl status nginx",
        "description": "General Nginx service restart"
    }
}


# ═════════════════════════════════════════════
# MAIN HANDLER
# ═════════════════════════════════════════════
def lambda_handler(event, context):
    print("🚀 AIOps Self-Healing Lambda triggered")
    print(f"📥 Event received: {json.dumps(event)[:200]}")

    try:
        # ── Guard: only process CloudWatch Logs subscription filter events ──
        if 'awslogs' not in event:
            print("⚠️  Not a CloudWatch Logs event — possibly a manual test or wrong trigger source")
            print(f"   Received event keys: {list(event.keys())}")
            return {
                "statusCode": 400,
                "body": "Event does not contain 'awslogs' data. Trigger this function via a CloudWatch Logs subscription filter."
            }

        # ── Step 1: Decode the compressed CloudWatch log payload ──
        log_data   = decode_cloudwatch_logs(event)
        log_events = log_data.get('logEvents', [])
        log_group  = log_data.get('logGroup', 'unknown')
        log_stream = log_data.get('logStream', 'unknown')

        print(f"📋 Log Group  : {log_group}")
        print(f"📋 Log Stream : {log_stream}")
        print(f"📋 Events received: {len(log_events)}")

        if not log_events:
            return {"statusCode": 200, "body": "No log events to process"}

        # Combine all log messages into a single string for analysis
        combined_logs = "\n".join([e.get('message', '') for e in log_events])

        # ── Step 2: Extract the EC2 instance ID from the log stream name ──
        # Stream name format: i-<instance-id>/nginx-error
        instance_id = extract_instance_id(log_stream, log_group)
        print(f"🖥️  EC2 Instance ID: {instance_id}")

        # ── Step 2.5: Pre-check — skip Bedrock if logs are startup-only ──
        # When Nginx starts after auto-remediation it writes startup [notice]
        # logs. These do not require analysis. Early-exiting here saves cost
        # and avoids triggering a second remediation cycle.
        if is_startup_only_logs(combined_logs):
            print("✅ Pre-check 2.5: Startup-only logs detected — service is healthy, skipping Bedrock")
            return {
                "statusCode": 200,
                "body": json.dumps({
                    "status": "healthy",
                    "reason": "Startup-only logs detected — no analysis needed",
                    "log_group": log_group,
                    "events_count": len(log_events)
                })
            }

        # ── Step 2.6: Pre-check — skip Bedrock if service is under maintenance ──
        # Operators set MAINTENANCE_MODE_SERVICES env var (e.g. "nginx" or
        # "nginx,mysql") to signal a planned maintenance window. When a service
        # in that list is detected in the incoming logs, Bedrock is bypassed
        # entirely — avoiding unnecessary analysis cost and preventing
        # auto-remediation from restarting a service that was intentionally stopped.
        maintenance_env      = os.environ.get('MAINTENANCE_MODE_SERVICES', '').strip()
        maintenance_services = []
        if maintenance_env and maintenance_env.lower() != 'none':
            maintenance_services = [s.strip().lower() for s in maintenance_env.split(',') if s.strip()]

        if maintenance_services:
            detected_service = detect_service_from_logs(combined_logs)
            print(f"🔍 Detected service from logs : {detected_service}")
            print(f"📋 Maintenance list           : {maintenance_services}")

            if detected_service and detected_service in maintenance_services:
                print(f"🔶 MAINTENANCE MODE (pre-Bedrock): '{detected_service}' is in the maintenance list — skipping Bedrock entirely")
                return {
                    "statusCode": 200,
                    "body": json.dumps({
                        "status": "maintenance_mode",
                        "reason": f"Service '{detected_service}' is under planned maintenance",
                        "maintenance_list": maintenance_services,
                        "message": "Bedrock analysis skipped. Remove the service from MAINTENANCE_MODE_SERVICES to re-enable auto-remediation.",
                        "log_group": log_group,
                        "instance_id": instance_id
                    })
                }

        # ── Step 3: Send logs to Amazon Bedrock (Claude) for AI analysis ──
        print("🤖 Sending logs to Bedrock for analysis...")
        analysis = analyze_with_bedrock(combined_logs, instance_id)
        print(f"✅ Bedrock Analysis Complete: {json.dumps(analysis, indent=2)}")

        # ── Step 4: Persist the analysis result to a dedicated CloudWatch log group ──
        log_analysis_to_cloudwatch(analysis, instance_id, combined_logs)

        # ── Step 5: Evaluate remediation action ──
        remediation_action = analysis.get('remediation_action', 'none')
        affected_service   = detect_affected_service(remediation_action)

        print(f"🔧 Remediation Action : {remediation_action}")
        print(f"🛠️  Affected Service   : {affected_service}")
        print(f"📋 Maintenance List   : {maintenance_services if maintenance_services else 'Empty (no maintenance)'}")

        if affected_service and affected_service.lower() in maintenance_services:
            # Service is in maintenance — log the event but take no action
            print(f"🔶 MAINTENANCE MODE: '{affected_service}' is in the maintenance list — skipping remediation")
            analysis['remediation'] = {
                "status": "maintenance_mode",
                "reason": f"Service '{affected_service}' is under planned maintenance",
                "maintenance_list": maintenance_services,
                "action_would_have_been": remediation_action,
                "message": "Remove the service from MAINTENANCE_MODE_SERVICES to re-enable auto-remediation."
            }
            log_remediation_result(analysis['remediation'], instance_id)

        elif remediation_action == 'none':
            # Bedrock determined the service is healthy — no action needed
            print("ℹ️  No remediation required — Bedrock confirmed the service is healthy")
            analysis['remediation'] = {
                "status": "skipped",
                "reason": "Bedrock analysis determined no remediation is required"
            }

        elif instance_id is None:
            # Cannot remediate without knowing which EC2 instance to target
            print("⚠️  Cannot remediate — EC2 instance ID could not be determined")
            analysis['remediation'] = {
                "status": "failed",
                "reason": "EC2 instance ID could not be determined from the log stream name"
            }

        else:
            # All checks passed — trigger SSM Run Command for auto-remediation
            print(f"🔧 Starting auto-remediation for instance: {instance_id}")
            remediation_result      = perform_remediation(instance_id, analysis)
            analysis['remediation'] = remediation_result
            log_remediation_result(remediation_result, instance_id)

        return {
            "statusCode": 200,
            "body": json.dumps(analysis)
        }

    except Exception as e:
        print(f"❌ Lambda error: {str(e)}")
        raise


# ═════════════════════════════════════════════
# HELPER FUNCTIONS
# ═════════════════════════════════════════════

def decode_cloudwatch_logs(event):
    """
    Decode the base64-encoded, gzip-compressed payload delivered by
    a CloudWatch Logs subscription filter.
    """
    encoded      = event['awslogs']['data']
    compressed   = base64.b64decode(encoded)
    decompressed = gzip.decompress(compressed)
    return json.loads(decompressed)


def extract_instance_id(log_stream, log_group):
    """
    Extract the EC2 instance ID from the CloudWatch log stream name.

    Primary  — regex match on the stream name (format: i-<id>/nginx-error).
    Fallback — describe EC2 instances filtered by Name tag 'aiops-nginx-demo'
               in case the stream name does not follow the expected pattern.
    """
    match = re.search(r'(i-[a-f0-9]{8,17})', log_stream)
    if match:
        return match.group(1)

    # Fallback: look up the instance by tag
    try:
        response = ec2.describe_instances(
            Filters=[
                {'Name': 'instance-state-name', 'Values': ['running']},
                {'Name': 'tag:Name',            'Values': ['aiops-nginx-demo']}
            ]
        )
        reservations = response.get('Reservations', [])
        if reservations:
            return reservations[0]['Instances'][0]['InstanceId']
    except Exception as e:
        print(f"⚠️  Could not determine instance ID via EC2 describe: {e}")

    return None


def is_startup_only_logs(log_text):
    """
    Return True if the log batch contains only Nginx startup messages
    and no shutdown, error, or crash signals.

    Used as a fast pre-check (Step 2.5) to skip Bedrock calls when
    Nginx has just been started by a previous remediation cycle —
    avoiding unnecessary cost and preventing remediation loops.
    """
    shutdown_signals = [
        'sigquit',
        'sigterm',
        'shutting down',
        'gracefully shutting down',
        'worker process exited',    # specific — avoids matching startup "start worker process" lines
        'exited with code',
        '[error]',
        '[crit]',
        '[alert]',
        'connection refused',
        'no space left',
        'bind() failed',
        'open() failed',
    ]
    log_lower = log_text.lower()

    for signal in shutdown_signals:
        if signal in log_lower:
            print(f"🔍 Pre-check 2.5: shutdown/error signal detected → '{signal}' — proceeding to Bedrock")
            return False

    print("✅ Pre-check 2.5 passed: startup-only logs, no shutdown signals detected")
    return True


def detect_service_from_logs(log_text):
    """
    Identify which service is referenced in the log content.

    This function is intentionally generic so it can be extended to
    support additional services (apache2, mysql, node, pm2, java, etc.)
    without modifying the main handler.

    Returns the service name as a lowercase string, or None if unknown.
    """
    log_lower = log_text.lower()

    # Nginx — matched via process signatures and shutdown keywords
    nginx_signals = [
        'nginx/',                    # e.g. "nginx/1.30.2"
        'nginx:',                    # e.g. "nginx: configuration file"
        'nginx[',                    # e.g. "nginx[12345]"
        'signal 3 (sigquit)',
        'signal 15 (sigterm)',
        'gracefully shutting down',
        'worker process exited',
        'exited with code'
    ]
    for signal in nginx_signals:
        if signal in log_lower:
            return 'nginx'

    # ── Add additional service detectors below as needed ──
    # Example:
    # apache_signals = ['apache2', 'httpd', 'apachectl']
    # for signal in apache_signals:
    #     if signal in log_lower:
    #         return 'apache2'

    return None


def detect_affected_service(remediation_action):
    """
    Map a Bedrock-returned remediation action key to the name of the
    affected service. Used to cross-check the maintenance list after
    Bedrock analysis (Step 5).

    Extend this map when new service action keys are added to
    REMEDIATION_ACTIONS.
    """
    service_map = {
        'nginx_service_stopped': 'nginx',
        'nginx_service_failed':  'nginx',
        'nginx_config_error':    'nginx',
        'nginx_port_conflict':   'nginx',
        'general_restart':       'nginx',
        'disk_full':             None,   # not service-specific
        'permission_error':      None,   # not service-specific
        'none':                  None
    }
    return service_map.get(remediation_action, None)


# ═════════════════════════════════════════════
# BEDROCK ANALYSIS
# ═════════════════════════════════════════════

def analyze_with_bedrock(log_text, instance_id):
    """
    Send Nginx error logs to Amazon Bedrock (Claude) for AI-powered
    root-cause analysis and remediation recommendation.

    The prompt uses explicit, rule-based instructions to ensure
    deterministic JSON output suitable for automated processing.
    """
    prompt = f"""You are an automated self-healing infrastructure system analyzing Nginx logs.
Your job is to detect if Nginx is DOWN and prescribe the correct remediation action.

EC2 Instance: {instance_id or 'unknown'}

Nginx Logs to Analyze:
<logs>
{log_text}
</logs>

─── MANDATORY DECISION RULES ───

RULE 1 — Nginx is STOPPED → remediation_action = "nginx_service_stopped"
  These log signals ALWAYS mean Nginx has stopped — no exceptions:
  • "signal 3 (SIGQUIT) received"
  • "signal 15 (SIGTERM) received"
  • "gracefully shutting down"
  • "worker process exited with code 0"
  • "worker process N exited"
  • "exit" appearing after any shutdown signal
  • Any combination of the above

  ⚠️  IMPORTANT: Do NOT consider whether the shutdown was graceful, intentional,
  or triggered by systemd/init. If Nginx has stopped → always use "nginx_service_stopped".
  This is an automated system. Nginx must always be running.

RULE 2 — Nginx FAILED or CRASHED → remediation_action = "nginx_service_failed"
  • "failed to start" / "start request repeated too quickly"
  • Process exited with non-zero code

RULE 3 — Config Error → remediation_action = "nginx_config_error"
  • "nginx: configuration file ... test failed"
  • "unknown directive" / "invalid parameter"

RULE 4 — Port Conflict → remediation_action = "nginx_port_conflict"
  • "bind() to 0.0.0.0:80 failed (98: Address already in use)"

RULE 5 — Disk Full → remediation_action = "disk_full"
  • "No space left on device"

RULE 6 — Permission Error → remediation_action = "permission_error"
  • "permission denied" on log files or pid files

RULE 7 — remediation_action = "none" ONLY when ALL of these are true:
  • Nginx startup lines present (e.g. "start worker process", "using the epoll event method")
  • ZERO shutdown/exit/SIGQUIT/SIGTERM signals in logs
  • Service is confirmed running with active worker processes

─── OUTPUT FORMAT ───

Respond ONLY with this exact JSON (no markdown, no explanation, no extra text):
{{
  "issue": "one line summary of what happened",
  "root_cause": "technical explanation of the root cause",
  "severity": "LOW|MEDIUM|HIGH|CRITICAL",
  "fix": "exact command or steps to fix this",
  "remediation_action": "nginx_service_stopped|nginx_service_failed|nginx_config_error|nginx_port_conflict|disk_full|permission_error|general_restart|none",
  "prevention": "steps to prevent this in future",
  "estimated_impact": "which users or services are affected"
}}

Severity guide:
  CRITICAL = complete outage, no recovery possible without intervention
  HIGH     = service is down, auto-remediation required immediately
  MEDIUM   = degraded performance or partial failure
  LOW      = informational only, service is healthy"""

    try:
        response = bedrock.invoke_model(
            modelId=MODEL_ID,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 1000,
                "messages": [{"role": "user", "content": prompt}]
            }),
            contentType='application/json',
            accept='application/json'
        )

        result   = json.loads(response['body'].read())
        raw_text = result['content'][0]['text'].strip()

        # Strip markdown code fences if present (defensive parsing)
        raw_text = re.sub(r'```

json|

```', '', raw_text).strip()
        analysis = json.loads(raw_text)
        analysis['analyzed_at'] = datetime.utcnow().isoformat()
        analysis['model_used']  = MODEL_ID
        return analysis

    except json.JSONDecodeError as e:
        print(f"⚠️  JSON parse error from Bedrock response: {e}")
        # Return a safe fallback so the pipeline can still attempt remediation
        return {
            "issue": "Log Analysis Completed — JSON parse error",
            "root_cause": raw_text[:500],
            "severity": "MEDIUM",
            "fix": "Manual review required",
            "remediation_action": "general_restart",
            "prevention": "Review logs manually",
            "estimated_impact": "Unknown",
            "analyzed_at": datetime.utcnow().isoformat()
        }


# ═════════════════════════════════════════════
# SSM AUTO-REMEDIATION
# ═════════════════════════════════════════════

def perform_remediation(instance_id, analysis):
    """
    Execute the recommended fix on the target EC2 instance using
    AWS Systems Manager (SSM) Run Command — no SSH or key pairs required.

    Steps:
      1. Validate that the instance is reachable via SSM.
      2. Send the shell command from REMEDIATION_ACTIONS.
      3. Wait up to 60 s for the command to complete.
      4. Return a structured result including exit code and stdout.
    """
    action_key  = analysis.get('remediation_action', 'general_restart')

    if action_key == 'none':
        return {
            "status": "skipped",
            "reason": "Bedrock determined no remediation is needed"
        }

    action      = REMEDIATION_ACTIONS.get(action_key, REMEDIATION_ACTIONS['general_restart'])
    command     = action['command']
    description = action['description']

    print(f"🔧 Remediation action : {action_key}")
    print(f"💻 Command            : {command}")

    try:
        if not is_instance_ssm_ready(instance_id):
            return {
                "status": "failed",
                "reason": "Instance is not reachable via SSM",
                "action_attempted": action_key
            }

        response = ssm.send_command(
            InstanceIds=[instance_id],
            DocumentName='AWS-RunShellScript',
            Parameters={
                'commands': [
                    command,
                    'echo "--- Post-Fix Status ---"',
                    'sudo systemctl is-active nginx && echo "NGINX_STATUS: RUNNING" || echo "NGINX_STATUS: STOPPED"',
                    'echo "--- Nginx Process ---"',
                    'ps aux | grep nginx | grep -v grep || echo "No nginx process found"'
                ],
                'executionTimeout': ['120']
            },
            Comment=f"AIOps Auto-Remediation: {description}",
            TimeoutSeconds=300
        )

        command_id = response['Command']['CommandId']
        print(f"✅ SSM Command sent: {command_id}")

        result = wait_for_ssm_command(command_id, instance_id)

        return {
            "status": "success",
            "action_taken": action_key,
            "description": description,
            "command_executed": command,
            "ssm_command_id": command_id,
            "output": result.get('output', '')[:500],
            "exit_code": result.get('exit_code', 'unknown'),
            "executed_at": datetime.utcnow().isoformat()
        }

    except Exception as e:
        print(f"❌ SSM remediation failed: {str(e)}")
        return {
            "status": "failed",
            "action_attempted": action_key,
            "error": str(e),
            "executed_at": datetime.utcnow().isoformat()
        }


def is_instance_ssm_ready(instance_id):
    """
    Check whether the target EC2 instance is registered with SSM
    and currently reachable (PingStatus == 'Online').
    """
    try:
        response  = ssm.describe_instance_information(
            Filters=[{'Key': 'InstanceIds', 'Values': [instance_id]}]
        )
        instances = response.get('InstanceInformationList', [])
        if instances:
            status = instances[0].get('PingStatus', '')
            print(f"📡 SSM Ping Status: {status}")
            return status == 'Online'
        return False
    except Exception as e:
        print(f"⚠️  SSM readiness check failed: {e}")
        return False


def wait_for_ssm_command(command_id, instance_id, max_wait=60):
    """
    Poll SSM for the result of a Run Command invocation.
    Returns as soon as a terminal status is reached or max_wait seconds elapse.
    """
    start = time.time()

    while time.time() - start < max_wait:
        try:
            response = ssm.get_command_invocation(
                CommandId=command_id,
                InstanceId=instance_id
            )
            status = response['Status']

            if status in ['Success', 'Failed', 'Cancelled', 'TimedOut']:
                return {
                    "status":    status,
                    "output":    response.get('StandardOutputContent', ''),
                    "error":     response.get('StandardErrorContent', ''),
                    "exit_code": response.get('ResponseCode', -1)
                }

            print(f"⏳ SSM command status: {status} — waiting...")
            time.sleep(5)

        except ssm.exceptions.InvocationDoesNotExist:
            # Command not yet registered on the SSM side — retry shortly
            time.sleep(3)

    return {"status": "timeout", "output": "Command timed out", "exit_code": -1}


# ═════════════════════════════════════════════
# CLOUDWATCH LOGGING HELPERS
# ═════════════════════════════════════════════

def log_analysis_to_cloudwatch(analysis, instance_id, original_logs):
    """
    Persist the Bedrock analysis result to a dedicated CloudWatch log group
    (/aiops/lambda/nginx-analysis) for audit, dashboarding, and future review.

    Log stream naming convention: <instance-id>/YYYY/MM/DD
    """
    log_stream = f"{instance_id or 'unknown'}/{datetime.utcnow().strftime('%Y/%m/%d')}"

    # Create log group and stream if they do not already exist
    for create_fn, kwargs in [
        (logs.create_log_group,  {'logGroupName': ANALYSIS_LOG_GROUP}),
        (logs.create_log_stream, {'logGroupName': ANALYSIS_LOG_GROUP, 'logStreamName': log_stream}),
    ]:
        try:
            create_fn(**kwargs)
        except logs.exceptions.ResourceAlreadyExistsException:
            pass

    log_entry = {
        "timestamp":           datetime.utcnow().isoformat(),
        "instance_id":         instance_id,
        "analysis":            analysis,
        "original_log_sample": original_logs[:300]
    }

    try:
        logs.put_log_events(
            logGroupName=ANALYSIS_LOG_GROUP,
            logStreamName=log_stream,
            logEvents=[{
                'timestamp': int(datetime.utcnow().timestamp() * 1000),
                'message':   json.dumps(log_entry, default=str)
            }]
        )
        print(f"📝 Analysis logged to: {ANALYSIS_LOG_GROUP}/{log_stream}")
    except Exception as e:
        print(f"⚠️  Failed to write analysis to CloudWatch: {e}")


def log_remediation_result(remediation_result, instance_id):
    """
    Print a structured remediation summary to the Lambda log stream
    for observability and debugging.
    """
    print("📊 Remediation Summary:")
    print(f"   Status    : {remediation_result.get('status')}")
    print(f"   Action    : {remediation_result.get('action_taken', remediation_result.get('action_attempted', 'N/A'))}")
    print(f"   Exit Code : {remediation_result.get('exit_code', 'N/A')}")
    if remediation_result.get('output'):
        print(f"   Output    : {remediation_result['output'][:200]}")

Lambda-AIOps-Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SSMRunCommand",
            "Effect": "Allow",
            "Action": [
                "ssm:SendCommand",
                "ssm:GetCommandInvocation",
                "ssm:ListCommandInvocations",
                "ssm:DescribeInstanceInformation"
            ],
            "Resource": "*"
        },
        {
            "Sid": "EC2Describe",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceStatus"
            ],
            "Resource": "*"
        }
    ]
}

Amazon Bedrock (Claude Sonnet) Claude analyzes the logs and returns structured JSON — issue, root cause, severity, fix, and a remediation_action key that maps directly to a shell command.
AWS SSM Run Command No SSH. No key pairs. Lambda sends the remediation command directly to the EC2 instance via SSM. The output (including post-fix Nginx status) is captured and logged.

A Smart Feature: Maintenance Mode

One of my favorite parts of this design is Maintenance Mode.

If you need to stop Nginx intentionally — for a deployment, an OS patch, or a config change — you do not want the system restarting it behind your back.

The solution is a single Lambda environment variable:

MAINTENANCE_MODE_SERVICES = nginx

When this is set, the Lambda function detects that the affected service is under planned maintenance and exits immediately — before even calling Bedrock. Zero analysis cost. Zero unintended restarts.

To re-enable auto-remediation, clear the variable.

This pattern is designed to be generic. You can add nginx,mysql,apache2 and it works the same way for each service.

What the Logs Look Like
Here is the Lambda log output from a real test run:

When Nginx stops unexpectedly:

🚀 AIOps Self-Healing Lambda triggered
📋 Events received: 11
🖥️  EC2 Instance ID: i-0496654c0724d703f
🔍 Pre-check: shutdown signal found → 'sigquit'
🤖 Sending logs to Bedrock for analysis...
✅ Bedrock: { "severity": "CRITICAL", "remediation_action": "nginx_service_stopped" }
📡 SSM Ping Status: Online
✅ SSM Command sent: cmd-xxxxxxxxxxxxxxxxx
📊 Status: success | Exit Code: 0
   Output: ● nginx.service ... Active: active (running)

When Nginx starts back up (second trigger — skipped automatically):

✅ Pre-check 2.5 passed: startup-only logs, no shutdown signals detected ✅ Startup-only logs detected — service is healthy, skipping Bedrock

The system detects that the second trigger is just Nginx startup noise and exits without making any Bedrock call.

Key Lessons from Building This

Never assume log levels. I wasted time debugging why the subscription filter was not triggering — the filter was set to ERROR, but Nginx shutdown logs are [notice]. Always verify what level a service actually logs at.
LLMs need explicit rules, not hints. In early iterations, Bedrock correctly identified a graceful Nginx shutdown as intentional and returned remediation_action: "none". Technically correct — operationally wrong. The fix was a prompt with mandatory, signal-based rules: if SIGQUIT is present, always return nginx_service_stopped.
Region matters more than you think. The Lambda was in eu-central-1, but SSM and EC2 clients were hardcoded to us-east-1. Everything appeared to work until the SSM command silently failed. Always derive region from the Lambda runtime environment variable.
SSH-less design is the right design. Using SSM Run Command instead of SSH eliminates key pair management, improves security posture, and makes the system cleaner. If you are building automation on AWS, there is rarely a reason to use SSH.

Try It Yourself

The complete source code — Lambda function and README with full setup instructions — is available on GitHub:

🔗 github.com/engr-usman/aws-aiops-demo-repo

The README covers everything: EC2 setup, CloudWatch Agent configuration, IAM roles, Lambda deployment, and test cases.

What Is Next
This is Demo 3 in an ongoing AIOps series. Upcoming additions include:

SNS/email alerts on remediation events
CloudWatch Dashboard for live pipeline visibility
Support for additional services (Apache, MySQL, Node.js)
Multi-instance support

If you found this useful, follow me for more hands-on AWS and AI content. Questions or suggestions? Drop them in the comments.

Difference between AWS VPC Peering and AWS Transit Gateway

Usman Ahmad — Thu, 13 Apr 2023 07:07:50 +0000

AWS VPC Peering and AWS Transit Gateway are two different ways to connect multiple Virtual Private Clouds (VPCs) within an AWS environment.

VPC Peering allows you to connect two VPCs within the same AWS account or across different AWS accounts, using private IP addresses. It provides a direct network connection between the VPCs, allowing them to communicate with each other securely and efficiently. VPC peering is suitable for scenarios where you need to connect a few VPCs and have a simple network topology.

On the other hand,

AWS Transit Gateway is a fully managed service that provides a centralized hub for connecting multiple VPCs and on-premises networks. It simplifies network management by allowing you to create a single transit gateway and attach multiple VPCs and VPN connections to it. This eliminates the need for creating multiple VPC peering connections, which can be difficult to manage and scale as the number of VPCs grows.

In summary, AWS VPC peering is suitable for connecting a few VPCs with a simple network topology, while AWS Transit Gateway is designed for managing complex network topologies with multiple VPCs and on-premises networks.

Modify the AWS RDS Instance size using Lambda Function

Usman Ahmad — Thu, 30 Mar 2023 06:20:48 +0000

This article gives you an overview of the AWS Lambda function to modify the RDS instance class using Python language in Lambda function without stopping the RDS instance.

Let’s follow this article to modify RDS instance using Lambda function.

Steps to create AWS Lambda Function for AWS RDS Instance class

we use the following steps to configure a lambda function.

Step1: Create an IAM Policy

The first step is to create IAM policy to gain access to RDS actions and AWS CloudWatch log events.

Navigate to IAM in the services and click on Policies => Create Policy.

Step2: Create an IAM Role and attach “Lambda_RDS_modification_policy”

In this step, we are creating an IAM role and attach the policy created in the previous step. Click on Roles -> Create Role:

Step3: Create an AWS Lambda Function

Now we will create AWS Lambda function to modify RDS instance class. First you have to select “Author from scratch” -> Function Name -> Runtime (Python3.7 or 3.8) -> Existing Role “RDS_Lambda_Role”

Add inline policy in existing IAM Role

Now, open a new tab for the IAM role and edit the existing Role RDS_Lambda. In the summary page, click on Add Inline Policy

In the Inline policy editor, paste the following JSON. Here, you note that we used the AWS lambda ARN in the resource section. You can copy ARN for your existing lambda ARN.

Lambda ARN follows the format: arn:aws:lambda:Region-AWS Account:function:lambda_function_Name

{   
   "Version": "2012-10-17",
   "Statement": [
{
      "Effect": "Allow",
      "Action": "lambda:GetFunctionConfiguration",
      "Resource": "arn:aws:lambda:us-east-1:11111111111:function:RDSStartFunction"
      }
   ]
}

Step4: Function Code: Scroll down and paste the Python code inside the editor. You need to select appropriate language in the run time. I go with the latest version Python 3.8

Python Code:

Here we are using 2 environment variables:

DBinstance
DBinstanceClass

import sys
import botocore
import boto3
import json
from botocore.exceptions import ClientError
def lambda_handler(event, context):
    rds = boto3.client('rds')
    lambdaFunc = boto3.client('lambda')
    print ('Trying to get Environment variable')
    try:
        funcResponse = lambdaFunc.get_function_configuration(
            FunctionName='RDS_Instance_Modification_Function'
       )
        DBinstance = funcResponse['Environment']['Variables']['DBInstanceName']
        DBinstanceClass = funcResponse['Environment']['Variables']['DBinstanceClass']

        print (f'Starting RDS service for DBInstance : {DBinstance}')
        print (f'RDS instance class : {DBinstanceClass}')


        response = rds.modify_db_instance(DBInstanceIdentifier=DBinstance, DBInstanceClass=DBinstanceClass, ApplyImmediately=True)

        print (f'Success :: {response} ') 
        return json.dumps(dict(abc=123))
    except ClientError as e:
return json.dumps(dict(error=str(e)))

    return json.dumps(dict(abc=12345))

Creating Environment Variables:

Step5: I already created one testdb RDS instance for the testing

Step6: Now we will test the Lambda Function

Click on the “Test” button

First time when you run it and as you already entered the “db.t2.micro” so the result will be like:

But now I will change the “DBinstanceClass” to “db.t2.small” so this time it will successfully modify the AWS RDS instance “testdb” class to “db.t2.small”

Now you will get the following logs after running the lambda function

Here you will see that the RDS instance is now “modifying” status. It will take some time to show you the result as this instance size will be change from “db.t2.micro” to “db.t2.small”

Final result: after modification you will see that now the instance size is “db.t2.small” and status is now “Available”

How to use Kubernetes Secret to pull private Docker Images from DockerHub

Usman Ahmad — Mon, 27 Mar 2023 09:26:33 +0000

In this article, you will learn how we pull the private docker image from DockerHub using Kubernetes Secret and create a Kubernetes Pod from the docker private image.

Docker Hub:

Docker Hub is a hosted repository service provided by Docker for finding and sharing container images with your team. Key features include Private Repositories: Push and pull container images. Automated Builds: Automatically build container images from GitHub and Bitbucket and push them to Docker Hub.

Kubernetes Secrets:

A Secret is an object that contains a small amount of sensitive data such as a password, a token, or a key. Such information might otherwise be put in a Pod specification or in a container image. Using a Secret means that you don’t need to include confidential data in your application code.

Example:

To use a secret to pull a private image from a container registry, you can create a “imagePullSecrets” field in your deployment or pod YAML file. Here’s an example:

Step1: Create a secret

kubectl create secret docker-registry my-registry-secret \
— docker-username=DOCKER_USER \
— docker-password=DOCKER_PASSWORD \
— docker-email=DOCKER_EMAIL

Replace the DOCKER_REGISTRY_SERVER, DOCKER_USER, DOCKER_PASSWORD, and DOCKER_EMAIL with your container registry server address, username, password, and email respectively.

Step2: My Dockerhub account, where I have my private docker image

Step3: Create a deployment file with “imagePullSecrets”

Modify your deployment or pod YAML file to include the imagePullSecrets field:

In this example, we added the imagePullSecrets field to the deployment YAML file, and set the value to the name of the secret we created in step 1 (my-registry-secret). Kubernetes will use this secret to authenticate with the container registry when pulling the private-registry/my-image image.

When you apply the modified YAML file to your cluster, Kubernetes will use the specified secret to authenticate with the container registry and pull the private image.

Step4: Final result

For this article I am using “minikube” cluster, so you can see that before creating the deployment we don’t have the docker image “usm87/jenkins-cicd-maven-project:v4”

After creating the deployment, below are the Pod event logs

Now you can see we have the docker image “usm87/jenkins-cicd-maven-project:v4” pulled from the docker hub successfully.

WordPress Installation on the AWS Ubuntu 20.04 Instance

Usman Ahmad — Fri, 13 Jan 2023 14:10:54 +0000

Steps:

AWS Account
Ubuntu 20.04 instance
Prerequisite to setup WordPress
Final verification of WordPress website

AWS Account: You should have AWS account to perform this task

We are using Ubuntu 20.04 OS to set up WordPress.

If you are setting up WordPress just for practice then “t2.micro” instance is fine for this work. Otherwise if you have plan to use this instance for real website then you should use instance type according to the expected load/traffic on your website.

Configure Instances and Add storage options will be remain same (default) but if you are using this instance for your actual website then you should follow AWS best practices.
Configure Security Group we need SSH and HTTP type should be open to the world (*for best practice ssh port 22 should allow just your ip address and after performing your work you should remove it from ingress rule).

Now review and launch
SSH your instance using Putty or command line

Prerequisite to setup WordPress

Install php with all packages
Install apache2
Install mysql-server

Commands: Follow below commands

apt update -y
sudo apt install php libapache2-mod-php php-mysql php-redis
sudo apt install php-curl php-gd php-mbstring php-xml php-xmlrpc php-soap php-intl php-zip
apt install apache2
apt install mysql-server
systemctl enable apache2 mysql
wget -c http://wordpress.org/latest.tar.gz
tar -xzvf latest.tar.gz
mkdir /var/www/wordpress
sudo cp -R wordpress /var/www/wordpress
sudo chown -R www-data:www-data /var/www/wordpress
sudo chmod -R 775 /var/www/html/wordpress
sudo mysql -u root

Now we create database, user with password

CREATE DATABASE wordpress;
CREATE USER 'wpuser'@'localhost' IDENTIFIED BY 'wppassword';
GRANT ALL PRIVILEGES ON * . * TO 'wpuser'@'localhost';
FLUSH PRIVILEGES;
exit

Now we will create .conf file inside site-available

sudo nano /etc/apache2/sites-available/wordpress.conf

Now update the below details in it

<Directory /var/www/wordpress/>
    AllowOverride All
</Directory>

Now run below commands

sudo a2enmod rewrite

Now to test the configurations

sudo apache2ctl configtest
Result: 
Syntax OK
Now restart apache2
sudo systemctl restart apache2

Now access the “wp-config.php” file and add the database credentials

Now we will access WordPress website through our EC2 instance public IP address

Here we have our WordPress website

Now you just need to configure with your website details and enjoy :)