David💻

Posted on Sep 29

Automating EBS Volume Upsizing on AWS

#aws #programming #python #todayilearned

In this article, we’ll show how to automate EBS volume upsizing using CloudWatch alarms, SNS notifications, AWS Systems Manager runbooks, and Lambda. By the end, we will have a hands-free setup that detects when volumes are running low on space, triggers an automated workflow, and expands them without downtime, keeping our workloads running smoothly

The Problem

This solution was created so we can fix a specific problem, and that is for the workloads that are running SAP in our EC2 instances, which sometimes run out of space. However, we are very intricate with our cost. The challenge is that not all disks are mounted on the same mount point: some are under /hana/data, others are under /usr/. Also, AWS Lambda currently has no way or library to identify inside our EC2 instances which volumes we have or where they are mounted. Certainly, we can increment the disk size of an EBS volume; however, we also need to extend that storage space inside the instance. For this, we created the following automation that can be easily adapted to other disks, not only SAP solutions

Requirements

AWS account / AWS CLI

Walkthrough

Before we begin, let's make sure the EBS volume that we are going to monitor is attached to an EC2 instance and has the CloudWatch agent installed. This way, we can leverage it to collect metrics

Accesing through Sessiong Manager

First, let's configure an EC2 instance so it has access to Systems Manager. In the AWS Console > Search for EC2 > Instances > Pick your instance > Security

If the instance doesn't have an instance profile role attached, go to Actions > Security > Modify IAM Role

Create a new role or use an existing one that has the policy AmazonSSMManagedInstanceCore attached. This role enables the instance to communicate with the AWS Systems Manager API

Note: Systems Manager is already installed in instances like AL2 and AL2023. If you are using another type of AMI, you should install Systems Manager first

Now, if our EC2 is not public or doesn't have an Elastic IP attached, we need to create VPC endpoints. For this, in the AWS Console go to > VPC > Endpoints > Create endpoint

Here, we need to create 3 types of endpoints in our VPC, in the subnets where our EC2 instances live

com.amazonaws.us-east-1.ssm
com.amazonaws.us-east-1.ssmmessages
com.amazonaws.us-east-1.ec2messages

After picking your VPC and subnets, be sure to pick or create a SG that has the instance SG in the port 443 as the source, feel free to use the same SG for each endpoint

After waiting some minutes this should be enough to have our instance ready ! In the AWS Console search for EC2 > Connect > Session Manager

Installing Amazon Cloudwatch Agent in EC2 AL2023

We begin by downloading the RPM file

curl -O https://amazoncloudwatch-agent.s3.amazonaws.com/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm

sudo dnf install -y ./amazon-cloudwatch-agent.rpm

We then execute the wizard installer, this will create a series of steps to configure the agent in our instance

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

For the first option we pick Linux.
On which OS are you planning to use the agent? 1. Linux

Trying to fetch the default region based on ec2 metadata. 1. EC2
Which user are you planning to run the agent?. 1. cwagent
Do you want to turn on StatsD daemon?. 2.No
Do you want to monitor metrics from CollectD? WARNING: CollectD must be installed or the Agent will fail to start. 2.No
Do you want to monitor any host metrics? e.g. CPU, memory, etc. 1. yes
Do you want to monitor cpu metrics per core? 2. No (We are only focusing on storage here)
Do you want to add ec2 dimensions (ImageId, InstanceId, InstanceType, AutoScalingGroupName) into all of your metrics if the info is available? 1. yes
Do you want to aggregate ec2 dimensions (InstanceId)? 2. No
Would you like to collect your metrics at high resolution (sub-minute resolution)? This enables sub-minute resolution for all metrics, but you can customize for specific metrics in the output json file. 4. 60s
Which default metrics config do you want?. 2. Standard

Final output:

{
        "agent": {
                "metrics_collection_interval": 60,
                "run_as_user": "cwagent"
        },
        "metrics": {
                "append_dimensions": {
                        "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
                        "ImageId": "${aws:ImageId}",
                        "InstanceId": "${aws:InstanceId}",
                        "InstanceType": "${aws:InstanceType}"
                },
                "metrics_collected": {
                        "cpu": {
                                "measurement": [
                                        "cpu_usage_idle",
                                        "cpu_usage_iowait",
                                        "cpu_usage_user",
                                        "cpu_usage_system"
                                ],
                                "metrics_collection_interval": 60,
                                "totalcpu": false
                        },
                        "disk": {
                                "measurement": [
                                        "used_percent",
                                        "inodes_free"
                                ],
                                "metrics_collection_interval": 60,
                                "resources": [
                                        "*"
                                ]
                        },
                        "diskio": {
                                "measurement": [
                                        "io_time"
                                ],
                                "metrics_collection_interval": 60,
                                "resources": [
                                        "*"
                                ]
                        },
                        "mem": {
                                "measurement": [
                                        "mem_used_percent"
                                ],
                                "metrics_collection_interval": 60
                        },
                        "swap": {
                                "measurement": [
                                        "swap_used_percent"
                                ],
                                "metrics_collection_interval": 60
                        }
                }
        }
}

Now let's make sure the cloudwatch agent is running by enabling the agent

sudo systemctl start amazon-cloudwatch-agent
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl status amazon-cloudwatch-agent

If the agent presents this error:

Search for your configuration that probably was created at bin at copy it in the /etc/ folder

sudo cp /opt/aws/amazon-cloudwatch-agent/bin/config.json /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

After that restart the agent

sudo systemctl restart amazon-cloudwatch-agent

Note: If you want to retrieve metrics from inside the EC2 be sure to have this permission cloudwatch:GetMetricStatistics in the instance role, or copy your temporal credentials in the instance

If you only want to track disk percentage which this article aims for this is the config file minimum for that, edit the file /opt/aws/amazon-cloudwatch-agent/bin/config.json

{
        "agent": {
                "metrics_collection_interval": 60,
                "run_as_user": "cwagent"
        },
        "metrics": {
                "append_dimensions": {
                        "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
                        "ImageId": "${aws:ImageId}",
                        "InstanceId": "${aws:InstanceId}",
                        "InstanceType": "${aws:InstanceType}"
                },
                "metrics_collected": {
                        "cpu": {
                                "measurement": [
                                        "cpu_usage_idle",
                                        "cpu_usage_iowait",
                                        "cpu_usage_user",
                                        "cpu_usage_system"
                                ],
                                "metrics_collection_interval": 60,
                                "totalcpu": false
                        },
                        "disk": {
                                "measurement": [
                                        "used_percent",
                                        "inodes_free"
                                ],
                                "metrics_collection_interval": 60,
                                "resources": [
                                        "*"
                                ]
                        },
                        "diskio": {
                                "measurement": [
                                        "io_time"
                                ],
                                "metrics_collection_interval": 60,
                                "resources": [
                                        "*"
                                ]
                        },
                        "mem": {
                                "measurement": [
                                        "mem_used_percent"
                                ],
                                "metrics_collection_interval": 60
                        },
                        "swap": {
                                "measurement": [
                                        "swap_used_percent"
                                ],
                                "metrics_collection_interval": 60
                        }
                }
        }
}

After that reload the file with the following commands

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config -m ec2 \
  -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s

Now let's proceed to create a dashboard to actually track the disk usage. For this in the AWS Console > Cloudwatch > Dashboard > Create Dashboard

Inside the dashboard we are going to create a widget metric of type gauge

Here in the tab Source we are going to add the following code:

{
    "region": "us-east-1",
    "title": "Disk Used % (Exact dims, /, <instance-id>)",
    "view": "gauge",
    "stat": "Average",
    "period": 60,
    "setPeriodToTimeRange": true,
    "yAxis": {
        "left": {
            "min": 0,
            "max": 100
        }
    },
    "metrics": [
        [ "CWAgent", "disk_used_percent", "InstanceId", "<your-instance-id>", "path", "/", "device", "<devive-name>", "fstype", "<device-type>", "ImageId", "<ami-id>", "InstanceType", "<instance-type>" ]
    ]
}

After adding the source click on Save, you should see a Gauge graphic showing your current disk usage

For cloudwatch the dimensions needed for instance are required, otherwise you can use only the host.

Since I'm tracking instance with a single root volume, I only checking the device name of the root otherwise you should add the other devices type attached if using more than one volume

Now let's create our custom runbook for this in the AWS Console we search for System Manager. Inside the left menu bar we go for Change Management Tools > Documents

Here we are going to click a Document of type Command.
We can fill the file like this:
Name: ExtendFilesystemRunbook
Target Type: --
Document Type: Command
For the code we are going to use yaml and paste the following code:

schemaVersion: "2.2"
description: "Robust filesystem extension with proper error handling"
parameters:
  device:
    type: "String"
    description: "Device name from CloudWatch"
    default: ""
  mountPath:
    type: "String"
    description: "Mount path from CloudWatch alarm"
    default: "/"
  executionTimeout:
    type: "String"
    default: "900"
mainSteps:
  - action: "aws:runShellScript"
    name: "extendFilesystem"
    inputs:
      timeoutSeconds: "{{ executionTimeout }}"
      runCommand:
        - "#!/bin/bash"
        - "set -e"
        - ""
        - "echo '=== Starting filesystem extension ==='"
        - "echo 'Device: {{ device }}'"
        - "echo 'Mount Path: {{ mountPath }}'"
        - ""
        - "# Install required tools"
        - "echo 'Installing required tools...'"
        - "sudo yum install -y parted cloud-utils-growpart util-linux"
        - ""
        - "# Wait for EBS changes and optimization"
        - "echo 'Waiting for EBS optimization to complete...'"
        - "sleep 60"
        - ""
        - "# Check if volume is still optimizing"
        - "echo 'Checking volume status...'"
        - "for i in {1..10}; do"
        - "    echo \"Check $i/10 - waiting for volume optimization...\""
        - "    sleep 30"
        - "    # Try to read the device"
        - "    if sudo blockdev --getsize64 $ACTUAL_DEVICE > /dev/null 2>&1; then"
        - "        echo 'Volume appears ready'"
        - "        break"
        - "    fi"
        - "done"
        - ""
        - "# Find the actual device"
        - "echo 'Finding actual device for mount path {{ mountPath }}...'"
        - "ACTUAL_DEVICE=$(df '{{ mountPath }}' | tail -1 | awk '{print $1}')"
        - "echo \"Device found: $ACTUAL_DEVICE\""
        - ""
        - "# Show current filesystem info"
        - "echo 'Current filesystem info:'"
        - "lsblk $ACTUAL_DEVICE"
        - "blkid $ACTUAL_DEVICE"
        - ""
        - "# Extract base device and partition"
        - "if [[ $ACTUAL_DEVICE =~ nvme.*p[0-9] ]]; then"
        - "    BASE_DEVICE=$(echo $ACTUAL_DEVICE | sed 's/p[0-9]*$//')"
        - "    PARTITION=$(echo $ACTUAL_DEVICE | sed 's/.*p//')"
        - "else"
        - "    BASE_DEVICE=$(echo $ACTUAL_DEVICE | sed 's/[0-9]*$//')"
        - "    PARTITION=$(echo $ACTUAL_DEVICE | sed 's/.*[^0-9]//')"
        - "fi"
        - ""
        - "echo \"Base device: $BASE_DEVICE\""
        - "echo \"Partition: $PARTITION\""
        - ""
        - "# Show partition table before"
        - "echo 'Partition table before:'"
        - "sudo fdisk -l $BASE_DEVICE"
        - ""
        - "# Grow the partition"
        - "echo 'Growing partition...'"
        - "sudo growpart $BASE_DEVICE $PARTITION"
        - ""
        - "# Show partition table after"
        - "echo 'Partition table after:'"
        - "sudo fdisk -l $BASE_DEVICE"
        - ""
        - "# Force kernel to re-read partition table"
        - "echo 'Forcing kernel to re-read partition table...'"
        - "sudo partprobe $BASE_DEVICE"
        - "sync"
        - ""
        - "# Get filesystem type"
        - "FS_TYPE=$(lsblk -no FSTYPE $ACTUAL_DEVICE)"
        - "echo \"Filesystem type: $FS_TYPE\""
        - ""
        - "# Extend filesystem based on type"
        - "case $FS_TYPE in"
        - "    'ext2'|'ext3'|'ext4')"
        - "        echo 'Extending ext filesystem...'"
        - "        sudo e2fsck -f $ACTUAL_DEVICE || echo 'e2fsck failed, continuing...'"
        - "        sudo resize2fs $ACTUAL_DEVICE"
        - "        ;;"
        - "    'xfs')"
        - "        echo 'Extending XFS filesystem...'"
        - "        sudo xfs_growfs '{{ mountPath }}'"
        - "        ;;"
        - "    *)"
        - "        echo \"Unknown or unsupported filesystem type: $FS_TYPE\""
        - "        echo 'Trying resize2fs anyway...'"
        - "        sudo resize2fs $ACTUAL_DEVICE || echo 'resize2fs failed'"
        - "        ;;"
        - "esac"
        - ""
        - "echo '=== Filesystem extension completed! ==='"
        - "echo 'Final result:'"
        - "df -h '{{ mountPath }}'"
        - "lsblk $ACTUAL_DEVICE"

This AWS Systems Manager runbook automates LVM filesystem expansion to utilize additional storage. It takes three parameters volume group name, logical volume name, and mount point (defaulting to a SAP HANA data setup) then runs a script that resizes all physical volumes in the group, extends the logical volume to use all available free space, and grows the XFS filesystem

Feel free to add tags if necessary, and then click on Create Document

Now we proceed to create our lambda, the purpose of this is to actually increment the volume size at the aws level and then call for the runbook to extend the lvm filesystem.
For this we search for the service Lambda, we click on Functions > Create Function.

We select Author from scratch and fill the following data.

Function name: ExtendFileSystemLambda
Runtime: Python 3.13
Architecture: x86_64
Execution Role: Create a new role with basic Lambda permissions

Feel free to add tags, and then click on CreateFunction

Note: There is no need for the lambda to be on the same VPC

Before we proceed we are going to add an aditional permission needed to run the runbook, for this we go to Configuration > Permissions > Click on Role Name

Inside the IAM Role of the lambda we are going to click on Add Permissions > Create inline policy > Choose JSON

And we copy the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeVolumes",
                "ec2:ModifyVolume"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:SendCommand",
                "ssm:GetCommandInvocation"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

This IAM policy allows the view and resize of EC2 storage volumes, execute commands on EC2 instances through Systems Manager, and write logs for monitoring

After copying the policy in the policy editor we click on Next, we give it a policy name like CommandExecutionResizeFileSystem and proceed clicking Create policy

Now we can go back to our lambda, after this we go into the Code tab and copy this code under the lambda_function.py file

import json
import boto3
import logging
import time
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """AWS Lambda to resize EBS volume based on CloudWatch alarm."""

    try:
        # Configuration
        volume_increase_gb = int(os.environ.get('VOLUME_INCREASE_GB', '1'))
        wait_time = int(os.environ.get('WAIT_TIME_SECONDS', '30'))

        # AWS clients
        ec2 = boto3.client('ec2')

        # Parse SNS message
        sns_message = json.loads(event['Records'][0]['Sns']['Message'])
        alarm_name = sns_message['AlarmName']

        # Extract instance ID and device from alarm dimensions
        dimensions = sns_message.get('Trigger', {}).get('Dimensions', [])
        instance_id = None
        device = None

        for dim in dimensions:
            if dim['name'] == 'InstanceId':
                instance_id = dim['value']
            elif dim['name'] == 'device':
                device = dim['value']

        if not instance_id:
            raise ValueError("No instance ID found in alarm")

        logger.info(f"Processing alarm '{alarm_name}' for instance {instance_id}, device: {device}")

        # Get instance to find the volume
        response = ec2.describe_instances(InstanceIds=[instance_id])
        instance = response['Reservations'][0]['Instances'][0]

        # Find the volume for the specific device
        target_volume = None
        device_variations = [device, f"/dev/{device}", device.replace('xvd', 'sd'), device.replace('sd', 'xvd')]

        for block_device in instance.get('BlockDeviceMappings', []):
            device_name = block_device['DeviceName']
            volume_id = block_device['Ebs']['VolumeId']

            # Check if this device matches our target (handle naming variations)
            if any(device_name.endswith(var.lstrip('/dev/')) for var in device_variations):
                target_volume = volume_id
                logger.info(f"Found target volume {volume_id} on device {device_name}")
                break

        if not target_volume:
            # Fallback: if no device specified, resize the first non-root volume or root if it's the only one
            volumes = [bd['Ebs']['VolumeId'] for bd in instance.get('BlockDeviceMappings', [])]
            if volumes:
                target_volume = volumes[0]
                logger.info(f"No specific device found, using first volume: {target_volume}")

        if not target_volume:
            raise ValueError(f"No volume found to resize for instance {instance_id}")

        # Get current volume size
        vol_response = ec2.describe_volumes(VolumeIds=[target_volume])
        current_size = vol_response['Volumes'][0]['Size']
        new_size = current_size + volume_increase_gb

        # Resize the volume
        ec2.modify_volume(VolumeId=target_volume, Size=new_size)
        logger.info(f"Resized volume {target_volume}: {current_size}GB -> {new_size}GB")

        # Wait for modification to complete
        logger.info(f"Waiting {wait_time} seconds for volume modification...")
        time.sleep(wait_time)

        # Optional: Run SSM command to extend filesystem
        runbook_name = os.environ.get('RUNBOOK_NAME', 'ExtendFilesystemRunbook')

        if runbook_name:
            ssm = boto3.client('ssm')

            # Get mount path and device from alarm dimensions
            disk_path = "/"  # default
            disk_device_raw = None

            for dim in dimensions:
                if dim['name'] == 'path':
                    disk_path = dim['value']
                elif dim['name'] == 'device':
                    disk_device_raw = dim['value']

            # Find the actual device by looking at the instance
            actual_device = None
            for block_device in instance.get('BlockDeviceMappings', []):
                device_name = block_device['DeviceName']
                # Strip /dev/ prefix for comparison
                device_short = device_name.replace('/dev/', '')

                # Match against the alarm device (handle xvda vs nvme naming)
                if (disk_device_raw and 
                    (device_short.startswith(disk_device_raw) or 
                     disk_device_raw.startswith(device_short) or
                     device_short.replace('nvme0n1p', 'xvd').replace('1', 'a') == disk_device_raw or
                     disk_device_raw.replace('xvd', 'nvme0n1p').replace('a', '1') == device_short)):
                    actual_device = device_name.replace('/dev/', '')  # Remove /dev/ prefix
                    break

            if not actual_device:
                # Fallback: use the first device
                actual_device = instance.get('BlockDeviceMappings', [{}])[0].get('DeviceName', '').replace('/dev/', '')

            # Prepare parameters for the runbook
            ssm_params = {
                'device': [actual_device],  # Send device without /dev/ prefix
                'mountPath': [disk_path]    # Send the actual mount path from alarm
            }

            logger.info(f"Running runbook with device='{actual_device}' and mountPath='{disk_path}'")

            command_response = ssm.send_command(
                InstanceIds=[instance_id],
                DocumentName=runbook_name,
                Parameters=ssm_params,
                Comment=f'Filesystem extension for alarm: {alarm_name}',
                TimeoutSeconds=300
            )
            command_id = command_response['Command']['CommandId']
            logger.info(f"Started filesystem extension: {command_id}")
        else:
            command_id = None
            logger.info("No SSM runbook configured, skipping filesystem extension")

        # Success response
        result = {
            'success': True,
            'instance_id': instance_id,
            'volume_id': target_volume,
            'old_size_gb': current_size,
            'new_size_gb': new_size,
            'alarm_name': alarm_name
        }

        if command_id:
            result['command_id'] = command_id

        return {
            'statusCode': 200,
            'body': json.dumps(result)
        }

    except Exception as e:
        logger.error(f"Error: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps({
                'success': False,
                'error': str(e)
            })
        }

After copying the code, click on Deploy

This Lambda function automatically responds to CloudWatch alarms by expanding disk storage. When triggered by an SNS notification from a storage alarm, it identifies the specific EC2 instance from the alarm, increases each volume by 1GB, waits for the changes to take effect, then runs the filesystem extension runbook to make the additional space available to the database

Now to test our lambda, this body should replicate the data of the alarm we should receive from a cloudwatch alarm this way the lambda will resize the disk in AWS and invoke the runbook we previously config

{
  "Records": [
    {
      "Sns": {
        "Message": "{\n  \"AlarmName\": \"DiskUsed% CRIT\",\n  \"AWSAccountId\": \"111882299112\",\n  \"NewStateValue\": \"ALARM\",\n  \"Trigger\": {\n    \"Namespace\": \"CWAgent\",\n    \"MetricName\": \"disk_used_percent\",\n    \"Dimensions\": [\n      {\"name\": \"InstanceId\", \"value\": \"<your-instance-id>\"},\n      {\"name\": \"device\", \"value\": \"nvme0n1p1\"},\n      {\"name\": \"path\", \"value\": \"/\"}\n    ]\n  }\n}"
      }
    }
  ]
}

Once we tested our lambda we can verify it certainly increase the space both in AWS and inside the volume

Now for our final steps we are going to configure the automated part of this, for this we are going to create a cloudwatch alarm with a sns to trigger our lambda we are going to receive the event for our lambda to increment the space on the volume and then expand it at OS level. For this we go AWS Console > Cloudwatch > All alarms > Create alarm.

In the first step Metric, we copy the same source we used in our dashboard.

{
    "region": "us-east-1",
    "title": "Disk Used % (Exact dims, /, <instance-id>)",
    "view": "gauge",
    "stat": "Average",
    "period": 60,
    "setPeriodToTimeRange": true,
    "yAxis": {
        "left": {
            "min": 0,
            "max": 100
        }
    },
    "metrics": [
        [ "CWAgent", "disk_used_percent", "InstanceId", "<your-instance-id>", "path", "/", "device", "<devive-name>", "fstype", "<device-type>", "ImageId", "<ami-id>", "InstanceType", "<instance-type>" ]
    ]
}

This should show us the current metric we are tracking

At conditions for testing purpose we are going to put the disk at the current limit otherwise put it at a threshold you see fit

At notification pick > In alarm state > Create new topic > Set up your sns topic > your sns topic should point to the lambda
Next in alarm details feel free to add a name and a description, after that create the Alarm.

After waiting for some minutes we should see the alarm being triggered

When the alarm is activated it will trigger our lambda then our runbook will resize our volume

The volume can only be resized every 6 hours

After lambda execution we can se our resize has been succesful

Lessons learned and a conclusion

Automating EBS growth from disk usage alarms is solid, but there are some details:

Don’t rely on device names. Rely on the VolumeId. You won’t get it from CloudWatch; resolve it on the instance via SSM using the device and path the alarm provides
Pick a single integration pattern and wire it correctly. For code that expects SNS, use Alarm → SNS → Lambda and the SNS principal in your Lambda policy
Bias toward safety. Remove “first volume” fallbacks, add idempotency checks, and guard with clear alarms and logs