In this article, we’ll show how to automate EBS volume upsizing using CloudWatch alarms, SNS notifications, AWS Systems Manager runbooks, and Lambda. By the end, we will have a hands-free setup that detects when volumes are running low on space, triggers an automated workflow, and expands them without downtime, keeping our workloads running smoothly
The Problem
This solution was created so we can fix a specific problem, and that is for the workloads that are running SAP in our EC2 instances, which sometimes run out of space. However, we are very intricate with our cost. The challenge is that not all disks are mounted on the same mount point: some are under /hana/data, others are under /usr/. Also, AWS Lambda currently has no way or library to identify inside our EC2 instances which volumes we have or where they are mounted. Certainly, we can increment the disk size of an EBS volume; however, we also need to extend that storage space inside the instance. For this, we created the following automation that can be easily adapted to other disks, not only SAP solutions
Requirements
- AWS account / AWS CLI
Walkthrough
Before we begin, let's make sure the EBS volume that we are going to monitor is attached to an EC2 instance and has the CloudWatch agent installed. This way, we can leverage it to collect metrics
Accesing through Sessiong Manager
First, let's configure an EC2 instance so it has access to Systems Manager. In the AWS Console > Search for EC2 > Instances > Pick your instance > Security
If the instance doesn't have an instance profile role attached, go to Actions > Security > Modify IAM Role
Create a new role or use an existing one that has the policy AmazonSSMManagedInstanceCore
attached. This role enables the instance to communicate with the AWS Systems Manager API
Note: Systems Manager is already installed in instances like AL2 and AL2023. If you are using another type of AMI, you should install Systems Manager first
Now, if our EC2 is not public or doesn't have an Elastic IP attached, we need to create VPC endpoints. For this, in the AWS Console go to > VPC > Endpoints > Create endpoint
Here, we need to create 3 types of endpoints in our VPC, in the subnets where our EC2 instances live
com.amazonaws.us-east-1.ssm
com.amazonaws.us-east-1.ssmmessages
com.amazonaws.us-east-1.ec2messages
After picking your VPC and subnets, be sure to pick or create a SG that has the instance SG in the port 443 as the source, feel free to use the same SG for each endpoint
After waiting some minutes this should be enough to have our instance ready ! In the AWS Console search for EC2 > Connect > Session Manager
Installing Amazon Cloudwatch Agent in EC2 AL2023
We begin by downloading the RPM file
curl -O https://amazoncloudwatch-agent.s3.amazonaws.com/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo dnf install -y ./amazon-cloudwatch-agent.rpm
We then execute the wizard installer, this will create a series of steps to configure the agent in our instance
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
For the first option we pick Linux.
On which OS are you planning to use the agent? 1. Linux
- Trying to fetch the default region based on ec2 metadata. 1. EC2
- Which user are you planning to run the agent?. 1. cwagent
- Do you want to turn on StatsD daemon?. 2.No
- Do you want to monitor metrics from CollectD? WARNING: CollectD must be installed or the Agent will fail to start. 2.No
- Do you want to monitor any host metrics? e.g. CPU, memory, etc. 1. yes
- Do you want to monitor cpu metrics per core? 2. No (We are only focusing on storage here)
- Do you want to add ec2 dimensions (ImageId, InstanceId, InstanceType, AutoScalingGroupName) into all of your metrics if the info is available? 1. yes
- Do you want to aggregate ec2 dimensions (InstanceId)? 2. No
- Would you like to collect your metrics at high resolution (sub-minute resolution)? This enables sub-minute resolution for all metrics, but you can customize for specific metrics in the output json file. 4. 60s
- Which default metrics config do you want?. 2. Standard
Final output:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
},
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait",
"cpu_usage_user",
"cpu_usage_system"
],
"metrics_collection_interval": 60,
"totalcpu": false
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"diskio": {
"measurement": [
"io_time"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"swap": {
"measurement": [
"swap_used_percent"
],
"metrics_collection_interval": 60
}
}
}
}
Now let's make sure the cloudwatch agent is running by enabling the agent
sudo systemctl start amazon-cloudwatch-agent
sudo systemctl enable amazon-cloudwatch-agent
sudo systemctl status amazon-cloudwatch-agent
If the agent presents this error:
Search for your configuration that probably was created at bin at copy it in the /etc/
folder
sudo cp /opt/aws/amazon-cloudwatch-agent/bin/config.json /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
After that restart the agent
sudo systemctl restart amazon-cloudwatch-agent
Note: If you want to retrieve metrics from inside the EC2 be sure to have this permission
cloudwatch:GetMetricStatistics
in the instance role, or copy your temporal credentials in the instance
If you only want to track disk percentage which this article aims for this is the config file minimum for that, edit the file /opt/aws/amazon-cloudwatch-agent/bin/config.json
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
},
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait",
"cpu_usage_user",
"cpu_usage_system"
],
"metrics_collection_interval": 60,
"totalcpu": false
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"diskio": {
"measurement": [
"io_time"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"swap": {
"measurement": [
"swap_used_percent"
],
"metrics_collection_interval": 60
}
}
}
}
After that reload the file with the following commands
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 \
-c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
Now let's proceed to create a dashboard to actually track the disk usage. For this in the AWS Console > Cloudwatch > Dashboard > Create Dashboard
Inside the dashboard we are going to create a widget metric of type gauge
Here in the tab Source
we are going to add the following code:
{
"region": "us-east-1",
"title": "Disk Used % (Exact dims, /, <instance-id>)",
"view": "gauge",
"stat": "Average",
"period": 60,
"setPeriodToTimeRange": true,
"yAxis": {
"left": {
"min": 0,
"max": 100
}
},
"metrics": [
[ "CWAgent", "disk_used_percent", "InstanceId", "<your-instance-id>", "path", "/", "device", "<devive-name>", "fstype", "<device-type>", "ImageId", "<ami-id>", "InstanceType", "<instance-type>" ]
]
}
After adding the source click on Save, you should see a Gauge graphic showing your current disk usage
For cloudwatch the dimensions needed for instance are required, otherwise you can use only the host.
Since I'm tracking instance with a single root volume, I only checking the device name of the root otherwise you should add the other devices type attached if using more than one volume
Now let's create our custom runbook for this in the AWS Console we search for System Manager. Inside the left menu bar we go for Change Management Tools > Documents
Here we are going to click a Document of type Command.
We can fill the file like this:
Name: ExtendFilesystemRunbook
Target Type: --
Document Type: Command
For the code we are going to use yaml and paste the following code:
schemaVersion: "2.2"
description: "Robust filesystem extension with proper error handling"
parameters:
device:
type: "String"
description: "Device name from CloudWatch"
default: ""
mountPath:
type: "String"
description: "Mount path from CloudWatch alarm"
default: "/"
executionTimeout:
type: "String"
default: "900"
mainSteps:
- action: "aws:runShellScript"
name: "extendFilesystem"
inputs:
timeoutSeconds: "{{ executionTimeout }}"
runCommand:
- "#!/bin/bash"
- "set -e"
- ""
- "echo '=== Starting filesystem extension ==='"
- "echo 'Device: {{ device }}'"
- "echo 'Mount Path: {{ mountPath }}'"
- ""
- "# Install required tools"
- "echo 'Installing required tools...'"
- "sudo yum install -y parted cloud-utils-growpart util-linux"
- ""
- "# Wait for EBS changes and optimization"
- "echo 'Waiting for EBS optimization to complete...'"
- "sleep 60"
- ""
- "# Check if volume is still optimizing"
- "echo 'Checking volume status...'"
- "for i in {1..10}; do"
- " echo \"Check $i/10 - waiting for volume optimization...\""
- " sleep 30"
- " # Try to read the device"
- " if sudo blockdev --getsize64 $ACTUAL_DEVICE > /dev/null 2>&1; then"
- " echo 'Volume appears ready'"
- " break"
- " fi"
- "done"
- ""
- "# Find the actual device"
- "echo 'Finding actual device for mount path {{ mountPath }}...'"
- "ACTUAL_DEVICE=$(df '{{ mountPath }}' | tail -1 | awk '{print $1}')"
- "echo \"Device found: $ACTUAL_DEVICE\""
- ""
- "# Show current filesystem info"
- "echo 'Current filesystem info:'"
- "lsblk $ACTUAL_DEVICE"
- "blkid $ACTUAL_DEVICE"
- ""
- "# Extract base device and partition"
- "if [[ $ACTUAL_DEVICE =~ nvme.*p[0-9] ]]; then"
- " BASE_DEVICE=$(echo $ACTUAL_DEVICE | sed 's/p[0-9]*$//')"
- " PARTITION=$(echo $ACTUAL_DEVICE | sed 's/.*p//')"
- "else"
- " BASE_DEVICE=$(echo $ACTUAL_DEVICE | sed 's/[0-9]*$//')"
- " PARTITION=$(echo $ACTUAL_DEVICE | sed 's/.*[^0-9]//')"
- "fi"
- ""
- "echo \"Base device: $BASE_DEVICE\""
- "echo \"Partition: $PARTITION\""
- ""
- "# Show partition table before"
- "echo 'Partition table before:'"
- "sudo fdisk -l $BASE_DEVICE"
- ""
- "# Grow the partition"
- "echo 'Growing partition...'"
- "sudo growpart $BASE_DEVICE $PARTITION"
- ""
- "# Show partition table after"
- "echo 'Partition table after:'"
- "sudo fdisk -l $BASE_DEVICE"
- ""
- "# Force kernel to re-read partition table"
- "echo 'Forcing kernel to re-read partition table...'"
- "sudo partprobe $BASE_DEVICE"
- "sync"
- ""
- "# Get filesystem type"
- "FS_TYPE=$(lsblk -no FSTYPE $ACTUAL_DEVICE)"
- "echo \"Filesystem type: $FS_TYPE\""
- ""
- "# Extend filesystem based on type"
- "case $FS_TYPE in"
- " 'ext2'|'ext3'|'ext4')"
- " echo 'Extending ext filesystem...'"
- " sudo e2fsck -f $ACTUAL_DEVICE || echo 'e2fsck failed, continuing...'"
- " sudo resize2fs $ACTUAL_DEVICE"
- " ;;"
- " 'xfs')"
- " echo 'Extending XFS filesystem...'"
- " sudo xfs_growfs '{{ mountPath }}'"
- " ;;"
- " *)"
- " echo \"Unknown or unsupported filesystem type: $FS_TYPE\""
- " echo 'Trying resize2fs anyway...'"
- " sudo resize2fs $ACTUAL_DEVICE || echo 'resize2fs failed'"
- " ;;"
- "esac"
- ""
- "echo '=== Filesystem extension completed! ==='"
- "echo 'Final result:'"
- "df -h '{{ mountPath }}'"
- "lsblk $ACTUAL_DEVICE"
This AWS Systems Manager runbook automates LVM filesystem expansion to utilize additional storage. It takes three parameters volume group name, logical volume name, and mount point (defaulting to a SAP HANA data setup) then runs a script that resizes all physical volumes in the group, extends the logical volume to use all available free space, and grows the XFS filesystem
Feel free to add tags if necessary, and then click on Create Document
Now we proceed to create our lambda, the purpose of this is to actually increment the volume size at the aws level and then call for the runbook to extend the lvm filesystem.
For this we search for the service Lambda, we click on Functions > Create Function.
We select Author from scratch and fill the following data.
Function name: ExtendFileSystemLambda
Runtime: Python 3.13
Architecture: x86_64
Execution Role: Create a new role with basic Lambda permissions
Feel free to add tags, and then click on CreateFunction
Note: There is no need for the lambda to be on the same VPC
Before we proceed we are going to add an aditional permission needed to run the runbook, for this we go to Configuration > Permissions > Click on Role Name
Inside the IAM Role of the lambda we are going to click on Add Permissions > Create inline policy > Choose JSON
And we copy the following policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeVolumes",
"ec2:ModifyVolume"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:SendCommand",
"ssm:GetCommandInvocation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}
This IAM policy allows the view and resize of EC2 storage volumes, execute commands on EC2 instances through Systems Manager, and write logs for monitoring
After copying the policy in the policy editor we click on Next, we give it a policy name like CommandExecutionResizeFileSystem and proceed clicking Create policy
Now we can go back to our lambda, after this we go into the Code tab and copy this code under the lambda_function.py file
import json
import boto3
import logging
import time
import os
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""AWS Lambda to resize EBS volume based on CloudWatch alarm."""
try:
# Configuration
volume_increase_gb = int(os.environ.get('VOLUME_INCREASE_GB', '1'))
wait_time = int(os.environ.get('WAIT_TIME_SECONDS', '30'))
# AWS clients
ec2 = boto3.client('ec2')
# Parse SNS message
sns_message = json.loads(event['Records'][0]['Sns']['Message'])
alarm_name = sns_message['AlarmName']
# Extract instance ID and device from alarm dimensions
dimensions = sns_message.get('Trigger', {}).get('Dimensions', [])
instance_id = None
device = None
for dim in dimensions:
if dim['name'] == 'InstanceId':
instance_id = dim['value']
elif dim['name'] == 'device':
device = dim['value']
if not instance_id:
raise ValueError("No instance ID found in alarm")
logger.info(f"Processing alarm '{alarm_name}' for instance {instance_id}, device: {device}")
# Get instance to find the volume
response = ec2.describe_instances(InstanceIds=[instance_id])
instance = response['Reservations'][0]['Instances'][0]
# Find the volume for the specific device
target_volume = None
device_variations = [device, f"/dev/{device}", device.replace('xvd', 'sd'), device.replace('sd', 'xvd')]
for block_device in instance.get('BlockDeviceMappings', []):
device_name = block_device['DeviceName']
volume_id = block_device['Ebs']['VolumeId']
# Check if this device matches our target (handle naming variations)
if any(device_name.endswith(var.lstrip('/dev/')) for var in device_variations):
target_volume = volume_id
logger.info(f"Found target volume {volume_id} on device {device_name}")
break
if not target_volume:
# Fallback: if no device specified, resize the first non-root volume or root if it's the only one
volumes = [bd['Ebs']['VolumeId'] for bd in instance.get('BlockDeviceMappings', [])]
if volumes:
target_volume = volumes[0]
logger.info(f"No specific device found, using first volume: {target_volume}")
if not target_volume:
raise ValueError(f"No volume found to resize for instance {instance_id}")
# Get current volume size
vol_response = ec2.describe_volumes(VolumeIds=[target_volume])
current_size = vol_response['Volumes'][0]['Size']
new_size = current_size + volume_increase_gb
# Resize the volume
ec2.modify_volume(VolumeId=target_volume, Size=new_size)
logger.info(f"Resized volume {target_volume}: {current_size}GB -> {new_size}GB")
# Wait for modification to complete
logger.info(f"Waiting {wait_time} seconds for volume modification...")
time.sleep(wait_time)
# Optional: Run SSM command to extend filesystem
runbook_name = os.environ.get('RUNBOOK_NAME', 'ExtendFilesystemRunbook')
if runbook_name:
ssm = boto3.client('ssm')
# Get mount path and device from alarm dimensions
disk_path = "/" # default
disk_device_raw = None
for dim in dimensions:
if dim['name'] == 'path':
disk_path = dim['value']
elif dim['name'] == 'device':
disk_device_raw = dim['value']
# Find the actual device by looking at the instance
actual_device = None
for block_device in instance.get('BlockDeviceMappings', []):
device_name = block_device['DeviceName']
# Strip /dev/ prefix for comparison
device_short = device_name.replace('/dev/', '')
# Match against the alarm device (handle xvda vs nvme naming)
if (disk_device_raw and
(device_short.startswith(disk_device_raw) or
disk_device_raw.startswith(device_short) or
device_short.replace('nvme0n1p', 'xvd').replace('1', 'a') == disk_device_raw or
disk_device_raw.replace('xvd', 'nvme0n1p').replace('a', '1') == device_short)):
actual_device = device_name.replace('/dev/', '') # Remove /dev/ prefix
break
if not actual_device:
# Fallback: use the first device
actual_device = instance.get('BlockDeviceMappings', [{}])[0].get('DeviceName', '').replace('/dev/', '')
# Prepare parameters for the runbook
ssm_params = {
'device': [actual_device], # Send device without /dev/ prefix
'mountPath': [disk_path] # Send the actual mount path from alarm
}
logger.info(f"Running runbook with device='{actual_device}' and mountPath='{disk_path}'")
command_response = ssm.send_command(
InstanceIds=[instance_id],
DocumentName=runbook_name,
Parameters=ssm_params,
Comment=f'Filesystem extension for alarm: {alarm_name}',
TimeoutSeconds=300
)
command_id = command_response['Command']['CommandId']
logger.info(f"Started filesystem extension: {command_id}")
else:
command_id = None
logger.info("No SSM runbook configured, skipping filesystem extension")
# Success response
result = {
'success': True,
'instance_id': instance_id,
'volume_id': target_volume,
'old_size_gb': current_size,
'new_size_gb': new_size,
'alarm_name': alarm_name
}
if command_id:
result['command_id'] = command_id
return {
'statusCode': 200,
'body': json.dumps(result)
}
except Exception as e:
logger.error(f"Error: {str(e)}")
return {
'statusCode': 500,
'body': json.dumps({
'success': False,
'error': str(e)
})
}
After copying the code, click on Deploy
This Lambda function automatically responds to CloudWatch alarms by expanding disk storage. When triggered by an SNS notification from a storage alarm, it identifies the specific EC2 instance from the alarm, increases each volume by 1GB, waits for the changes to take effect, then runs the filesystem extension runbook to make the additional space available to the database
Now to test our lambda, this body should replicate the data of the alarm we should receive from a cloudwatch alarm this way the lambda will resize the disk in AWS and invoke the runbook we previously config
{
"Records": [
{
"Sns": {
"Message": "{\n \"AlarmName\": \"DiskUsed% CRIT\",\n \"AWSAccountId\": \"111882299112\",\n \"NewStateValue\": \"ALARM\",\n \"Trigger\": {\n \"Namespace\": \"CWAgent\",\n \"MetricName\": \"disk_used_percent\",\n \"Dimensions\": [\n {\"name\": \"InstanceId\", \"value\": \"<your-instance-id>\"},\n {\"name\": \"device\", \"value\": \"nvme0n1p1\"},\n {\"name\": \"path\", \"value\": \"/\"}\n ]\n }\n}"
}
}
]
}
Once we tested our lambda we can verify it certainly increase the space both in AWS and inside the volume
Now for our final steps we are going to configure the automated part of this, for this we are going to create a cloudwatch alarm with a sns to trigger our lambda we are going to receive the event for our lambda to increment the space on the volume and then expand it at OS level. For this we go AWS Console > Cloudwatch > All alarms > Create alarm.
In the first step Metric, we copy the same source we used in our dashboard.
{
"region": "us-east-1",
"title": "Disk Used % (Exact dims, /, <instance-id>)",
"view": "gauge",
"stat": "Average",
"period": 60,
"setPeriodToTimeRange": true,
"yAxis": {
"left": {
"min": 0,
"max": 100
}
},
"metrics": [
[ "CWAgent", "disk_used_percent", "InstanceId", "<your-instance-id>", "path", "/", "device", "<devive-name>", "fstype", "<device-type>", "ImageId", "<ami-id>", "InstanceType", "<instance-type>" ]
]
}
This should show us the current metric we are tracking
At conditions for testing purpose we are going to put the disk at the current limit otherwise put it at a threshold you see fit
At notification pick > In alarm state > Create new topic > Set up your sns topic > your sns topic should point to the lambda
Next in alarm details feel free to add a name and a description, after that create the Alarm.
After waiting for some minutes we should see the alarm being triggered
When the alarm is activated it will trigger our lambda then our runbook will resize our volume
The volume can only be resized every 6 hours
After lambda execution we can se our resize has been succesful
Lessons learned and a conclusion
Automating EBS growth from disk usage alarms is solid, but there are some details:
- Don’t rely on device names. Rely on the VolumeId. You won’t get it from CloudWatch; resolve it on the instance via SSM using the device and path the alarm provides
- Pick a single integration pattern and wire it correctly. For code that expects SNS, use Alarm → SNS → Lambda and the SNS principal in your Lambda policy
- Bias toward safety. Remove “first volume” fallbacks, add idempotency checks, and guard with clear alarms and logs
Top comments (0)