DEV Community: Susseta Bose

Auto-Orphan-Volume-Cleanup-Automation

Susseta Bose — Sun, 28 Dec 2025 16:36:40 +0000

Introduction:
In modern cloud environments, unused resources often accumulate silently, driving up costs and creating operational inefficiencies. One of the most common culprits is orphaned EBS volumes that remain unattached and unnoticed after workloads are terminated. This project, was initiated to address that challenge by introducing an automated, secure, and auditable workflow for managing and cleaning up unused volumes.

Problem Statement:
In dynamic cloud environments, unused resources often accumulate unnoticed. One of the most common examples is orphaned Amazon EBS volumes left behind after instances are terminated. These unused volumes not only increase storage costs but also pose governance and compliance challenges. For example, a 100 GB General Purpose SSD (gp3) volume costs about $8 per month, while a 500 GB Provisioned IOPS SSD (io2) volume with 20,000 IOPS can exceed $1,250 per month. Snapshots add further hidden expenses at $0.05 per GB-month. When multiplied across multiple accounts and regions, these orphaned volumes can silently drive up bills by hundreds or even thousands of dollars monthly. Manual cleanup is error-prone and time-consuming, especially at scale. The need is clear: an automated and secure solution to manage the EBS volume lifecycle and eliminate unnecessary costs.

Current Workflow State:

Traditionally, teams rely on manual scripts or periodic audits to identify unused volumes. This approach suffers from:

Lack of visibility across accounts.
No centralized approval mechanism.
Risk of accidental deletion of critical data.
High operational overhead

Target Workflow State

The goal was to design a fully automated system that:

Discovers unused EBS volumes regularly.
Provides a centralized approval interface before deletion.
Ensures secure, auditable cleanup with minimal human intervention.
Integrates seamlessly with existing CI/CD pipelines.

Architecture Diagram

What AWS Services We Used:

AWS Lambda – Discovery and deletion of EBS volumes
Amazon DynamoDB – Metadata and approval status storage
Amazon ECS + ALB – Hosting the Streamlit approval application
Amazon EventBridge – Scheduling automated discovery jobs
Amazon ECR – Container image repository for the web app
AWS IAM & KMS – Security and encryption
Amazon CloudWatch – Logging and monitoring

Explaination Of Entire System Workflow:

Volume Discovery Lambda runs on a schedule via EventBridge, scanning for unused EBS volumes and storing volume details in DynamoDB. Additionally, it exports the volume details and upload to S3 bucket, which lambda function use to send an email notification to end user.
Streamlit Web App (deployed on ECS) provides a user-friendly interface to review discovered volumes. Approver can mark volumes as Approved in “column” for deletion. Once it’s saved in the application, it makes change in dynamodb table.
Delete Volume Lambda gets triggered once the change is detected in “DeleteConfirmation” in dynamodb table due to enablement of dynamodb stream feature. This lambda function identifies all the approved volumes and delete each one of them one by one.
CI/CD Pipeline ensures infrastructure and application updates are deployed consistently using GitLab and Terraform.

How Did We Implement:

Infrastructure as Code: Terraform provisions all AWS resources including Lambdas, DynamoDB, ECS, ALB, and EventBridge rules.

Containerization: The Streamlit app is packaged with Docker and pushed to ECR.

Automation: GitLab CI/CD pipeline builds images, pushes them to ECR, and applies Terraform changes.

Security: IAM roles follow least privilege principles, ECS logs are encrypted with KMS, and VPC security groups enforce isolation.

C:.
│   .gitlab-ci.yml
│   Dockerfile
│   README.md
│
├───DeleteVolumeFunction
│       lambda_function.py
│
├───StreamlitApplication
│       approval.py
│
├───Terraform
│       alb.tf
│       data.tf
│       dynamodb.tf
│       ecs.tf
│       eventbridge.tf
│       iam.tf
│       lambda.tf
│       output.tf
│       provider.tf
│       terraform.tfvars
│       variable.tf
│
└───VolumeDiscoveryFunction
        available_vol_discovery.py
        lambda_function.py
        push_to_dynamodb.py

Lambda Scripts:

VolumeDiscoveryFunction/available_vol_discovery.py

import boto3
import os
import sys
import pandas as pd
from datetime import datetime
#~~~~~~~~~~~~~ Create EC2 client and describe volumes ~~~~~~~~~~~~~#

client = boto3.client('ec2')
sns_client = boto3.client('sns')

#~~~~~~~~~~~~~ Define a blank variable to store ebs volume info ~~~~~~~~~~~~~#

volume_data = []

#~~~~~~~~~~~~~ Loop through the response and extract relevant information ~~~~~~~~~~~~~#
def vol_discovery():

    response = client.describe_volumes()
    for vol in response['Volumes']:
        if vol['State'] == 'available':
            Volume_ID = vol['VolumeId']
            Size = vol['Size']
            State = vol['State']
            Creation_time = vol['CreateTime']
            Creation_time = Creation_time.replace(tzinfo=None) if Creation_time.tzinfo is not None else None
            Creation_time = Creation_time.strftime("%Y-%m-%d %H:%M:%S")
            Vol_Type = vol['VolumeType']
            Disk_Type = [ tag['Value'] for tag in vol['Tags'] if tag['Key'] == 'Type'][0]
            Owner = [ tag['Value'] for tag in vol['Tags'] if tag['Key'] == 'Owner'][0]

            data = {
                "VolumeID": Volume_ID,
                "Size": Size,
                "State": State,
                "Created": Creation_time,
                "VolumeType": Vol_Type,
                "DiskType": Disk_Type,
                "Owner": Owner,
                "DeleteConfirmation": "Pending"
            }
            volume_data.append(data)

    #~~~~~~~~~~~~~ Create a Excel file with extracted volume information ~~~~~~~~~~~~~#
    time = datetime.now().strftime("%H%M%S")
    df = pd.DataFrame(volume_data)
    output_file = f'discovery_available_ebsvol-{time}.xlsx'
    df.to_excel(f'/tmp/{output_file}', index=False)

    #~~~~~~~~~~~~~ Upload the Excel file to S3 ~~~~~~~~~~~~~#

    s3 = boto3.client('s3')
    bucket_name = os.environ['BUCKET_NAME']
    file_path = f'/tmp/{output_file}'
    s3_object_key = f'OrphanEBSReport/{output_file}' # Desired object key in S3

    try:
        s3.upload_file(file_path, bucket_name, s3_object_key)
        s3_url = f"https://{bucket_name}.s3.amazonaws.com/{s3_object_key}"
        print(f"File uploaded to S3: {s3_url}")
    except Exception as e:
        print(f"Error uploading file to S3: {e}")
        exit()
    #~~~~~~~~~~~~~ Send an email with the Excel file as an attachment ~~~~~~~~~~~~~#

    snsarn = os.environ['SNS_ARN']
    body = f"Hi Team, \n\nPlease be informed that the following EBS volumes are in 'available' state and not attached to any EC2 instances. Kindly review the excel report from below link.\n\nLink: {s3_url}\n\nPlease click on http://ALB-External-364496655.us-east-1.elb.amazonaws.com to provide an approval.\n\nBest Regards,\nSystems Management Team."
    res = sns_client.publish(
        TopicArn = snsarn,
        Subject = f'Orphan EBS Volume Discovery Report',
        Message = str(body)
        )
    return volume_data

VolumeDiscoveryFunction/lambda_function.py

import boto3
import os
import sys
import available_vol_discovery
import push_to_dynamodb


def lambda_handler(event, context):
    #~~~~~~~~~~~~~ Call the volume discovery function ~~~~~~~~~~~~~#
    volume_data = available_vol_discovery.vol_discovery()
    #~~~~~~~~~~~~~~ Call the function to push data to DynamoDB ~~~~~~~~~~~~~#
    push_to_dynamodb.push_data_to_dynamodb(volume_data)

VolumeDiscoveryFunction/push_to_dynamodb.py

import boto3
import os

def push_data_to_dynamodb(volume_data):
    dynamodb = boto3.resource("dynamodb")
    table_name = os.environ.get("TABLE_NAME")
    table = dynamodb.Table(table_name)
    for data in volume_data:
        table.put_item(Item=data)

StreamlitApplication/approval.py

import streamlit as st
import boto3
import pandas as pd
from boto3.dynamodb.types import TypeDeserializer

db = boto3.client('dynamodb')

def list_tables():
    list_table = db.list_tables()
    list_table = tuple(tables for tables in list_table['TableNames'])
    return list_table

def get_columns_from_table(table_name):
    response = db.scan(TableName=table_name, Limit=100)
    columns = set()
    for item in response.get('Items', []):
        columns.update(item.keys())
    # Handle pagination if table is large
    while 'LastEvaluatedKey' in response:
        response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'], Limit=100)
        for item in response.get('Items', []):
            columns.update(item.keys())
    return tuple(columns)

def get_table_key_schema(table_name):
    response = db.describe_table(TableName=table_name)
    key_schema = response['Table']['KeySchema']
    attribute_definitions = {attr['AttributeName']: attr['AttributeType'] for attr in response['Table']['AttributeDefinitions']}
    return key_schema, attribute_definitions

def get_items(table_name, select_col_name, possible_val):
    deserializer = TypeDeserializer()
    response = db.scan(TableName=table_name, Limit=100)
    items = response['Items']
    clean_items = [{k: deserializer.deserialize(v) for k, v in item.items()} for item in items]
    df = pd.DataFrame(clean_items)
    return df


def streamlit_approval():
    # ---- Streamlit UI ----
    st.title("✨DynamoDB Table Viewer✨")
    st.sidebar.title("Filter Table options")
    table_name = st.sidebar.selectbox("Select Table", list_tables())
    select_col_name = st.sidebar.selectbox("Filter the Column", 'DeleteConfirmation' if table_name else [])
    possible_val = st.sidebar.selectbox("Select The Column Value", ("Pending")) if select_col_name == 'DeleteConfirmation'else []

    if st.sidebar.button("Get Data"):
        df = get_items(table_name, select_col_name, possible_val)
        st.session_state.df = df
        st.session_state.table_name = table_name

    if 'df' in st.session_state:
        st.write(f"### DynamoDB Table: `{st.session_state.table_name}`")
        df_filtered = st.session_state.df
        column_config = {}
        for col in df_filtered.columns:
            if col == "DeleteConfirmation":
                column_config[col] = st.column_config.SelectboxColumn(
                    "DeleteConfirmation",
                    options=["Pending", "Approved"],
                    help="Change delete status"
                )
            else:
                column_config[col] = st.column_config.TextColumn(
                    label=col,
                    disabled=True
                )
        edited_df = st.data_editor(df_filtered, column_config=column_config, use_container_width=True, key="only_delete_editable", num_rows="fixed")

        if st.button("Save Changes"):
            key_schema, attribute_definitions = get_table_key_schema(st.session_state.table_name)
            for index, row in edited_df.iterrows():
                if row['DeleteConfirmation'] == 'Approved':
                    # Build the key based on table schema
                    key = {}
                    for key_attr in key_schema:
                        attr_name = key_attr['AttributeName']
                        attr_type = attribute_definitions[attr_name]
                        key_value = row[attr_name]
                        key[attr_name] = {attr_type: str(key_value)}

                    db.update_item(
                        TableName=st.session_state.table_name,
                        Key=key,
                        UpdateExpression='SET DeleteConfirmation = :val',
                        ExpressionAttributeValues={':val': {'S': 'Approved'}}
                    )
            st.success("Changes saved successfully!")

streamlit_approval()

DeleteVolumeFunction/lambda_function.py

import boto3
import os
import json

def lambda_handler(event, context):
    for ev in event:
        if ev['dynamodb']['NewImage']['DeleteConfirmation']['S'] == 'Approved' and ev['dynamodb']['OldImage']['DeleteConfirmation']['S'] == 'Pending':
            volume_id = ev['dynamodb']['OldImage']['VolumeID']['S']
            region = ev['awsRegion']
            ec2 = boto3.client('ec2', region_name=region)
            try:
                ec2.delete_volume(VolumeId=volume_id)
                print(f"Successfully deleted volume: {volume_id}")
                ## Delete the item of that volume id from DynamoDB
                dynamodb = boto3.resource('dynamodb', region_name=region)
                table_name = os.environ['DYNAMODB_TABLE_NAME']
                table = dynamodb.Table(table_name)
                table.delete_item(
                    Key={
                        'VolumeId': volume_id
                    }
                )

            except Exception as e:
                print(f"Error deleting volume {volume_id}: {str(e)}")

Terraform Codes:

Terraform/alb.tf

resource "aws_lb_target_group" "this_tg" {
  name     = var.TG_conf["name"]
  port     = var.TG_conf["port"]
  protocol = var.TG_conf["protocol"]
  vpc_id   = data.aws_vpc.this_vpc.id
  health_check {
    enabled           = var.TG_conf["enabled"]
    healthy_threshold = var.TG_conf["healthy_threshold"]
    interval          = var.TG_conf["interval"]
    path              = var.TG_conf["path"]
  }
  target_type = var.TG_conf["target_type"]
  tags = {
    Attached_ALB_dns = aws_lb.this_alb.dns_name
  }
}


resource "aws_lb" "this_alb" {
  name               = var.ALB_conf["name"]
  load_balancer_type = var.ALB_conf["load_balancer_type"]
  ip_address_type    = var.ALB_conf["ip_address_type"]
  internal           = var.ALB_conf["internal"]
  security_groups    = [data.aws_security_group.ext_alb.id]
  subnets            = [data.aws_subnet.web_subnet_1a.id, data.aws_subnet.web_subnet_1b.id]
  tags               = merge(var.alb_tags)
}

resource "aws_lb_listener" "this_alb_lis" {
  for_each          = var.Listener_conf
  load_balancer_arn = aws_lb.this_alb.arn
  port              = each.value["port"]
  protocol          = each.value["protocol"]
  default_action {
    type             = each.value["type"]
    target_group_arn = aws_lb_target_group.this_tg.arn
  }
}

Terraform/data.tf

# vpc details :

data "aws_vpc" "this_vpc" {
  state = "available"
  filter {
    name   = "tag:Name"
    values = ["custom-vpc"]
  }
}
# subnets details :

data "aws_subnet" "web_subnet_1a" {
  vpc_id = data.aws_vpc.this_vpc.id
  filter {
    name   = "tag:Name"
    values = ["weblayer-pub1-1a"]
  }
}

data "aws_subnet" "web_subnet_1b" {
  vpc_id = data.aws_vpc.this_vpc.id
  filter {
    name   = "tag:Name"
    values = ["weblayer-pub2-1b"]
  }
}

# ALB security group details :
data "aws_security_group" "ext_alb" {
  filter {
    name   = "tag:Name"
    values = ["ALBSG"]
  }
}

data "aws_security_group" "streamlit_app" {
  filter {
    name   = "tag:Name"
    values = ["StreamlitAppSG"]
  }
}

# Lambda execution role
data "aws_iam_role" "lambda_role" {
  name = var.lambda_role
}

# sns topic details
data "aws_sns_topic" "sns_topic_info" {
  name = var.sns
}

Terraform/dynamodb.tf

resource "aws_dynamodb_table" "dynamodb-table" {
  name             = var.dynamodb_table
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "VolumeID"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"
  attribute {
    name = "VolumeID"
    type = "S" # String type
  }
}

Terraform/ecs.tf

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS ECR Repository~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

resource "aws_ecr_repository" "aws-ecr" {
  name = var.ecr_repo
  tags = var.ecr_tags
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS ECS Cluster~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

resource "aws_ecs_cluster" "aws-ecs-cluster" {
  name = var.ecs_details["Name"]
  configuration {
    execute_command_configuration {
      kms_key_id = aws_kms_key.kms.arn
      logging    = var.ecs_details["logging"]
      log_configuration {
        cloud_watch_encryption_enabled = true
        cloud_watch_log_group_name     = aws_cloudwatch_log_group.log-group.name
      }
    }
  }
  tags = var.custom_tags
}

resource "aws_ecs_task_definition" "taskdef" {
  family = var.ecs_task_def["family"]
  container_definitions = jsonencode([
    {
      "name" : "${var.ecs_task_def["cont_name"]}",
      "image" : "${aws_ecr_repository.aws-ecr.repository_url}:v1",
      "entrypoint" : [],
      "essential" : "${var.ecs_task_def["essential"]}",
      "logConfiguration" : {
        "logDriver" : "${var.ecs_task_def["logdriver"]}",
        "options" : {
          "awslogs-group" : "${aws_cloudwatch_log_group.log-group.id}",
          "awslogs-region" : "${var.region}",
          "awslogs-stream-prefix" : "app-prd"
        }
      },
      "portMappings" : [
        {
          "containerPort" : "${var.ecs_task_def["containerport"]}",
        }
      ],
      "cpu" : "${var.ecs_task_def["cpu"]}",
      "memory" : "${var.ecs_task_def["memory"]}",
      "networkMode" : "${var.ecs_task_def["networkmode"]}"
    }
  ])

  requires_compatibilities = var.ecs_task_def["requires_compatibilities"]
  network_mode             = var.ecs_task_def["networkmode"]
  memory                   = var.ecs_task_def["memory"]
  cpu                      = var.ecs_task_def["cpu"]
  execution_role_arn       = aws_iam_role.ecsTaskExecutionRole.arn
  task_role_arn            = aws_iam_role.ecsTaskExecutionRole.arn

  tags = var.custom_tags
}



#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS CloudWatch Log Group~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

resource "aws_cloudwatch_log_group" "log-group" {
  name = var.cw_log_grp
  tags = var.custom_tags
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS KMS Key~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

resource "aws_kms_key" "kms" {
  description             = var.kms_key["description"]
  deletion_window_in_days = var.kms_key["deletion_window_in_days"]
  tags                    = var.custom_tags
}

Terraform/eventbridge.tf





resource "aws_pipes_pipe" "event_pipe" {
  depends_on  = [data.aws_iam_role.lambda_role]
  name        = var.eventbridge_pipe
  description = "EventBridge Pipe to process DynamoDB Stream data to Lambda"
  role_arn    = data.aws_iam_role.lambda_role.arn
  source      = aws_dynamodb_table.dynamodb-table.stream_arn
  target      = aws_lambda_function.lambda_2.arn

  source_parameters {
    dynamodb_stream_parameters {
      starting_position = "LATEST"
    }

    filter_criteria {
      filter {
        pattern = jsonencode({
          dynamodb = {
            OldImage = {
              DeleteConfirmation = {
                S = ["Pending"]
              }
            },
            NewImage = {
              DeleteConfirmation = {
                S = ["Approved"]
              }
            }
          }
        })
      }
    }
  }
}

Terraform/iam.tf

resource "aws_iam_role" "ecsTaskExecutionRole" {
  name               = var.ecs_role
  assume_role_policy = data.aws_iam_policy_document.assume_role_policy.json
}

data "aws_iam_policy_document" "assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ecs-tasks.amazonaws.com"]
    }
  }
}

locals {
  policy_arn = [
    "arn:aws:iam::aws:policy/AdministratorAccess",
    "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role",
    "arn:aws:iam::669122243705:policy/CustomPolicyECS"
  ]
}
resource "aws_iam_role_policy_attachment" "ecsTaskExecutionRole_policy" {
  count      = length(local.policy_arn)
  role       = aws_iam_role.ecsTaskExecutionRole.name
  policy_arn = element(local.policy_arn, count.index)
}

Terraform/lambda.tf

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Archive the Codespaces~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

data "archive_file" "lambda_zip_1" {
  type        = "zip"
  source_dir  = "${path.module}/../VolumeDiscoveryFunction"
  output_path = "${path.module}/../VolumeDiscoveryFunction/lambda.zip"
}

data "archive_file" "lambda_zip_2" {
  type        = "zip"
  source_dir  = "${path.module}/../DeleteVolumeFunction"
  output_path = "${path.module}/../DeleteVolumeFunction/lambda.zip"
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Lambda Functions~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<Lambda Func: VolumeDiscoveryFunction>>>>>>>>>>>>>>>>>>>>>#

resource "aws_lambda_function" "lambda_1" {
  filename         = data.archive_file.lambda_zip_1.output_path
  function_name    = var.lambda_function_1
  role             = data.aws_iam_role.lambda_role.arn
  handler          = "lambda_function.lambda_handler"
  source_code_hash = data.archive_file.lambda_zip_1.output_base64sha256

  runtime     = "python3.13"
  layers      = [var.lambda_layer_arn]
  timeout     = 60
  memory_size = 900
  ephemeral_storage {
    size = 1024
  }

  environment {
    variables = {
      BUCKET_NAME = var.s3_bucket_name
      SNS_ARN     = data.aws_sns_topic.sns_topic_info.arn
      TABLE_NAME  = var.dynamodb_table
    }
  }

}


#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<Lambda Func: DeleteVolumeFunction>>>>>>>>>>>>>>>>>>>>>#

resource "aws_lambda_function" "lambda_2" {
  filename         = data.archive_file.lambda_zip_2.output_path
  function_name    = var.lambda_function_2
  role             = data.aws_iam_role.lambda_role.arn
  handler          = "lambda_function.lambda_handler"
  source_code_hash = data.archive_file.lambda_zip_2.output_base64sha256

  runtime     = "python3.13"
  timeout     = 60
  memory_size = 250
  environment {
    variables = {
      TABLE_NAME = var.dynamodb_table
    }
  }
}

Terraform/output.tf

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS ECR Repository~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
output "ecr_arn" {
  value = aws_ecr_repository.aws-ecr.arn
}

output "ecr_registry_id" {
  value = aws_ecr_repository.aws-ecr.registry_id
}

output "ecr_url" {
  value = aws_ecr_repository.aws-ecr.repository_url
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS ALB~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
output "arn" {
  value = [aws_lb.this_alb.arn]
}

output "dns_name" {
  value = [aws_lb.this_alb.dns_name]
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS ECS Cluster~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
output "ecs_arn" {
  value = aws_ecs_cluster.aws-ecs-cluster.id
}

output "cw_log_group_arn" {
  value = aws_cloudwatch_log_group.log-group.arn
}

output "kms_id" {
  value = aws_kms_key.kms.id
}

output "kms_arn" {
  value = aws_kms_key.kms.arn
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS Lambda~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

output "lambda" {
  value = {
    lambda_1 = aws_lambda_function.lambda_1.arn
    lambda_2 = aws_lambda_function.lambda_2.arn
  }
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~AWS DynamoDB~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

output "dynamodb_table_name" {
  value = aws_dynamodb_table.dynamodb-table.arn
}

Terraform/provider.tf

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "6.17.0"
    }
    archive = {
      source  = "hashicorp/archive"
      version = "2.7.1"
    }
  }
}

terraform {
  backend "s3" {
    bucket = "terraform0806"
    key    = "TerraformStateFiles"
    region = "us-east-1"
  }
}


provider "aws" {
  # Configuration options
  region = "us-east-1"
}
provider "archive" {}

Terraform/terraform.tfvars

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Terraform/terraform.tfvars of ALB~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

TG_conf = {
  enabled           = true
  healthy_threshold = "2"
  interval          = "30"
  name              = "TargetGroup-External"
  port              = "8501"
  protocol          = "HTTP"
  target_type       = "ip"
  path              = "/"
}

ALB_conf = {
  internal           = false
  ip_address_type    = "ipv4"
  load_balancer_type = "application"
  name               = "ALB-External"
}

Listener_conf = {
  "1" = {
    port     = "80"
    priority = 100
    protocol = "HTTP"
    type     = "forward"
  }
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Terraform/terraform.tfvars of ECS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

ecs_details = {
  Name                           = "Streamlit-cluster"
  logging                        = "OVERRIDE"
  cloud_watch_encryption_enabled = true
}

ecs_task_def = {
  family                   = "custom-task-definition"
  cont_name                = "streamlit"
  cpu                      = 256
  memory                   = 512
  essential                = true
  logdriver                = "awslogs"
  containerport            = 8501
  networkmode              = "awsvpc"
  requires_compatibilities = ["FARGATE", ]
}


cw_log_grp = "cloudwatch-log-group-ecs-cluster"

kms_key = {
  description             = "log group encryption"
  deletion_window_in_days = 7
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Terraform/terraform.tfvars of Lambda~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

lambda_role       = "custom-lambda-role"
lambda_function_1 = "VolumeDiscoveryFunction-1"
lambda_function_2 = "DeleteVolumeFunction-1"
s3_bucket_name    = "terraform0806"
dynamodb_table    = "AvailableEBSVolume-1"
sns               = "SNSEmailNotification"
lambda_layer_arn  = "arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python313:4"
eventbridge_pipe  = "Custom-Eventbridge-Pipe"

Terraform/variable.tf

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Variables of ALB~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

variable "TG_conf" {
  type = object({
    name              = string
    port              = string
    protocol          = string
    target_type       = string
    enabled           = bool
    healthy_threshold = string
    interval          = string
    path              = string
  })
}

variable "ALB_conf" {
  type = object({
    name               = string
    internal           = bool
    load_balancer_type = string
    ip_address_type    = string
  })
}

variable "Listener_conf" {
  type = map(object({
    port     = string
    protocol = string
    type     = string
    priority = number
  }))
}

variable "alb_tags" {
  description = "provides the tags for ALB"
  type = object({
    Environment = string
    Email       = string
    Type        = string
    Owner       = string
  })
  default = {
    Email       = "dasanirban9019@gmail.com"
    Environment = "Dev"
    Owner       = "Anirban Das"
    Type        = "External"
  }
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Variables of ECR~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

variable "ecr_repo" {
  description = "Name of repository"
  default     = "streamlit-repo"
}

variable "ecr_tags" {
  type = map(any)
  default = {
    "AppName" = "StreamlitApp"
    "Env"     = "Dev"
  }
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Variables of ECS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

variable "region" {
  type    = string
  default = "us-east-1"
}

variable "ecs_role" {
  description = "ecs roles"
  default     = "ecsTaskExecutionRole"
}

variable "ecs_details" {
  description = "details of ECS cluster"
  type = object({
    Name                           = string
    logging                        = string
    cloud_watch_encryption_enabled = bool
  })
}

variable "ecs_task_def" {
  description = "defines the configurations of task definition"
  type = object({
    family                   = string
    cont_name                = string
    cpu                      = number
    memory                   = number
    essential                = bool
    logdriver                = string
    containerport            = number
    networkmode              = string
    requires_compatibilities = list(string)

  })
}


variable "cw_log_grp" {
  description = "defines the log group in cloudwatch"
  type        = string
  default     = ""
}

variable "kms_key" {
  description = "defines the kms key"
  type = object({
    description             = string
    deletion_window_in_days = number
  })
}

variable "custom_tags" {
  description = "defines common tags"
  type        = object({})
  default = {
    AppName = "StreamlitApp"
    Env     = "Dev"
  }
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Variables of Lambda~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

variable "lambda_role" {
  type = string
}

variable "lambda_function_1" {
  type = string
}

variable "lambda_function_2" {
  type = string
}

variable "s3_bucket_name" {
  type = string
}

variable "dynamodb_table" {
  type = string
}

variable "sns" {
  type = string
}

variable "lambda_layer_arn" {
  type = string
}

variable "eventbridge_pipe" {
  type = string
}

Impact Analysis:

Cost Savings: Automated cleanup reduces unnecessary storage costs.
Operational Efficiency: Eliminates manual audits and cleanup scripts.
Governance: Approval workflow ensures accountability and prevents accidental deletions.
Scalability: Works seamlessly across multiple accounts and regions.

Future Improvement Possibilities:

Extend support to unused snapshots and AMIs.
Add multi-account aggregation using AWS Organizations.
Integrate with Slack or Teams notifications for approval requests.
Enhance the web app with role-based access control (RBAC).
Implement machine learning-based recommendations for identifying safe-to-delete volumes.

Conclusion:

The Auto AMI Cleanup project demonstrates how automation, infrastructure-as-code, and approval workflows can transform cloud resource management. By combining AWS services with a simple web interface, we achieved a secure, scalable, and cost-efficient solution to a common cloud challenge. This approach not only saves money but also strengthens governance and operational hygiene in AWS environments.

Monthly Golden Image Build Process using Packer & Ansible

Susseta Bose — Tue, 23 Dec 2025 04:43:31 +0000

Introduction:

In IT operations,Imagine We're working in a cloud organization that deploys hundreds of EC2 instances every month. Each instance needs to be secure, compliant.Manually configuring each instance is a nightmare. Instead, you want a golden image — a reusable AMI that’s pre-hardened and provisioned with all necessary tools.

This is the point at which Packer becomes relevant.

Problem Statement:

In most organizations, EC2 instances are launched frequently to support various workloads. But here's the catch — each instance needs to be:

Secure and compliant

Equipped with monitoring and security agents

Consistently configured

Real Time Scenario:

Suppose you’re part of a security-conscious enterprise. Every EC2 instance must:

Follow CIS benchmarks
Have CrowdStrike and Qualys agents installed

Instead of configuring each instance post-launch, you want to create a golden AMI that’s already hardened and provisioned. This image will serve as the base for all future deployments — saving time and ensuring consistency.

Tools Involved:

Packer
AWS EC2
Ansible
Gitlab CI/CD
Amazon SSM

Architecture Diagram:

This workflow automates the creation of a secure AMI by:

Launching a temporary EC2 instance from a base image.

Running provisioning scripts to:

Apply OS hardening (CIS benchmarks, firewall rules, SSH configs).

Creating a new AMI from the configured instance.

Terminating the temporary instance.

Implementation Steps :

Step1: Install Packer

Step2: Create Packer Template with Ansible Provisioning

This Packer template automates the creation of a custom Amazon Machine Image (AMI) by launching a temporary EC2 instance in a specific AWS VPC and subnet, using a designated SSH key pair for secure access.

Code block

packer {
  required_plugins {
    amazon = {
      version = ">= 1.2.8"
      source  = "github.com/hashicorp/amazon"
    }
    ansible = {
      version = "~> 1"
      source = "github.com/hashicorp/ansible"
    }  
  }
}
variable "ami_prefix" {
  type = string
  default = ""
}

variable "reference_image" {
  type = string
  default = ""
}

locals {
  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
}

variable "privatekey"{
  type = string
  default = ""
}

source "amazon-ebs" "amazon_linux" {
  ami_name      = "${var.ami_prefix}-${local.timestamp}"
  instance_type = "t2.micro"
  region        = "ap-south-1"
  vpc_id = "vpc-07b2ce11f9b189f3b"
  subnet_id = "subnet-063ebc661edd9fb37"
  security_group_id = "sg-04e9ae673095b02e9"
  ssh_interface = "private_ip"
  associate_public_ip_address = true
  ssh_keypair_name = "runner_key"
  ssh_private_key_file = var.privatekey

  source_ami_filter {
    filters = {
      name                = "${var.reference_image}"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    most_recent = true
    owners      = [""]
  }
  ssh_username = "ec2-user"
}

build {
  name = "learn-packer"
  sources = [
    "source.amazon-ebs.amazon_linux"
  ]

  provisioner "shell" {
  inline = [
    "sleep 20",
    "echo '--- Running AMI pre-check ---'",
    "set -e",
    "sudo mkdir -p /usr/lib",
    "# Ensure SFTP subsystem path exists",
    "if [ ! -f /usr/lib/sftp-server ]; then",
    "  if [ -f /usr/libexec/openssh/sftp-server ]; then",
    "    sudo ln -s /usr/libexec/openssh/sftp-server /usr/lib/sftp-server",
    "    echo 'Linked /usr/libexec/openssh/sftp-server -> /usr/lib/sftp-server';",
    "  else",
    "    echo 'Warning: sftp-server not found, installing openssh-server...';",
    "    sudo yum install -y openssh-server || sudo apt-get install -y openssh-server;",
    "  fi;",
    "fi",

    "# Basic network sanity check",
    "sudo yum clean all || true",
    "sudo yum update -y || true",
    "echo '--- Pre-check complete ---'"
  ]
  }

  provisioner "shell" {
  inline = [
    "sudo mkdir -p /tmp/.ansible",
    "sudo chmod 777 /tmp/.ansible"
  ]
  }

  provisioner "ansible" {
    playbook_file = "./playbook/main.yml"
    use_proxy = false
    extra_arguments = ["--vault-password-file=/home/gitlab-runner/.vault_pass",
    "-e", "ansible_remote_tmp=/tmp/.ansible",
    "-e", "ansible_local_tmp=/tmp/.ansible",
    "-e", "ansible_scp_if_ssh=True",
    "-e", "ansible_python_interpreter=/usr/bin/python3",
    "-e", "ansible_ssh_transfer_method=scp"]
  }

}

Ansible Main Block:

---
- name: Create users and providing sudo access
  hosts: all
  become: true
  gather_facts: true
  vars_files:
    - ../vars/useradd.yml
    - ../vars/vault.yml
  roles:
    - ../roles/useradd
    - ../roles/sudo

- name: Set hostnames
  hosts: all
  become: true
  gather_facts: false
  vars_files:
    - ../vars/var.yml
  roles:
    - ../roles/hostnamectl

- name: Enable or Set miscellaneous services
  hosts: all
  gather_facts: false
  become: true
  roles:
     - ../roles/ssh
     - ../roles/login_banner
     - ../roles/services
     - ../roles/timezone
    # - ../roles/fs_integrity
    #  - ../roles/selinux
    #  - ../roles/firewalld
    #  - ../roles/log_management
     - ../roles/rsyslog
    #  - ../roles/cron
    #  - ../roles/journald

Ansible User Creation block:

---
- hosts: all
  become: true
  gather_facts: true
  vars_files:
    - ../vars/useradd.yml
    - ../vars/vault.yml
  roles:
    - ../roles/useradd

Outcome:

This template builds a custom AMI by:

Launching a VM in a specific VPC and subnet.

Using a defined SSH key pair for access.

Running shell scripts and Ansible to configure the instance.

Saving the final image with a unique name for future use.

Note: Now, I've also configured a CI/CD variable in GitLab to securely store the private key content used for SSH access in the Packer build.

GitLab CI/CD Variable for Private Key
In GitLab, CI/CD variables allow us to store sensitive data like passwords, tokens, or SSH keys securely.

I've created a variable (e.g., PRIVATE_KEY) that contains the entire private key content (not just the path).

This variable is injected into the pipeline at runtime, allowing tools like Packer to use it without hardcoding the key or exposing it in our repository.

packer validate -var-file="ami.pkrvars.hcl" -var "privatekey=runner_key.pem" aws-linux.pkr.hcl

Step3: Gitlab Pipeline Stages

In this, GitLab CI/CD job automates the AMI creation process by:

Securely injecting an SSH private key from a CI/CD variable.

Validating and building a Packer template.

Provisioning an EC2 instance in a specific VPC/subnet.

Saving the final image for future use.

default: 
    tags:
      - gitlab_runner

stages:
  - image_build

Image Build: 
  stage: image_build
  script: 
    - echo "$SSH_PRIVATE_KEY" > runner_key.pem
    - chmod 400 runner_key.pem
    - packer init .
    - echo "Validating packer template..."
    - packer validate -var-file="ami.pkrvars.hcl" -var "privatekey=runner_key.pem" aws-linux.pkr.hcl
    - echo "Building AMI..."
    - packer build -var-file="ami.pkrvars.hcl" -var "privatekey=runner_key.pem" aws-linux.pkr.hcl

Conclusion: In this article, we tackled a common challenge in cloud operations i.e., ensuring every EC2 instance is secure, compliant, and consistently configured — without manual intervention.

By combining Packer, Ansible, and GitLab CI/CD, we built a fully automated pipeline that:

Launches a temporary EC2 instance
Applies CIS hardening and installs security agents
Saves a golden AMI for future use
Secures credentials using GitLab CI/CD variables

This approach not only boosts security and compliance but also saves hours of manual effort, reduces human error, and ensures every deployment starts from a trusted baseline.

Thanks,
Susseta Bose