DEV Community: Philippe borribo

FinOps Optimization: Reducing AWS Bills Through Automated EC2 Shutdowns

Philippe borribo — Fri, 16 May 2025 12:58:29 +0000

1. Introduction

As organizations increasingly migrate workloads to the cloud, cost management becomes a critical component of sustainable operations. While cloud services like AWS offer flexibility and scalability, they also come with the risk of escalating expenses if not carefully monitored. This is where FinOps — a practice that blends financial accountability with cloud engineering — plays a key role.
One of the most common sources of unnecessary cloud expenditure is idle or underutilized resources, particularly Amazon EC2 instances that continue running outside of business hours. Many teams spin up instances for development, testing, or internal applications, but fail to shut them down when not in use — often because manual shutdown is inconvenient or simply forgotten.
To address this issue, organizations can implement a simple yet powerful solution: automating the shutdown and restart of EC2 instances based on usage schedules. By stopping non-critical instances during off-hours (e.g., overnight or on weekends), companies can drastically reduce their cloud bills without compromising productivity or system availability.
In this article, we explore how automated EC2 shutdowns can support FinOps goals through a real-world use case. We’ll break down the cost savings, demonstrate how to implement the automation, and calculate the return on investment (ROI) of this approach. Whether you're managing a startup environment or a large-scale enterprise infrastructure, the strategies discussed here can help you take meaningful control of your AWS spending.

2. Understanding EC2 Cost Structures

To effectively reduce AWS costs through automation, it's essential to first understand how EC2 (Elastic Compute Cloud) pricing works. AWS offers a flexible pricing model for EC2 instances, but this flexibility can also lead to overspending if not managed carefully.

2.1 EC2 Pricing Models
AWS provides multiple pricing options for EC2:

On-Demand Instances: These are billed per second (with a minimum of 60 seconds) and are the most flexible option. They are ideal for short-term, unpredictable workloads, but also the most expensive if left running continuously.
Reserved Instances (RIs): Offer significant discounts (up to 75%) in exchange for committing to use a specific instance type in a specific region for a 1- or 3-year term. While cost-effective, they require predictable workloads and upfront planning.
Spot Instances: Allow you to bid on unused EC2 capacity at a reduced price — often up to 90% off. However, they can be interrupted by AWS with little notice.
In this article, we focus on on-demand EC2 instances, as they are most commonly used in development and testing environments where usage is dynamic.

3. Use Case Overview

To better illustrate the practical value of automated EC2 shutdowns, let’s consider a real-world use case that many organizations can relate to: a development and testing environment running on AWS.

3.1 Scenario Context
A mid-sized software company uses AWS EC2 to host several environments for internal development, testing, and QA purposes. These instances are not mission-critical, and developers typically work from 05:00 to 00:00 UTC, leaving a 5-hour window each night when the machines are idle.

Historically, the EC2 instances in this environment remained running 24/7, even though no one was using them during the early morning hours. This led to thousands of dollars in unnecessary compute costs over time.

3.2 Objective
The goal of this cost-optimization initiative was simple:

Automatically shut down all non-critical EC2 instances at midnight (00:00 UTC) and restart them at 05:00 UTC, seven days a week.

This schedule ensured that development teams would always find the environment ready when their workday began, without any delays or disruptions.

3.3 Target Resources
The team identified which EC2 instances could be safely stopped without affecting production. To streamline this, they applied a simple tagging policy:

Key: AutoShutdown  
Value: true

This allowed the automation script to select only the relevant EC2 instances, avoiding the risk of stopping production workloads or other critical infrastructure.

3.4 Results Expected
By implementing this scheduled shutdown:

The company aimed to save at least 20% of the compute cost for each affected instance.
They expected to automate the process entirely, removing reliance on manual shutdowns.
The solution had to be scalable, so it could be applied to dozens (or eventually hundreds) of EC2 instances.

This use case sets the stage for the technical implementation and ROI analysis that follows, demonstrating that small changes in operational discipline — when automated — can lead to significant financial impact.

4. Automation Strategy

To achieve reliable and repeatable cost savings, the shutdown and startup of EC2 instances must be fully automated. This section outlines the tools, methodology, and implementation strategy used to automate EC2 lifecycle operations based on a fixed schedule.

4.1 Tools Used
A variety of tools can be used to implement EC2 automation, depending on your organization’s existing infrastructure and preferences. For this use case, we’ll focus on a simple, script-based solution using:

AWS CLI: To interact with EC2 via terminal commands (or scripts).
Crontab (Linux Scheduler): To schedule scripts to run at specific times.
EC2 Instance Tags: To selectively identify which instances should be managed.
IAM Roles: To securely authorize the automation script with the correct permissions.

For more complex or cloud-native implementations, AWS Systems Manager, Lambda, or EventBridge can be used — but here, simplicity and portability are prioritized.

4.2 Step-by-Step Implementation
Step 1: Tag Your EC2 Instances
Add a custom tag to all EC2 instances that should be included in the shutdown/startup cycle.
For example:

Key: AutoShutdown
Value: true

This tag will serve as a filter for the automation script.

Step 2: Create IAM Permissions
Ensure the script (or the instance running the script) has an IAM role or user with at least the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:StopInstances",
        "ec2:StartInstances"
      ],
      "Resource": "*"
    }
  ]
}

Step 3: Write the Shutdown Script
Here’s an example in Bash using AWS CLI:

#!/bin/bash

# Get instance IDs of EC2s with the tag AutoShutdown=true and that are running
INSTANCES=$(aws ec2 describe-instances \
  --filters "Name=tag:AutoShutdown,Values=true" "Name=instance-state-name,Values=running" \
  --query "Reservations[*].Instances[*].InstanceId" --output text)

if [ -z "$INSTANCES" ]; then
  echo "No instances to stop."
else
  echo "Stopping instances: $INSTANCES"
  aws ec2 stop-instances --instance-ids $INSTANCES
fi

Step 4: Write the Startup Script

#!/bin/bash

# Get instance IDs of EC2s with the tag AutoShutdown=true and that are stopped
INSTANCES=$(aws ec2 describe-instances \
  --filters "Name=tag:AutoShutdown,Values=true" "Name=instance-state-name,Values=stopped" \
  --query "Reservations[*].Instances[*].InstanceId" --output text)

if [ -z "$INSTANCES" ]; then
  echo "No instances to start."
else
  echo "Starting instances: $INSTANCES"
  aws ec2 start-instances --instance-ids $INSTANCES
fi

These scripts can be placed on a dedicated automation instance, or embedded into a lightweight management server.

Step 5: Schedule with Crontab
Use crontab -e to schedule the jobs:

# Stop instances every day at 00:00 UTC
0 0 * * * /path/to/stop-ec2.sh >> /var/log/ec2_stop.log 2>&1

# Start instances every day at 05:00 UTC
0 5 * * * /path/to/start-ec2.sh >> /var/log/ec2_start.log 2>&1

Make sure the environment running these cron jobs has valid AWS credentials or an appropriate IAM role attached.

4.3 Error Handling and Logging
To ensure reliability:

Add logging to capture success/failure of each operation.
Use email alerts or Slack/webhook notifications for failed jobs (optional).
Use AWS CloudWatch Logs or a central syslog server for log aggregation.

4.4 Scalability Considerations

As the environment grows, you can switch to AWS Lambda + EventBridge for more scalable, serverless management.
For production environments, consider using State Manager in AWS Systems Manager to enforce desired instance states at scale.

By automating EC2 management with lightweight tools and best practices, organizations can ensure cost savings are applied consistently and securely, without adding operational overhead.

5. Risks and Considerations

While automating the shutdown and startup of EC2 instances is an effective FinOps strategy, it's not without potential challenges. Failing to account for these risks could lead to service interruptions, data loss, or operational inefficiencies. This section outlines the key considerations every organization should evaluate before deploying an EC2 automation policy.

5.1 Risk of Stopping Critical Instances
The biggest risk in automated shutdowns is inadvertently stopping production or critical infrastructure. If an instance that supports live services is mistakenly tagged (or not properly excluded), it could result in:

Service downtime
User disruption
Violation of SLAs

Mitigation Strategies:

Use strict tagging policies with defined naming conventions (AutoShutdown=true only on non-critical resources).
Maintain a list of "Do Not Touch" tags (NoShutdown=true or Critical=true) to exclude sensitive resources.
Implement a manual approval workflow for newly tagged instances before they are included in automation.

5.2 Delayed Availability at Startup
Even when automation works as expected, there may be a delay in instance availability due to:

Boot time for the OS and applications
Service warm-up (e.g., databases or backend processes)
Dependency resolution (e.g., connections to external services)

For instance, a large EC2 instance running a containerized microservice might take 3–5 minutes to fully initialize.

Mitigation Strategies:

Build startup buffers into the schedule (e.g., restart instances at 04:45 instead of 05:00).
Use health checks to confirm service readiness.
Leverage pre-warming scripts that prepare the environment immediately after the instance boots.

5.3 Incomplete or Inconsistent Tagging
Automation relies on accurate and consistent tagging. In practice, many environments suffer from tagging drift, where:

Instances are launched without appropriate tags.
Old instances retain outdated or incorrect tags.
Developers bypass policies due to lack of enforcement.

Mitigation Strategies:

Enforce tagging at provisioning time using IAM policies,
Service Control Policies (SCPs), or Infrastructure-as-Code (IaC) templates like Terraform or CloudFormation.
Periodically run audit scripts to detect untagged or misconfigured resources.

Use AWS Config rules or Tag Policies in AWS Organizations to maintain compliance.

5.4 Security and Permissions Misconfiguration
Automated scripts require IAM permissions. Poorly scoped permissions can either:

Expose security vulnerabilities (if overly permissive), or
Break the automation (if too restrictive)

Mitigation Strategies:

Follow the principle of least privilege when defining IAM roles.
Use instance profiles instead of hardcoded credentials.
Regularly rotate keys and audit IAM policies.

5.5 Dependency on a Single Point of Failure
If the automation relies on a single server (e.g., a Linux VM running cron jobs), its failure could break the entire process.

Mitigation Strategies:

Use high-availability setups or run scripts from AWS Lambda.
Monitor task success/failure using CloudWatch alarms, email alerts, or observability tools.
Keep manual override scripts available to quickly restart or stop instances if needed.

5.6 Compliance and Auditability
In regulated environments, automated changes must be traceable. Stopping or starting instances without proper logging could violate audit requirements.

Mitigation Strategies:

Enable CloudTrail to log all EC2 actions (stop/start/terminate).
Centralize logs in CloudWatch Logs or a SIEM for review.
Document the automation policy and keep it aligned with governance frameworks.

6. Best Practices for EC2 Cost Optimization

Successfully automating the shutdown and startup of EC2 instances is only the beginning. To sustain the benefits and avoid backsliding into inefficient usage, organizations must adopt a set of operational and cultural best practices. This section provides practical guidance for embedding EC2 automation into your long-term FinOps and cloud governance strategy.

6.1 Establish a Strong Tagging Policy
Tagging is the backbone of instance targeting for automation. A consistent, enforced tagging strategy helps you scale cost-saving efforts and reduce errors.

Recommendations:

Define required tags such as:

AutoShutdown=true

Environment=dev/test/prod

Owner=team-name

CostCenter=12345

Use AWS Tag Policies and IAM tag enforcement rules to ensure compliance.
Automate tagging through Infrastructure as Code (IaC) templates like Terraform, CloudFormation, or Pulumi.

6.2 Involve DevOps and Developers Early
Automation policies should be collaborative, not imposed. Developers and DevOps engineers often know which workloads are safe to shut down — and which are not.

Tips:

Involve teams in identifying auto-shutdown candidates.
Provide them with tooling or dashboards to opt in/out.
Educate teams about the cost impact of idle resources.

6.3 Use AWS Native Services for Scaling
While shell scripts and crontabs work well for small-scale environments, growing organizations benefit from AWS-native, serverless automation.

Advanced options:

AWS EventBridge: Schedule events to trigger instance actions.
AWS Lambda: Run scripts without managing infrastructure.
AWS Systems Manager Automation Documents (SSM): Define and execute EC2 stop/start workflows with tracking and audit logs.

These services are more resilient, monitorable, and maintainable over time.

6.4 Track Savings with Tag-Based Cost Allocation
Use AWS Cost Explorer and Cost Allocation Tags to measure the financial impact of your automation initiative.

Activate the AutoShutdown tag in the AWS billing console.
Use Cost Explorer filters to compare costs before and after implementation.
Present monthly reports to stakeholders showing realized savings per instance, team, or environment.

This reinforces accountability and justifies future FinOps investments.

6.5 Monitor and Iterate
Automation is not a "set it and forget it" strategy. EC2 usage evolves, new teams spin up instances, and requirements change.

Recommendations:

Set up alerts and monitoring to track failed or missed shutdowns.
Schedule quarterly reviews to refine automation rules and schedules.
Maintain documentation and onboarding guides for new teams.
Use automation scripts stored in version control (e.g., GitHub, CodeCommit) to enable version tracking and collaborative updates.

6.6 Integrate into Broader FinOps Practice
Automating EC2 shutdowns should be part of a larger FinOps maturity model, which may include:

Rightsizing instances
Buying Reserved Instances or Savings Plans
Deleting unused volumes or snapshots
Optimizing S3 storage tiers
Tracking per-project or per-department costs

By integrating EC2 scheduling into a FinOps culture, organizations can align cloud usage with business value.

7. Conclusion

Automating the shutdown and startup of EC2 instances may seem like a simple technical task, but it represents a powerful and scalable FinOps strategy with measurable impact. By reducing idle compute time especially outside business hours. Organizations can unlock significant cost savings without compromising performance or productivity.
As demonstrated, this approach requires more than a few lines of script: it calls for a disciplined framework involving tagging policies, stakeholder alignment, security governance, and continuous monitoring. When combined with AWS-native tools and a culture of cloud cost awareness, EC2 automation becomes a key pillar of operational efficiency.
Ultimately, the success of any cost optimization initiative depends not just on the tools you use, but on how consistently and intelligently you apply them. With the right practices in place, automated EC2 scheduling can serve as a launchpad for broader FinOps maturity... transforming cloud infrastructure from a cost center into a strategic advantage.

Never get caught off guard: Receive your server's health every minute by Email

Philippe borribo — Thu, 15 May 2025 08:17:16 +0000

Introduction

For system administrators, surprises are rarely a good thing. Whether it's a sudden spike in CPU usage, memory running dangerously low, or disk space nearing its limit, these issues can quickly escalate into major problems : from system slowdowns to full outages.
The key to avoiding such disasters? Proactive monitoring.
Imagine receiving a quick health check of your server every single minute: delivered straight to your inbox. No need to log in and manually check metrics. Instead, you stay one step ahead, spotting trouble before it affects users or services.
In this article, you'll learn how to set up a simple but powerful script that runs automatically every minute. It collects key system metrics like CPU, memory, and disk usage, then emails the report to your system administrator. It's fast, lightweight, and incredibly effective - a perfect fit for anyone managing production environments or personal servers alike.

1. Why monitor your server every minute?

When it comes to server health, timing is everything. Even a few minutes of downtime can have serious consequences… from lost revenue to broken user trust. While traditional monitoring solutions may check system status every 5 or 10 minutes, some environments require faster feedback and immediate awareness.
a. Real-Time awareness
Monitoring your server every minute gives you near real-time visibility into its behavior. This is especially useful for detecting:
Sudden CPU spikes due to rogue processes.
Memory leaks that gradually degrade performance.
Rapid disk consumption caused by logging errors, backups, or attacks.
By checking these metrics every 60 seconds, you drastically reduce the time between a problem appearing and it being detected.
b. Faster incident response
Early detection means faster reaction. If you receive an email showing high CPU usage or critical memory shortage, you can investigate and act before the issue escalates into service disruption. In production environments, this can save hours of downtime and avoid emergency interventions.
c. Lightweight alternative to complex monitoring tools
While tools like Nagios, Zabbix, or Prometheus are powerful, they can be overkill for smaller setups or individual servers. A simple script combined with a cron job offers a minimalist approach to monitoring - no dashboards, no agents, no third-party services - just email notifications that work.
d. Ideal for headless or remote servers
If you're managing cloud instances, VPS setups, or physical servers in remote locations, having frequent performance reports in your inbox provides peace of mind. You're always informed, even when you're away from the terminal.

2. What metrics should be tracked?

To keep your server healthy and responsive, it's important to track a few key performance indicators. These metrics provide a snapshot of your system's current state and help you spot problems before they become critical.
Here are the most essential metrics you should include in your monitoring script:

a. CPU usage
CPU usage shows how much of your processor's capacity is being used at a given moment. High or constantly maxed-out CPU usage can indicate:
Inefficient code or processes running in loops
Malware or unauthorized scripts
Overloaded services during peak traffic

You can retrieve this metric using commands like:

top -bn1 | grep "Cpu(s)"

or more simply:

mpstat 1 1

b. Memory usage
Memory consumption is another critical metric. If your system runs out of RAM, it will start using swap space, which is significantly slower and can cause serious performance degradation.
Track:

Total memory
Used memory
Free memory
Swap usage

Command example:

free -m

c. Disk usage
Running out of disk space can crash applications, break databases, or prevent logging. Your script should report:
Total and used space per mounted partition
Alert thresholds (e.g., warn if a partition is over 90%)

Use this command:

df -h

d. Optional: load average
The load average reflects how many processes are actively running or waiting for CPU time. It gives a broader view of overall system strain.
Example:

uptime

or:

cat /proc/loadavg

e. Optional: Uptime and reboot detection
Knowing how long your server has been running helps identify unexpected reboots or instability.
Example:

uptime -p

3. Tools You'll Need

Setting up an automated monitoring script that sends email reports every minute doesn't require complex infrastructure or expensive software. All you need are a few standard tools, most of which are already available on typical Linux distributions. Here's a breakdown of what you'll need and why:

a. A Linux server
This guide assumes you're running a Unix-based system like Ubuntu, Debian, CentOS, or Red Hat. Most cloud instances (AWS, Azure, DigitalOcean, etc.) run some flavor of Linux. The commands and tools used are native to these environments.

b. A Scripting Language (Bash or Python)
You'll need a script that collects system metrics and formats them into an email message.
Bash is ideal for simplicity and direct access to system commands.
Python offers more flexibility, better formatting options, and error handling.
Choose the one you're more comfortable with - both are excellent for the task.

c. Crontab
cron is the time-based job scheduler built into Linux. It allows you to run scripts at fixed intervals… like every minute.
To edit the scheduled jobs, use:

crontab -e

This is where you'll tell the system to execute your monitoring script regularly.

d. Mail Utility (mailx, msmtp, or sendmail)
To send system emails, your server needs a mail transfer agent (MTA) or a mail client that can send messages from the command line.
Some options include:
mailx: Simple and commonly pre-installed. Often used with sendmail or postfix.
msmtp: Lightweight and easy to configure with an external SMTP service (like Gmail).
sendmail or postfix: Full-fledged MTAs but heavier to configure.

For example, to install msmtp on Ubuntu:

sudo apt update
sudo apt install msmtp msmtp-mta

Create a configuration file ~/.msmtprc

nano ~/.msmtprc

And add your SMTP credentials

defaults
auth on
tls on
tls_trust_file /etc/ssl/certs/ca-certificates.crt
logfile ~/.msmtp.log
account gmail
host smtp.gmail.com
port 587
from your.email@gmail.com
user your.email@gmail.com
password your_app_password
account default : gmail

e. A Valid Email Address for Delivery
In your script, you'll specify the recipient email.
Make sure:
The email is actively monitored.
SMTP settings are correct to ensure deliverability.
You whitelist your server's address, if needed, to avoid spam filters

4. Writing the Monitoring Script

Now that you know what metrics to track and which tools to use, it's time to write the core of the solution - the monitoring script. This script will collect the server's CPU, memory, and disk usage data, format it into a readable message, and send it to the system administrator via email.
We'll write a simple Bash script, which is efficient and widely compatible across Linux distributions.

a. Script Overview
The script will:

Collect system performance metrics (CPU, RAM, Disk).
Format the data into a plain-text report.
Email the report to the system administrator.
Optionally include a timestamp and hostname.

b. Sample script (monitor.py)

import psutil
import socket
import datetime
import subprocess

# Collect system metrics
def get_metrics():
    hostname = socket.gethostname()
    cpu = psutil.cpu_percent(interval=1)
    ram = psutil.virtual_memory()
    disk = psutil.disk_usage('/')
    now = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    message = f"""\
Subject: [ALERT] Server Metrics - {hostname}
From: Monitoring <monitor@localhost>
To: your@email.com

Date       : {now}
Server     : {hostname}

CPU     : {cpu}% used
RAM     : {ram.percent}% used ({round(ram.used / (1024**3), 2)} GB / {round(ram.total / (1024**3), 2)} GB)
Disk    : {disk.percent}% used ({round(disk.used / (1024**3), 2)} GB / {round(disk.total / (1024**3), 2)} GB)
"""
    return message

# Send the email using msmtp
def send_email(body):
    process = subprocess.Popen(
        ['msmtp', 'destination@email.com'],
        stdin=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    stdout, stderr = process.communicate(input=body.encode())

    if process.returncode != 0:
        print("Failed to send email:", stderr.decode())
    else:
        print("Email sent successfully.")

# Main
if __name__ == "__main__":
    metrics = get_metrics()
    send_email(metrics)

c. Make the Script Executable
Save the script as monitor.py, then make it executable:
chmod +x monitor.py

d. Test the Script
Run it manually first to ensure it works as expected:

python monitor.py

Check the recipient inbox (or spam folder if necessary) to confirm receipt.

5. Automating the Script with Crontab

To send metrics every minute:

crontab -e

Then add:

* * * * * python monitor.py

Conclusion

Automating the monitoring of server health is not just a good practice, it's essential for maintaining system reliability, identifying performance issues early, and preventing downtime. With a simple Bash script, a few core Linux tools, and a properly configured email system, you can build a lightweight but powerful alerting mechanism that keeps your system administrator informed every minute.
This approach doesn't require expensive monitoring platforms or complex dashboards. It leverages native system utilities and the flexibility of cron to deliver real-time performance snapshots directly to the sysadmin's inbox. By tracking key metrics such as CPU usage, memory consumption, and disk space, you create a proactive culture of system care rather than waiting for something to break.
In environments where uptime and responsiveness matter, such a script can be the first line of defense… a quiet guardian running in the background, ensuring your infrastructure stays healthy and your admin stays informed.