DatanestDigital

Posted on Mar 20 • Edited on Mar 23

Cloud Cost Optimization: Cutting Your AWS/Azure Bill by 40%

#cloud #azure #aws #devops

The average company wastes 32% of its cloud spend — that's not a guess, it's what Flexera's State of the Cloud report has found year after year. If your team hasn't done a formal cost review in the last quarter, you're almost certainly burning money on oversized instances, forgotten resources, and on-demand pricing you could eliminate today.

This guide covers practical, immediately actionable strategies for cutting your AWS and Azure bills. No theoretical frameworks — just the specific optimizations that save the most money, ranked by impact.

The Cloud Cost Pyramid

Start from the top — biggest savings first:

         ┌─────────────┐
         │  Commitment  │  ← 30-60% savings
         │  Discounts   │
         ├─────────────┤
         │ Right-Sizing │  ← 20-40% savings
         │              │
         ├─────────────┤
         │   Storage    │  ← 10-30% savings
         │ Optimization │
         ├─────────────┤
         │   Network    │  ← 5-15% savings
         │   Costs      │
         ├─────────────┤
         │  Scheduling  │  ← 5-20% savings
         │  & Cleanup   │
         └─────────────┘

1. Commitment Discounts (30-60% Savings)

This is the single highest-impact optimization. If you're running workloads 24/7 on on-demand pricing, you're leaving money on the table.

AWS Savings Plans

Savings Plan Types:
1. Compute Savings Plans (up to 66% off)
   - Applies to EC2, Fargate, Lambda
   - Flexible: any instance family, region, OS
   - Best for most organizations

2. EC2 Instance Savings Plans (up to 72% off)
   - Locked to instance family in a region
   - Slightly cheaper than Compute plans
   - Good when you know your instance types

3. SageMaker Savings Plans (up to 64% off)
   - For ML workloads

How to calculate your commitment:

# Simple Savings Plan calculator
import json


def calculate_savings_plan(
    monthly_on_demand: float,
    baseline_percentage: float = 0.60,
    savings_rate: float = 0.40,
    commitment_years: int = 1,
) -> dict:
    """Calculate optimal Savings Plan commitment.

    Args:
        monthly_on_demand: Current monthly on-demand spend
        baseline_percentage: % of spend that's consistent (0.0-1.0)
        savings_rate: Discount rate for the plan (0.30-0.66)
        commitment_years: 1 or 3 year commitment
    """
    # Your baseline = what you consistently use
    baseline_monthly = monthly_on_demand * baseline_percentage

    # Commitment amount (hourly)
    commitment_hourly = (baseline_monthly * (1 - savings_rate)) / 730

    # Annual savings
    annual_savings = baseline_monthly * savings_rate * 12

    # Total commitment
    total_commitment = commitment_hourly * 730 * 12 * commitment_years

    return {
        "current_monthly_spend": monthly_on_demand,
        "baseline_monthly": baseline_monthly,
        "commitment_hourly": round(commitment_hourly, 2),
        "monthly_savings": round(baseline_monthly * savings_rate, 2),
        "annual_savings": round(annual_savings, 2),
        "total_commitment": round(total_commitment, 2),
        "break_even_months": round(
            total_commitment / (baseline_monthly * savings_rate), 1
        ) if baseline_monthly * savings_rate > 0 else float("inf"),
    }


# Example: $50K/month on-demand EC2 spend
result = calculate_savings_plan(
    monthly_on_demand=50000,
    baseline_percentage=0.65,  # 65% is consistent baseline
    savings_rate=0.40,         # 40% discount with 1-year Compute plan
)
print(json.dumps(result, indent=2))
# Annual savings: ~$156,000

Azure Reserved Instances

Azure's equivalent — commit to a 1 or 3 year term for significant discounts.

Azure RI discounts:
- VMs: up to 72% (3-year)
- SQL Database: up to 55%
- Cosmos DB: up to 65%
- Azure Databricks: up to 49% (with pre-purchase)
- Storage: up to 38% (reserved capacity)

Rule of thumb: If a resource runs more than 50% of the time, a 1-year reservation saves money. If it runs more than 30% of the time, a 3-year reservation saves money.

2. Right-Sizing (20-40% Savings)

Most instances are oversized. Engineers pick a "safe" instance size during initial setup and never revisit it.

Finding Oversized Instances

import boto3
from datetime import datetime, timedelta


def find_oversized_instances(region: str = "eu-west-1") -> list[dict]:
    """Find EC2 instances with consistently low CPU utilization."""

    ec2 = boto3.client("ec2", region_name=region)
    cloudwatch = boto3.client("cloudwatch", region_name=region)

    instances = ec2.describe_instances(
        Filters=[{"Name": "instance-state-name", "Values": ["running"]}]
    )

    oversized = []

    for reservation in instances["Reservations"]:
        for instance in reservation["Instances"]:
            instance_id = instance["InstanceId"]
            instance_type = instance["InstanceType"]

            # Get average CPU over last 14 days
            response = cloudwatch.get_metric_statistics(
                Namespace="AWS/EC2",
                MetricName="CPUUtilization",
                Dimensions=[
                    {"Name": "InstanceId", "Value": instance_id}
                ],
                StartTime=datetime.utcnow() - timedelta(days=14),
                EndTime=datetime.utcnow(),
                Period=86400,  # Daily averages
                Statistics=["Average", "Maximum"],
            )

            if response["Datapoints"]:
                avg_cpu = sum(
                    d["Average"] for d in response["Datapoints"]
                ) / len(response["Datapoints"])
                max_cpu = max(
                    d["Maximum"] for d in response["Datapoints"]
                )

                if avg_cpu < 20 and max_cpu < 50:
                    oversized.append({
                        "instance_id": instance_id,
                        "instance_type": instance_type,
                        "avg_cpu": round(avg_cpu, 1),
                        "max_cpu": round(max_cpu, 1),
                        "recommendation": "Downsize by 1-2 sizes",
                    })

    return oversized


# Run the analysis
results = find_oversized_instances()
for r in results:
    print(
        f"{r['instance_id']} ({r['instance_type']}): "
        f"avg CPU {r['avg_cpu']}%, max {r['max_cpu']}% "
        f"-> {r['recommendation']}"
    )

Right-Sizing Decision Matrix

Avg CPU	Max CPU	Memory Usage	Recommendation
< 10%	< 30%	< 30%	Downsize by 2 sizes or consider serverless
10-30%	< 50%	< 50%	Downsize by 1 size
30-60%	< 80%	< 70%	Current size OK
> 60%	> 80%	> 70%	Consider upsizing

Instance Family Selection

Don't just resize — pick the right family:

Common mistake: Using m5.xlarge for a CPU-bound workload
Better choice: c5.large (compute-optimized, half the cost)

Common mistake: Using m5.2xlarge for an in-memory cache
Better choice: r5.xlarge (memory-optimized, same RAM, less CPU cost)

AWS instance families:
- t3/t4g: Burstable, web servers, dev environments
- m6i/m7g: General purpose, balanced workloads
- c6i/c7g: CPU-intensive (data processing, batch)
- r6i/r7g: Memory-intensive (caches, databases)
- g5: GPU (ML training/inference)
- i3/i4i: Storage-optimized (databases)

Graviton (ARM) instances: 20% cheaper, often better performance
Switch t3 → t4g, m5 → m7g, c5 → c7g for instant savings

3. Storage Optimization (10-30% Savings)

Storage costs creep up silently. Nobody notices until the bill is $20K/month.

S3 Lifecycle Policies

{
  "Rules": [
    {
      "ID": "MoveToIA",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555
      }
    },
    {
      "ID": "CleanupIncomplete",
      "Status": "Enabled",
      "Filter": {},
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    },
    {
      "ID": "DeleteOldVersions",
      "Status": "Enabled",
      "Filter": {},
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "STANDARD_IA"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      }
    }
  ]
}

S3 Cost Comparison

Storage Class	Cost per GB/month	Retrieval Cost	Use Case
Standard	$0.023	None	Active data
Intelligent-Tiering	$0.023 + monitoring	None	Unknown access patterns
Standard-IA	$0.0125	$0.01/GB	Monthly access
Glacier Instant	$0.004	$0.03/GB	Quarterly access
Glacier Flexible	$0.0036	$0.01/GB + time	Annual access
Deep Archive	$0.00099	$0.02/GB + 12hrs	Compliance archives

Quick win: Enable S3 Intelligent-Tiering on buckets with unknown access patterns. It automatically moves data between tiers and typically saves 30-40%.

EBS Volume Cleanup

def find_unused_ebs_volumes(region: str = "eu-west-1") -> list[dict]:
    """Find unattached EBS volumes costing you money."""
    ec2 = boto3.client("ec2", region_name=region)

    volumes = ec2.describe_volumes(
        Filters=[{"Name": "status", "Values": ["available"]}]
    )

    unused = []
    total_cost = 0

    for vol in volumes["Volumes"]:
        size_gb = vol["Size"]
        vol_type = vol["VolumeType"]

        # Approximate monthly cost
        cost_per_gb = {
            "gp2": 0.10, "gp3": 0.08, "io1": 0.125,
            "io2": 0.125, "st1": 0.045, "sc1": 0.015,
        }
        monthly_cost = size_gb * cost_per_gb.get(vol_type, 0.10)
        total_cost += monthly_cost

        unused.append({
            "volume_id": vol["VolumeId"],
            "size_gb": size_gb,
            "type": vol_type,
            "monthly_cost": round(monthly_cost, 2),
            "created": str(vol["CreateTime"]),
        })

    print(f"Found {len(unused)} unused volumes")
    print(f"Total wasted: ${total_cost:.2f}/month")
    return unused

4. Network Cost Reduction (5-15% Savings)

Data transfer is the hidden cloud tax. Cross-AZ, cross-region, and internet egress add up fast.

Key Network Cost Rules

AWS Data Transfer Costs:
- Same AZ, same VPC: FREE
- Cross-AZ (within region): $0.01/GB each way
- Cross-region: $0.02/GB
- Internet egress: $0.09/GB (first 10TB)
- CloudFront egress: $0.085/GB (cheaper than direct)

Cost reduction strategies:
1. Use VPC endpoints for AWS services (S3, DynamoDB)
   - Eliminates NAT Gateway charges ($0.045/GB)
   - Free for Gateway endpoints (S3, DynamoDB)

2. Keep traffic in the same AZ when possible
   - Use AZ-aware routing in ALB
   - Configure services to prefer same-AZ replicas

3. Use CloudFront for egress
   - Cheaper than direct internet egress
   - Also reduces latency

4. Compress data in transit
   - Enable gzip/brotli on ALB
   - Compress S3 objects before transfer

VPC Endpoint Cost Savings

# Terraform: S3 Gateway Endpoint (FREE)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.eu-west-1.s3"
  vpc_endpoint_type = "Gateway"

  route_table_ids = aws_route_table.private[*].id
}

# This eliminates NAT Gateway charges for S3 traffic
# If you transfer 1TB/month to S3:
#   Without endpoint: 1000 GB × $0.045 = $45/month (NAT)
#   With endpoint: $0/month

5. Scheduling and Cleanup (5-20% Savings)

Non-production environments don't need to run 24/7.

Auto-Shutdown for Dev/Staging

import boto3
from datetime import datetime


def manage_dev_instances(action: str = "stop"):
    """Stop dev instances outside business hours."""
    ec2 = boto3.client("ec2", region_name="eu-west-1")

    # Find instances tagged as dev/staging
    instances = ec2.describe_instances(
        Filters=[
            {"Name": "tag:Environment", "Values": ["dev", "staging"]},
            {"Name": "tag:AutoShutdown", "Values": ["true"]},
            {
                "Name": "instance-state-name",
                "Values": ["running" if action == "stop" else "stopped"],
            },
        ]
    )

    instance_ids = [
        i["InstanceId"]
        for r in instances["Reservations"]
        for i in r["Instances"]
    ]

    if not instance_ids:
        print(f"No instances to {action}")
        return

    if action == "stop":
        ec2.stop_instances(InstanceIds=instance_ids)
        print(f"Stopped {len(instance_ids)} instances")
    elif action == "start":
        ec2.start_instances(InstanceIds=instance_ids)
        print(f"Started {len(instance_ids)} instances")


# Schedule with EventBridge:
# Stop at 7 PM: manage_dev_instances("stop")
# Start at 8 AM: manage_dev_instances("start")
# = 13 hours off per weekday + weekends
# = ~60% reduction in dev instance costs

Resource Cleanup Automation

def cleanup_old_resources(dry_run: bool = True) -> dict:
    """Find and optionally delete old/unused resources."""
    ec2 = boto3.client("ec2", region_name="eu-west-1")
    savings = {"monthly_savings": 0, "resources": []}

    # 1. Old snapshots (> 90 days, no AMI reference)
    snapshots = ec2.describe_snapshots(OwnerIds=["self"])
    cutoff = datetime.utcnow() - timedelta(days=90)

    for snap in snapshots["Snapshots"]:
        if snap["StartTime"].replace(tzinfo=None) < cutoff:
            size_gb = snap["VolumeSize"]
            cost = size_gb * 0.05  # $0.05/GB/month for snapshots
            savings["monthly_savings"] += cost
            savings["resources"].append({
                "type": "snapshot",
                "id": snap["SnapshotId"],
                "size_gb": size_gb,
                "monthly_cost": round(cost, 2),
                "age_days": (
                    datetime.utcnow() - snap["StartTime"].replace(tzinfo=None)
                ).days,
            })

            if not dry_run:
                ec2.delete_snapshot(SnapshotId=snap["SnapshotId"])

    # 2. Unattached Elastic IPs ($3.60/month each if not attached)
    addresses = ec2.describe_addresses()
    for addr in addresses["Addresses"]:
        if "AssociationId" not in addr:
            savings["monthly_savings"] += 3.60
            savings["resources"].append({
                "type": "elastic_ip",
                "id": addr["AllocationId"],
                "monthly_cost": 3.60,
            })

            if not dry_run:
                ec2.release_address(AllocationId=addr["AllocationId"])

    print(f"Potential monthly savings: ${savings['monthly_savings']:.2f}")
    print(f"Resources to clean: {len(savings['resources'])}")
    return savings

Monthly Cost Review Checklist

Run this checklist on the first of every month:

Check	Tool	Target
Unused instances	AWS Compute Optimizer	Downsize or terminate
Unattached EBS volumes	Cost Explorer	Delete or snapshot
Old snapshots	Custom script	Delete if > 90 days
Unattached Elastic IPs	Console/script	Release
S3 access patterns	S3 Analytics	Apply lifecycle policies
Reserved coverage	Savings Plans report	Cover 60-70% baseline
NAT Gateway traffic	VPC Flow Logs	Replace with VPC endpoints
Cross-AZ data transfer	Cost Explorer	Optimize routing

Summary

Cloud cost optimization isn't a one-time project — it's an ongoing practice:

Strategy	Typical Savings	Effort	Impact Time
Savings Plans/RIs	30-60%	Low	Immediate
Right-sizing	20-40%	Medium	1-2 weeks
Storage tiering	10-30%	Low	Days
Network optimization	5-15%	Medium	1-2 weeks
Scheduling/cleanup	5-20%	Low	Immediate

Start with commitment discounts and right-sizing — that's where 70% of the savings come from.

If you want ready-made scripts, dashboards, and automation templates for cloud cost optimization and data infrastructure, check out DataStack Pro.

DEV Community