Aman Singh

Posted on May 26

Identifying Idle and Underutilized AWS Resources: Signals, Metrics, and Patterns

#ai #finops #architecture #cloudskills

Idle resource detection is still one of the highest-ROI levers in cloud cost optimization. Even mature FinOps orgs see silent waste accumulate because resources drift from their original purpose, ownership fragments, and migrations leave infrastructure behind.

The problem isn't awareness. It's complex. Modern AWS environments span dozens of accounts, multiple regions, and distributed teams. Idle instances, unused storage, stale load balancers, and abandoned network components don't break anything so they go unnoticed until the budget damages compounds.

This post covers the exact signals and metrics to detect idle and underutilized resources by service type, plus how to operationalize detection at scale.

What Counts as Idle or Underutilized?

Idle means no meaningful compute, network, or storage activity over a defined period: a Lambda with zero invocations, an EC2 instance with negligible CPU and network traffic, an unattached EBS volume still generating charges.

Underutilized means the workload exists, but allocated capacity far exceeds actual consumption EC2 running at 10% CPU for weeks, overprovisioned RDS with minimal connections, GPUs deployed for sporadic batch jobs.

Underutilization usually comes from oversizing, conservative defaults, or bad assumptions about peak demand. Idle resources come from operational anti-patterns: forgotten experiments, incomplete decommissioning, or zombie infrastructure left behind after deployments.

Detection Signals by Service

1. EC2 Instances (Low CPU, Memory, Network)

An EC2 instance is idle when CPU stays under ~10–15%, network throughput is negligible, and disk operations approach zero over 14–30 days.

Detection tools:

CloudWatch metrics: CPUUtilization, NetworkIn/Out, VolumeReadOps/WriteOps
AWS Compute Optimizer rightsizing insights
AWS CLI: describe-instances with utilization tags

These show up most often in dev/test environments, lift-and-shift migrations, and orphaned nodes from Auto Scaling Groups.
**

Stopped EC2 With Billing Attachments**

A stopped EC2 looks harmless but retains chargeable components:

Attached EBS volumes (especially gp3 with provisioned throughput/IOPS)
Elastic IPs detached from running instances
Orphaned snapshots
Marketplace AMI subscription fees (for subscription-priced AMIs hourly-priced AMIs only bill when running)

Use describe-volumes, describe-addresses, and describe-snapshots to enumerate all dependent artifacts for instances stopped longer than 7 days.

3. Idle Load Balancers (ALB/NLB/CLB)

ALBs, NLBs, and Classic LBs incur hourly charges even with zero traffic. An idle LB shows zero ActiveConnectionCount, no healthy registered targets, and no request volume.

Check CloudWatch metrics (ActiveConnectionCount, RequestCount, HealthyHostCount) and use describe-load-balancers + describe-target-health to find LBs with no registered targets. These accumulate heavily in Kubernetes environments where service deletions don't automatically clean up associated AWS load balancers.

4. Lambda Functions With Zero Invocations

Inactive Lambda functions still generate indirect costs through CloudWatch log groups, retained log storage, and any provisioned concurrency. A function is idle when it shows zero or near-zero Invocations over 30–90 days.

Use the Lambda console Monitor tab or list-functions + get-metric-statistics via CLI. Always verify event sources (API Gateway, S3 notifications, EventBridge rules) before deleting some functions that handle infrequent but critical workflows.

5. Idle NAT Gateways

NAT Gateways are one of the most common sources of silent waste, steady hourly charges plus data processing fees, regardless of active traffic.

Look for near-zero BytesProcessed and ActiveConnectionCount in CloudWatch, then validate which route tables still point to the gateway. Many idle NAT Gateways persist after VPC refactoring or application migrations where engineers update routes but don't remove the original gateway.

6. EBS Volumes and Snapshots

A volume is idle when it's unattached for 7+ days, shows no read/write operations, or was provisioned with gp3 throughput/IOPS settings that exceed actual workload requirements. Snapshots accumulate silently when automated backup processes retain long chains without pruning.

Use describe-volumes and describe-snapshots to surface unattached volumes, low-activity volumes, and snapshot sprawl.
**

Elastic Network Interfaces (ENIs)**

Unattached ENIs rarely incur significant direct cost, but they create operational overhead, clutter VPCs, and can complicate subnet scaling and security group management.

Filter ENIs in the available state via describe-network-interfaces, then confirm zero traffic using VPC Flow Logs queried through CloudWatch Logs Insights or Athena.

8. S3 Buckets and Storage Classes

Idle S3 spend shows up as cold data sitting in expensive storage classes buckets with little or no read activity over 30–90 days, objects that never transitioned to lower-cost tiers, and accumulated logs or exports no longer referenced by any application.

Use S3 Storage Lens for high-level activity trends, and query S3 server access logs through Athena to identify cold prefixes. Apply Lifecycle Policies to transition objects to Standard-IA, Glacier, or Deep Archive or delete obsolete buckets entirely.

If you're weighing commitment strategies for the resources you do use, we covered the tradeoffs in detail here AWS Savings Plans vs Reserved Instances: A Practical Guide to Buying Commitments

9. RDS, Aurora, ElastiCache, Redshift

Database and caching services are frequently overprovisioned because teams size for peak conditions that rarely materialize. An instance is underutilized when it consistently shows CPU below ~5–10%, minimal active connections, and low read/write throughput.

Watch for replica drift read replicas or cluster nodes persisting after the workload that justified them has scaled down or changed architecture. Use describe-db-instances, describe-cache-clusters, and describe-clusters to flag sustained low activity.

10. Kubernetes Node Underutilization

EKS clusters carry hidden waste when node capacity exceeds actual pod demand. Look for low CPU/memory consumption, low pod density, or workloads requesting far more resources than they consume (conservative requests preventing efficient pod packing).

Analyze resource requests vs. actual usage through Prometheus, CloudWatch Container Insights, or the Kubernetes Metrics API.

Detecting Idle Resources at Scale

Single-account detection is straightforward. Multi-account detection is not.

Triangulated Detection

A single CloudWatch metric rarely classifies a resource as idle with confidence. Combine:

CloudWatch Metrics: consumption signals (CPU, network, ops, connections)
CloudTrail Events: API activity revealing creation, modification, or operational intent
VPC Flow Logs: confirms whether a resource still participates in active network paths

Metrics show consumption. Events and logs reveal intent.

Multi-Account Strategy

Idle detection at scale requires centralizing telemetry through AWS Organizations, AWS Cost Explorer, or custom ingestion pipelines aggregating CloudWatch and CloudTrail activity. Mandatory tagging policies (via AWS Config or SCPs) are critical without consistent owner, team, environment, and lifecycle tags, and cleanup accountability breaks down.

CLI and Automation Scripts

Automated scripts using AWS CLI or boto3 can scan entire regions or accounts in seconds:

Query EC2 instances filtered by low CPU over the last 14–30 days
List unattached EBS volumes and idle snapshots across all regions
Enumerate Lambda functions with zero invocations over a defined window
Check NAT Gateways with minimal BytesProcessed

These can run on a schedule via Lambda or Step Functions and push findings to Slack, Jira, or a FinOps dashboard.

Policy-Based Continuous Detection

Ad-hoc audits don't prevent waste from returning. AWS Config rules can:

Flag EC2 instances with no network activity for 7 days
Detect EBS volumes in available state
Ensure S3 buckets have lifecycle policies
Validate required cost-allocation tags

Config rules can auto-remediate (snapshot and delete an unattached volume) or trigger alerts for human review. Combined with guardrails in AWS Organizations, this shifts idle detection from reactive cleanup to proactive prevention.

Choosing between 1-year and 3-year commitments affects how aggressively you can rightsize after cleanup we broke down the decision here How to Choose Between 1-Year and 3-Year AWS Commitments

Keeping Costs Low After Cleanup

Removing idle resources solves one half of the problem. The other half is ensuring you're not overpaying for the resources you keep.

Usage.ai handles this with Flex Commitments, a dynamic purchasing engine that continuously adapts to real usage patterns across AWS, Azure, and GCP. Once your team remediates idle infrastructure, Usage.ai adjusts commitments so your new usage baseline is always covered at the lowest effective rate. It also issues rebates on unused commitment portions, which native cloud commitments don't do.

The loop: identify and remove idle resources → Usage.ai re-optimizes commitments against the new baseline → as workloads evolve, it adjusts again automatically.

What's the biggest source of idle waste you've found in your AWS environment and what finally surfaced?

Continue reading the full technical analysis here → How to Identify Idle & Underutilized AWS Resources: A Comprehensive Technical Guide for 2026