Cloud environments naturally drift. Costs creep up. Security posture degrades. Manual audits do not scale, and periodic reviews miss issues that emerge between checks.
I needed an auditing tool that could:
- Identify cost waste — idle EC2 instances, unattached EBS volumes, orphaned snapshots
- Detect security misconfigurations — public S3 buckets, overly permissive security groups, weak IAM hygiene
- Map findings to a known framework — CIS AWS Foundations Benchmark
- Operate safely — strictly read-only, no automated deletion or remediation
This article walks through the key design decisions, trade-offs, and lessons learned from building it.
Core Design Principles
1. Read-Only by Design
Decision: No auto-remediation. No destructive permissions.
Automatically deleting or modifying cloud resources is risky, especially in production. An instance that appears idle may be a disaster-recovery standby, a scheduled batch worker, or part of a failover strategy.
The tool’s role is to surface risk and waste, not to make irreversible decisions.
IAM policy scope:
{
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"s3:Get*",
"s3:List*",
"iam:List*",
"iam:Get*"
],
"Resource": "*"
}
No Delete, Terminate, or Modify permissions. The blast radius is limited to discovery only.
This constraint shaped the entire architecture and made the tool safe to run against live accounts.
2. Defining “Idle” Using CloudWatch Metrics
Problem: “Idle” is ambiguous in cloud systems.
CPU utilization is an imperfect signal, but it is widely available and easy to reason about. I defined idle EC2 instances as those with:
- Average CPU utilization < 5%
- Observed over a 7-day window
Implementation:
def get_cpu_utilization(self, instance_id: str, days: int = 7) -> float:
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
response = self.cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=86400, # 1 day
Statistics=['Average']
)
if response['Datapoints']:
return sum(dp['Average'] for dp in response['Datapoints']) / len(response['Datapoints'])
return -1.0
Trade-offs:
- Misses bursty or scheduled workloads (batch jobs, ML training)
- Flags instances that are intentionally dormant
Rather than hiding these limitations, they are explicitly documented. Transparency builds trust in tooling.
3. Aligning Findings with CIS Benchmarks
Raw findings are less useful without context. Mapping issues to the CIS AWS Foundations Benchmark adds structure and credibility.
Example:
findings.append({
'bucket_name': bucket_name,
'issue': 'Public access enabled',
'severity': 'HIGH',
'cis_control': '2.1.5',
'remediation': 'Enable S3 Block Public Access'
})
This approach:
- Makes findings actionable
- Aligns with how security teams think
- Signals familiarity with compliance-driven environments
Results from a Real AWS Account
Running the auditor against my own AWS account produced the following:
SECURITY FINDINGS: 32
CRITICAL: 11 (SSH/RDP open to 0.0.0.0/0)
HIGH: 9 (IAM users without MFA, public S3 buckets)
MEDIUM: 12 (stale access keys, permissive policies)
Even a relatively small account accumulated meaningful security drift. Continuous auditing is not optional at scale.
Handling False Positives
One flagged issue was my own AuditorToolReadOnly IAM policy using:
"Resource": "*"
At first glance, this looks overly permissive. In practice, it is required. Read-only IAM and EC2 discovery APIs (List*, Describe*) cannot be scoped to specific ARNs that are not yet known.
Key point:
- Not all flagged issues are actionable
- False positives should be documented, not ignored
This distinction is critical in real operational environments.
What I Would Improve for a Production Deployment
If this were moving beyond a portfolio project:
- Multi-region scanning instead of single-region execution
- Historical persistence using DynamoDB for trend analysis
- AWS Cost Explorer integration for real billing data
- Alerting via Slack or SNS for critical findings
- IAM Access Analyzer integration for deeper policy analysis
The current scope balances realism with complexity without overengineering.
Key Takeaways
- Read-only audits reduce risk and build trust when running against live environments
- Cost and security signals are more useful when tied to metrics and known frameworks
- Not every finding should be auto-remediated; judgment still matters
These principles guided the design choices throughout this project.
Try It Yourself
Repository:
https://github.com/cypher682/aws-cost-security-auditor
Run locally:
git clone https://github.com/cypher682/aws-cost-security-auditor
cd aws-cost-security-auditor
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python src/full_audit.py --profile auditor-role
See the remediation guidance in docs/REMEDIATION_PLAYBOOK.md.
What’s Next
Planned extensions:
- Multi-account support (AWS Organizations)
- RDS idle detection
- Lambda cost analysis
Links:
This project is part of my portfolio focused on production-grade cloud platform engineering.
Top comments (0)