cypher682

Posted on Feb 12

Building a Production-Grade AWS Cost & Security Auditor

#aws #security #python #devops

Cloud environments naturally drift. Costs creep up. Security posture degrades. Manual audits do not scale, and periodic reviews miss issues that emerge between checks.

I needed an auditing tool that could:

Identify cost waste — idle EC2 instances, unattached EBS volumes, orphaned snapshots
Detect security misconfigurations — public S3 buckets, overly permissive security groups, weak IAM hygiene
Map findings to a known framework — CIS AWS Foundations Benchmark
Operate safely — strictly read-only, no automated deletion or remediation

This article walks through the key design decisions, trade-offs, and lessons learned from building it.

Core Design Principles

1. Read-Only by Design

Decision: No auto-remediation. No destructive permissions.

Automatically deleting or modifying cloud resources is risky, especially in production. An instance that appears idle may be a disaster-recovery standby, a scheduled batch worker, or part of a failover strategy.

The tool’s role is to surface risk and waste, not to make irreversible decisions.

IAM policy scope:

{
  "Effect": "Allow",
  "Action": [
    "ec2:Describe*",
    "s3:Get*",
    "s3:List*",
    "iam:List*",
    "iam:Get*"
  ],
  "Resource": "*"
}

No Delete, Terminate, or Modify permissions. The blast radius is limited to discovery only.

This constraint shaped the entire architecture and made the tool safe to run against live accounts.

2. Defining “Idle” Using CloudWatch Metrics

Problem: “Idle” is ambiguous in cloud systems.

CPU utilization is an imperfect signal, but it is widely available and easy to reason about. I defined idle EC2 instances as those with:

Average CPU utilization < 5%
Observed over a 7-day window

Implementation:

def get_cpu_utilization(self, instance_id: str, days: int = 7) -> float:
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=days)

    response = self.cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=start_time,
        EndTime=end_time,
        Period=86400,  # 1 day
        Statistics=['Average']
    )

    if response['Datapoints']:
        return sum(dp['Average'] for dp in response['Datapoints']) / len(response['Datapoints'])
    return -1.0

Trade-offs:

Misses bursty or scheduled workloads (batch jobs, ML training)
Flags instances that are intentionally dormant

Rather than hiding these limitations, they are explicitly documented. Transparency builds trust in tooling.

3. Aligning Findings with CIS Benchmarks

Raw findings are less useful without context. Mapping issues to the CIS AWS Foundations Benchmark adds structure and credibility.

Example:

findings.append({
    'bucket_name': bucket_name,
    'issue': 'Public access enabled',
    'severity': 'HIGH',
    'cis_control': '2.1.5',
    'remediation': 'Enable S3 Block Public Access'
})

This approach:

Makes findings actionable
Aligns with how security teams think
Signals familiarity with compliance-driven environments

Results from a Real AWS Account

Running the auditor against my own AWS account produced the following:

SECURITY FINDINGS: 32
  CRITICAL: 11  (SSH/RDP open to 0.0.0.0/0)
  HIGH:      9  (IAM users without MFA, public S3 buckets)
  MEDIUM:   12  (stale access keys, permissive policies)

Even a relatively small account accumulated meaningful security drift. Continuous auditing is not optional at scale.

Handling False Positives

One flagged issue was my own AuditorToolReadOnly IAM policy using:

"Resource": "*"

At first glance, this looks overly permissive. In practice, it is required. Read-only IAM and EC2 discovery APIs (List*, Describe*) cannot be scoped to specific ARNs that are not yet known.

Key point:

Not all flagged issues are actionable
False positives should be documented, not ignored

This distinction is critical in real operational environments.

What I Would Improve for a Production Deployment

If this were moving beyond a portfolio project:

Multi-region scanning instead of single-region execution
Historical persistence using DynamoDB for trend analysis
AWS Cost Explorer integration for real billing data
Alerting via Slack or SNS for critical findings
IAM Access Analyzer integration for deeper policy analysis

The current scope balances realism with complexity without overengineering.

Key Takeaways

Read-only audits reduce risk and build trust when running against live environments
Cost and security signals are more useful when tied to metrics and known frameworks
Not every finding should be auto-remediated; judgment still matters

These principles guided the design choices throughout this project.

Try It Yourself

Repository:
https://github.com/cypher682/aws-cost-security-auditor

Run locally:

git clone https://github.com/cypher682/aws-cost-security-auditor
cd aws-cost-security-auditor
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python src/full_audit.py --profile auditor-role

See the remediation guidance in docs/REMEDIATION_PLAYBOOK.md.

What’s Next

Planned extensions:

Multi-account support (AWS Organizations)
RDS idle detection
Lambda cost analysis

Links:

X: https://twitter.com/cypher682
LinkedIn: https://linkedin.com/in/suleiman-abdulrahman-dev

This project is part of my portfolio focused on production-grade cloud platform engineering.

DEV Community