DEV Community

Cover image for Building a Production-Grade AWS Cost & Security Auditor
cypher682
cypher682

Posted on

Building a Production-Grade AWS Cost & Security Auditor

Cloud environments naturally drift. Costs creep up. Security posture degrades. Manual audits do not scale, and periodic reviews miss issues that emerge between checks.

I needed an auditing tool that could:

  1. Identify cost waste — idle EC2 instances, unattached EBS volumes, orphaned snapshots
  2. Detect security misconfigurations — public S3 buckets, overly permissive security groups, weak IAM hygiene
  3. Map findings to a known framework — CIS AWS Foundations Benchmark
  4. Operate safely — strictly read-only, no automated deletion or remediation

This article walks through the key design decisions, trade-offs, and lessons learned from building it.


Core Design Principles

1. Read-Only by Design

Decision: No auto-remediation. No destructive permissions.

Automatically deleting or modifying cloud resources is risky, especially in production. An instance that appears idle may be a disaster-recovery standby, a scheduled batch worker, or part of a failover strategy.

The tool’s role is to surface risk and waste, not to make irreversible decisions.

IAM policy scope:

{
  "Effect": "Allow",
  "Action": [
    "ec2:Describe*",
    "s3:Get*",
    "s3:List*",
    "iam:List*",
    "iam:Get*"
  ],
  "Resource": "*"
}
Enter fullscreen mode Exit fullscreen mode

No Delete, Terminate, or Modify permissions. The blast radius is limited to discovery only.

This constraint shaped the entire architecture and made the tool safe to run against live accounts.


2. Defining “Idle” Using CloudWatch Metrics

Problem: “Idle” is ambiguous in cloud systems.

CPU utilization is an imperfect signal, but it is widely available and easy to reason about. I defined idle EC2 instances as those with:

  • Average CPU utilization < 5%
  • Observed over a 7-day window

Implementation:

def get_cpu_utilization(self, instance_id: str, days: int = 7) -> float:
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=days)

    response = self.cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=start_time,
        EndTime=end_time,
        Period=86400,  # 1 day
        Statistics=['Average']
    )

    if response['Datapoints']:
        return sum(dp['Average'] for dp in response['Datapoints']) / len(response['Datapoints'])
    return -1.0
Enter fullscreen mode Exit fullscreen mode

Trade-offs:

  • Misses bursty or scheduled workloads (batch jobs, ML training)
  • Flags instances that are intentionally dormant

Rather than hiding these limitations, they are explicitly documented. Transparency builds trust in tooling.


3. Aligning Findings with CIS Benchmarks

Raw findings are less useful without context. Mapping issues to the CIS AWS Foundations Benchmark adds structure and credibility.

Example:

findings.append({
    'bucket_name': bucket_name,
    'issue': 'Public access enabled',
    'severity': 'HIGH',
    'cis_control': '2.1.5',
    'remediation': 'Enable S3 Block Public Access'
})
Enter fullscreen mode Exit fullscreen mode

This approach:

  • Makes findings actionable
  • Aligns with how security teams think
  • Signals familiarity with compliance-driven environments

Results from a Real AWS Account

Running the auditor against my own AWS account produced the following:

SECURITY FINDINGS: 32
  CRITICAL: 11  (SSH/RDP open to 0.0.0.0/0)
  HIGH:      9  (IAM users without MFA, public S3 buckets)
  MEDIUM:   12  (stale access keys, permissive policies)
Enter fullscreen mode Exit fullscreen mode

Even a relatively small account accumulated meaningful security drift. Continuous auditing is not optional at scale.


Handling False Positives

One flagged issue was my own AuditorToolReadOnly IAM policy using:

"Resource": "*"
Enter fullscreen mode Exit fullscreen mode

At first glance, this looks overly permissive. In practice, it is required. Read-only IAM and EC2 discovery APIs (List*, Describe*) cannot be scoped to specific ARNs that are not yet known.

Key point:

  • Not all flagged issues are actionable
  • False positives should be documented, not ignored

This distinction is critical in real operational environments.


What I Would Improve for a Production Deployment

If this were moving beyond a portfolio project:

  1. Multi-region scanning instead of single-region execution
  2. Historical persistence using DynamoDB for trend analysis
  3. AWS Cost Explorer integration for real billing data
  4. Alerting via Slack or SNS for critical findings
  5. IAM Access Analyzer integration for deeper policy analysis

The current scope balances realism with complexity without overengineering.


Key Takeaways

  • Read-only audits reduce risk and build trust when running against live environments
  • Cost and security signals are more useful when tied to metrics and known frameworks
  • Not every finding should be auto-remediated; judgment still matters

These principles guided the design choices throughout this project.


Try It Yourself

Repository:
https://github.com/cypher682/aws-cost-security-auditor

Run locally:

git clone https://github.com/cypher682/aws-cost-security-auditor
cd aws-cost-security-auditor
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python src/full_audit.py --profile auditor-role
Enter fullscreen mode Exit fullscreen mode

See the remediation guidance in docs/REMEDIATION_PLAYBOOK.md.


What’s Next

Planned extensions:

  1. Multi-account support (AWS Organizations)
  2. RDS idle detection
  3. Lambda cost analysis

Links:


This project is part of my portfolio focused on production-grade cloud platform engineering.

Top comments (0)