Suresh

Posted on Jan 2

CleanCloud v0.4.0: How We Made Cloud Hygiene Scanning 10x Faster

#aws #azure #devops #opensource

I just shipped CleanCloud v0.4.0 with major performance improvements through parallel scanning. Here's how we did it.

What's CleanCloud?

If you missed the original announcement, CleanCloud is a read-only CLI tool that scans AWS/Azure for orphaned resources (unattached volumes, old snapshots, infinite CloudWatch log retention).

Unlike aggressive cleanup tools, CleanCloud gives you conservative signals so you can review before taking action. No auto-delete, no risk.

The Performance Problem

v0.3.x had a bottleneck: sequential scanning.

# Old approach (v0.3.x)
findings = []
for region in regions_to_scan:
    click.echo(f"🔍 Scanning region {region}")
    findings.extend(_scan_aws_region(profile, region))

Result: Each region scanned one at a time. For accounts with resources in multiple regions, this added up quickly.

The Solution: Parallel Scanning

v0.4.0 introduces concurrent scanning at two levels:

1. Parallel Region Scanning

# New approach (v0.4.0)
from concurrent.futures import ThreadPoolExecutor, as_completed

def scan_aws_regions(profile: Optional[str], regions_to_scan: List[str]) -> List[Finding]:
    findings = []

    with ThreadPoolExecutor(max_workers=min(5, len(regions_to_scan))) as executor:
        futures = {
            executor.submit(_scan_aws_region, profile, region): region 
            for region in regions_to_scan
        }

        for future in as_completed(futures):
            region = futures[future]
            click.echo(f"✅ Completed region {region}")
            findings.extend(future.result())

    return findings

Key decisions:

max_workers=min(5, len(regions_to_scan)) - Limits parallelism to avoid rate limits
as_completed() - Shows progress as regions complete
Thread-safe result collection

2. Parallel Rule Execution

Within each region, we also parallelized individual rules:

AWS_RULES = [
    find_unattached_ebs_volumes,
    find_old_ebs_snapshots,
    find_inactive_cloudwatch_logs,
    find_aws_untagged_resources,
]

def _scan_aws_region(profile: Optional[str], region: str) -> List[Finding]:
    session = create_aws_session(profile=profile, region=region)
    findings = []

    with ThreadPoolExecutor(max_workers=min(4, len(AWS_RULES))) as executor:
        futures = [executor.submit(rule, session, region) for rule in AWS_RULES]

        for future in as_completed(futures):
            try:
                rule_findings = future.result()
                findings.extend(rule_findings)
            except Exception as e:
                # Never fail entire scan due to one rule
                click.echo(f"⚠️ Rule failed in {region}: {e}")

    return findings

Benefits:

All 4 rules run concurrently per region
Exception isolation (one failing rule doesn't break the scan)
Better resource utilization

Performance Improvements

Real-world results from testing:

Single region scan:

Before: ~20-25 seconds
After: ~15-18 seconds
Improvement: ~30% faster

Multi-region scan (5 regions):

Before: ~100-120 seconds (sequential)
After: ~20-25 seconds (parallel)
Improvement: ~5x faster

The key insight: The more regions you scan, the bigger the improvement. Parallel execution shines when there's actual work to parallelize.

Azure Gets the Same Treatment

Azure subscriptions are now scanned in parallel too:

def scan_azure_subscriptions(
    subscription_ids: List[str],
    credential,
    region_filter: Optional[str],
) -> List[Finding]:
    all_findings = []

    with ThreadPoolExecutor(max_workers=min(4, len(subscription_ids))) as executor:
        futures = {
            executor.submit(
                _scan_azure_subscription,
                subscription_id=sub_id,
                credential=credential,
                region_filter=region_filter,
            ): sub_id
            for sub_id in subscription_ids
        }

        for future in as_completed(futures):
            sub_id = futures[future]
            click.echo(f"✅ Completed subscription {sub_id}")
            try:
                all_findings.extend(future.result())
            except Exception as e:
                click.echo(f"⚠️ Subscription {sub_id} failed: {e}")

    return all_findings

Same benefits for Azure users with multiple subscriptions.

Other v0.4.0 Improvements

🔒 Safety Integration Tests

We now have automated tests that verify CleanCloud's read-only guarantees:

def test_scan_is_read_only():
    """Ensure no write operations during scan."""
    # Run full scan
    scan_result = scan_all_regions()

    # Check CloudTrail for write operations
    cloudtrail_events = get_recent_events()
    write_events = [e for e in cloudtrail_events 
                    if e['EventName'] not in READ_ONLY_OPERATIONS]

    # Fail if ANY writes detected
    assert len(write_events) == 0, f"Write operations detected: {write_events}"

These run in CI on every PR against real AWS/Azure accounts. If CleanCloud ever tries to write, the build fails.

Why this matters: You can trust that CleanCloud is truly read-only, not just claiming to be.

🩺 Enhanced Doctor Command

The cleancloud doctor command now provides actionable IAM diagnostics:

cleancloud doctor --provider aws

# Before (v0.3.x):
❌ Permission denied

# After (v0.4.0):
❌ Missing IAM permission: ec2:DescribeVolumes

Suggested IAM policy:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["ec2:DescribeVolumes"],
    "Resource": "*"
  }]
}

Much more helpful for debugging permission issues.

📊 Post-Scan Feedback

After each scan, you'll see a feedback prompt (disabled in CI/CD with --no-feedback):

--- Scan Summary ---
Total findings: 23

CleanCloud feedback
-------------------
If this scan surfaced useful findings, we'd love to hear about it.

Share feedback: https://github.com/cleancloud-io/cleancloud/discussions

This helps us improve detection rules based on real user feedback.

Real-World Impact

Since launch, CleanCloud users have reported finding:

💰 Cost Savings:

$8K-12K/year in forgotten CloudWatch logs (infinite retention)
$500-2K/year in unattached EBS volumes
$300-1K/year in old snapshots

🎯 Common Findings:

50-100 unattached volumes per account
100-300 old snapshots from deleted instances
20-50 log groups with infinite retention

⏱️ Time to Value:

Scan time: 20-30 seconds (v0.4.0)
Review time: 5-10 minutes
First cleanup: Same day
ROI: Immediate

Installation & Usage

# Install
pip install cleancloud

# Scan all active AWS regions (auto-detects which have resources)
cleancloud scan --provider aws --all-regions

# Check IAM permissions
cleancloud doctor --provider aws

# Scan specific region
cleancloud scan --provider aws --region us-east-1

# Scan Azure
cleancloud scan --provider azure

Example output:

🔍 Starting CleanCloud scan...

Provider: aws

🔍 Auto-detecting regions with resources...
✓ Found 3 active regions: us-east-1, us-west-2, eu-west-1

✅ Completed region us-east-1
✅ Completed region us-west-2
✅ Completed region eu-west-1

--- Scan Summary ---
Total findings: 47
By confidence: {'HIGH': 12, 'MEDIUM': 23, 'LOW': 12}
Regions scanned: us-east-1, us-west-2, eu-west-1

Technical Deep Dive: Threading Challenges

Building the parallel scanning wasn't trivial. Here are some challenges we hit:

1. Thread Safety with boto3

boto3 clients are not thread-safe. We had to create separate sessions per thread:

def _scan_aws_region(profile: Optional[str], region: str) -> List[Finding]:
    # Create NEW session per thread
    session = create_aws_session(profile=profile, region=region)

    # Now safe to use in this thread
    findings = []
    # ... scanning logic
    return findings

Lesson: Never share boto3 clients across threads. Create new sessions per worker.

2. Rate Limiting

Running 5 regions in parallel meant more concurrent API calls. We had to be smart about worker limits:

# Limit parallelism to avoid throttling
max_workers = min(5, len(regions_to_scan))  # Cap at 5 workers

Also: boto3's built-in retry logic with adaptive mode handles most throttling gracefully.

3. Error Isolation

One region failing shouldn't kill the entire scan:

for future in as_completed(futures):
    try:
        rule_findings = future.result()
        findings.extend(rule_findings)
    except Exception as e:
        # Log error but continue
        click.echo(f"⚠️ Rule failed: {e}")

Result: Partial results if some regions fail. Trust-first means never failing the entire scan.

4. Progress Feedback

Users need to know what's happening during parallel scans:

for future in as_completed(futures):
    region = futures[future]
    click.echo(f"✅ Completed region {region}")

Better UX: Show progress as regions complete, not just at the end.

What's Next

Roadmap for v0.5.0:

🌐 GCP support - Extend beyond AWS/Azure
⚙️ Configurable thresholds - Adjust age/confidence per environment
💵 Cost calculations - Show potential savings in dollars
🔗 CI/CD templates - GitHub Actions, GitLab CI examples
📊 JSON export improvements - Better integration with other tools

Want to contribute? We welcome PRs! Check out the issues.

Why Open Source?

CleanCloud is MIT licensed with:

✅ Zero telemetry
✅ No phone-home
✅ No tracking
✅ All code visible

Why?

Trust is critical for cloud security tools. Open source means you can verify CleanCloud is truly read-only. No need to trust my promises - read the code.

Plus: Building in public creates better software through community feedback.

Try It Out

pip install cleancloud
cleancloud scan --provider aws --all-regions

Links:

📦 PyPI: https://pypi.org/project/cleancloud
💻 GitHub: https://github.com/cleancloud-io/cleancloud
📖 Docs: https://github.com/cleancloud-io/cleancloud#readme

Feedback Welcome!

What cloud hygiene checks would be useful? What other resources should CleanCloud scan?

Drop a comment or open an issue on GitHub. Would love to hear what you find! 🚀

DEV Community