Gerardo Castro Arica for AWS Heroes

Posted on Mar 24

My manager asked if it could run itself. Here's how I automated iam-audit with Fargate, EventBridge and Terraform (Part 3)

#aws #security #python #boto3

A few weeks ago my manager asked me a question that seemed simple:

"Can it be scheduled to arrive on its own every week?"

The script was already scanning more than 20 AWS accounts. It was already detecting Access Keys from 2018 still active in production. It was already generating a dashboard that any CISO could read without opening a spreadsheet. Technically, the work was done.

But "the work was done" meant someone had to remember to run it. Someone had to have Docker installed, credentials configured, and free time on a Monday morning. On a security team with multiple open fronts, that "someone" is exactly the link that breaks.

Automation wasn't a cosmetic improvement. It was the step that turned a tool into a service.

The constraint that defines the architecture

The first decision wasn't technical — it was about constraints.

The report needs to run once a week. It takes minutes. When it's done, there's nothing to keep alive. Paying for infrastructure that sits idle 99.9% of the time isn't just a cost problem — it's a design problem.

With that clear, the options narrow themselves.

Lambda? The 15-minute execution limit is the problem. In an Organization with many accounts, the script can take longer — and a silent timeout halfway through an audit is worse than not running at all. Lambda is designed for millisecond-to-minute workloads, not for audit processes that traverse dozens of accounts in sequence.

ECS Service? A Service is designed for processes that run indefinitely — an API, a worker listening to a queue. Keeping a Service alive for a weekly job is exactly the antipattern we wanted to avoid. You pay for availability you'll never use.

EC2? More attack surface, more OS management, more base cost. Discarded.

The right answer is ECS Fargate Task — no Service, no persistent instances. A Task is ephemeral by design: it spins up, executes, and disappears. Nothing is running between executions. Nothing to patch, monitor, or pay for when not in use.

That's FinOps applied to security: the cheapest architecture isn't the one with fewer features — it's the one that doesn't spend on what it doesn't need.

The architecture

The complete flow has four steps and no unnecessary pieces.

EventBridge Scheduler fires an event every Monday at 9am Lima time (cron(0 14 ? * MON *) — UTC-5). No server waiting, no process sleeping. EventBridge simply remembers it has something to do, and does it.

That event spins up an ECS Fargate Task in the Security account. The task runs the iam-audit Docker image — the same one you were running locally with a single command in Part 2 — but now on AWS, with an assigned IAM role, no hardcoded credentials, no human intervention.

When the task finishes, it uploads the report to a dedicated S3 bucket. The bucket has a 90-day lifecycle policy — older reports are deleted automatically. No indefinite accumulation, no silently growing costs.

The last step is notification. The task generates a presigned URL valid for 48 hours and sends it via Slack. Whoever receives the message has two days to open the dashboard — after that the link expires. The report never leaves your AWS account; what travels through Slack is only the temporary access.

EventBridge Scheduler (Monday 9am Lima)
        │
        ▼
ECS Fargate Task
  └─ image: gerardokaztro/iam-audit
  └─ role: iam-audit-task-role
  └─ secret: Slack webhook URL (Secrets Manager)
        │
        ├──▶ S3 bucket (report + presigned URL 48h)
        │
        └──▶ Slack (presigned URL)

Everything lives in the Security account of the Organization. Not in the management account, not in an application account. The Security account is the right place for tools that touch the entire organization — isolated, with controlled access, audited separately.

If you want to deploy it in a different account, you can do so without touching anything structural. Just adjust the configuration values — bucket name, SSO profile, environment variables — and Terraform handles the rest the same way.

Terraform and the partial backend

The entire stack is defined in Terraform. But before writing a single resource, there's a language limitation that needs to be understood — and if you don't know it, it leads you straight to hardcoding things that shouldn't be in code.

The backend block in Terraform initializes before the variable system. That means this doesn't work:

terraform {
  backend "s3" {
    bucket  = var.state_bucket  # ❌ not valid
    region  = var.aws_region    # ❌ not valid
    profile = var.aws_profile   # ❌ not valid
  }
}

Terraform rejects it at init. The variables simply don't exist yet at that point in the lifecycle.

The obvious solution — and the wrong one — is to hardcode the values directly:

terraform {
  backend "s3" {
    bucket  = "my-state-bucket"  # ❌ now it's in the repo
    region  = "us-east-1"
    profile = "my-sso-profile"
  }
}

It works. But if the repo is public, you just exposed your state bucket name and SSO profile. And if the repo is private today, it might not be tomorrow.

The correct solution is the partial backend: leave the block empty in main.tf and pass the values in a separate file that goes in .gitignore.

main.tf:

terraform {
  backend "s3" {}
}

backend.hcl (in .gitignore, never in the repo):

bucket  = "your-state-bucket"
region  = "us-east-1"
profile = "your-sso-profile"

And the init looks like this:

terraform init -backend-config=backend.hcl

The repo includes a backend.hcl.example with the structure and example values. Whoever clones the project copies the file, fills in their values, and runs init. No friction, no exposed secrets.

This isn't a workaround — it's the pattern Terraform recommends exactly for this case. The language limitation, turned into a security best practice.

The Task Definition and secrets

When you define an ECS Task Definition in Terraform, you have two ways to pass values to the container: environment and secrets. They seem equivalent. They're not.

environment passes the value directly as an environment variable — visible in plain text in the ECS console, in the task logs, and in any describe-task-definition that someone with account access runs.

secrets does something different: it tells the task to fetch the value from AWS Secrets Manager at execution time, inject it as an environment variable in memory, and never write it anywhere. The value doesn't appear in the task definition. It doesn't appear in the logs. It doesn't appear in the console.

The Slack webhook URL is exactly the kind of value that shouldn't be in environment. Anyone with that URL can send messages to your Slack channel on behalf of the system — with no additional authentication, no traceability. It's a credential, not a configuration.

In the Task Definition it looks like this:

secrets = [
  {
    name      = "SLACK_WEBHOOK_URL"
    valueFrom = aws_secretsmanager_secret.slack_webhook.arn
  }
]

The value is created once in Secrets Manager and Terraform only references the ARN. The container receives the variable at runtime — the Python code reads it with os.environ["SLACK_WEBHOOK_URL"] like any environment variable, but it was never exposed in any definition.

The task's IAM role

The iam-audit-task-role is the role the container assumes at runtime. It's the equivalent of the AWS profile you were using locally in the first two posts — but now it's a role with permissions explicitly defined in Terraform, no long-lived credentials, no ~/.aws to mount.

What the task needs to function is exactly this and nothing more:

# List accounts in the Organization
statement {
  effect    = "Allow"
  actions   = ["organizations:ListAccounts"]
  resources = ["*"]
}

# Assume the audit role in each member account
statement {
  effect    = "Allow"
  actions   = ["sts:AssumeRole"]
  resources = ["arn:aws:iam::*:role/iam-audit-role"]
}

# Upload the report to the S3 bucket
statement {
  effect    = "Allow"
  actions   = ["s3:PutObject", "s3:GetObject"]
  resources = ["${aws_s3_bucket.reports.arn}/*"]
}

# Read the Slack secret
statement {
  effect    = "Allow"
  actions   = ["secretsmanager:GetSecretValue"]
  resources = [aws_secretsmanager_secret.slack_webhook.arn]
}

No * in resources where it isn't necessary. No AdministratorAccess because "it's easier." Each permission has a specific reason and a scoped target.

There's something worth noting: this is the role of the tool that audits least privilege across the entire organization. If that role had excessive permissions, we'd be auditing a principle we don't apply at home. Consistency isn't just aesthetic — it's what makes the project credible.

The result

Every Monday at 9am, without anyone doing anything, this arrives in Slack:

🔍 iam-audit | Weekly report
Organization: more than 20 accounts audited
📊 View dashboard → https://s3.amazonaws.com/...?X-Amz-Expires=172800
⏳ Link valid for 48 hours

Nothing to run. Nothing to remember. No engineer who had to remember on a Monday morning that this tool existed.

The Fargate Task spun up, audited, uploaded the report, generated the presigned URL, notified, and disappeared. The total cost of that execution is cents — literally. An ephemeral task that runs minutes per week doesn't generate a visible line in the monthly billing.

That's what it means to automate well: not just that it runs on its own, but that it runs on its own without leaving infrastructure or cost behind.

For a security team in LATAM operating on a tight budget with multiple open fronts, this isn't a minor detail. It's the difference between a tool that gets used and a tool that gets forgotten.

Closing

Three posts. Three versions of the same problem.

The first was a question: who has access, with what credentials, and since when? The answer was a Python script that traversed the entire Organization in minutes and surfaced what nobody was looking at.

The second was a tension: the data was there, but it didn't communicate to all audiences. The answer was a dashboard anyone could read, root account detection, and a Docker image that eliminated the friction of running it.

The third was an operational constraint: someone had to run it. The answer was turning the tool into a service — ephemeral, automated, secure, and with a cost that doesn't justify a line in the budget.

Visibility. Communication. Automation.

That's what we built. Not with commercial platforms, not with an enterprise budget, not with a team of ten people. With Python, Docker, Terraform, and the right design decisions.

If you're building AWS security in LATAM with the resources you have — not the ones you wish you had — I hope this series gave you something concrete to take with you. Not a framework to memorize. A tool you can run today.

The repository is on GitHub. The image is on Docker Hub. The IaC is in Terraform. All open, all documented, all yours.

🔗 GitHub: gerardokaztro/iam-audit
🐳 Docker Hub: gerardokaztro/iam-audit

About the author

Gerardo Castro is an AWS Security Hero and Cloud Security Engineer focused on LATAM. Founder and Lead Organizer of the AWS Security Users Group LatAm. He believes the best way to learn cloud security is by building real things — not memorizing frameworks. He writes about what he builds, what he finds, and what he learns along the way.

🔗 GitHub: gerardokaztro
🔗 LinkedIn: gerardokaztro

DEV Community