DEV Community: Tom Williams

Fixing a 4-Year-Old Terraform Timeout in the AWS Provider

Tom Williams — Tue, 16 Jun 2026 15:52:06 +0000

Deploying AWS Config conformance packs across an Organization sounds like a solved problem until Terraform sits on Still creating... for half an hour and then gives up on a pack that deployed fine five minutes in.

I hit this running Config from a delegated admin account, found a four-year-old GitHub issue full of people hitting the same wall, and ended up getting a fix merged into the Terraform AWS provider. Here's what was actually going on.

The symptom

You apply an organization conformance pack. CloudFormation shows CREATE_COMPLETE within a few minutes. The AWS Config console says Deployment Successful. Terraform, meanwhile, is still waiting:

aws_config_organization_conformance_pack.cis: Still creating... [37m5s elapsed]
...
Error: error waiting for Config Organization Conformance Pack (cis-conformance-pack) to be created:
timeout while waiting for state to become 'CREATE_SUCCESSFUL'
(last state: 'CREATE_IN_PROGRESS', timeout: 30m0s)

Then it gets annoying. Because the create "failed", Terraform taints the resource and tries to destroy and recreate it on the next apply. The pack is deployed and working. Terraform just has no idea.

Issue #24545 had been open since 2022 with a steady trickle of "me too" comments. Bumping the create timeout didn't help. Swapping between template_body and template_s3_uri didn't help. People ended up doing manual uploads or shelling out with local-exec.

What's actually happening

The creation waiter polls DescribeOrganizationConformancePackStatuses and blocks until the aggregate status flips to CREATE_SUCCESSFUL. From the Organization management account, that works.

From a delegated administrator account, which is how most sensible landing zones run Config, it doesn't. The aggregate status never moves off CREATE_IN_PROGRESS, even once every member account has finished. The per-account detail is correct the whole time. It's only the rolled-up summary that gets stuck. So the waiter sits there polling a value that's never going to change, and eventually times out.

That's really an AWS API quirk rather than a Terraform bug. But Terraform is where you feel it, so that's where the workaround has to live.

The fix

There's a second, more granular API the provider was already using elsewhere: GetOrganizationConformancePackDetailedStatus. It returns status per member account instead of one aggregate value, and it stays accurate even from a delegated admin account.

So the fix leaves the aggregate poll alone and adds a fallback. When the aggregate status comes back as CREATE_IN_PROGRESS, ask the detailed API how many accounts are genuinely still in progress. If the answer is zero, the deployment has finished and we can treat it as successful:

status := output.Status

// The DescribeOrganizationConformancePackStatuses API may not
// transition the aggregate status from CREATE_IN_PROGRESS to
// CREATE_SUCCESSFUL when called from a delegated administrator
// account, even after all member accounts have completed
// deployment. Work around this by checking per-account detailed
// statuses: if no accounts are still in progress, the
// deployment has finished.
if status == types.OrganizationResourceStatusCreateInProgress {
    if v, err := findOrganizationConformancePackDetailedStatusesByTwoPartKey(
        ctx, conn, name, types.OrganizationResourceDetailedStatusCreateInProgress,
    ); err == nil && len(v) == 0 {
        status = types.OrganizationResourceStatusCreateSuccessful
    }
}

if status == types.OrganizationResourceStatusUpdateInProgress {
    if v, err := findOrganizationConformancePackDetailedStatusesByTwoPartKey(
        ctx, conn, name, types.OrganizationResourceDetailedStatusUpdateInProgress,
    ); err == nil && len(v) == 0 {
        status = types.OrganizationResourceStatusUpdateSuccessful
    }
}

return output, string(status), nil

The same thing happens on UPDATE_IN_PROGRESS, so it gets the same treatment. The issue had reports of unexpected state 'UPDATE_IN_PROGRESS' for exactly that reason.

A few things kept this safe rather than reckless:

It only reaches for the detailed API when the aggregate status is one of the two states known to get stuck. The normal management-account path is untouched.
If the detailed call errors, it falls back to the original aggregate status instead of guessing. A transient API blip never gets read as "done".
Querying with a CREATE_IN_PROGRESS filter and getting nothing back is a clean signal that every member account has moved on.

Testing it

This one is genuinely awkward to test, because it only shows up with a real multi-account Organization and a delegated admin configured. That's not something CI has lying around. The PR adds a delegatedAdministrator acceptance test built on the provider's alternate-account and organization-member pre-checks to stand that topology up.

Alongside that I verified it by hand against a live Organization, where I confirmed the aggregate status stayed pinned on CREATE_IN_PROGRESS while CloudFormation reported CREATE_COMPLETE in the delegated admin account, and that with the fix applied the waiter detected completion and returned cleanly.

What I took from it

The resource was never broken. It deployed every time. What was broken was its view of its own status from one particular calling context, and multi-account AWS is full of that. Plenty of APIs behave slightly differently depending on whether you're calling from the management account, a delegated admin, or a member, and it's worth assuming that's a possibility whenever something works in one account and hangs in another.

The other thing: a four-year-old issue with a wall of frustrated comments looks intimidating, but the actual change was small. A fallback in one status function and a test to cover it. The hard part wasn't the code, it was that nobody had pinned down the root cause yet.

The fix is merged and will ship in an upcoming provider release. If you've been working around this with manual uploads or shell-outs, you can drop them once it lands.

Links: PR #47072 and Issue #24545

Centralizing Firewall in a Multi-Account Org: Network Firewall's New Transit Gateway Attachment

Tom Williams — Tue, 02 Jun 2026 19:55:39 +0000

If you run a multi-account AWS Organization, you've almost certainly built (or inherited) a centralized inspection VPC. It works, but it's fiddly: dedicated firewall subnets, a separate Transit Gateway attachment subnet per AZ, appliance mode, and a small pile of route tables you have to get exactly right or traffic silently stops being inspected.

AWS recently made that whole pattern a lot simpler. Network Firewall now supports native attachment to Transit Gateway — you attach the firewall directly to the TGW and skip the inspection VPC plumbing entirely. This post covers what changed, how it maps onto a Control Tower / AFT landing zone, and when it's worth migrating.

The old model: an inspection VPC you have to babysit

In the classic centralized design, traffic between spoke VPCs (and to/from on-prem) is hairpinned through a dedicated inspection VPC that you own and operate. The moving parts:

Two subnets per AZ in the inspection VPC — one for the TGW attachment, one for the firewall endpoint.
Two TGW route tables — a spoke route table (default route pointing at the inspection VPC attachment) and a firewall route table (spoke routes propagated in so return traffic has a path home).
A VPC route table per TGW subnet, each with a 0.0.0.0/0 default pointing at the firewall endpoint in the same AZ.
Appliance mode enabled on the TGW attachment, so the flow hash pins a connection to a single firewall ENI and return traffic comes back symmetrically through the same AZ.

None of this is hard, exactly. But it's a lot of state to keep correct across every account you onboard, and it's the kind of thing that breaks subtly — an asymmetric route, a forgotten propagation — in ways that are annoying to debug.

The new model: attach the firewall to the TGW directly

With native TGW attachment, you create the firewall and tell it which Transit Gateway to attach to. That's the bulk of it. AWS deploys the firewall endpoints into an AWS-managed VPC on your behalf, and from your side the firewall shows up as a Transit Gateway attachment that you route traffic to — like any other attachment.

What disappears:

No inspection VPC to create or own.
No firewall/attachment subnets to lay out per AZ.
No per-subnet route tables pointing at AZ-local endpoints.
Appliance-mode symmetry is handled as part of the integration rather than something you wire up by hand.

You still control the firewall policy, rule groups, and logging exactly as before. What you stop managing is the network scaffolding around the firewall.

Where this fits in a Control Tower / AFT landing zone

In most landing zones, inspection lives in a dedicated network or security account, with the TGW shared out via RAM and spoke VPCs attached as accounts get vended through AFT. The native attachment model slots into that cleanly:

Network account owns the firewall. Create the Network Firewall and its TGW attachment in the centralized networking account that already owns the Transit Gateway.
Spoke routing stays the same. AFT-vended account VPCs attach to the TGW and associate with a spoke route table whose default route points at the firewall attachment. The big win: there's no longer an inspection-VPC attachment to special-case in your account baseline.
Codify it. Whether you template the firewall in your AFT global-customizations or a separate networking pipeline, the resource definition shrinks — you're declaring a firewall and an attachment, not a VPC, subnets, and a route-table matrix.

For new orgs, this means one fewer bespoke component to stand up. For new accounts, onboarding is just "attach to TGW, point default route at the firewall attachment" — no inspection-VPC-aware customization needed.

Should you migrate an existing inspection VPC?

If you already have a working centralized inspection VPC, there's no fire drill here — the old model still works fine. But native attachment is worth planning a migration toward if any of these resonate:

You're tired of maintaining the inspection-VPC subnet/route-table boilerplate across AZs.
Your account baseline carries special-case logic for the inspection attachment that you'd love to delete.
You're scaling AZs or regions and don't want to hand-replicate the subnet layout each time.

A few things to check before you commit:

Region availability — native TGW support rolled out broadly, but confirm it's live in every region your org operates in.
Cutover is a routing change. You'll be repointing TGW route tables from the old inspection-VPC attachment to the new firewall attachment. Plan it as a controlled flow cutover and watch your firewall logs to confirm traffic is still being inspected (not bypassed) on both directions.
Logging and policy parity. Reuse your existing firewall policy and rule groups so behaviour is identical post-cutover — only the attachment model changes.

Takeaway

The native Transit Gateway attachment doesn't change what Network Firewall does — it removes the inspection-VPC scaffolding you used to have to build and maintain around it. In a multi-account org, that's less per-account routing state, a simpler account baseline, and one fewer thing to get subtly wrong. If you're standing up a new landing zone, start here. If you're running the classic model today, it's worth putting a migration on the roadmap.

Governance You Hold, Not Governance You Rent — A Stratum Case Study

Tom Williams — Sun, 31 May 2026 13:58:09 +0000

Most AWS governance tools ask you to do something quietly radical: hand a third party a role into your Organization and let your IAM policies, CloudTrail events, and resource inventory flow out to someone else's SaaS backend. It works. It also tends to be the exact thing that stalls in procurement for three months while security asks where the data goes.

I build the other kind. Stratum is a self-hosted AWS governance platform that deploys into the customer's own AWS Organization, scans every member account, and produces prioritised, deduplicated findings with concrete remediation paths — without the data plane ever leaving the customer's AWS boundary.

This is a case study, not a product page. I'm writing it up because Stratum is how I operationalise governance work for clients, and it's a fair proxy for how I approach the job generally: primitives first, append-only audit trails, decisions written down, and no magic.

The problem: multi-account estates rot

If you run a growing AWS estate, you already know the failure mode. It isn't one big breach. It's slow decay:

Drift. A landing zone that was clean at launch accumulates one-off exceptions, hand-edited security groups, and "temporary" public buckets.
Alert fatigue. Five tools each emit findings in their own schema. Nobody triages all five, so nothing gets triaged.
Orphaned findings. A finding gets noticed, half-fixed, and lost. Six months later it resurfaces in an audit.
Evidence scrambles. SOC 2 season arrives and three engineers spend a week screenshotting consoles to prove controls were operating.

The common thread is that governance is treated as a periodic event rather than a running system. Stratum's job is to make it a running system — one a human can actually keep up with.

What Stratum is

A scanner-plus-frontend platform that lives entirely in the customer's AWS Organization. Scanners run on a schedule, enumerate accounts through the Organizations API, assume a read-only cross-account role in each member account, and emit findings into one shared schema. A frontend reads those findings through a versioned API and renders a single, scope-parameterised view: whole-org, per-account, or per-module.

The headline differentiator is boundary, not features:

Nothing that touches real AWS data leaves the customer's account. Scanners, the findings store, and the evidence bucket all live in the customer's tooling account.

That single property is what gets a tool like this through a security review instead of dying in it. You're not auditing a vendor's data-handling claims — there's no data leaving to handle.

Coverage that lands in one queue

Breadth only helps if it converges. Stratum runs around eleven independent scanner modules:

IAM posture
Compute (EC2 / EBS)
Networking (VPC, security groups, flow logs, Transit Gateway)
S3
RDS
DynamoDB
Lambda
AWS Config
SSM patch coverage
Cost
…plus a SOC 2 compliance-mapping layer that sits on top

Every module writes to the same findings schema, so the output is one deduplicated, prioritised queue a human can triage — not eleven dashboards in eleven dialects. Modules don't import from each other; where cross-module context matters (say, an internet-exposed Lambda or RDS instance), module-local correlators enrich a finding by reading the shared open-findings index. There's deliberately no shared aggregation god-layer to rot.

Architecture worth showing

A few choices that reflect how I like to build:

Terraform manages CloudFormation StackSets; StackSets deploy the modules. Each module is independently deployable. One module failing to roll out never takes the others down with it — you get partial coverage and a clear signal, not an all-or-nothing apply.

A server-side read model keeps triage fast at scale. The system is designed for large estates producing tens of thousands of findings. The frontend never talks to the database directly; it calls a versioned API with an explicit contract. That decoupling is also why the platform can later lift its control plane out of the customer account without a rewrite — the data plane stays put either way.

One apply per customer. Root configs compose modules and own their own backend and providers; a single modules = { ... } toggle decides which scanners are enabled. Turning governance on for a new account is configuration, not a project.

Trust and safety by design

A platform with org-wide reach has to be conservative about reach. Stratum is read-only by default — the cross-account scanner role carries a permission boundary that blocks every write action, org-wide, full stop. The default deployment cannot change your infrastructure even if it wanted to.

Remediation is designed as a deliberate, opt-in path rather than a default. When it's enabled — module by module, account by account — an action runs as an SSM Automation document under a separate, tightly-scoped per-module role whose permission boundary denies destructive operations (no touching admin/root roles, no *:* wildcards). Every action is logged immutably. The posture is: prove you trust a module before you ever let it write, and even then it writes inside a box you defined.

Findings you can put in front of an auditor

Findings are immutable and append-only. Each finding is written once; its status (open, suppressed, resolved) is computed from an append-only event stream of status, note, and assignee changes — not by mutating a row. Evidence lands in an S3 bucket with Object Lock and long retention. The result is a tamper-evident history: you can show not just the current state, but exactly when something was detected, acknowledged, and closed.

The SOC 2 layer maps technical findings to Trust Services Criteria — honestly. It evidences control state and detection. It does not print a green "you are 94% compliant" number, because that number is fiction and auditors know it. It tells you which controls have supporting evidence and which don't. That honesty is the point; a compliance layer that flatters you is worse than none.

Why this is the case study

Stratum isn't the deliverable I'm selling — it's proof of how I work. The same instincts that shaped it are the ones I bring to client engagements:

Primitives first. StackSets, SSM, DynamoDB, Object Lock — well-understood AWS building blocks composed deliberately, not a tower of bespoke abstractions.
Append-only audit trails. If it matters, it's reconstructable from an event stream.
Decisions written down. Architecture decisions are recorded so the why survives the people.
No overclaiming. Read-only until you say otherwise; honest compliance mapping; partial coverage surfaced rather than hidden.

If your multi-account estate is growing faster than your ability to govern it — drift, alert fatigue, audit-time scrambles — that's exactly the shape of problem I work on.

Let's talk

I'm happy to spend 30 minutes, no pitch, looking at your AWS Organizations, IAM, and multi-account posture and telling you where the real risk sits. If that's useful, reach out and we'll find a time.

Turning a Mac mini Into a Home Server for Self-Hosted Services

Tom Williams — Sun, 17 May 2026 13:59:41 +0000

The Mac mini is one of the more underrated pieces of homelab hardware you can buy. It's small, near-silent, sips power, and ships with an absurd amount of CPU per watt thanks to Apple Silicon. I've been running mine as a 24/7 home server for a while now, and this post is the writeup I wish I'd had when I started.

Why a Mac mini and not a Raspberry Pi or a NUC?

The honest answer: I already had one sitting on a shelf. But there are a few reasons it turned out to be a great fit:

Performance per watt. An M-series Mac mini will idle around 4–7W and still rip through container workloads that would make a Pi cry.
It's silent. No fans spinning up when Plex transcodes or when a backup job kicks off.
macOS is a real Unix. You get Homebrew, launchd, ZFS via OpenZFS if you really want it, and a familiar shell environment.
It just stays up. Mine has been running for months without a reboot beyond the occasional OS update.

The trade-offs are real though — you're paying Apple prices, you don't get ECC RAM, and storage expansion means hanging things off USB or Thunderbolt. If you need 40TB of spinning rust, build a NAS. For everything else, the Mac mini is excellent.

Initial macOS setup

A few settings to change before anything else:

Disable sleep. In System Settings → Energy:

Prevent automatic sleeping when the display is off: on
Start up automatically after a power failure: on
Wake for network access: on

Enable auto-login so the machine comes back up unattended after a power blip. Yes, this trades some security for uptime — make sure the box lives somewhere physically safe.

Turn on Remote Login (SSH) in System Settings → General → Sharing. While you're there, give the machine a sensible hostname.

Install the command line tools and Homebrew:

xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Container runtime: OrbStack

I run almost everything in containers, and on Apple Silicon, OrbStack beats Docker Desktop handily. It's faster to start, uses less RAM, has better filesystem performance, and the networking just works.

brew install --cask orbstack

Set it to start at login and you're done. The docker and docker compose CLIs work exactly as you'd expect.

Reverse proxy: Caddy

For routing traffic to services and handling TLS, Caddy is the path of least resistance. A single Caddyfile and you have automatic HTTPS from Let's Encrypt for every service.

jellyfin.home.example.com {
    reverse_proxy localhost:8096
}

homeassistant.home.example.com {
    reverse_proxy localhost:8123
}

I run Caddy in a container alongside everything else, with the config mounted from ~/srv/caddy/Caddyfile.

Remote access without opening ports: Tailscale

This is the single most important piece of the setup. Do not port-forward your home router to expose services to the internet. Instead, install Tailscale on the Mac mini and on every device you want to reach it from:

brew install --cask tailscale

Now your Mac mini has a stable 100.x.x.x IP that's only reachable by your own devices. Combine it with MagicDNS and you can hit http://mac-mini:8123 from your phone, anywhere in the world, with no firewall changes.

For services I want family members to reach, Tailscale's Funnel feature exposes a single service to the public internet through their edge, with TLS handled for you.

The services I actually run

Here's the current lineup, all in Docker Compose:

Service	What it does
Jellyfin	Media server. Hardware transcoding works on Apple Silicon with the right flags.
Home Assistant	Smart home brain. Runs in a container, talks to Zigbee via a USB stick.
Pi-hole	Network-wide ad blocking. DNS for the whole house points here.
Paperless-ngx	OCR'd document archive. Scan once, search forever.
Vaultwarden	Self-hosted Bitwarden-compatible password manager.
Uptime Kuma	Tells me when something I forgot about has fallen over.
Syncthing	Folder sync across all my machines without going through the cloud.

Each one is a folder in ~/srv/ with its own docker-compose.yml and persistent volumes underneath. Boring, predictable, easy to back up.

Storage and backups

My internal SSD holds the OS, container images, and small databases. For media and bulk data I have a Thunderbolt enclosure with a couple of SSDs in it, mounted at /Volumes/data.

For backups I run two layers:

Time Machine to a separate external drive, for the whole system.
Restic to Backblaze B2 for the irreplaceable stuff — Vaultwarden, Paperless, Home Assistant configs, photos. Encrypted, deduplicated, cheap.

A nightly launchd job kicks off the Restic run and pings a healthcheck URL when it succeeds. If the ping doesn't arrive, Uptime Kuma yells at me.

Monitoring

Nothing fancy: Uptime Kuma for "is the service responding," and Beszel for lightweight host metrics (CPU, RAM, disk, container health). Both run in containers, both have web UIs I can hit over Tailscale.

What I'd do differently

Buy the bigger SSD. Container images, databases, and Time Machine snapshots eat space faster than you think.
Get the Thunderbolt enclosure sooner. USB 3 is fine until you're moving real volumes of data.
Document the compose files in a git repo from day one. Past-me thought he'd remember which env vars he set. Past-me was wrong.

Wrapping up

A Mac mini won't replace a rack of Dell servers, but for a home setup that quietly hosts a dozen useful services and disappears into a shelf, it's hard to beat. The combination of low power draw, silent operation, and a real Unix environment makes it a genuinely lovely machine to run things on.

If you've got one gathering dust, give it a job.

Lessons from Migrating 9TB of File Shares to FSx

Tom Williams — Sun, 22 Mar 2026 14:00:29 +0000

Migrating a Windows file server sounds straightforward until you're staring at 9TB of data across 14 shares and trying to work out what's actually worth moving.

This is what I learned doing exactly that — moving a legacy EC2-hosted Windows file server to FSx for Windows File Server, with a detour through S3 Glacier for the data nobody was using.

Start with discovery, not migration

The temptation is to spin up FSx, robocopy everything across, and call it done. Resist that. You'll end up paying FSx prices for terabytes of data that hasn't been touched in years.

I wrote a PowerShell script to scan every share and classify files by age. This immediately surfaced that a significant portion of the data was cold — files that hadn't been written to in over two years.

The LastAccessTime trap

Here's the gotcha that cost me a day: the server had DisableLastAccess set to 1. This is a common Windows performance optimisation, but it means LastAccessTime is unreliable — it wasn't being updated when files were read.

That left LastWriteTime as the only trustworthy timestamp. It's a reasonable proxy (if nobody's modified a file in two years, it's probably cold), but it's not perfect. A file that's read daily but never edited would appear cold.

The fix: I enabled LastAccessTime tracking early in the project timeline and let it run for a few weeks before the final classification scan. This gave us a more accurate picture before committing to the archival decisions.

Lesson: Check fsutil behavior query DisableLastAccess on day one of any file migration project.

Archive before you migrate

With the data classified, the approach was:

Archive cold data to S3 Glacier (cheap, still retrievable if needed)
Migrate only active data to FSx
Keep the original EC2 instance read-only for a transition period

This significantly reduced the FSx storage footprint and brought the monthly cost down to something sensible.

Things I'd do differently

Automate the archival pipeline end-to-end: I used a semi-manual process with AWS DataSync. Next time I'd script the full workflow including verification and cleanup.
Set up monitoring on FSx from day one: Storage growth on FSx can surprise you. CloudWatch alarms on free storage space are essential.
Communicate the archive process to users early: People get nervous when they hear "we're archiving your files." Setting expectations about retrieval times and the safety net of Glacier avoids unnecessary panic.

Was FSx worth it?

Yes. Automated backups, native AD integration, no more patching a Windows Server instance, and the storage scales without us managing disks. The migration was a few weeks of focused work, but the operational overhead dropped permanently.

Why Event-Driven Infrastructure Beats Cron Jobs

Tom Williams — Sun, 22 Mar 2026 12:51:30 +0000

If you've spent any time managing infrastructure at scale, you've probably written a cron job that polls for something. Maybe it checks for untagged resources every hour, or scans for missing CloudWatch alarms on a schedule. It works. It's simple. And it's almost always the wrong long-term answer.

I recently rebuilt one of these systems — a compliance remediation tool that ensures every EC2 instance in our multi-account AWS organisation has CloudWatch CPU alarms — and the shift from scheduled polling to event-driven architecture made a surprising difference.

The cron approach

The original setup ran a Lambda on a CloudWatch Events schedule every 30 minutes. It would:

Assume a role into each member account
List all EC2 instances
Check for the existence of CloudWatch alarms
Create any that were missing

This worked, but had problems:

Latency: A new instance could run for up to 30 minutes without monitoring
Cost: Every run scanned every instance, even if nothing had changed
Complexity: The Lambda needed to handle pagination across dozens of accounts, manage rate limiting, and deal with partial failures gracefully
Noise: CloudWatch Logs filled up with successful "nothing to do" runs

The event-driven approach

The replacement uses EventBridge rules deployed to each member account via StackSets. When an EC2 instance launches or has its tags modified, the event is forwarded to a central event bus where a Lambda evaluates and applies alarms.

The reconciliation Lambda still exists — it runs daily as a safety net — but it catches edge cases rather than doing the heavy lifting.

What changed

Remediation time: From up to 30 minutes to under 60 seconds
Lambda invocations: Dropped significantly — we only run when something actually happens
Code complexity: The event-driven Lambda handles one instance at a time, not a full cross-account sweep
Terraform: The module became simpler because each component has a single, clear responsibility

When cron still wins

Event-driven isn't always the answer. Use scheduled runs when:

There's no reliable event source for the change you care about
You need a full reconciliation sweep (drift detection, for example)
The event volume would be higher than the polling cost

But for "react when something changes" — which is what most compliance automation is doing — EventBridge is the better tool.

Getting started

If you're currently running a polling Lambda and want to shift:

Identify the AWS API action that triggers the change you care about
Create an EventBridge rule matching that event pattern
Keep your existing Lambda as a daily reconciliation fallback
Deploy the rule to member accounts via StackSets

The two patterns complement each other. Events handle the real-time path, scheduled runs handle the "trust but verify" path.