DEV Community

Cover image for Automating Externally-Managed Stateful Rule Groups in AWS Network Firewall Using Terraform
Santanu Das
Santanu Das

Posted on

Automating Externally-Managed Stateful Rule Groups in AWS Network Firewall Using Terraform

TL;DR:
Managing AWS Network Firewall via Terraform often leads to a tug-of-war when external automation (like a Lambda) needs to update dynamic rules (e.g., FQDNs for ephemeral tenants).

In this post, I'll share an architecture that solves this by:

  • Decoupling Ownership: Splitting rule groups into "Self-Managed" (Terraform-owned) and "Custom-Managed" (Automation-owned).
  • Preventing State Drift: Using the lifecycle { ignore_changes = [...] } block on custom resources so Terraform ignores updates made by external APIs.
  • Enforcing Order: Implementing a stateful_rule_group_order list in the parent module to ensure a deterministic evaluation of Suricata, Domain Lists, and AWS-managed rules.
  • Scaling Gracefully: Using a placeholder strategy to initialize high-capacity rule groups that are ready for high-frequency updates without ever re-running a pipeline.

Table Of Contents


Since migrating our edge security stack from a combination of F5, HAProxy, and standalone Suricata to AWS Network Firewall (NFW), the relationship has been a bit sweet-and-sour.

Having relied on Suricata for years, was excited to see that NFW supports Suricata-compatible IPS rules natively and based on that, our migration goals were very clear:

  • Retain Suricata Semantics: Full support for TLS SNI filtering, IDS/IPS signatures, and custom rulesets.
  • Preserve our Hostname Model: Maintain our existing user-defined hostname filtering logic without regression.
  • Pure GitOps: The entire stack must be Infrastructure-as-Code (IaC) via Terraform and Terragrunt.
  • Zero Manual Intervention: Eliminate "humans editing allow-lists" at all costs to ensure security and consistency.

However, the path to a fully automated, stateful firewall wasn't as linear as the documentation suggested. This article breaks down the architectural hurdles we encountered, the technical trade-offs we faced, and the specific automation patterns I used to reach the summit.

Background Overview

We operate a high-scale, multi-tenant SaaS platform. In our environment, customers can dynamically provision their own subdomains through a self-service user portal. The result is a constantly evolving list of hostnames, such as:

${TENANT_1}.examples.site
${TENANT_2}.examples.site
...
Enter fullscreen mode Exit fullscreen mode

The backend application processes these tenant registrations and exposes them to the public internet as fully-fledged websites. The traffic flow follows a standard high-availability pattern: Internet → Network Load Balancer (NLB) → AWS Network Firewall (NFW).

Architectural Requirements

To support this model at scale, our firewall layer had to meet five non-negotiable criteria:

  1. L3/L4 Resilience: Robust protection against volumetric DDoS attacks.
  2. Suricata-grade IDS/IPS: Deep packet inspection using industry-standard Suricata signatures.
  3. Tenant-Specific Filtering: Granular hostname filtering to ensure traffic isolation and security.
  4. Zero-Touch Onboarding: A fully automated pipeline where a new tenant registration triggers a firewall update without manual intervention.
  5. In-line Propagation: Real-time availability for new sites to ensure a seamless user experience.

Beyond these dynamic, user-generated endpoints, the firewall must also manage a collection of static internal endpoints. These are traditional infrastructure dependencies that require a stable, code-defined allow-list within our primary firewall policy.

The Core Challenge: Dynamic Ingress at Scale

In our SaaS environment, hostnames are ephemeral — they are created (and removed) constantly. This means the ingress rule-set must be agile enough to reflect these changes in near real-time.

The hurdle here is one of context: the Infrastructure/Networking layer has no inherent knowledge of which hosts are being added or removed; that metadata lives exclusively within the Application layer. To bridge this gap, we needed a delivery mechanism that could update our Access Control Lists (ACLs) the moment a change occurred in the database.

The War: Declarative Tooling in Ephemeral Environments

Initially, the challenge was more engineering-focused than architectural. Because Terraform is declarative, my first instinct was to trigger a CI/CD pipeline to run terraform apply every time a tenant was added. This would rebuild the entire ruleset with the new entries.

However, we quickly realized this approach was unsustainable:

  • Latency: Running a full Terraform plan/apply cycle for every single tenant change was painfully slow.
  • Fragility: Frequent, automated applies increased the risk of state locks and race conditions, making the process highly error-prone

 And Peace: A Hybrid Ownership Model

To solve this, I moved toward a partitioned management strategy:

  • Static Governance: Traditional Suricata rules (system-wide signatures and internal endpoints) remain strictly managed within Terraform.
  • External Agility: Dynamic tenant rules are offloaded to a separate, externally-managed stateful rule group.
  • Clean State: Safely automate that dynamic group via a Lambda or API call without Terraform ever knowing (or caring) that the rule content has changed.
  • Enforced Logic: The evaluation order of all groups — both static and dynamic — is centrally controlled via the Firewall Policy to ensure consistent security.

Engineering Overview

This article focuses on a hybrid architecture that balances static infrastructure with dynamic security updates. The primary goal is to maintain a clean Terraform state while allowing external automation to handle high-frequency rule changes.

The diagram below is a conceptual view of the end-to-end process:

Architectural diagram

Key Architectural Concepts

  • Hybrid Ownership: As shown in the Management Plane, the module orchestrates four logical "families" of stateful rule groups. This allows for a mix of static, code-defined groups and the dynamic, externally-managed groups targeted by the Lambda Function.

  • Deterministic Evaluation: By enforcing STRICT_ORDER within the stateful_engine_options, we ensure that rules are evaluated in a specific sequence rather than the default Action Order.

  • Managed Safety Net: AWS-managed rule groups (e.g., threat signatures and botnet filters) are appended to the end of our evaluation chain, acting as a global security layer for traffic that passes our custom filters.

  • State Decoupling: To prevent tug-of-war deployments, Terraform initializes the dynamic groups but explicitly ignores subsequent rule changes, handing off control to external processes.

Policy Structure & Rule Families

The child module (tf-module-nfw) is responsible for building a prioritized stateful_rule_group_order. It categorizes rules into four distinct logical families, with the Evaluation Priority as listed below:

Family Source Ownership
self_managed_domainlist FQDN Filter Terraform: Managed via code
self_managed_suricata Suricata IPS Terraform: Managed via code
extl_managed_suricata Suricata IPS External: Initialized by TF; updated by Lambda
extl_managed_domainlist FQDN Filter External: Feed-driven updates
aws_managed_groups Threat Intel AWS: Automatically updated by AWS

This module concludes by constructing a flattened, ordered list of ARNs for the Firewall Policies, ensuring that the rules are applied in the exact sequence required by our security policy.

The Conflict Resolution

To prevent Terraform from overwriting updates made by external automation, I used a specific lifecycle strategy (as described below):

  1. Initialize: Terraform creates the rule group shell.
  2. Handoff: A dedicated Lambda (detailed in Part 2) updates the rules.
  3. Ignore: Terraform is configured to ignore changes to the rules_source, ensuring our ${variable} definitions in CI/CD remain stable.

The Rules of Engagement

To ensure a resilient and scalable architecture, I established these five non-negotiable principles for the module:

1️⃣ Terraform as the Immutable Source of Truth
Static Suricata rulesets are stored in S3, rendered through Terraform, and deployed via NFW Stateful Rule Groups. To maintain security integrity, no external process or manual intervention is permitted to modify these core groups.

2️⃣ External Management for Dynamic Tenant Rules
Since tenant FQDNs are ephemeral, re-running a full Terragrunt pipeline for every hostname change is unacceptable. The dynamic rule groups must be designed for external updates (e.g., via Lambda) without triggering a conflict with the Terraform state.

3️⃣ Unified Ordering Mechanism
Firewall policy includes a diverse mix of rule types:

  • Static Suricata (Standard signatures)
  • Dynamic Suricata (Tenant-specific)
  • Domain List Groups (FQDN filtering)
  • AWS-Managed Groups (Global threat intel)

Regardless of the source, I required a single, centralized configuration to manage the evaluation order across all groups.

4️⃣ Strict Ordering and Explicit Ownership
For auditability and debugging, I enforced a clear Separation of Concerns:

  • rule_order = "STRICT_ORDER": Enforced for all stateful Suricata groups to ensure predictable packet flow.
  • Explicit Ownership: Every rule group is tagged as either Terraform-owned or Automation-owned, preventing overlapping authority.

5️⃣ Elimination of State Drift Tug-of-War
The most critical requirement is preventing Terraform from attempting to revert updates made by external automation. The module must be configured to explicitly ignore attributes that the Lambda is authorized to change, ensuring that terraform plan remains clean even after high-frequency rule updates.

The rest of this article is about the Terraform side only.

The Parent Module: Orchestrating Rule Groups

At the parent level, I defined the architectural "blueprint." This is where I declared which rule families to include and precisely how I wanted them to be prioritized during packet inspection.

1️⃣ Defining the Evaluation Hierarchy:

I injected the global ordering policy as a simple list. This dictated the sequence in which the AWS Network Firewall engine processed traffic:

## Rule evaluation order for stateful rule groups
stateful_rule_group_order = [
  "self_managed_domainlist",
  "self_managed_suricata",
  "extl_managed_suricata",
]
Enter fullscreen mode Exit fullscreen mode

The resulting evaluation logic is deterministic:

  1. Priority 1: Any self_managed_domainlist groups are placed at the top of the stack.
  2. Priority 2: All self_managed_suricata groups follow.
  3. Priority 3: extl_managed_suricata groups are inserted next.
  4. Baseline: Finally automatically appends AWS-managed groups to the end of the chain as a final safety layer.

By decoupling the order from the child module, I achieved the ability to adjust the security posture per environment (e.g., Dev vs. Prod) without modifying the underlying infrastructure code.

2️⃣ Defining Suricata Rule Groups:

A single list is used to define both the static and dynamic Suricata groups. To distinguish between them, I introduced a self_managed boolean flag, which is the primary pivot for how the child module handled the lifecycle of each group.

## Suricata Firewall Rule Groups
suricata_stateful_rule_group = [
  # 1) Static Suricata rule group (Terraform-owned)
  {
    name = format("${var.service_nfw.id}-suricata-%s-sf",
           substr(sha256(local.nfw_rules_sets), 0, 7))
    capacity       = 100
    description    = "Stateful Suricata rule-sets to allow static-hosts"
    rules_content  = base64decode(
      aws_s3_object.aso_sct_ingress[each.value].content_base64
    )
    rule_variables = local.nfw_rule_variables
    self_managed   = true
  },

  # 2) Dynamic Suricata rule group (externally-managed)
  {
    name           = "${var.service_nfw.id}-suricata-dynamic-sf"
    capacity       = 8000
    description    = "Stateful Suricata rule-sets for tenant-hosts"
    rules_file     = "${path.module}/templates/suricata_dynamic_placeholder"
    rule_variables = local.nfw_rule_variables
    self_managed   = false
  },
]
Enter fullscreen mode Exit fullscreen mode

Key Implementation Details:

  • The Static Group (self_managed = true): For these rules, I actually formed a multi-stage pipeline - first, Terraform generated the Suricata rule-set from a template file; then pushed this rendered output to S3 to serve as the ultimate source of truth; and finally, the NFW resource pulled the rules_content directly from that S3 object. This ensured that even though the rules were generated from templates, the firewall was always pinned to a versioned, verifiable object in storage. I also used a sha256 hash in the name to ensure any change in the S3 content triggered a clean resource update.

  • The Dynamic Group (self_managed = false): For the tenant-specific rules, I pointed rules_file to a simple placeholder template. This allowed me to satisfy the AWS Network Firewall requirement of having a valid Suricata structure at creation time. I set the capacity significantly higher (8,000 units) to provide enough "headroom" for the Lambda to inject thousands of tenant hostnames later without hitting architectural limits.

  • The Logic Pivot: The self_managed flag was the brain of the operation. Inside the child module, I used this to separate groups that Terraform must protect from those that Terraform must ignore after the initial creation.

3️⃣ Implementing Domain List Rule Groups:

For the sake of completeness and architectural flexibility, I applied the exact same pattern to Domain List rule groups. While my primary focus was on Suricata for deep packet inspection and SNI filtering, I wanted the module to remain type-agnostic.

domain_stateful_rule_group = [
  {
    name           = "${local.template_name}-domains-sf"
    capacity       = 100
    description    = "Stateful rule with domain list option"
    actions        = "ALLOWLIST"
    protocols      = ["TLS_SNI"]
    domain_list    = ["placeholder.invalid"]
    rule_variables = local.nfw_rule_variables
    self_managed   = true
  }
]
Enter fullscreen mode Exit fullscreen mode

Even though it relies heavily on Suricata signatures for our core logic, I included support for Domain Lists as a secondary mechanism. This allowed us to handle simple Allow-listing for standard HTTP/HTTPS traffic using AWS's native FQDN filtering when the full power of Suricata's regex and header inspection wasn't required.

4️⃣ Integrating AWS-Managed Rule Groups:

To round out our security posture, I wanted to leverage the threat intelligence provided by AWS. Instead of managing complex ARNs in the parent configuration, I designed the module to accept simple, human-readable names which I then programmatically converted into full ARNs.

aws_managed_rule_group_arns = [
  for name in var.aws_managed_rule_groups :
  "arn:aws:network-firewall:${var.region}:aws-managed:stateful-rulegroup/${name}"
]
Enter fullscreen mode Exit fullscreen mode

The module was purposely structured to append these AWS-managed groups (such as AWSManagedRulesAnonymousIpList or ThreatSignaturesMalware) to the end of our evaluation chain.

By placing them after our custom Suricata and Domain List groups, I ensured that our specific business logic (our Allow rules) was evaluated first, while the AWS-managed groups acted as a global safety-net for any traffic that hadn't already been explicitly matched.

The Child Module: Orchestrating the Rule Groups

Once I had the parent module's requirements defined, the child module (tf-module-nfw) translates those lists into actual AWS resources. I decided to split the responsibilities into four distinct resource types to ensure different lifecycle rules and management logic can be applied to each.

Each rule family is mapped to a specific Terraform resource block:

  • self_managed_suricataaws_networkfirewall_rule_group.sf_self_suricata
  • extl_managed_suricataaws_networkfirewall_rule_group.sf_custom_suricata
  • self_managed_domainlistaws_networkfirewall_rule_group.sf_self_domain_list
  • extl_managed_domainlistaws_networkfirewall_rule_group.sf_custom_domain_list

By creating separate resource blocks for Self (Terraform-owned) vs Custom (Externally-owned), this design provides a granular control over how Terraform interacts with the AWS API. This was the foundation for solving the state-drift problem: I could tell Terraform to strictly enforce the rules for the Self groups while essentially looking the other way for the Custom ones.

1️⃣ Static Suricata — Terraform-owned:

For the Self-Managed resource, I designed a filter that only processes rule groups where the self_managed flag is set to true. This ensured that Terraform maintained absolute authority over these specific groups.

resource "aws_networkfirewall_rule_group" "sf_self_suricata" {
  for_each = {
    for rg in var.suricata_stateful_rule_group :
    rg.name => rg
    if rg.self_managed
  }

  type        = "STATEFUL"
  name        = each.value.name
  description = each.value.description
  capacity    = each.value.capacity

  rule_group {
    # Rule variables: HOME_NET, EXTERNAL_NET, etc. 
    rules_source {
      # Static Suricata rules come from S3 → Terraform → rules_content
      rules_string = each.value.rules_content
    }

    stateful_rule_options {
      rule_order = "STRICT_ORDER"
    }
  }

  lifecycle {
    create_before_destroy = true
  }

  tags = merge(var.tags)
}
Enter fullscreen mode Exit fullscreen mode

In this block, Terraform owns everything: the rules, the metadata, and the resource lifecycle. This was my go-to choice for stable, platform-owned domains and internal network signatures.

  • The Power of for_each Filtering:
    By using a conditional for_each loop, I kept the parent module interface clean while allowing the child module to selectively manage only the static groups.

  • Enforcing STRICT_ORDER:
    I explicitly set the rule_order to STRICT_ORDER. This was critical because it allowed me to predict exactly how packets would traverse my custom rules versus the AWS-managed ones later in the chain.

  • Lifecycle Management:
    create_before_destroy = true is exclusively added to prevent any security blind spots. If a static rule group needed a replacement (like a name change due to a new hash), the new group would be provisioned before the old one is torn down.

2️⃣ Dynamic Suricata — Externally-owned:

This is arguably the most important section of this entire process. For the dynamic rule groups, I implemented a resource block that specifically targets objects where self_managed is set to false. This is where the architectural hand-off occurs.

resource "aws_networkfirewall_rule_group" "sf_custom_suricata" {
  for_each = {
    for rg in var.suricata_stateful_rule_group :
    rg.name => rg
    if !rg.self_managed
  }

  type        = "STATEFUL"
  name        = each.value.name
  description = each.value.description
  capacity    = each.value.capacity

  rule_group {
    # Same rule_variables block as above (HOME_NET / EXTERNAL_NET etc.)

    rules_source {
      # Placeholder Suricata, just to create the group.
      rules_string = file(each.value.rules_file)
    }

    stateful_rule_options {
      rule_order = "STRICT_ORDER"
    }
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes = [
      # Lambda will take ownership of the rules_string
      rule_group[0].rules_source[0].rules_string,
    ]
  }

  tags = merge(var.tags)
}
Enter fullscreen mode Exit fullscreen mode

The Secret Sauce: lifecycle.ignore_changes
The real magic happens in the lifecycle block - configuring Terraform in this manner establishes a clear lifecycle for dynamic security rules.

Creation: Terraform provisions the rule group container using the high capacity that I defined (8,000 units) and populates it with a basic Suricata placeholder file to satisfy the AWS API.

Delegation: Once the resource exists, ownership of the rules_string attribute is effectively transferred to the external automation.

Coexistence: Because of the ignore_changes directive, Terraform will no longer attempt to correct or revert the rules during subsequent plan or apply cycles.

This approach is what allowed me to safely call UpdateRuleGroup from a Lambda function (which I'll cover in Part 2) without triggering a state-drift fight. Terraform manages the infrastructure shell, while the Lambda manages the security intelligence.

3️⃣ Domainlist Rule Groups:

The exact same blueprint has followed there for Domain List rule groups. Mirroring the logic used for Suricata ensures the module remains predictable and easy to maintain.

These are split into two distinct resources:

  • sf_self_domain_list: The Terraform-owned resource for static, code-defined FQDNs.
  • sf_custom_domain_list: The externally-owned resource for dynamic lists.

These are handled in an identical fashion to the Suricata groups—utilizing the self_managed flag to trigger either a strict management cycle or a handoff to external automation via ignore_changes.

The most important takeaway here isn't the resource type itself, but these groups are now prepared to participate in the unified ordering mechanism, which is the final piece of the puzzle.

Building the Final Stateful Rule Group Order

This approach transforms the design into an elegant, structured solution rather than an ad-hoc one. By leveraging Terraform locals, a scattered set of resources is transformed into a strictly ordered pipeline.

The child module utilizes a three-step logic:

  1. Map the Categories: The ARNs of all created resources are grouped tinto a structured map.
  2. Flatten by Priority: Those groups are then reordered and flattened based on the stateful_rule_group_order list provided by the parent.
  3. Inject AWS Intelligence: The AWS-managed ARNs are appended as the final elements.

The Transformation Logic:

Here is how I handled that data transformation using HCL:

locals {
  # 1. Categorize rule-group ARNs into logical families
  sf_rule_group_arns = {
    self_managed_suricata = {
      for KV in var.suricata_stateful_rule_group :
      KV.name => aws_networkfirewall_rule_group.sf_self_suricata[KV.name].arn
      if KV.self_managed
    }

    extl_managed_suricata = {
      for KV in var.suricata_stateful_rule_group :
      KV.name => aws_networkfirewall_rule_group.sf_custom_suricata[KV.name].arn
      if !KV.self_managed
    }

    self_managed_domainlist = {
      for KV in var.domain_stateful_rule_group :
      KV.name => aws_networkfirewall_rule_group.sf_self_domain_list[KV.name].arn
      if KV.self_managed
    }

    extl_managed_domainlist = {
      for KV in var.domain_stateful_rule_group :
      KV.name => aws_networkfirewall_rule_group.sf_custom_domain_list[KV.name].arn
      if !KV.self_managed
    }
  }

# 2. Flatten categories using the parent’s
stateful_rule_group_order
  stateful_group_arns = concat(
    concat([
      for key in var.stateful_rule_group_order :
      values(local.sf_rule_group_arns[key])
    ]...),

    # then, tack on AWS-managed rule groups
    var.aws_managed_rule_group_arns,
  )
}
Enter fullscreen mode Exit fullscreen mode

The Final Policy:

With a flat, ordered list of ARNs established, applying them to the firewall policy becomes a straightforward process. I then used a dynamic block to ensure the policy stayed updated whenever the list changed.

resource "aws_networkfirewall_firewall_policy" "this" {
  name = "${var.prefix}-${var.service_name}-policy"

  firewall_policy {
    dynamic "stateful_rule_group_reference" {
      for_each = local.stateful_group_arns
      content {
        resource_arn = stateful_rule_group_reference.value
      }
    }

    # stateless config omitted
  }

  tags = merge(var.tags)
}
Enter fullscreen mode Exit fullscreen mode

The Resulting Impact:

This modular architecture yields three major wins:

  • Centralized Control: The parent module uses one simple variable to dictate the entire evaluation sequence.
  • True Reusability: The child module is now a generic engine that can be easily dropped into any service, account, or environment.
  • Future-Proofing: New categories (such as an external domain feed) can be added seamlessly without requiring a refactor of the parent module's core logic.

Lessons Learned (Part 1)

Building this architecture was a journey that reinforced several core principles of cloud networking and Infrastructure-as-Code. Here is what I took away from the experience:

1️⃣ Separation of Concerns is Mandatory
I realized early on that pushing every single rule through a Terraform pipeline is a recipe for frustration once you reach high-velocity, frequently changing lists. Splitting our firewall into Static (TF-owned) and Dynamic (Externally-owned) rule groups created a much healthier, more predictable management pattern.

2️⃣ The Danger of Joint Custody
One of my biggest takeaways was that Terraform and Lambda must never share ownership of a single attribute. If both tools think they are allowed to modify the rules_string, you will spend your life resolving state drift. By using separate resources (sf_self_* vs. sf_custom_*) and leveraging ignore_changes, I established a clean hand-off that kept our pipelines green.

3️⃣ Null Values are Not Empty
When defining input objects with optional fields like rules_content and rules_file, I found it's easy to accidentally treat a null value as "present." I had to ensure my module logic explicitly tested for null rather than just checking if a key existed, preventing several silent deployment failures.

4️⃣ Explicit Ordering is Worth the Overhead
The stateful_rule_group_order list might seem like extra configuration at first, but it completely removed the ambiguity from our security posture. In an environment that combines custom Suricata signatures, FQDN filters, and AWS-managed threat intel, knowing exactly which rule fires first is critical for both security and troubleshooting.

What’s Coming in Part-2

While we’ve built the infrastructure "shell" and enforced the ordering logic, the "Externally Managed" part of this architecture is where the automation truly comes to life.

In the next post, I’ll deep-dive into the Delivery Mechanism, featuring:

  • The Rule-Engine Lambda: How I built a function to read tenant_hosts.json from S3 and dynamically expand base domains into full Suricata SNI rules.
  • Concurrency & Safety: How to handle the UpdateRuleGroup API safely using the required UpdateToken to prevent race conditions.
  • Security & Identity: Managing cross-account access via S3 bucket policies and KMS key permissions.
  • Observability: Implementing Microsoft Teams notifications that include:
    • Diff Analysis: Clearly showing which hosts were added or removed.
    • Smart Truncation: Handling long lists gracefully so notifications remain readable.
    • Status Cards: Visual Green (Success) and Red (Failure) indicators for immediate feedback.

That’s where the Self-Service aspect of our firewall becomes a reality. See you in Part 2!

Top comments (0)