<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jakub</title>
    <description>The latest articles on DEV Community by Jakub (@jakops).</description>
    <link>https://dev.to/jakops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891075%2Fb7b9a8ea-900b-4e28-862d-a944694725e3.jpg</url>
      <title>DEV Community: Jakub</title>
      <link>https://dev.to/jakops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jakops"/>
    <language>en</language>
    <item>
      <title>Secure Private EKS Access and SSO-Protected Frontends with Cloudflare Tunnel on EC2</title>
      <dc:creator>Jakub</dc:creator>
      <pubDate>Mon, 18 May 2026 17:47:29 +0000</pubDate>
      <link>https://dev.to/jakops/secure-private-eks-access-and-sso-protected-frontends-with-cloudflare-tunnel-on-ec2-26dg</link>
      <guid>https://dev.to/jakops/secure-private-eks-access-and-sso-protected-frontends-with-cloudflare-tunnel-on-ec2-26dg</guid>
      <description>&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;The system uses a Cloudflare Tunnel running on a single EC2 instance to replace traditional VPN infrastructure. It provides zero-trust VPC access via WARP for engineers and identity-aware frontend application delivery through a private ALB, exposing services on public subdomains without opening inbound firewall ports.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fai38xtl4rgtgo75h3hc3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fai38xtl4rgtgo75h3hc3.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The runtime infrastructure groups components strictly around a single egress-only tunnel instance that services two access models:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EC2 instance&lt;/strong&gt; — A t4g.micro instance running RHEL 9 in a private subnet with no public IP, no SSH access, and no inbound ports.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare Tunnel&lt;/strong&gt; — A daemon running as a systemd service on the EC2 instance that opens outbound QUIC connections to Cloudflare's edge network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare Access Applications&lt;/strong&gt; — Edge configurations mapping public subdomains to the internal load balancer while enforcing identity provider SSO.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Private ALB&lt;/strong&gt; — An AWS Application Load Balancer with no internet-facing listener, fronting frontend services inside EKS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Secrets Manager&lt;/strong&gt; — Secure persistent storage holding the tunnel token retrieved by the instance at boot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security group&lt;/strong&gt; — An AWS network firewall configured with restrictive egress-only rules for the tunnel instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IAM role&lt;/strong&gt; — An execution role scoped strictly to read permissions for AWS Secrets Manager and AWS Systems Manager.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS cluster security group rule&lt;/strong&gt; — A network policy rule allowing internal ingress traffic from the tunnel instance.&lt;/p&gt;

&lt;p&gt;All system management occurs via AWS Systems Manager Session Manager.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Technical Behavior
&lt;/h2&gt;

&lt;p&gt;At runtime, the EC2 instance retrieves the authentication token from AWS Secrets Manager and starts the cloudflared daemon. The process initiates outbound QUIC connections to Cloudflare's edge network over ports 7844 and 7845.&lt;/p&gt;

&lt;p&gt;For engineer network routing, users connect via the local Cloudflare WARP client. Traffic destined for the VPC CIDR routes through the tunnel, giving direct network access to the EKS API server, internal RDS databases, and private cluster services.&lt;/p&gt;

&lt;p&gt;For web traffic, Cloudflare Access serves as an identity-aware reverse proxy. External web requests to public domains hit the Cloudflare edge, which evaluates user sessions against an identity provider. Authenticated requests pass through the QUIC tunnel to the private ALB, which forwards traffic directly to frontend pods running inside EKS.&lt;/p&gt;

&lt;p&gt;Frontend Request Flow&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → app.jakops.cloud (Cloudflare Edge, TLS + SSO)
     → Cloudflare Tunnel (encrypted QUIC)
     → EC2 cloudflared instance (private subnet)
     → Private ALB (VPC)
     → EKS frontend pods

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instance Egress Security Group Rule&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;egress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7844&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7845&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"udp"&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow outbound QUIC for Cloudflare Tunnel"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;VPC Internal Ingress Security Group Rule&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr_block&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow all traffic from VPC for WARP routing"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Engineering Decisions
&lt;/h2&gt;

&lt;p&gt;Dual-purpose tunnel consolidates L3/L4 network proxying via WARP and L7 application routing via Cloudflare Access onto a single EC2 footprint.&lt;/p&gt;

&lt;p&gt;Public domains with private infrastructure keeps public DNS records pointed to Cloudflare's edge while leaving the AWS footprint completely invisible without public endpoints.&lt;/p&gt;

&lt;p&gt;SSO at the edge forces authentication before traffic ever enters the AWS network, removing the requirement for application-level authentication gates on internal frontends.&lt;/p&gt;

&lt;p&gt;Private ALB configuration removes internet-facing listeners, eliminating public DDoS surface area, public security group tracking, and AWS-side certificate rotation.&lt;/p&gt;

&lt;p&gt;ARM architecture selection leverages t4g.micro instances to save approximately 20% on compute cost compared to t3 variants while running the lightweight cloudflared Go binary.&lt;/p&gt;

&lt;p&gt;Hardcoded AMI pinning prevents unintended infrastructure tearing and instance recreation during terraform apply actions when upstream OS images change.&lt;/p&gt;

&lt;p&gt;Single instance deployment without an Auto Scaling Group trades automated sub-minute failover for simplified configuration on staging and developer environments.&lt;/p&gt;

&lt;p&gt;IMDSv2 requirement mitigates SSRF-based IAM credential theft from the EC2 instance metadata service endpoint.&lt;/p&gt;

&lt;p&gt;Dedicated system user constraints execute the cloudflared binary as a non-login user with no system shell to restrict localized blast radius.&lt;/p&gt;

&lt;p&gt;KMS-encrypted EBS enforces protection of the root volume data at rest via a customer-managed key.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trade-offs
&lt;/h2&gt;

&lt;p&gt;Optimized for: cost, simplicity, zero-trust posture, unified access control, operational minimalism, elimination of public attack surface.&lt;/p&gt;

&lt;p&gt;Sacrificed: high availability (single instance), self-healing (no ASG), automated AMI rotation, independent scaling of frontend proxy vs. WARP routing, centralized logging (logs stay on-instance via journald).&lt;/p&gt;




&lt;h2&gt;
  
  
  Results / Cost Impact
&lt;/h2&gt;

&lt;p&gt;The implementation reduced ongoing infrastructure spend to an explicit total of approximately $7.33 per month.&lt;/p&gt;

&lt;p&gt;t4g.micro (on-demand) — ~$6.13&lt;br&gt;
8GB GP3 EBS — ~$0.80&lt;br&gt;
Secrets Manager secret — ~$0.40&lt;/p&gt;

&lt;p&gt;This architecture replaced a managed VPN product and a public ALB setup including WAF, certificate validation, and public DNS operations that had cost over $75 per month. Operational overhead for certificate rotation, WAF rule maintenance, and network auditing was removed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;A single Cloudflare Tunnel instance provides concurrent infrastructure routing for developers and SSO-gated public domain ingress for external stakeholders. By keeping the target AWS load balancer private, the entire internal network remains closed to inbound public traffic while supporting edge-authenticated web delivery.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Combining WARP routing with Cloudflare Access applications on a single tunnel gives you both L3 infrastructure access and L7 application delivery with SSO on real domains with zero public infrastructure for under $8/month.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;For the full implementation details, see the complete article at &lt;a href="https://jakops.cloud/secure-private-eks-access-sso-frontends-cloudflare-tunnel-ec2" rel="noopener noreferrer"&gt;jakops.cloud&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help?
&lt;/h2&gt;

&lt;p&gt;If you want to deploy a zero-trust setup including Cloudflare Tunnel on EC2, WARP routing, private ALB ingress for EKS, and Terraform modules with IMDSv2 and KMS encryption, you can find assistance directly at &lt;a href="https://jakops.cloud" rel="noopener noreferrer"&gt;https://jakops.cloud&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cloudflaretunnel</category>
      <category>cloudflared</category>
      <category>zerotrust</category>
      <category>eks</category>
    </item>
    <item>
      <title>Migrating a Terraform Monolith to Terragrunt: State Slicing Without Downtime</title>
      <dc:creator>Jakub</dc:creator>
      <pubDate>Fri, 08 May 2026 07:39:10 +0000</pubDate>
      <link>https://dev.to/jakops/migrating-a-terraform-monolith-to-terragrunt-state-slicing-without-downtime-1flb</link>
      <guid>https://dev.to/jakops/migrating-a-terraform-monolith-to-terragrunt-state-slicing-without-downtime-1flb</guid>
      <description>&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I decomposed a monolithic Terraform state containing 19 logical AWS infrastructure components into a Terragrunt monorepo. This migration established isolated state files for each component—including VPC, EKS, and RDS—to enable independent locking, reduced blast radius, and faster plan performance without triggering any infrastructure changes or downtime.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flesy6mdq6hibx7skwf1h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flesy6mdq6hibx7skwf1h.png" alt="Migrating a Terraform Monolith to Terragrunt State Slicing Without Downtime" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Monolith State&lt;/strong&gt; — A single S3-backed state file containing all 19 infrastructure components under a nested module hierarchy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terragrunt Modules&lt;/strong&gt; — 13 independent module directories, each inheriting root configuration and managing a unique S3 state key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency Graph&lt;/strong&gt; — Explicit inter-module wiring using Terragrunt dependency blocks to pass versioned outputs between isolated states.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Technical Behavior
&lt;/h2&gt;

&lt;p&gt;The system runtime behavior changed from a single global lock to a per-component locking model. In the monolith, any change to a load balancer rule required a full re-evaluation of the entire stack, including RDS and EKS clusters. By slicing the state, I isolated the execution flow so that Terraform only reconciles the resources relevant to a specific logical component.&lt;/p&gt;

&lt;p&gt;The migration process relied on address rewriting to drop the top-level parent prefixes used in the monolith. For example, a resource originally located at module.client_stage.module.database.module.rds.aws_db_instance.this[0] was moved to module.rds.aws_db_instance.this[0] within the new isolated rds module state.&lt;/p&gt;

&lt;p&gt;Pulling the monolith state to a local file for immutable processing&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform state pull &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; monolith.tfstate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dynamically discovering direct child modules from the state list&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DIRECT_MODULES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STATE_LIST&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"^&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODULE_PREFIX&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;module&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s2"&gt;"s|^&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODULE_PREFIX&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;module&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;||"&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/^\([^.[]*\).*/\1/'&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Executing the state move from the local monolith source to individual module states&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform state &lt;span class="nb"&gt;mv&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MONOLITH_STATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-state-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TARGET_STATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$resource&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$new_address&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Final runtime verification involved a run-all plan across the dependency graph. This confirmed that downstream modules could successfully read VPC IDs and RDS endpoints from upstream modules via typed outputs stored in the new isolated state files.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Engineering Decisions
&lt;/h2&gt;

&lt;p&gt;Script-driven slicing over manual commands was implemented to ensure the move of hundreds of resources across 13 modules remained reproducible and free of manual typos.&lt;/p&gt;

&lt;p&gt;Immutable source state management used separate -state and -state-out files to ensure the local monolith snapshot was never modified during the slicing process, allowing for clean retries.&lt;/p&gt;

&lt;p&gt;Dynamic module discovery derived module names directly from the state list rather than a hardcoded inventory, preventing the silent omission of existing infrastructure from the migration.&lt;/p&gt;

&lt;p&gt;Python-based regex processing was utilized for address rewriting to correctly handle dot-separated and bracket-indexed resource patterns that are not safely handled by standard shell tools.&lt;/p&gt;

&lt;p&gt;Local backend validation was performed before migrating to S3 to verify each module against a zero-diff plan, ensuring the state perfectly matched live infrastructure before pushing to remote storage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trade-offs
&lt;/h2&gt;

&lt;p&gt;Optimized for: blast radius reduction, per-module state locking, and faster iteration via targeted plan/apply cycles.&lt;/p&gt;

&lt;p&gt;Sacrificed: operational simplicity during the migration window, requiring a change freeze to prevent drift while state existed in both monolithic and sliced forms.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results / Cost Impact
&lt;/h2&gt;

&lt;p&gt;The platform now operates 13 independent state files in S3, each protected by its own DynamoDB lock.&lt;/p&gt;

&lt;p&gt;Parallel workstreams no longer block each other, as a Kubernetes deployment change no longer locks the VPC or database state.&lt;/p&gt;

&lt;p&gt;The system enforces explicit ownership boundaries, where changes are restricted to specific infrastructure concerns without the risk of affecting adjacent resources in the same state file.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This migration turned a monolithic bottleneck into a scalable management boundary by performing state surgery instead of infrastructure re-creation. The resulting system maintains zero-drift compared to the original monolith while enabling the team to execute parallel changes with isolated failure modes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The correctness of a state migration is guaranteed when every isolated module produces a clean plan with zero diff.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;For the full implementation details, see the complete article at &lt;a href="https://jakops.cloud/terraform-monolith-to-terragrunt-state-migration" rel="noopener noreferrer"&gt;jakops.cloud&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help?
&lt;/h2&gt;

&lt;p&gt;If you're working on a similar state decomposition or evaluating Terragrunt adoption for a growing SaaS platform, feel free to reach out at &lt;a href="mailto:hello@jakops.cloud"&gt;hello@jakops.cloud&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://jakops.cloud" rel="noopener noreferrer"&gt;https://jakops.cloud&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>terragrunt</category>
      <category>devops</category>
      <category>aws</category>
    </item>
    <item>
      <title>Athena Cost Kill Switch: Automated IAM Credential Revocation with CloudWatch, EventBridge, and Lambda</title>
      <dc:creator>Jakub</dc:creator>
      <pubDate>Wed, 06 May 2026 12:16:38 +0000</pubDate>
      <link>https://dev.to/jakops/athena-cost-kill-switch-automated-iam-credential-revocation-with-cloudwatch-eventbridge-and-1a1c</link>
      <guid>https://dev.to/jakops/athena-cost-kill-switch-automated-iam-credential-revocation-with-cloudwatch-eventbridge-and-1a1c</guid>
      <description>&lt;p&gt;How to design an automated kill switch for an Athena data platform that disables service credentials within seconds of a scan threshold breach.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;This system provides an automated response to excessive AWS Athena scan costs generated by external services. It monitors Athena workgroup metrics and immediately revokes IAM access keys when pre-defined data processing thresholds are exceeded, preventing unmonitored cost spikes without requiring human intervention.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xguqaiae0hdyidr3xy3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xguqaiae0hdyidr3xy3.png" alt="How to design an automated kill switch for an Athena data platform" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The architecture is composed of four distinct layers operating in sequence to monitor, route, and execute the revocation.&lt;br&gt;
Athena Workgroups - Dedicated workgroups for PowerBI and OpenMetadata that enforce a 1 GB per-query scan cutoff and publish CloudWatch metrics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch Alarms&lt;/strong&gt; - Three independent alarms monitoring the OpenMetadata workgroup for sustained high usage, high failure rates, and rapid consumption spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EventBridge Rule&lt;/strong&gt; - A routing layer that pattern-matches CloudWatch Alarm State Change events to trigger the execution logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda Kill Switch &lt;/strong&gt;- A Python-based function that retrieves service credentials from Secrets Manager and executes the IAM revocation call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets Manager&lt;/strong&gt; - A KMS-encrypted store for the OpenMetadata IAM username and access key ID, keeping the execution logic stateless.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Core Technical Behaviour
&lt;/h2&gt;

&lt;p&gt;The system remains passive until a threshold is breached. CloudWatch tracks ProcessedBytes and query failure counts at the workgroup level. When a metric crosses a threshold, the alarm transitions to ALARM state.&lt;br&gt;
EventBridge detects this state change and triggers the Lambda function. The Lambda performs two primary operations: it fetches the target IAM metadata from Secrets Manager and calls the IAM API to set the specific access key status to Inactive.&lt;br&gt;
Python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# One-line caption: Disabling the IAM access key via Boto3
&lt;/span&gt;&lt;span class="n"&gt;iam_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_access_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;UserName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;AccessKeyId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;access_key_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inactive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The execution flow is asynchronous. While the Lambda disables the credential, SNS simultaneously sends email notifications to the engineering team. Once the key is inactive, all subsequent Athena queries from the external service fail with authentication errors until manual rotation or reactivation occurs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Engineering Decisions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;IAM user with static credentials was used because OpenMetadata does not support IAM role assumption. Disabling the access key provides the fastest possible revocation without modifying IAM policies or workgroup configurations.&lt;/li&gt;
&lt;li&gt;Storing the access key ID and IAM username in Secrets Manager keeps the Lambda stateless. This ensures that credential rotation can occur within the security layer without requiring code changes or redeployments of the Lambda infrastructure.&lt;/li&gt;
&lt;li&gt;Three independent alarms were chosen over a composite alarm to ensure any single failure mode - sustained volume, high failure rates, or sudden spikes - triggers the switch immediately. A composite alarm would have required multiple conditions to be met simultaneously.&lt;/li&gt;
&lt;li&gt;Direct EventBridge-to-Lambda integration was selected over Step Functions for this path. While the platform's S3-triggered Glue pipeline uses Step Functions for stateful orchestration, the kill switch is a single, stateless API call where added orchestration would only increase latency.&lt;/li&gt;
&lt;li&gt;The use of configurable Terraform variables for thresholds allows for environment-specific tuning. This enables tighter cost controls in staging and more relaxed limits in production without modifying the underlying logic.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Trade-offs
&lt;/h2&gt;

&lt;p&gt;Optimized for: speed of response and operational simplicity. The system executes in seconds with a minimal codebase and no external dependencies beyond native AWS APIs.&lt;/p&gt;

&lt;p&gt;Sacrificed: self-healing. The system requires a deliberate manual action by a platform engineer to investigate the root cause and re-enable or rotate credentials.&lt;/p&gt;

&lt;p&gt;The system lacks a dead-letter queue on the Lambda invocation. If the IAM API call fails or Secrets Manager is throttled, the system relies on standard Lambda async retries without a secondary alerting path for the kill switch's own failure.&lt;/p&gt;

&lt;p&gt;The spike alarm uses a fixed 60-second period. This fixed window cannot be adjusted via Terraform variables, meaning a legitimate high-volume schema discovery scan could trigger a false positive that requires a code change to tune.&lt;/p&gt;

&lt;p&gt;The high-usage alarm does not have a direct Lambda action assigned in its configuration. It relies entirely on the EventBridge rule pattern match for routing, which differs from how the other alarms utilize direct actions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results / Cost Impact
&lt;/h2&gt;

&lt;p&gt;The system introduces negligible ongoing costs as it is entirely event-driven. It eliminates the response window between a threshold breach and human intervention, which is critical for Athena where billing is processed per byte scanned. The platform team receives immediate SNS notifications while the automated response ensures that financial exposure is capped within seconds of a breach.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This architecture uses CloudWatch, EventBridge, and Lambda to create a production-grade cost control mechanism for managed query services. By targeting the IAM credential layer, the system provides a reversible but immediate response to misbehaving external connectors.&lt;br&gt;
Automated cost control is most effective when it targets well-defined service boundaries with fast event routing and manual recovery.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;For the full implementation details, see the complete article at &lt;a href="https://jakops.cloud/athena-cost-kill-switch-cloudwatch-eventbridge-lambda." rel="noopener noreferrer"&gt;https://jakops.cloud/athena-cost-kill-switch-cloudwatch-eventbridge-lambda/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help?
&lt;/h2&gt;

&lt;p&gt;If you're working on similar infrastructure challenges around AWS cost control, data platform access governance, or IAM-level automation, feel free to reach out at &lt;a href="mailto:hello@jakops.cloud"&gt;hello@jakops.cloud&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>athena</category>
      <category>lambda</category>
      <category>infrastructure</category>
      <category>cloudwatch</category>
    </item>
  </channel>
</rss>
