DEV Community: Jakub

Running RabbitMQ on EKS Without Bitnami — A Cluster-Operator-Based Setup

Jakub — Fri, 17 Jul 2026 10:07:23 +0000

What I Built

I built a custom RabbitMQ deployment on AWS EKS from the ground up to replace Bitnami-dependent charts for a client's Django and Celery-based SaaS platform. The system uses the official RabbitMQ Cluster Operator orchestrated via ArgoCD and a custom, thin Helm chart to manage core task queues across multiple environments.

System Architecture

The infrastructure consists of two separate ArgoCD Application resources that deploy the messaging stack onto AWS EKS to serve downstream applications:

RabbitMQ Cluster Operator — Installed directly into its own rabbitmq-system namespace from upstream release manifests to manage the cluster lifecycle.

Thin Helm Chart — A custom-owned, minimal manifest used solely to render a single RabbitmqCluster custom resource without bundled third-party images or vendor defaults.

CRM Application Components — The downstream API, backend workers, beat scheduler, and Flower monitoring interface that consume the message broker.

Core Technical Behavior

The system executes via an operator-driven pattern rather than traditional template mechanics. The RabbitMQ Cluster Operator watches the cluster for RabbitmqCluster custom resources and automatically reconciles them into a StatefulSet, services, and local cluster configuration.

ArgoCD enforces explicit deployment sequencing at runtime. The operator application applies a negative sync-wave annotation, guaranteeing that CRDs and webhooks are created and verified before the second application attempts to deploy the actual broker resource.

Data ingestion and task execution rely on dynamic credential synchronization. Downstream application components bind directly to an internal cluster secret generated entirely by the operator, removing the necessity of storing explicit broker connection strings in Git repositories.

Scaling and runtime scheduling change based on target parameters. Multi-replica environments inject pod anti-affinity and zone-aware topology spread constraints to distribute workloads safely across nodes, while single-replica staging setups strip these constraints to allow scheduling on smaller node groups.

Traffic to the management interface is gated conditionally. The custom chart templates an Ingress resource behind an internal Application Load Balancer only when the opt-in flag is enabled, dynamically iterating over multi-host array scopes.

Key Engineering Decisions

Prioritizing an operator-first architecture choice over community charts. This shifts operational clustering, leader election, and automated upgrade mechanics from static, vendor-maintained Helm templates into active code maintained by the core RabbitMQ team.

Elimination of inherited vendor packaging defaults. Every configuration field inside the custom Helm chart was added because the client environment specifically required it, creating a values structure focused strictly on explicit storage classes, plugins, and resource parameters.

Sync-wave ordering as an architectural requirement. Utilizing a negative sync-wave combined with server-side apply ensures the operator’s structural dependencies exist first, while excluding CRD status fields and webhook values from ArgoCD's diff prevents reconciliation loops.

Environment-scaled termination grace periods. Production brokers use an elongated termination grace period of seven days to give active queues ample time to drain and rebalance during node churn, whereas development deployments override this setting to sixty seconds for speed.

Scoping configuration parameters and plugins by environment profile. The core platform toggles advanced cross-cluster forwarding using shovel management and specific memory watermarks in production, while running a stripped-down, light management footprint for standard development work.

Trade-offs

Optimized for: absolute control over exposed broker settings, a standardized deployment pattern across cluster environments, and independence from external chart packaging lifecycles.

Sacrificed: the immediate out-of-the-box convenience of large community charts, which include bundled metrics tools, complex default secret mappings, and pre-wired configurations.

Results / Cost Impact

The client eliminated third-party image deprecation and licensing vulnerabilities from their production background-job path.

The architecture established completely identical, zero-drift runtime behavior between environments and allowed non-production configurations to scale down dynamically to avoid idle infrastructure resource waste.

Conclusion

Transitioning to the RabbitMQ Cluster Operator replaces broad community charts with a lean manifest blueprint where core clustering logic is maintained by its actual authors. This setup provides predictable GitOps delivery and clean environmental scaling for critical message broker dependencies.

Offloading cluster state and failover mechanics to an official operator always beats maintaining custom, heavy forks of community charts.

Need Help?

If you are building similar systems, feel free to reach out at hello@jakops.cloud.

https://jakops.cloud

Production SigNoz on EKS: Cost-Optimized Observability with Tiered Storage and Auto-Instrumented APM

Jakub — Wed, 17 Jun 2026 11:47:14 +0000

What I Built

A SaaS client running multiple workloads on EKS had outgrown CloudWatch dashboards and needed correlated telemetry — metrics, traces, and logs — with months of retention. Commercial observability vendors were off the table due to per-seat and per-GB pricing.

I designed and delivered a production SigNoz deployment on their existing EKS cluster, balancing retention depth against storage cost while keeping the ingestion pipeline elastic under bursty load — without adding dedicated infrastructure team overhead.

System Architecture

Application Instrumentation — OpenTelemetry Operator injecting the Python auto-instrumentation agent at pod startup, emitting traces, spans, and runtime metrics without code changes.

Ingestion — OpenTelemetry Collectors running as scalable deployments, receiving telemetry from instrumented pods and scraping Prometheus endpoints from Karpenter, KEDA, ArgoCD, LiteLLM, and Valkey.

Processing — ClickHouse as the primary TSDB for hot telemetry data, backed by Zookeeper for coordination.

Cold Storage — S3 with a three-stage lifecycle policy moving aging data from Standard → Standard-IA → Glacier IR → expiration.

Metadata — An encrypted PostgreSQL RDS instance storing SigNoz application state (dashboards, alerts, users), decoupled from ClickHouse.

The entire stack is defined in Terraform (infrastructure) and Helm (workloads), deployed via ArgoCD.

Core Technical Behavior

Zero-Code APM Injection

The OpenTelemetry Operator manages an Instrumentation custom resource that injects the Python agent into pods at startup via a mutating webhook. Application teams opt in with a single annotation — no SDK calls, no coordinated rollout.

Instrumentation CR:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: backend-otl
spec:
  exporter:
    endpoint: "http://signoz-otel-collector.signoz.svc:4317"
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "1"
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

Pod opt-in annotation and service identity:

podAnnotations:
  instrumentation.opentelemetry.io/inject-python: "backend-otl"
env:
  - name: OTEL_SERVICE_NAME
    value: "backend"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "service.namespace=backend,deployment.environment=prod"

The agent captures distributed traces across HTTP handlers, Django ORM queries, Valkey cache operations, and inter-service calls. Celery tasks become individual spans with task name, queue, execution duration, and retry metadata. Unhandled exceptions are captured as span events with full stack traces.

The sampler is parentbased_traceidratio — the sampling decision propagates from the trace entry point. This prevents orphaned child spans from partially-sampled request flows.

The Operator runs with two replicas and topology spread constraints, keeping the mutating webhook available during node rotations.

Collector Scaling and Backpressure

Collectors scale via KEDA on actual ingestion pressure, not CPU thresholds. Telemetry volume tracks application traffic — CPU is a poor proxy for this.

Collector processor configuration:

otelCollector:
  keda:
    enabled: true
  config:
    processors:
      memory_limiter:
        limit_mib: 1000
        check_interval: 5s
      batch:
        timeout: 10s
        send_batch_size: 1000

The memory limiter enforces backpressure at 1 GiB — a pod approaching that threshold drops data rather than OOMing. The batch processor aggregates telemetry into 1000-item batches with a 10-second flush window, reducing write amplification on ClickHouse.

A separate metrics/infra pipeline handles infrastructure scraping (Karpenter, KEDA, ArgoCD server, repo-server, application-controller, LiteLLM, Valkey exporter) with the same limiter and batch configuration. This keeps infrastructure metrics isolated from application trace ingestion at the pipeline level.

Collectors run on Spot instances. They are stateless and the batch processor ensures minimal data loss on graceful termination. Brief gaps in scrape-based metrics can occur during node reclamation.

Tiered Storage Lifecycle

ClickHouse offloads older partitions to S3. The lifecycle is managed via Terraform:

resource "aws_s3_bucket_lifecycle_configuration" "signoz_lifecycle" {
  rule {
    id     = "expire-old-telemetry"
    status = "Enabled"
    filter {}

    transition {
      days          = var.signoz_retention_standard_ia_days
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = var.signoz_retention_glacier_days
      storage_class = "GLACIER_IR"
    }
    expiration {
      days = var.signoz_retention_expire_days
    }
  }
}

Recent data stays on EBS in ClickHouse for fast queries. Standard-IA is roughly 45% cheaper per GB than Standard. Glacier IR is roughly 68% cheaper. Bucket versioning is disabled — telemetry is append-only and reproducible from source. AES256 server-side encryption is enforced and all public access is blocked.

Decoupled RDS Metadata Store

resource "aws_db_instance" "signoz" {
  engine                       = "postgres"
  instance_class               = var.instance_class
  storage_type                 = "gp3"
  storage_encrypted            = true
  multi_az                     = var.multi_az
  deletion_protection          = true
  monitoring_interval          = 1
  performance_insights_enabled = true
}

Credentials are sourced from Secrets Manager and injected via Kubernetes secrets. SSL is enforced via rds.force_ssl = 1. The security group restricts access to EKS node security groups and specific internal CIDRs.

A ClickHouse failure does not corrupt dashboard state. The metadata store can be independently backed up, scaled, or restored.

Key Engineering Decisions

OTel Operator over SDK instrumentation. Manual SDK integration across a multi-service Python stack — API, workers, beat, flower — requires coordinated developer effort and ongoing maintenance per service. The Operator centralizes instrumentation control on the platform team. Any new service gets full APM coverage via annotation.

KEDA over HPA for collectors. HPA scales on CPU and memory, which have no consistent relationship to telemetry throughput. KEDA scales on actual ingestion pressure, matching collector capacity to load.

Internal ALB with shared group. The SigNoz frontend joins an existing ALB group via AWS Load Balancer Controller's group feature, avoiding a dedicated load balancer provisioned solely for this workload (~$16/month base cost saved).

IRSA for S3 cold storage access. ClickHouse pods assume an IAM role via EKS service account annotation. No long-lived credentials are used for bucket access.

Separate OTEL_SERVICE_NAME per component. Each deployment — api, backend, worker, beat, flower — reports a distinct service name with shared namespace attributes. SigNoz's service map reflects the actual component topology, making queue-level latency visible without filtering through a monolithic service.

Trade-offs

Optimized for: long-term retention cost, ingestion elasticity, operational durability, security posture, developer velocity via zero-code instrumentation.

Sacrificed: query latency on cold data (Glacier IR adds retrieval latency — acceptable for historical investigations, not for real-time alerting), operational complexity from managing RDS separately, and instrumentation precision (auto-instrumentation captures fewer custom business attributes than hand-written spans).

The auto-instrumentation agent adds approximately 50–80 MiB memory overhead per pod with negligible request latency impact.

Cost & Operational Impact

A system ingesting 50 GB/day pays full EBS rates only for the hot window. After lifecycle transitions, the effective per-GB cost drops to roughly one-third of the initial rate.

Spot instances for collectors produce 60–70% savings on that compute tier.

The zero-code APM approach eliminates weeks of developer instrumentation work across services. Every service gets identical trace propagation, sampling strategy, and attribute enrichment regardless of which team owns it.

Conclusion

The system delivers vendor-independent observability with months of queryable retention and full distributed tracing across a Python stack. Cost scales sub-linearly with data volume because only the hot window pays full storage rates — everything behind it transitions through progressively cheaper tiers.

The Operator injection model and KEDA-based scaling mean the platform team controls instrumentation coverage and ingestion capacity centrally, without coordinating with application teams on each change.

Combining zero-code OTel injection with event-driven collector scaling and tiered object storage is what makes retention cost predictable at scale — the expensive tier stays bounded by the hot window alone.

Need Help?

If you're working on observability or APM infrastructure on EKS, reach me at hello@jakops.cloud.

Deploying Stirling PDF on EKS with Helm, SSO, and Persistent Storage

Jakub — Thu, 28 May 2026 10:08:32 +0000

What I Built

The system is a self-hosted, compliance-aligned PDF processing platform running on AWS EKS to replace third-party SaaS alternatives. It fulfills structural requirements for data auditability and controllability by integrating into an existing corporate identity provider.

System Architecture

The setup relies entirely on the following internal and infrastructure components:

stirling-pdf-chart — upstream application chart version 3.1.0 declared as a clean dependency alias

Persistent Volume Claims — three 1Gi gp3 volumes providing non-shared storage for application data paths

AWS Secrets Manager / SecretStore — infrastructure provider for decoupling and pulling runtime SSO credentials

AWS Load Balancer Controller — routing component handling multi-service integration on a single load balancer

Core Technical Behavior

At runtime, the wrapper chart dynamically provisions three distinct 1Gi gp3 Persistent Volume Claims mapped directly to internal paths: /configs, /pipeline, and /usr/share/tessdata. The /usr/share/tessdata mount explicitly retains OCR language assets across container life cycles, preventing runtime re-downloading whenever a pod restarts.

Pod initialization timing is long due to security checks and OCR engine setup. Health tracking uses specific readiness and liveness timings:

Upstream chart dependency block

dependencies:
  - name: stirling-pdf-chart
    alias: stirling-pdf
    version: "3.1.0"
    repository: "https://stirling-tools.github.io/Stirling-PDF-chart"

Persistent volume loop layout

persistence:
  additionalVolumes:
    - name: configs
      size: 1Gi
    - name: pipeline
      size: 1Gi
    - name: tessdata
      size: 1Gi

Liveness checks delay for 120 seconds and retry every 30 seconds, tolerating up to 5 consecutive failures before forcing a restart. Readiness checks delay for 90 seconds and run every 15 seconds, hitting the /api/v1/info/status path to track initialization progress.

Security parameters strictly enforce authentication via external parameters:

SSO environment declaration

envsFrom:
  - secretRef:
      name: stirling-pdf-sso-secret

The runtime enforces DOCKER_ENABLE_SECURITY=true, SECURITY_OAUTH2_ENABLED=true, and SECURITY_ENABLELOGIN=true while maintaining active CSRF protection. To prevent processing failures during token exchanges and large file moves, SERVER_TOMCAT_MAX_HTTP_HEADER_SIZE is expanded to 65536 bytes, and the permitted form post size is raised to 10MB.

Ingress traffic routing uses an internal scheme linked via the AWS Load Balancer Controller:

Ingress group settings

alb.ingress.kubernetes.io/load-balancer-name: "stage-shared-alb"
alb.ingress.kubernetes.io/group.name: "stage"

Path rules selectively route public requests to static resources, the core API, and login endpoints, preventing direct discovery of any internal application paths.

Key Engineering Decisions

Wrapping the upstream chart as a dependency isolates lifecycle tracking. Upstream updates are consumed by advancing the dependency version string, separating core application changes from internal platform assets like network policies or volume claims.

Externalizing credentials via AWS Secrets Manager prevents leaking raw keys. Environment parameters ingest the secret values dynamically at deploy time using an ExternalSecret link, removing plaintext values from the repository configuration.

Consolidating services into a shared ALB group limits platform cost overhead. Setting a matching ingress group name allows the controller to attach multi-service routing rules to the stage-shared-alb instead of spinning up standalone, single-tenant load balancers.

Layering environment settings reduces configuration redundancy. A global values.yaml defines base platform baselines, whereas target environment configurations override only the exact parameters required for that specific deployment.

Trade-offs

Optimized for: operational simplicity, compliance alignment, cost efficiency on shared infrastructure.

Sacrificed:

Pod startup speed — conservative initialization periods and high failure thresholds lengthen application rollout times.
Multi-AZ storage resilience — ReadWriteOnce storage properties lock the gp3 persistent volumes to a single availability zone, preventing target pod scheduling onto healthy nodes if the original node fails before being confirmed dead.
Network security validation speed — NetworkPolicy configurations are disabled at the staging stage to reduce engineering friction during initial runtime testing.

Conclusion

This EKS architecture establishes a secure, self-hosted PDF utility integrated with an upstream Helm dependency model and externalized identity verification. The design limits infrastructure costs via a shared ingress deployment while preserving state across container restarts.

Wrapping upstream Helm charts rather than forking them keeps maintenance overhead low while retaining full control over platform-level behavior.

Need Help?

If you're working on similar infrastructure challenges — self-hosted tooling, EKS platform design, or Helm chart architecture — feel free to reach out at hello@jakops.cloud.

Secure Private EKS Access and SSO-Protected Frontends with Cloudflare Tunnel on EC2

Jakub — Mon, 18 May 2026 17:47:29 +0000

What I Built

The system uses a Cloudflare Tunnel running on a single EC2 instance to replace traditional VPN infrastructure. It provides zero-trust VPC access via WARP for engineers and identity-aware frontend application delivery through a private ALB, exposing services on public subdomains without opening inbound firewall ports.

System Architecture

The runtime infrastructure groups components strictly around a single egress-only tunnel instance that services two access models:

EC2 instance — A t4g.micro instance running RHEL 9 in a private subnet with no public IP, no SSH access, and no inbound ports.

Cloudflare Tunnel — A daemon running as a systemd service on the EC2 instance that opens outbound QUIC connections to Cloudflare's edge network.

Cloudflare Access Applications — Edge configurations mapping public subdomains to the internal load balancer while enforcing identity provider SSO.

Private ALB — An AWS Application Load Balancer with no internet-facing listener, fronting frontend services inside EKS.

AWS Secrets Manager — Secure persistent storage holding the tunnel token retrieved by the instance at boot.

Security group — An AWS network firewall configured with restrictive egress-only rules for the tunnel instance.

IAM role — An execution role scoped strictly to read permissions for AWS Secrets Manager and AWS Systems Manager.

EKS cluster security group rule — A network policy rule allowing internal ingress traffic from the tunnel instance.

All system management occurs via AWS Systems Manager Session Manager.

Core Technical Behavior

At runtime, the EC2 instance retrieves the authentication token from AWS Secrets Manager and starts the cloudflared daemon. The process initiates outbound QUIC connections to Cloudflare's edge network over ports 7844 and 7845.

For engineer network routing, users connect via the local Cloudflare WARP client. Traffic destined for the VPC CIDR routes through the tunnel, giving direct network access to the EKS API server, internal RDS databases, and private cluster services.

For web traffic, Cloudflare Access serves as an identity-aware reverse proxy. External web requests to public domains hit the Cloudflare edge, which evaluates user sessions against an identity provider. Authenticated requests pass through the QUIC tunnel to the private ALB, which forwards traffic directly to frontend pods running inside EKS.

Frontend Request Flow

User → app.jakops.cloud (Cloudflare Edge, TLS + SSO)
     → Cloudflare Tunnel (encrypted QUIC)
     → EC2 cloudflared instance (private subnet)
     → Private ALB (VPC)
     → EKS frontend pods

Instance Egress Security Group Rule

egress {
  from_port   = 7844
  to_port     = 7845
  protocol    = "udp"
  cidr_blocks = ["0.0.0.0/0"]
  description = "Allow outbound QUIC for Cloudflare Tunnel"
}

VPC Internal Ingress Security Group Rule

ingress {
  from_port   = 0
  to_port     = 0
  protocol    = "-1"
  cidr_blocks = [var.vpc_cidr_block]
  description = "Allow all traffic from VPC for WARP routing"
}

Key Engineering Decisions

Dual-purpose tunnel consolidates L3/L4 network proxying via WARP and L7 application routing via Cloudflare Access onto a single EC2 footprint.

Public domains with private infrastructure keeps public DNS records pointed to Cloudflare's edge while leaving the AWS footprint completely invisible without public endpoints.

SSO at the edge forces authentication before traffic ever enters the AWS network, removing the requirement for application-level authentication gates on internal frontends.

Private ALB configuration removes internet-facing listeners, eliminating public DDoS surface area, public security group tracking, and AWS-side certificate rotation.

ARM architecture selection leverages t4g.micro instances to save approximately 20% on compute cost compared to t3 variants while running the lightweight cloudflared Go binary.

Hardcoded AMI pinning prevents unintended infrastructure tearing and instance recreation during terraform apply actions when upstream OS images change.

Single instance deployment without an Auto Scaling Group trades automated sub-minute failover for simplified configuration on staging and developer environments.

IMDSv2 requirement mitigates SSRF-based IAM credential theft from the EC2 instance metadata service endpoint.

Dedicated system user constraints execute the cloudflared binary as a non-login user with no system shell to restrict localized blast radius.

KMS-encrypted EBS enforces protection of the root volume data at rest via a customer-managed key.

Trade-offs

Optimized for: cost, simplicity, zero-trust posture, unified access control, operational minimalism, elimination of public attack surface.

Sacrificed: high availability (single instance), self-healing (no ASG), automated AMI rotation, independent scaling of frontend proxy vs. WARP routing, centralized logging (logs stay on-instance via journald).

Results / Cost Impact

The implementation reduced ongoing infrastructure spend to an explicit total of approximately $7.33 per month.

t4g.micro (on-demand) — ~$6.13
8GB GP3 EBS — ~$0.80
Secrets Manager secret — ~$0.40

This architecture replaced a managed VPN product and a public ALB setup including WAF, certificate validation, and public DNS operations that had cost over $75 per month. Operational overhead for certificate rotation, WAF rule maintenance, and network auditing was removed.

Conclusion

A single Cloudflare Tunnel instance provides concurrent infrastructure routing for developers and SSO-gated public domain ingress for external stakeholders. By keeping the target AWS load balancer private, the entire internal network remains closed to inbound public traffic while supporting edge-authenticated web delivery.

Combining WARP routing with Cloudflare Access applications on a single tunnel gives you both L3 infrastructure access and L7 application delivery with SSO on real domains with zero public infrastructure for under $8/month.

Need Help?

If you want to deploy a zero-trust setup including Cloudflare Tunnel on EC2, WARP routing, private ALB ingress for EKS, and Terraform modules with IMDSv2 and KMS encryption, you can find assistance directly at https://jakops.cloud.

Migrating a Terraform Monolith to Terragrunt: State Slicing Without Downtime

Jakub — Fri, 08 May 2026 07:39:10 +0000

What I Built

I decomposed a monolithic Terraform state containing 19 logical AWS infrastructure components into a Terragrunt monorepo. This migration established isolated state files for each component—including VPC, EKS, and RDS—to enable independent locking, reduced blast radius, and faster plan performance without triggering any infrastructure changes or downtime.

System Architecture

Monolith State — A single S3-backed state file containing all 19 infrastructure components under a nested module hierarchy.

Terragrunt Modules — 13 independent module directories, each inheriting root configuration and managing a unique S3 state key.

Dependency Graph — Explicit inter-module wiring using Terragrunt dependency blocks to pass versioned outputs between isolated states.

Core Technical Behavior

The system runtime behavior changed from a single global lock to a per-component locking model. In the monolith, any change to a load balancer rule required a full re-evaluation of the entire stack, including RDS and EKS clusters. By slicing the state, I isolated the execution flow so that Terraform only reconciles the resources relevant to a specific logical component.

The migration process relied on address rewriting to drop the top-level parent prefixes used in the monolith. For example, a resource originally located at module.client_stage.module.database.module.rds.aws_db_instance.this[0] was moved to module.rds.aws_db_instance.this[0] within the new isolated rds module state.

Pulling the monolith state to a local file for immutable processing

terraform state pull > monolith.tfstate

Dynamically discovering direct child modules from the state list

DIRECT_MODULES=$(echo "$STATE_LIST" | grep "^${MODULE_PREFIX}\.module\." | \
  sed "s|^${MODULE_PREFIX}\.module\.||" | \
  sed 's/^\([^.[]*\).*/\1/' | sort -u)

Executing the state move from the local monolith source to individual module states

terraform state mv \
  -state="$MONOLITH_STATE" \
  -state-out="$TARGET_STATE" \
  "$resource" "$new_address"

Final runtime verification involved a run-all plan across the dependency graph. This confirmed that downstream modules could successfully read VPC IDs and RDS endpoints from upstream modules via typed outputs stored in the new isolated state files.

Key Engineering Decisions

Script-driven slicing over manual commands was implemented to ensure the move of hundreds of resources across 13 modules remained reproducible and free of manual typos.

Immutable source state management used separate -state and -state-out files to ensure the local monolith snapshot was never modified during the slicing process, allowing for clean retries.

Dynamic module discovery derived module names directly from the state list rather than a hardcoded inventory, preventing the silent omission of existing infrastructure from the migration.

Python-based regex processing was utilized for address rewriting to correctly handle dot-separated and bracket-indexed resource patterns that are not safely handled by standard shell tools.

Local backend validation was performed before migrating to S3 to verify each module against a zero-diff plan, ensuring the state perfectly matched live infrastructure before pushing to remote storage.

Trade-offs

Optimized for: blast radius reduction, per-module state locking, and faster iteration via targeted plan/apply cycles.

Sacrificed: operational simplicity during the migration window, requiring a change freeze to prevent drift while state existed in both monolithic and sliced forms.

Results / Cost Impact

The platform now operates 13 independent state files in S3, each protected by its own DynamoDB lock.

Parallel workstreams no longer block each other, as a Kubernetes deployment change no longer locks the VPC or database state.

The system enforces explicit ownership boundaries, where changes are restricted to specific infrastructure concerns without the risk of affecting adjacent resources in the same state file.

Conclusion

This migration turned a monolithic bottleneck into a scalable management boundary by performing state surgery instead of infrastructure re-creation. The resulting system maintains zero-drift compared to the original monolith while enabling the team to execute parallel changes with isolated failure modes.

The correctness of a state migration is guaranteed when every isolated module produces a clean plan with zero diff.

Need Help?

If you're working on a similar state decomposition or evaluating Terragrunt adoption for a growing SaaS platform, feel free to reach out at hello@jakops.cloud.

https://jakops.cloud

Athena Cost Kill Switch: Automated IAM Credential Revocation with CloudWatch, EventBridge, and Lambda

Jakub — Wed, 06 May 2026 12:16:38 +0000

How to design an automated kill switch for an Athena data platform that disables service credentials within seconds of a scan threshold breach.

What I Built

This system provides an automated response to excessive AWS Athena scan costs generated by external services. It monitors Athena workgroup metrics and immediately revokes IAM access keys when pre-defined data processing thresholds are exceeded, preventing unmonitored cost spikes without requiring human intervention.

System Architecture

The architecture is composed of four distinct layers operating in sequence to monitor, route, and execute the revocation.
Athena Workgroups - Dedicated workgroups for PowerBI and OpenMetadata that enforce a 1 GB per-query scan cutoff and publish CloudWatch metrics.

CloudWatch Alarms - Three independent alarms monitoring the OpenMetadata workgroup for sustained high usage, high failure rates, and rapid consumption spikes.
EventBridge Rule - A routing layer that pattern-matches CloudWatch Alarm State Change events to trigger the execution logic.
Lambda Kill Switch - A Python-based function that retrieves service credentials from Secrets Manager and executes the IAM revocation call.
Secrets Manager - A KMS-encrypted store for the OpenMetadata IAM username and access key ID, keeping the execution logic stateless.

Core Technical Behaviour

The system remains passive until a threshold is breached. CloudWatch tracks ProcessedBytes and query failure counts at the workgroup level. When a metric crosses a threshold, the alarm transitions to ALARM state.
EventBridge detects this state change and triggers the Lambda function. The Lambda performs two primary operations: it fetches the target IAM metadata from Secrets Manager and calls the IAM API to set the specific access key status to Inactive.
Python

# One-line caption: Disabling the IAM access key via Boto3
iam_client.update_access_key(
    UserName=username,
    AccessKeyId=access_key_id,
    Status="Inactive"
)

The execution flow is asynchronous. While the Lambda disables the credential, SNS simultaneously sends email notifications to the engineering team. Once the key is inactive, all subsequent Athena queries from the external service fail with authentication errors until manual rotation or reactivation occurs.

Key Engineering Decisions

IAM user with static credentials was used because OpenMetadata does not support IAM role assumption. Disabling the access key provides the fastest possible revocation without modifying IAM policies or workgroup configurations.
Storing the access key ID and IAM username in Secrets Manager keeps the Lambda stateless. This ensures that credential rotation can occur within the security layer without requiring code changes or redeployments of the Lambda infrastructure.
Three independent alarms were chosen over a composite alarm to ensure any single failure mode - sustained volume, high failure rates, or sudden spikes - triggers the switch immediately. A composite alarm would have required multiple conditions to be met simultaneously.
Direct EventBridge-to-Lambda integration was selected over Step Functions for this path. While the platform's S3-triggered Glue pipeline uses Step Functions for stateful orchestration, the kill switch is a single, stateless API call where added orchestration would only increase latency.
The use of configurable Terraform variables for thresholds allows for environment-specific tuning. This enables tighter cost controls in staging and more relaxed limits in production without modifying the underlying logic.

Trade-offs

Optimized for: speed of response and operational simplicity. The system executes in seconds with a minimal codebase and no external dependencies beyond native AWS APIs.

Sacrificed: self-healing. The system requires a deliberate manual action by a platform engineer to investigate the root cause and re-enable or rotate credentials.

The system lacks a dead-letter queue on the Lambda invocation. If the IAM API call fails or Secrets Manager is throttled, the system relies on standard Lambda async retries without a secondary alerting path for the kill switch's own failure.

The spike alarm uses a fixed 60-second period. This fixed window cannot be adjusted via Terraform variables, meaning a legitimate high-volume schema discovery scan could trigger a false positive that requires a code change to tune.

The high-usage alarm does not have a direct Lambda action assigned in its configuration. It relies entirely on the EventBridge rule pattern match for routing, which differs from how the other alarms utilize direct actions.

Results / Cost Impact

The system introduces negligible ongoing costs as it is entirely event-driven. It eliminates the response window between a threshold breach and human intervention, which is critical for Athena where billing is processed per byte scanned. The platform team receives immediate SNS notifications while the automated response ensures that financial exposure is capped within seconds of a breach.

Conclusion

This architecture uses CloudWatch, EventBridge, and Lambda to create a production-grade cost control mechanism for managed query services. By targeting the IAM credential layer, the system provides a reversible but immediate response to misbehaving external connectors.
Automated cost control is most effective when it targets well-defined service boundaries with fast event routing and manual recovery.

Need Help?

If you're working on similar infrastructure challenges around AWS cost control, data platform access governance, or IAM-level automation, feel free to reach out at hello@jakops.cloud.

DEV Community: Jakub

Running RabbitMQ on EKS Without Bitnami — A Cluster-Operator-Based Setup

What I Built

System Architecture

Core Technical Behavior

Key Engineering Decisions

Trade-offs

Results / Cost Impact

Conclusion

Further Reading

Need Help?

Production SigNoz on EKS: Cost-Optimized Observability with Tiered Storage and Auto-Instrumented APM

What I Built

System Architecture

Core Technical Behavior

Zero-Code APM Injection

Collector Scaling and Backpressure

Tiered Storage Lifecycle

Decoupled RDS Metadata Store

Key Engineering Decisions

Trade-offs

Cost & Operational Impact

Conclusion

Further Reading

Need Help?

Deploying Stirling PDF on EKS with Helm, SSO, and Persistent Storage

What I Built

System Architecture

Core Technical Behavior

Key Engineering Decisions

Trade-offs

Conclusion

Further Reading

Need Help?

Secure Private EKS Access and SSO-Protected Frontends with Cloudflare Tunnel on EC2

What I Built

System Architecture

Core Technical Behavior

Key Engineering Decisions

Trade-offs

Results / Cost Impact

Conclusion

Further Reading

Need Help?

Migrating a Terraform Monolith to Terragrunt: State Slicing Without Downtime

What I Built

System Architecture

Core Technical Behavior

Key Engineering Decisions

Trade-offs

Results / Cost Impact

Conclusion

Further Reading

Need Help?

Athena Cost Kill Switch: Automated IAM Credential Revocation with CloudWatch, EventBridge, and Lambda

What I Built

System Architecture

Core Technical Behaviour

Key Engineering Decisions

Trade-offs

Results / Cost Impact

Conclusion

Further Reading

Need Help?