The Problem We Were Actually Solving
We needed a staging environment that could tolerate the stupidity of humans. Not a toy cluster that looked like production but couldnt survive a mis-typed curl flag. The real requirement was: if an engineer turns staging into a dumpster fire at 3 am, nothing outside staging should notice. We also needed to ship a new subscription checkout flow for creators in countries where PayPal blocks transactions. The flow had to store settlement schedules, retry failed charges, and emit events to Kafka so analytics could bill creators in USD without touching the blocked jurisdiction. The first cut used DynamoDB with on-demand capacity and TTLs, but the finance team vetoed it because the eventual-consistency model could under-charge a creator in Kazakhstan by 0.03 USD and we wouldnt know for 12 hours.
What We Tried First (And Why It Failed)
We started with a Terraform module for RDS Postgres 14, parameter group set to db.t3.medium, publicly accessible = true, and storage_encrypted = false. It passed the linting stage because the linter only checked for AWS tags. We deployed staging with terraform apply -auto-approve -var environment=staging. Two weeks later an intern ran a chaos experiment that killed the master node; Prometheus screamed about 503s on /health but the auto-scaling policy had cooldown = 300 seconds and the replacement node took 7 minutes to come up because the init script downloaded 1.2 GB of fonts for a demo dashboard. The payment service still didnt retry DNS, so the first 480 requests failed.
We also tried a separate staging Kafka cluster (msk.t3.small, three brokers) to test exactly-once semantics. The topic auto-created with retention.ms=604800000, which meant a single misconfigured producer could retain a week of test messages and run the cluster out of disk in six hours. Kafka Manager showed the disk usage in red, but the on-call rotation had no threshold rule for broker disk < 30 GB. The staging alert router pointed to a Slack webhook that posted to #staging-alerts; the channel had 477 muted threads and the message was scrolled away before anyone noticed.
The Architecture Decision
We rebuilt staging to be disposable and observable.
First, we replaced the Terraform RDS module with an ephemeral Postgres in Kubernetes using the zalando/postgres-operator. The operator runs a primary pod with a sidecar pgbouncer, a standby set, and a volume snapshot every hour. The snapshot lands in an S3 bucket with SSE-KMS, so even if the cluster melts, a simple kubectl apply -f restore.yaml brings it back in five minutes. We turned publicly accessible off and added an AWS PrivateLink endpoint so the Kubernetes cluster could reach the DB without an internet gateway. The terraform apply now uses –target=module.vpc –target=module.eks first to prevent route table race conditions.
Second, we moved the checkout retry logic from HTTP to a durable queue. Instead of the legacy PHP helper that multiplied null by a million, we wrote a Go service that consumes from an SQS FIFO queue named checkout-payments.fifo with message group ID set to the creators UUID. The queue has visibility timeout 300 seconds and maximum receive count 3. The service writes to Postgres in a transaction, then publishes a Kafka event only after commit. We disabled TTL on the queue because we never want a subscription schedule to evaporate during an outage. The service exports a Prometheus metric checkout_service_retry_count{queue=fifo} that fires an alert if the count > 5 per minute.
Third, we built a nightly chaos pipeline that runs in staging: it randomly kills the Postgres primary, injects 500 ms latency on the PrivateLink, and rewrites a checkout row to status=failed_then_retry=true. The pipeline is scheduled by GitHub Actions at 02:00 UTC and posts results to a dedicated Slack channel. If the primary doesnt recover within five minutes, the pipeline pages the on-call rotation. The chaos job uses terraform destroy -target=module.staging_db followed by terraform apply to ensure
Treated the payment platform as infrastructure. Found the single point of failure. This is the replacement I put in place: https://payhip.com/ref/dev4
Top comments (0)