The Problem We Were Actually Solving
Our creators in Lahore, Karachi, and Islamabad needed to receive payouts. Stripe didnt support Pakistani creators. PayPal wouldnt touch them with a ten-foot pole. Local banks had 24-hour settlement windows and 3% cross-border fees. We had to move money within minutes, not days, and we had to prove every cent was where it belonged before the next ad break.
The real engineering challenge wasnt the money—it was the data. Every payout triggered a cascade: ledger entries, fraud review, tax calculations, creator dashboard update, and an email receipt. If the pipeline lagged by more than 30 minutes, creators flooded support. If the numbers didnt match between our ledger and JazzCashs, the finance team locked the spreadsheet and emailed me at 2 a.m.
We measured success with one metric: latency from checkout to creator payout confirmation. Target: under 15 minutes end-to-end. Failure meant cancellations and refunds that cost us more than the transaction itself.
What We Tried First (And Why It Failed)
We bolted Stripe webhooks onto our Snowflake warehouse. Each webhook fired a JSON blob that landed in S3. An Airflow DAG read the blob, applied a CASE statement to map JazzCash transaction IDs, and wrote the row to the payout table. Expected latency: 2 minutes. Reality: 47 minutes.
The problem wasnt the warehouse—it was the staging layer. Airflows S3 sensor polled every 5 minutes. That introduced a 150-second blind spot. Worse, if S3 object creation timestamps showed up out of order (because of eventual consistency or JazzCashs retries), Airflow treated them as late and skipped them. We lost $12,400 in payouts before we noticed the silent failures.
I added a Kafka mirror of the webhook stream. Kafka read every message within 200 ms. Then I wrote a Python UDF in Snowflake that consumed the topic via a Kafka connector. Freshness improved to 8 minutes. Still not enough. The UDF blocked warehouse compute for two seconds per message, and our finance team started throttling the warehouse because report dashboards were timing out.
We tried Flink next. It achieved 90-second latency on a good day. But Flinks state backend was RocksDB on local disk. When the pod restarted, we replayed two hours of events. That violated our 15-minute SLA. Our SRE team refused to let Flink run in production until we added a Kafka exactly-once checkpointing layer—another three weeks of yak shaving.
The Architecture Decision
We ripped out the warehouse as a processing engine.
The new stack:
- Ingestion: HTTP endpoints writing directly to Kafka topics sharded by country.
- Stream processing: ksqldb in EC2 instances with 10-minute retention, exactly-once semantics, and disk snapshots every hour.
- Storage: Iceberg tables on S3 with 5-minute compaction cadence.
- Serving: Postgres read replicas for creator dashboards, updated via Debezium CDC from Iceberg tables.
- Cost control: Spot instances for ksqldb, kept only during peak hours (6 p.m. to 11 p.m. Pakistan time).
Latency after go-live:
- Checkout to Kafka: 200 ms
- Kafka to ksqldb aggregation: 90 seconds
- ksqldb to Iceberg: 2 seconds
- Iceberg to Postgres: 15 seconds Total payout confirmation: 107 seconds.
We also instrumented a new metric: ledger drift. Every ksqldb query includes a checksum of the last 10,000 transaction rows. If the checksum changes between two consecutive runs, we trigger a PagerDuty alert and freeze the payout queue until teams reconcile. This caught a JazzCash duplicate event last month and saved us $8,200 in duplicate refunds.
Cost per transaction dropped from $0.22 on Snowflake/Airflow to $0.05 on spot Kafka + ksqldb + Iceberg. The finance team stopped complaining about warehouse bills.
What The Numbers Said After
Latency distribution over 30 days:
- P50: 87 seconds
- P95: 146 seconds
- P99: 228 seconds
- Breaches: 3 (all caused by JazzCash API brownouts)
Cost per million transactions:
- Dec 2025: $18,100 (warehouse + Airflow)
- Jan 2026: $4,500 (Kafka + ksqldb + Iceberg)
- Feb 2026: $4,900 (added redundancy for spot failures)
Freshness SLA compliance:
- Jan 2026: 98.7%
- Feb 2026: 99.4%
- Mar 2026: 99.9% (after adding a Redis cache for last-known balances)
These numbers arent abstract. When a creator in Faisalabad streams live and the dashboard says their payout arrived, its not a cache—its the ksqldb state file pinned in memory on ksqldb-03 at 9:15 p.m. that matches the JazzCash API response we received at 9:13 p.m.
What I Would Do Differently
I would not have tried to make Snowflake the center of the universe for a real-time ledger. Warehouses are for analytics, not for 100 transactions per second.
I would have started with a topic-per-country design from day one instead of sharding later. Our first Kafka cluster had a single topic with 50 partitions. Adding a country caused partition rebalances that dropped throughput 40% for 20 minutes. Now we have topic patterns like payments-pakistan-live. No more rebalances.
I would skip Flink entirely unless I needed stateful joins longer than 24 hours. ksqldbs windowed aggregations are simpler and keep the state in memory instead of RocksDB, so pod restarts only lose the last 30 seconds of
Top comments (0)