At large-scale consumer platforms, product reuse can quietly turn into technical debt.
In this case, a non-monetized engagement points feature shared a backend with a wallet-based monetization service.
The engagement system handled over a million transactions per second, far beyond what the transactional payments backend was built for.
What began as a pragmatic shortcut led to rising latency, ballooning storage costs, and operational strain on a system optimized for financial correctness rather than lightweight interactivity.
The only sustainable fix was to decouple them completely.
Problem Context
The monetization system prioritized:
- Durable writes
- Idempotency
- Strong transactional guarantees
The engagement system prioritized:
- Speed
- Throughput
- Low operational cost
By sharing a backend, the engagement workload forced the payments infrastructure to scale inefficiently adding expensive transactional overhead to a system that didn’t need it.
Scaling further wasn’t the answer. Isolation was.
Designing the New Backend
The decoupling required two major components:
- A dedicated API and data model built for high-TPS, low-latency operations.
- A live migration pipeline capable of achieving full data parity with zero downtime.
Infrastructure Overview
| Component | Purpose |
|---|---|
| Go microservice | Core logic and API layer |
| Protobuf + gRPC | Internal RPC communication |
| AWS DynamoDB | Primary datastore: high throughput, flexible schema |
| AWS Kinesis Streams | Real-time change-data capture |
| AWS Lambda functions | Stream processors handling event ordering and writes |
| Redis cache | Idempotency layer to prevent duplicate writes during dual-write and stream replay |
| Terraform | Infrastructure-as-code provisioning |
| CloudWatch metrics | Observability for throughput, lag, and latency |
| Feature flag service | Safe rollout and traffic control |
The new backend used a lightweight schema aligned to engagement interactions, simpler, cheaper, and better suited for massive write volume.
Architecture Diagrams
Diagram 1: System Overview
flowchart TD
A[Engagement API] --> B[Old DynamoDB Table]
B --> C[AWS Kinesis Stream]
C --> D[AWS Lambda Functions]
D --> E[New DynamoDB Table]
D --> F[Redis Cache<br/>(idempotency)]
F --> E
E --> G[Feature-flag Dual-Write]
Diagram 2: Migration Lifecycle
flowchart TD
A[1. Export snapshot → S3 → Import new table] --> B[2. Sync updates via Kinesis + Lambda]
B --> C[3. Redis ensures idempotency<br/>for dual-writes & replayed events]
C --> D[4. Feature flag directs dual-writes]
D --> E[5. Cutover & validation]
Migration Strategy
Building the API was straightforward.
Migrating live data at 1M+ TPS without downtime was the challenge.
1. Snapshot Bootstrap
AWS provides a built-in mechanism to export a DynamoDB table snapshot to S3, which can then be imported into a new table.
This seeded the new database with a point-in-time baseline, no long-running scans or Glue jobs required.
2. Real-Time Sync via Kinesis + Lambda
Once the snapshot was imported, Kinesis Streams captured every subsequent change (insert, update, delete) from the source table.
Each event was processed by an AWS Lambda consumer that replayed the change into the new DynamoDB table.
Maintaining transaction order was critical, out-of-sequence events could cause corruption or lost updates.
To handle retries and potential duplicate delivery, I introduced a Redis-based idempotency layer.
Each event carried a unique transaction ID. Before processing, Lambda performed a fast Redis lookup to check whether that ID had already been written.
If found, the event was skipped, eliminating double writes both from Kinesis replays and from the feature-flagged dual-write traffic hitting the same endpoint.
This lightweight Redis layer made the migration safe, ensuring exactly-once behavior without compromising throughput.
Monitoring IteratorAge and Duration metrics in CloudWatch remained critical.
If IteratorAge rose, the stream was falling behind, meaning either smaller batches or more concurrency were needed.
With tuning and caching in place, the pipeline kept pace with over a million updates per second.
The full migration completed within hours, not days.
Cutover with Feature Flags
After the real-time sync stabilized, I rolled out the new backend via a feature-flagged dual-write:
- Dual-write requests to both APIs.
- Use Redis for idempotency checks to prevent duplicate writes.
- Validate data parity.
- Monitor Kinesis lag until zero.
- Cut traffic to the old API.
Once validation passed, the engagement service ran entirely on its new infrastructure.
The monetization system was finally free of the extra load, and both systems could scale independently.
Safety and Verification
- Extended Kinesis retention to 24+ hours during migration.
- Kept the source table intact until post-cutover validation completed.
- Used Redis TTLs to automatically expire processed transaction keys, keeping cache cost minimal.
- Continuously compared record counts and hash digests between tables during dual-write.
These guardrails ensured recovery options, consistency, and full traceability throughout the migration.
Results and Takeaways
- Kinesis Streams + Lambda enable live, high-TPS migrations when tuned for throughput and ordering.
- Redis caching ensures idempotency and prevents double writes under dual-write conditions.
- Feature-flag rollouts provide control and observability for safe cutovers.
- Decoupling mismatched systems is often cheaper and safer than scaling them together.
“Scaling isn’t always about adding resources, sometimes it’s about removing coupling.”
After the final cutover, metrics flatlined exactly where they should, and I finally took that long-delayed vacation.
Technologies Referenced
AWS Lambda — Serverless compute service for running backend logic without managing servers.
Amazon DynamoDB — Fully managed NoSQL database optimized for high-throughput workloads.
Amazon Kinesis Data Streams — Real-time event streaming service used for data replication and ingestion.
AWS CloudWatch — Metrics and observability platform for monitoring throughput, latency, and iterator age.
Amazon S3 — Object storage service used for snapshot exports and imports.
Redis — In-memory cache used here for idempotency checks during dual-writes and stream replay.
Terraform — Infrastructure-as-code tool for provisioning and managing AWS resources.
gRPC — High-performance RPC framework for service-to-service communication.
Protocol Buffers (Protobuf) — Serialization format used with gRPC to define and enforce API contracts.
Feature Toggles / Flags — Technique for gradual rollouts and safe cutovers.
Go Concurrency: Goroutines — Lightweight thread mechanism for concurrent workloads.
Go Channels — Synchronization and communication primitive used for concurrent fan-out/fan-in patterns.
Written by Stephanie Dover, Software Engineer 10+ YOE, ex GitHub, Twitch, Microsoft. Creator of Klaussy.
Top comments (0)