Stephanie Dover

Posted on Jun 23

Decoupling a High-Throughput Engagement Service from a Monetization System

#architecture #database #distributedsystems #systemdesign

At large-scale consumer platforms, product reuse can quietly turn into technical debt.

In this case, a non-monetized engagement points feature shared a backend with a wallet-based monetization service.

The engagement system handled over a million transactions per second, far beyond what the transactional payments backend was built for.

What began as a pragmatic shortcut led to rising latency, ballooning storage costs, and operational strain on a system optimized for financial correctness rather than lightweight interactivity.

The only sustainable fix was to decouple them completely.

Problem Context

The monetization system prioritized:

Durable writes
Idempotency
Strong transactional guarantees

The engagement system prioritized:

Speed
Throughput
Low operational cost

By sharing a backend, the engagement workload forced the payments infrastructure to scale inefficiently adding expensive transactional overhead to a system that didn’t need it.

Scaling further wasn’t the answer. Isolation was.

Designing the New Backend

The decoupling required two major components:

A dedicated API and data model built for high-TPS, low-latency operations.
A live migration pipeline capable of achieving full data parity with zero downtime.

Infrastructure Overview

Component	Purpose
Go microservice	Core logic and API layer
Protobuf + gRPC	Internal RPC communication
AWS DynamoDB	Primary datastore: high throughput, flexible schema
AWS Kinesis Streams	Real-time change-data capture
AWS Lambda functions	Stream processors handling event ordering and writes
Redis cache	Idempotency layer to prevent duplicate writes during dual-write and stream replay
Terraform	Infrastructure-as-code provisioning
CloudWatch metrics	Observability for throughput, lag, and latency
Feature flag service	Safe rollout and traffic control

The new backend used a lightweight schema aligned to engagement interactions, simpler, cheaper, and better suited for massive write volume.

Architecture Diagrams

Diagram 1: System Overview

flowchart TD
    A[Engagement API] --> B[Old DynamoDB Table]
    B --> C[AWS Kinesis Stream]
    C --> D[AWS Lambda Functions]
    D --> E[New DynamoDB Table]
    D --> F[Redis Cache<br/>(idempotency)]
    F --> E
    E --> G[Feature-flag Dual-Write]

Diagram 2: Migration Lifecycle

flowchart TD
    A[1. Export snapshot → S3 → Import new table] --> B[2. Sync updates via Kinesis + Lambda]
    B --> C[3. Redis ensures idempotency<br/>for dual-writes & replayed events]
    C --> D[4. Feature flag directs dual-writes]
    D --> E[5. Cutover & validation]

Migration Strategy

Building the API was straightforward.

Migrating live data at 1M+ TPS without downtime was the challenge.

1. Snapshot Bootstrap

AWS provides a built-in mechanism to export a DynamoDB table snapshot to S3, which can then be imported into a new table.

This seeded the new database with a point-in-time baseline, no long-running scans or Glue jobs required.

2. Real-Time Sync via Kinesis + Lambda

Once the snapshot was imported, Kinesis Streams captured every subsequent change (insert, update, delete) from the source table.

Each event was processed by an AWS Lambda consumer that replayed the change into the new DynamoDB table.

Maintaining transaction order was critical, out-of-sequence events could cause corruption or lost updates.

To handle retries and potential duplicate delivery, I introduced a Redis-based idempotency layer.

Each event carried a unique transaction ID. Before processing, Lambda performed a fast Redis lookup to check whether that ID had already been written.

If found, the event was skipped, eliminating double writes both from Kinesis replays and from the feature-flagged dual-write traffic hitting the same endpoint.

This lightweight Redis layer made the migration safe, ensuring exactly-once behavior without compromising throughput.

Monitoring IteratorAge and Duration metrics in CloudWatch remained critical.

If IteratorAge rose, the stream was falling behind, meaning either smaller batches or more concurrency were needed.

With tuning and caching in place, the pipeline kept pace with over a million updates per second.

The full migration completed within hours, not days.

Cutover with Feature Flags

After the real-time sync stabilized, I rolled out the new backend via a feature-flagged dual-write:

Dual-write requests to both APIs.
Use Redis for idempotency checks to prevent duplicate writes.
Validate data parity.
Monitor Kinesis lag until zero.
Cut traffic to the old API.

Once validation passed, the engagement service ran entirely on its new infrastructure.

The monetization system was finally free of the extra load, and both systems could scale independently.

Safety and Verification

Extended Kinesis retention to 24+ hours during migration.
Kept the source table intact until post-cutover validation completed.
Used Redis TTLs to automatically expire processed transaction keys, keeping cache cost minimal.
Continuously compared record counts and hash digests between tables during dual-write.

These guardrails ensured recovery options, consistency, and full traceability throughout the migration.

Results and Takeaways

Kinesis Streams + Lambda enable live, high-TPS migrations when tuned for throughput and ordering.
Redis caching ensures idempotency and prevents double writes under dual-write conditions.
Feature-flag rollouts provide control and observability for safe cutovers.
Decoupling mismatched systems is often cheaper and safer than scaling them together.

“Scaling isn’t always about adding resources, sometimes it’s about removing coupling.”

After the final cutover, metrics flatlined exactly where they should, and I finally took that long-delayed vacation.

Technologies Referenced

AWS Lambda — Serverless compute service for running backend logic without managing servers.

Amazon DynamoDB — Fully managed NoSQL database optimized for high-throughput workloads.

Amazon Kinesis Data Streams — Real-time event streaming service used for data replication and ingestion.

AWS CloudWatch — Metrics and observability platform for monitoring throughput, latency, and iterator age.

Amazon S3 — Object storage service used for snapshot exports and imports.

Redis — In-memory cache used here for idempotency checks during dual-writes and stream replay.

Terraform — Infrastructure-as-code tool for provisioning and managing AWS resources.

gRPC — High-performance RPC framework for service-to-service communication.

Protocol Buffers (Protobuf) — Serialization format used with gRPC to define and enforce API contracts.

Feature Toggles / Flags — Technique for gradual rollouts and safe cutovers.

Go Concurrency: Goroutines — Lightweight thread mechanism for concurrent workloads.

Go Channels — Synchronization and communication primitive used for concurrent fan-out/fan-in patterns.

Written by Stephanie Dover, Software Engineer 10+ YOE, ex GitHub, Twitch, Microsoft. Creator of Klaussy.

LinkedIn · GitHub · Klaussy Desktop · Klaussy Agents

DEV Community