Mizbauddin Mohammad

Posted on Jun 23 • Originally published at Medium

Strangle the Monolith, Don't Rewrite It: Modernizing a Mission-Critical Payments Core Without a Big-Bang

A system of record is not rewritten. It is starved — one endpoint, one reconciliation, one reversible increment at a time.

The most expensive sentence in enterprise software is spoken with great confidence in a conference room: "We'll freeze new features for two quarters, rewrite the core, and cut over at the end."

I have watched versions of that plan consume years and budgets and careers. The pattern is always the same. The freeze slips. The business cannot actually stop, so a shadow backlog of "just this one change" reopens the old system you swore not to touch. The cutover weekend arrives, the rollback plan is a paragraph nobody has rehearsed, and the go/no-go call becomes a negotiation between exhaustion and fear.

The lesson I have internalized over twenty years of running platforms that are not allowed to fail — payments and ledgers at 50,000+ transactions per second, 99.99% availability, hundreds of consuming applications — is this: legacy modernization is not an engineering problem. It is a risk-management problem that happens to be solved with engineering. Once you accept that framing, the whole strategy inverts. You stop optimizing for how fast can we replace it and start optimizing for how small and reversible can each step be.

This article is the architecture I use to make that real. I've also published it as a small, runnable reference implementation (Java / Spring, docker compose up) — link at the end — so this is not theory.

The dial, not the switch

A big-bang cutover is a light switch: off, then on, with a terrifying moment in between. Everything I design replaces that switch with a dial — a traffic weight you can turn from 0% to 10% to 100% and, critically, back to 0% in minutes with no data loss and a clean audit trail.

That single property — reversible at every step — is the north star. Every architectural decision below exists to protect it.

Modernization fails on the seams, not the services. Four decisions hold the seams together.

1. A Strangler Fig facade — so traffic is a dial, not a destiny

Put a routing facade in front of the legacy core and migrate endpoint by endpoint, shifting traffic by weight. New capabilities live as independent services behind the facade; you delete a legacy route only after the new path has proven itself in production, at a percentage you chose.

The senior point isn't the pattern — everyone has read the Fowler essay. It's the discipline the pattern buys you: continuous production validation instead of a single bet. You are never more than one config change away from the last known-good state. The facade becomes a critical, must-be-highly-available component — that is a deliberate, accepted trade, and you engineer it accordingly.

2. An Anti-Corruption Layer — so you don't inherit thirty years of accidental complexity

Every legacy core carries archaeology: columns like ACCT_NO, STAT_CD, ROW_VERS; status codes whose meanings live only in a retired engineer's memory. The single most common way modernization quietly fails is by letting those shapes leak into the new system. The moment they do, your "new" model is coupled to the old one and you've built a more expensive version of what you had.

So legacy state crosses the border only as events, through a translator that maps cryptic legacy fields into a clean domain — explicit Account, Money, real lifecycle states. No new service is permitted to read the legacy database. Decades of quirks are quarantined to one component you can test against real edge cases and throw away at the end. The border is the product.

3. A transactional outbox for CDC — so truth is never lost in the gap

During coexistence, legacy changes must reach the new platform reliably. The naive approach — write the database, then publish to the event bus — has a silent failure mode: crash between the two and the event is gone forever. In a ledger, a lost event is a lost fact, and lost facts are how you end up explaining a discrepancy to a regulator.

The fix is unglamorous and non-negotiable: write the business change and an outbox row in the same transaction, then relay the outbox to the event bus. No distributed transaction, no dual-write race, no lost truth. Delivery becomes at-least-once, which means every consumer must be idempotent — a constraint, not an afterthought, and one you design for from the first line.

4. An event-sourced, CQRS ledger — so the new core is auditable by construction

The replacement core does not store balances as mutable rows. It stores an append-only log of events as the system of record, and serves reads from a separate projection that is rebuildable by replaying that log. Per-aggregate sequence numbers give optimistic concurrency; an idempotency key on the payment makes a retry a no-op instead of a double-charge.

This is more moving parts than CRUD, and I will not pretend otherwise. What you buy is decisive for finance: a complete audit trail, point-in-time reconstruction, independent read scaling, and the most underrated operational superpower in the catalog — recovery by replay. When a projection is corrupted, you do not restore a backup and pray; you rebuild the read model from the truth. The log is the truth; everything else is a cache.

Earning the right to turn the dial

Here is the part most architecture diagrams omit, and the part that actually de-risks the program: parallel-run reconciliation.

While both systems are live, you continuously compare the legacy balances (arriving as CDC events) against the new ledger's projection, account by account, and you treat any discrepancy beyond tolerance as a release-blocking defect. Reconciliation is the gate. You do not widen the traffic dial because a sprint ended; you widen it because the books agree. This converts "do we trust the new system?" from an opinion in a meeting into a number on a dashboard.

What happens at 3 a.m.

Distributed settlement has no global ACID transaction. You reserve funds, post to the ledger, notify an external rail, confirm — across boundaries that can each fail independently. So you plan for partial failure explicitly with an orchestration-based SAGA: a single coordinator drives the steps, persists its state for crash recovery, and runs compensations in reverse order when a later step fails. A rail rejection after the ledger has posted triggers a ledger reversal.

Two principles I hold firm here. First, in finance you compensate, you never delete — the correction is a new, auditable entry, because erasing history is itself the incident. Second, choose orchestration when the workflow is non-trivial and you want one place to reason about state, timeouts, and compensation — while staying vigilant that the orchestrator never metastasizes into a god-service.

Shipping change without holding your breath

Reversibility applies to deployments too. New versions roll out as canaries with automated analysis against your service-level objectives; a breach aborts and rolls back without a human in the loop. When your error budget is 99.99% — roughly 52 minutes a year — you cannot afford to discover a regression from a support ticket. You discover it from a metric, and the system reacts before you do.

The rigor underneath (because someone senior will ask)

Strategy without capacity math is a wish. At 50,000 TPS with ~500-byte events you are writing ~25 MB/s — about 2.1 TB/day of raw log. That forces real decisions: partition the stream for ordering and parallelism; sub-key hot accounts (a clearing account will try to become a single bottleneck) while keeping the canonical stream ordered; tier storage so hot data stays in the primary and cold history offloads to the lakehouse that also serves analytics. The point is not the specific numbers — it's that the numbers exist before the decisions do.

What I'd actually tell the steering committee

The hardest part of this is not the architecture. It is convincing leadership that slower-looking is safer and ultimately faster, and then holding the line when a stakeholder asks why you aren't "just done." The honest pitch is a portfolio of risk, not a Gantt chart:

Value is incremental, not deferred to a cutover that may never safely arrive.
Risk is bounded at each step to the percentage of traffic you chose to move.
The legacy bill goes away on your schedule — you decommission when reconciliation says it's safe, not on a terrifying weekend.
Auditability improves on day one, which in a regulated domain is itself a deliverable.

When not to do this

Seniority is knowing when the expensive pattern is the wrong one. If the system is small, low-risk, or genuinely greenfield, the coexistence machinery here is overhead you don't need — rewrite it and move on. If the domain is being retired anyway, don't modernize it; sunset it. The Strangler Fig earns its complexity only when the system is mission-critical, long-lived, and cannot stop. That's exactly when most teams reach for the big-bang — which is exactly why they get hurt.

Principles, distilled

Make every step reversible. A dial, never a switch.

The event log is the truth; everything else is a cache.

Never let the legacy schema cross the border.

Earn each traffic increase with reconciliation, not optimism.

In finance, compensate — never delete.

I built a complete, runnable reference implementation of everything above — the strangler facade, the anti-corruption layer, the transactional outbox, the event-sourced CQRS ledger, the orchestration SAGA, and the canary rollout — in Java / Spring, with architecture decision records and a one-command local run.

Clone it and run docker compose up: https://github.com/mizbamd/payments-modernization-platform

If this resonates, it's one of five reference implementations in an open Enterprise Platform Reference Architecture covering modernization, production RAG, governed AI agents, MACH pricing, and a streaming lakehouse. I write about building platforms that are not allowed to fail — follow along.

Originally published on Medium.

DEV Community