In my early days as a software engineer, I often wondered: how do teams migrate between databases without bringing everything down?
It’s one of those invisible feats of engineering — moving millions of records, decomposing APIs, redefining schemas — all while users continue using the system like nothing happened.
I recently had the opportunity to lead one of these migrations myself — a journey that pushed me to balance complexity and pragmatism, architecture and delivery speed, and “fancy” vs “effective” design decisions. It also pushed me to grow as an engineer, my team navigate a difficult impending blocker — navigating ambiguity, managing moving parts across systems, and learning to balance clean design with real-world delivery constraints.
While there are famous war stories from brilliant engineers at tech behemoths like Discord’s migration to ScyllaDB or Shopify’s scale-up journey with Vitess, this experience gave me a more grounded appreciation of what goes into these transitions —
from dual writes, to batch migration strategies, to worker-based synchronization, ensuring consistency between two databases in production.
We didn’t reinvent the wheel; we just made sure it rolled smoothly. The real value was in designing something that others could reuse, learn from, and trust.
The Problem: A Monolith Slowing Down
Disclaimer: All service names, model names, and technical identifiers mentioned in this post have been changed or generalized for confidentiality. The architectural principles and migration strategies, however, remain true to the original implementation.
Our Orders Service had grown into a large, unwieldy monolith.
It housed all the API endpoints — Orders, Media Assets, Transactions, Handler Details, and more. Over time, this came at a cost:
Migrations were long and painful.
Responses were slow — our heaviest APIs (Orders) were taking 4+ seconds (50th percentile, and it was growing with more data). Orders serialized multiple nested objects.
Nested joins and transformations increased complexity.
It was clear that we needed to move toward lightweight, domain-focused APIs — a modular approach that could evolve independently.
Two candidates emerged: Media Assets and Transactions.
We started with Media Assets, since they were growing rapidly and heavily coupled with the Orders API.
Each order had 6–8 related assets, and this interlinking was dragging down performance.
Something we recognized was that an alternative approach could have been to split the APIs within the same service into dedicated endpoints. This would have allowed us to continue using the same database while still reducing response times. However, we were inspired to migrate to a full Media Assets service to accelerate growth in that domain. Additional benefits included simpler test suites, streamlined business logic, and establishing a single source of truth for media assets that other domains could leverage, for example, analytics, possibly.
The Approach: From Monolith to Dedicated Service
The goal was simple — bring the response times under 300ms, and create a foundation that scaled cleanly as we grew.
The Plan:
Spin off Media Assets into an independent microservice.
Keep V1 and V2 running in parallel — zero downtime.
Use dual-writes during the migration period.
Migrate 5M+ asset records per day, until all 40M+ assets were moved. Point to note here is that we didn't open the floodgates from day one; we tried with ~10k records at first and progressively increased write throughput once we saw successful write jobs every night and their corresponding performance to catch any unforeseen issues beforehand.
The New Architecture
Before:
Orders Service managed all nested models, let's call them — Assets, Event Details, Reviewer Details, etc.
After:
A dedicated Media Asset microservice with its own schema and database.
-
DynamoDB chosen for predictable access patterns and low maintenance.
-
OrderIdas the PartitionKey. -
AssetId#AssetTypeas the SortKey. - Additional GSIs for fast lookups (e.g., by
AssetTypeorCreationDate).
-
APIs written in Rails, deployed to ECS Fargate.
Asynchronous
Sidekiqtasks to perform non-blocking V2 writes. We could have had a better solution using a dedicated MQ to handle load, but that came with added costs, maintenance, writing logic, and scaling. Especially since our service instances were more than capable of handling a sidecar queue without memory issues, we chose to keep it as a simple sidecar.CloudwatchandNew Relicfor observability and tracing. CW logs gave us enough observability with its querying interface, and NR with its metrics interface. We didn't invest a lot of effort since CW has matured quite a bit for engineers to drill down into issues, provided we have good logging practices and tracers.
The Migration Journey
We designed the migration to be zero-downtime and reversible:
Read replica of RDBMS for backfill jobs.
Dual-write mode: both V1 (old) and V2 (new) APIs handled traffic.
Synthetic data tests in QA before go-live.
Batch backfill for historical data before the switchover date.
Feature toggles and rollback plan to instantly revert in case of issues.
Rollback Plan:
If V1 was impacted — disable the feature toggle, roll back to the stable release, and resume migration after fixing the issue.
| Challenge | Solution |
|---|---|
| Data Consistency between V1 and V2 during dual-writes | Implemented reconciliation jobs to verify that each record existed and matched in both systems. |
| Race Conditions — upload completed before V2 API confirmation | Defensive retry logic: workers re-sent messages until V2 returned 200 OK. This was especially easy to do with SQS visibility timeout for retrying events later. |
| Intermittent Failures in background tasks | Exponential backoff with 4 retries. Failed records were marked v2_complete = false for later reconciliation. |
| Copying S3 Files to the new directory structure | Used SQS-triggered worker pipelines to handle copy + compression in the background. |
| Retry Storms / API Clumping under load | Introduced random jitter to stagger retries. |
| Orphaned Uploads / Missed Syncs | Daily reconciliation jobs scanned an outbox table (DLQ) to capture missing create/update/delete events. |
The Results
The outcomes were immediate and measurable:
Latency dropped from 4+ seconds to ~180ms. This was foreseen as an easy gain since now we have a new database that is serializing much less transitively, and has allowed for super-fast lookups. The querying is even faster, but our app added some data validation and massaging, and our average response times were super fast!
Deployment times decreased — smaller test suites, faster CI/CD (~15minutes total). This is a pure process gain rather than a tech optimization. This highlighted that not all optimizations were always in code, but the setup, approach, and the right architecture.
Independent scaling unlocked — the Media Asset service could evolve on its own.
New UI components could call V2 APIs directly without waiting on the Order API.
Standardized JSON API contracts for enterprise integrations.
Lessons & Next Steps
Even with strong results, there were areas we wanted to evolve further:
Long-polling upload status could be replaced with Server-Sent Events (SSE).
Offload background work entirely using SQS topics for better isolation.
Explore a stable ORM for developer experience.
Consider rewriting worker scripts in a more performant language long-term.
Monitor DynamoDB cost trade-offs as access patterns evolve.
Final Thoughts
This migration wasn’t about reinventing the wheel — it was about balancing impact with reliability.
We didn’t aim for the flashiest architecture, but the most sustainable one — that could handle 2M+ orders and 40M+ assets with confidence.
And that’s the real beauty of engineering migrations: making the complex look effortless.

Top comments (0)