Fh Ndiritu

Posted on Jan 22 • Edited on Jan 26

I migrated 15M records of 13GB data from Mongo to Rails Postgres on a 512MB budget

#ruby #rails #heroku #postgres

The "Complexity Tax" of Polyglot Persistence

Scaling is exciting; it’s even more exciting before the grounding reality of the first month in production sets in and the thousands of customers you scaled for actually come knocking and you find out the actual numbers. Our Polyglot persistence setup was not unique, nor was it unexpected for an application handling hundreds of thousands of documents through extraction pipelines. But like most startups, we ended up scaling our costs long before we scaled our user base.

Early on, we designed the system to store extracted paragraphs on a MongoDB instance while keeping the core relational data on Postgres. This delivered the expected flexibility of very fast writes and schema-less retrieval—until the day we had to critique the bill. We were paying a "Complexity Tax": the maintenance overhead of running two databases, the cognitive load of context switching, and the realization that we were leaving Postgres's best features—joins, foreign keys, and optimized includes—on the table.

Postgres 18 and recent versions have introduced the most significant optimizations since the debut of TOAST decades ago. The switch from memory-heavy PGLZ to LZ4 compression for text columns, reliance on hardware acceleration (SIMD), and Async I/O with copy scans has changed the calculus. These updates promised near-similar speed to MongoDB, lower costs, and zero network latency between our data and our metadata.

We also had early expectations that users would be constantly editing paragraph contents, leading to massive write spikes. This turned out to be false. After extraction, a paragraph is touched less than three times over its lifecycle. Our write-throughput bottleneck had disappeared, and so had our justification for the overhead of running two databases.

Designing the Migration Architecture

Moving data is easy to judge: is it there or is it not? But getting it there without downtime is a different beast. We needed to account for potential rollbacks and the reality that our extraction pipelines would continue pouring new data into MongoDB while we were trying to empty it. This demanded a "Dual Write" pattern combined with a Hot Swap approach to ensure we kept exactly the same table names, models and relationships.

The Dual Write Sequence

The application needed to write to the new Postgres table without breaking existing code. The strategy involved creating a shadow Active Record model, RailsParagraph, pointing to the standard paragraphs table.

class RailsParagraph < ApplicationRecord
  self.table_name = 'paragraphs'
end

We mirrored the MongoDB fields exactly but made one crucial decision: the mongo_id would become a unique index. This was the anchor. It allowed migration scripts to be replayed infinitely without creating duplicates.

We had more indexes but each B-tree index adds an extra write request, indexes like page_id, section_id and contains_images could wait and be added concurrently when doing the hot swap after data migration which involves renaming AR models and removing the MongoId models.

To capture live data, we hooked into the Mongoid lifecycle. Using after_save callbacks, every new or updated document was synced to Postgres immediately.

# Inside the Mongo model
after_save :sync_to_postgres

def sync_to_postgres
  # We use upsert to safely handle replays
  RailsParagraph.upsert(
    { mongo_id: self.id.to_s, content: self.content },
    unique_by: :mongo_id
  )
end

Idempotency is King

For the backfill script, simple inserts were insufficient. If the script crashed halfway through, it needed to be restartable without manual cleanup. We utilized the upsert_all method with the unique_by: :mongo_id parameter. This ensured that if a record had already been synced by the Dual Write process, the backfill script would simply skip it or update it, preventing race conditions.

The Poison Pill (Sanitization)

Migration scripts are excellent at finding data corruption you didn't know you had. MongoDB, being schema-less, had happily accepted strings containing Null Bytes (\u0000). PostgreSQL is stricter; a single null byte in a text column is a violation that aborts the entire transaction.

A sanitization boundary was required, stripping these characters before they ever touched the SQL payload.

# The Sanitization Boundary
text: doc['text']&.delete("\u0000")

The "Naive" Implementation & The Crash

A separate PR would remove the MongoId index which would not be needed anymore, add the remaining indexes by disabling ddl transaction which allows us to add concurrently indexes without locking tables and finally insert our relationships between paragraphs, pages and sections.

With the architecture set, the code was written—clean, idiomatic Ruby. A Sidekiq job was built to process 15 million historical records in batches of 2,000, spawning subsequent jobs in a neat daisy-chain. It looked perfect in the Pull Request.

Then it was deployed.

The jobs ran for a few minutes, processed a few batches, and then the worker would silently die. No error logs, just a vanished process. On Heroku, the graphs revealed the dreaded Error R14 (Memory Quota Exceeded).

It wasn't a flat line; it was a "staircase." Every batch pushed the memory baseline higher. Ruby was releasing the objects, but the operating system wasn't reclaiming the RAM. We weren't fighting a bug in the code; we were fighting Memory Fragmentation. The heap looked like Swiss Cheese—full of holes too small to reuse.

Debugging Physics

At this point, standard advice is "Just install jemalloc." It’s the easy button that swaps the memory allocator to mask the problem. But rather than masking it, we chose to fix the physics of the container.

Taming the Arenas

The standard Linux allocator creates multiple memory pools ("arenas") to optimize for multi-threaded speed. In a constrained 512MB container, these arenas compete for RAM and hoard pages. We forced glibc to be frugal by strictly limiting the number of arenas via the environment config.

# Force the allocator to reuse memory chunks
heroku config:set MALLOC_ARENA_MAX=2

The Hack: Manual Garbage Collection

High-churn string processing is the enemy of Ruby's lazy Garbage Collector. By the time the GC wakes up, the OS has already allocated new pages, increasing RSS (Resident Set Size). We manually managed memory inside the loop by forcing the GC to sweep immediately after releasing the batch reference.

# 1. Clear the reference explicitly
batch_docs = nil 

# 2. Force Ruby to clean up NOW
GC.start(full_mark: true, immediate_sweep: true)

The Defragmenter

Just cleaning up isn't enough if the heap is fragmented. We introduced compaction to physically move objects in memory, smashing the "Swiss Cheese" holes closed every few thousand records.

if (processed % 10_000).zero? && GC.respond_to?(:compact)
  GC.compact 
end

The "Crowded Door" (Connection Limits)

With memory stabilized, ambition took over. We scaled the worker count to speed up the migration. That’s when the second wall appeared: Redis Connection Limits.

Running on a Redis "Mini" plan meant a hard limit of 20 connections. The workers crashed immediately on boot with ERR max number of clients reached. Worse, attempting to debug via rails console grabbed 10 connections instantly, effectively performing a "Murder by Console" on the production workers.

The Fix: Connection Diet

The application had to go on a strict diet. We capped RAILS_MAX_THREADS to 2 and SIDEKIQ_CONCURRENCY to 1. We traded theoretical capacity for actual stability.

# Force Rails and Sidekiq to fit in the box
heroku config:set RAILS_MAX_THREADS=2
heroku config:set SIDEKIQ_CONCURRENCY=1

Optimizing Optimizations

With the system stable, we hunted for bottlenecks. The biggest one wasn't the database; it was the logs.

Writing "INSERT INTO..." 15 million times is a massive I/O operation. The CPU was spending half its cycles waiting for the disk to acknowledge the log write. We silenced ActiveRecord for the migration block, and throughput instantly increased by 40%.

ActiveRecord::Base.logger.silence do
  # Pure signal, no noise.
  migrate_batch!
end

Counter-intuitively, setting SIDEKIQ_CONCURRENCY=1 also made the migration faster. With 3 threads, there was high context-switching overhead and constant contention for the database lock. With 1 thread, the worker entered a "flow state"—reading from Mongo and writing to Postgres in a tight, uninterrupted loop without swapping to disk.

Conclusion

We successfully migrated 15 million records, consolidated our database stack, and reduced our monthly infrastructure spend. But the real lesson wasn't about PostgreSQL or MongoDB.

It was about Physics.

We could have installed jemalloc and called it a day. But by choosing the hard path—tuning the arenas, manually compacting the heap, and mathematically sizing our batches—we actually learned how our runtime interacts with the hardware. We moved from a fragile, complexity-taxed system to a boring, standard stack. And in infrastructure, "boring" is the highest compliment you can get.

DEV Community