Vlad Hilko

Posted on Jul 22, 2023 • Edited on Jul 26, 2023

Data Migration Strategies in Ruby on Rails: The Right Way to Manage Missing Data

#ruby #rails #patterns

Overview

In this article, we're going to discuss the possible strategies for migrating, generating, and backfilling data in a Rails application. We'll implement them, improve them, consider their pros and cons, and discuss which ones are better to use in different scenarios. By the end of the article, we will have a full picture of the different ways to solve data migration problems.

Introduction

In simple terms, data migration is the process of adding, updating, or transferring some data inside your application. The most popular cases for data migration are as follows:

Backfilling column data
Moving column data from one table to another
Generating new database records
Updating corrupted or invalid data with correct values
Removing unused data

We'll consider 3 different ways to do it:

Direct Data Manipulation
Rake Task
Data Migration Gem

Direct Data Manipulation

The first option is the simplest one: we'll just add missing data via rails c or through a direct database connection in production.

Advantages:

Easy
No need to implement anything new
It's fast because data migration can be done in minutes.

Problems:

Too risky; changes may not end up as intended
Possible access and security problems
There are no tests and code reviews, so we can't be sure of the quality
Lack of control; you don't know who ran the migration or why they ran it.

Rake Task

The second option is the rake task. In this chapter, we will try to understand how to properly add rake tasks, ensure that they work correctly, learn their pros and cons, and explore how they can be used for data migration. We will start by adding the simplest rake task, and then we will proceed to improve its structure, cover the logic with tests, and consider using best practices for writing data migration using rake tasks.

Let's imagine that we have an Animal model with the following fields:

id
kind
status
created_at
updated_at

And we need to change the status value from nil to reserved for all animals that we created before today. How can we do it? Let's start by adding a simple rake task template.

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    puts 'Updating animal status...'
  end
end

And check that it works:

rake animals:backfill_statuses
# => Updating animal status...

The task has been executed and works as expected. Now, let's add the actual code with the database update. It will look like the following:

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    Animal.where(status: nil).where('created_at < ?', Time.zone.today).update_all(status: 'reserved')
  end
end

Now, let's check if it works:

rake animals:backfill_statuses

The rake task has been executed, and the database values are updated accordingly. That's it. Our main scenario works as expected, but there's still some room for improvements. Let's take a look at what we can do to make our task more reliable.

Improvements

There are 5 areas that we can potentially improve:

Display results in the console for visibility
Ensure data consistency with transactions
Optimize DB requests
Isolate the rake task code
Add tests

Display results in the console for visibility

As you may have noticed, the rake task above hasn't shown any output. It can be a real problem because you don't know whether it was run successfully or not, and you will spend much time trying to check it by yourself on production data. Let's fix this problem:

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    puts "Before running the rake task, there were #{Animal.where(status: 'reserved').count} animals in the 'reserved' state."

    Animal.where(status: nil).where('created_at < ?', Time.zone.today).update_all(status: 'reserved')

    puts "After running the rake task, there are now #{Animal.where(status: 'reserved').count} animals in the 'reserved' state."
  end
end

Now, let's run the rake task:

rake animals:backfill_statuses

# => Before running the rake task, there were 0 animals in the 'reserved' state.
# => After running the rake task, there are now 101 animals in the 'reserved' state.

With the updated code, we display the number of animals in the 'reserved' state both before and after running the rake task, providing better visibility and ensuring that the task is executed successfully.

Ensure data consistency with transactions

What happens if some unexpected errors appear in the middle of data migration? Right now, we don't handle it. Even if it's not critical for the example that we've provided, in general, we should not forget to wrap such kind of data manipulation into a transaction to keep a consistent data state.

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    ActiveRecord::Base.transaction do
      Animal.where(status: nil).where('created_at < ?', Time.zone.today).update_all(status: 'reserved')
    end
  end
end

By adding the ActiveRecord::Base.transaction block, we ensure that all the updates are executed as a single atomic operation. If any error occurs during the data migration, the transaction will be rolled back, and the data will remain unchanged, maintaining data consistency and integrity.

P.S. If you have very large tables, you need to keep in mind that simply UPDATE all rows would cause a lock against writes on the whole table. That is why updates must be done in batches. The migration must not run within a transaction for the same reason.

Optimize DB requests

We have already handled this problem in our rake task, but it's important to mention that we should use the optimal database solution if possible. For example, someone could write our task like this:

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    Animal.where(status: nil).where('created_at < ?', Time.zone.today).each do |animal|
      animal.update(status: 'reserved')
    end
  end
end

The following code will trigger an SQL update request for every animal from the list, making it non-optimal:

D, [2023-07-21T09:50:58.346040 #67787] DEBUG -- :   Animal Load (1.6ms)  SELECT `animals`.* FROM `animals` WHERE `animals`.`status` IS NULL AND (created_at < '2023-07-21')
D, [2023-07-21T09:50:58.346735 #67787] DEBUG -- :   ↳ lib/tasks/animals/backfill_statuses.rake:10:in `block (2 levels) in <main>'
D, [2023-07-21T09:50:58.371908 #67787] DEBUG -- :   TRANSACTION (2.2ms)  BEGIN
D, [2023-07-21T09:50:58.372737 #67787] DEBUG -- :   ↳ lib/tasks/animals/backfill_statuses.rake:11:in `block (3 levels) in <main>'
D, [2023-07-21T09:50:58.375091 #67787] DEBUG -- :   Animal Update (2.2ms)  UPDATE `animals` SET `animals`.`status` = 'reserved', `animals`.`updated_at` = '2023-07-21 07:50:58.368697' WHERE `animals`.`id` = 1
D, [2023-07-21T09:50:58.375713 #67787] DEBUG -- :   ↳ lib/tasks/animals/backfill_statuses.rake:11:in `block (3 levels) in <main>'
D, [2023-07-21T09:50:58.381169 #67787] DEBUG -- :   TRANSACTION (5.0ms)  COMMIT
D, [2023-07-21T09:50:58.381524 #67787] DEBUG -- :   ↳ lib/tasks/animals/backfill_statuses.rake:11:in `block (3 levels) in <main>'
D, [2023-07-21T09:50:58.383624 #67787] DEBUG -- :   TRANSACTION (1.3ms)  BEGIN
D, [2023-07-21T09:50:58.384250 #67787] DEBUG -- :   ↳ lib/tasks/animals/backfill_statuses.rake:11:in `block (3 levels) in <main>'
D, [2023-07-21T09:50:58.385792 #67787] DEBUG -- :   Animal Update (1.4ms)  UPDATE `animals` SET `animals`.`status` = 'reserved', `animals`.`updated_at` = '2023-07-21 07:50:58.381901' WHERE `animals`.`id` = 2

...

That's why it always makes sense to try to find a way to do it in a single DB operation, as we did with the following code:

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    Animal.where(status: nil).where('created_at < ?', Time.zone.today).update_all(status: 'reserved')
  end
end

This code will trigger only one DB request:

Animal Update All (6.8ms)  UPDATE `animals` SET `animals`.`status` = 'reserved' WHERE `animals`.`status` IS NULL AND (created_at < '2023-07-21')

If there's no way to update something in a single DB request, at least you should consider using batches as a good practice:

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    Animal.where(status: nil).where('created_at < ?', Time.zone.today).find_each do |animal|
      animal.update(status: 'reserved')
    end
  end
end

P.S. To see SQL logs from the rake task, you can add the following code inside:

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

namespace :animals do
  desc "Update animal status to 'reserved' for animals created before today"
  task backfill_statuses: [:environment] do
    ActiveRecord::Base.logger = Logger.new(STDOUT)

    # ...
  end
end

Isolate the rake task code

One not so obvious problem that can occur when you have many rake tasks is a lack of encapsulation. Let's take a look at the following two rake tasks and try to guess what can be wrong here:

rake animals:task1

# frozen_string_literal: true

# lib/tasks/animals/task1.rake

namespace :animals do
  task task1: [:environment] do
    puts message
  end

  def message
    'Hello world from Task 1!'
  end
end

rake animals:task2

# frozen_string_literal: true

# lib/tasks/animals/task2.rake

namespace :animals do
  task task2: [:environment] do
    puts message
  end

  def message
    'Hello world from Task 2!'
  end
end

Now let's run both of them:

rake animals:task1
# => Hello world from Task 2!
rake animals:task2
# => Hello world from Task 2!

Have you noticed? That's not what we expected! The second rake task overrode the method value from the first one! And that's quite dangerous and unexpected if you were to use something like this:

rake animals:task1

# frozen_string_literal: true

# lib/tasks/animals/task1.rake

namespace :animals do
  task task1: [:environment] do
    query.destroy_all
  end

  def query
    Animal.where(status: nil)
  end
end

lib/tasks/animals/task2.rake

# frozen_string_literal: true

# lib/tasks/animals/task2.rake

namespace :animals do
  task task2: [:environment] do
    query.destroy_all
  end

  def query
    Animal.all
  end
end

And run:

rake animals:task1

You would remove all your records instead of the desired subset!

How can we fix it?

We need to wrap our rake tasks into the Rake::DSL class like this:

rake animals:task1

# frozen_string_literal: true

# lib/tasks/animals/task1.rake

module Tasks
  module Animals
    class Task1

      include Rake::DSL

      def initialize
        namespace :animals do
          task task1: [:environment] do
            puts message
          end
        end
      end

      private

      def message
        'Hello world from Task 1!'
      end
    end
  end
end

Tasks::Animals::Task1.new

rake animals:task2

# frozen_string_literal: true

# lib/tasks/animals/task2.rake

module Tasks
  module Animals
    class Task2

      include Rake::DSL

      def initialize
        namespace :animals do
          task task2: [:environment] do
            puts message
          end
        end
      end

      private

      def message
        'Hello world from Task 2!'
      end
    end
  end
end

Tasks::Animals::Task2.new

And let's execute:

rake animals:task1
# => Hello world from Task 1!
rake animals:task2
# => Hello world from Task 2!

Now everything works as expected. Let's apply the same isolation for our backfill_statuses rake task.

# frozen_string_literal: true

# lib/tasks/animals/backfill_statuses.rake

module Tasks
  module Animals
    class BackfillStatuses

      include Rake::DSL

      def initialize
        namespace :animals do
          desc "Update animal status to 'reserved' for animals created before today"
          task backfill_statuses: [:environment] do
            Animal.where(status: nil).where('created_at < ?', Time.zone.today).update_all(status: 'reserved')
          end
        end
      end

    end
  end
end

Tasks::Animals::BackfillStatuses.new

That's it.

Add tests

The last thing that we're going to do to ensure quality is to add tests. Let's see how we can test the rake tasks.

First of all, we need to define some code to load our tasks:

# spec/support/tasks.rb

# frozen_string_literal: true

RSpec.configure do |_config|
  Rails.application.load_tasks
end

And include it in the spec/rails_helper.rb:

# spec/rails_helper.rb

require 'support/tasks'

Then let's add our test:

# spec/tasks/animals/backfill_statuses_spec.rb

# frozen_string_literal: true

require 'rails_helper'

RSpec.describe 'rake animals:backfill_statuses', type: :task do
  subject { Rake::Task['animals:backfill_statuses'].execute }

  let(:expected_output) do
    <<~TEXT
      Before running the rake task, there were 1 animals in the 'reserved' state.
      After running the rake task, there are now 2 animals in the 'reserved' state.
    TEXT
  end

  let(:animal_1) { create(:animal, created_at: 10.days.ago, status: nil) }
  let(:animal_2) { create(:animal, created_at: 10.days.ago, status: 'reserved') }
  let(:animal_3) { create(:animal, created_at: 10.days.ago, status: 'another_status') }
  let(:animal_4) { create(:animal, created_at: 10.days.from_now, status: nil) }

  before do
    animal_1
    animal_2
    animal_3
    animal_4
  end

  it "update animal status to 'reserved' for animals created before today" do
    expect { subject }.to change { animal_1.reload.status }.from(nil).to('reserved')
      .and output(expected_output).to_stdout

    expect(animal_2.reload.status).to eq('reserved')
    expect(animal_3.reload.status).to eq('another_status')
    expect(animal_4.reload.status).to be_nil
  end
end

That's it.

Data Migration Gem

The third option is to use the data-migrate gem.

Let's add this gem to our project:

# Gemfile

gem 'data_migrate'

And execute:

bundle install

Now you can generate a data migration as you would generate a schema migration:

rails g data_migration backfill_animal_statuses

# => db/data/20230721111716_backfill_animal_statuses.rb

Let's add some code to the generated file to check if it actually works:

# frozen_string_literal: true

class BackfillAnimalStatuses < ActiveRecord::Migration[7.0]
  def up
    puts 'Test Data Migration'
  end

  def down
    # do nothing
  end
end

To run the migration, we need to use the following command:

rake data:migrate
# or
rake db:migrate:with_data

And we get the following output:

== 20230721111716 BackfillAnimalStatuses: migrating ===========================
Test Data Migration
== 20230721111716 BackfillAnimalStatuses: migrated (0.0000s) ==================

This migration can be run only once. So let's remove it and generate another one and add real code inside:

rails g data_migration backfill_animal_statuses

# => db/data/20230721112534_backfill_animal_statuses.rbb

Here's what we get after adding our business logic code:

# db/data/20230721112534_backfill_animal_statuses.rb

# frozen_string_literal: true

class BackfillAnimalStatuses < ActiveRecord::Migration[7.0]
  def up
    Animal.where(status: nil).where('created_at < ?', Time.zone.today).update_all(status: 'reserved')
  end

  def down
    # do nothing
  end
end

And let's run:

rake data:migrate

Here's what we get:

== Data =======================================================================
== 20230721112534 BackfillAnimalStatuses: migrating ===========================
== 20230721112534 BackfillAnimalStatuses: migrated (0.0221s) ==================

Basically, data migration works by the same logic as scheme migration, but instead of saving the last running migration version into the schema_migrations table, data migration saves the version into another table called data_migrations.

It's important to mention that data migration should be irreversible in most cases, but we don't want to raise an explicit error as it would prevent rollback for the scheme structure changes. Instead, we just leave the down method empty. For this reason, it would be better to design the migration in an idempotent way, to be able to run it several times if possible.

Data Migration doesn't provide any additional advantages except for what we discussed above. Therefore, we still need to think about the problems that we solved for the rake task, such as displaying output, adding transactions, optimizing DB requests, etc.

Comparison of Rake Task and Data Migration Gem

Let's compare these two solutions and decide which one we should use and under what conditions:

When does a rake task fit better?

When you want to have the ability to select the exact time and day when you want to run it.
When you want to have the ability to select the platform where you want to run it (for example, staging or production).
When you want to run the same rake task several times.

When does the data migration gem fit better?

When you want to make sure that data will be added automatically and no one forgets to run something.
When schema migration order is important for you (for example, schema migration adds a new column and data migration backfills this column).
When you want to run it on all environments without additional effort.
When you need to run the migration only once.

So, in general, the rake task is much more flexible and testable and can cover the same tasks as the data migration but may require more effort. Data migrations are much more strict but provide some automations and strict execution order that are connected with schema changes.

For example, the data migration gem suits very well if you need to support some database schema restructurization and you don't want to miss the data during these changes. For instance, if you need to rename a column and you want to copy the values from the old one to a new one, then set NULL=false constraint to the new one, and then completely remove the old one, the data migration gem will make this process much easier compared to using a rake task.

On the other hand, if the task is unrelated to database schema changes, a rake task might be a more suitable choice. It offers greater flexibility and testability, making it easier to manage tasks that are not directly tied to schema alterations.

Conclusion

Throughout this article, we have explored various strategies for data migration, generation, and backfilling in a Rails application. We have implemented and improved these strategies, carefully considering their advantages and disadvantages. Additionally, we have compared two primary solutions, the rake task and data migration gem, and examined their suitability in different scenarios.

DEV Community