Rodrigo Barreto for Vídeos de Ti

Posted on Aug 23

Data Import in Rails: From Less Recommended to Optimised 🚀🇬🇧

This post was 100% created with my examples, code, and real experiences, but formatted with AI assistance for better organization. AI can help us format and structure content, but it doesn't replace the knowledge and practical experience that we developers bring!

Importing data from JSON or Excel files is a common task in Rails development, but many people still do it inefficiently. Today I'll show you three approaches with impressive results: in a local test with just 10,000 records, the difference was from ~40 seconds to ~5 seconds - an improvement of more than 8x!

What this post does NOT cover 📝

How to efficiently read very large files (that's for another post)
Advanced chunks/batches imports (to keep the post concise)
Parallelization strategies with Sidekiq/ActiveJob

💡 Golden tip: For PostgreSQL users, the book High Performance PostgreSQL for Rails is a must-read!

Setting Up Our Scenario 🎬

First, let's create mock data to test our implementations:

# file_generator.rb (at the root of your project)
require 'json'
require 'faker'

base_users = [
  { name: "John Smith", email: "john@example.com", bio: "Ruby on Rails dev..." },
  { name: "Sarah Johnson", email: "sarah@example.com", bio: "Tech Lead..." },
  { name: "Mike Wilson", email: "mike@example.com", bio: "Full-stack dev..." },
  { name: "Emma Davis", email: "emma@example.com", bio: "DevOps engineer..." },
  { name: "Alex Rodriguez", email: "alex@example.com", bio: "Senior dev..." }
]

base_titles = [
  "Complete Guide to Active Record Queries",
  "Avoiding the N+1 Problem in Rails",
  "TDD with RSpec: From Basic to Advanced"
]

base_contents = [
  "In this article, we explore...",
  "Let's dive deep into...",
  "This tutorial covers..."
]

base_categories = ["Ruby on Rails", "Performance", "Database", "DevOps", "Architecture"]

posts = []

10_000.times do |i|
  user = base_users.sample
  post = {
    title: "#{base_titles.sample} ##{i}",
    content: "#{base_contents.sample} #{Faker::Lorem.paragraph(sentence_count: 5)}",
    published: [true, false].sample,
    user: user,
    categories: base_categories.sample(rand(1..3))
  }
  posts << post
end

File.write("blog_json_data_10k.json", JSON.pretty_generate(posts))

How to run 🏃‍♂️

Generate the JSON file with test data:

# At the root of your Rails project, run:
ruby file_generator.rb

This will create a blog_json_data_10k.json file with 10,000 mock posts.

To test the different implementations in Rails console:

# Open Rails console
rails console

# To test the less recommended implementation (grab some coffee!)
Importer::BadImporter.import!

# To test the reasonable implementation
Importer::BlogDataImporter.import!

# To test the optimized implementation (it will fly!)
Importer::BlogDataImporterWithActiveRecordImport.import!

💡 Note: You can change 10_000.times to generate more or less data. To start, 1,000 records already show the difference well!

The JsonImporter Class (Helper) 📄

You may have noticed we're using Importer::JsonImporter in all examples. It's a simple helper class to read the JSON file:

# app/services/importer/json_importer.rb
module Importer
  class JsonImporter
    attr_reader :file_name, :file_path

    def initialize(file_name:)
      @file_name = file_name
      @file_path = Rails.root.join(file_name)
    end

    def import!
      unless File.exist?(@file_path)
        raise "File not found: #{@file_path}"
      end

      puts "📁 Reading JSON file: #{@file_name}"
      file_content = File.read(@file_path)
      JSON.parse(file_content)
    rescue JSON::ParserError => e
      raise "Invalid JSON format: #{e.message}"
    rescue StandardError => e
      raise "Error reading file: #{e.message}"
    end
  end
end

This is a basic implementation that loads the entire file into memory. There are MUCH more performant and efficient ways to read large files (streaming, chunking, etc.), but that's for another post! The focus here is on importing data to the database.

Example 1: The Less Recommended Way ❌

# app/services/importer/bad_importer.rb
module Importer
  class BadImporter
    def self.import!
      start_time = Time.current

      blog_json = Importer::JsonImporter.new(file_name: "blog_json_data_10k.json").import!

      # Process each post individually
      blog_json.each do |post_data|

        # Create user for every post (checks for duplicates every time!)
        user = User.find_or_create_by(email: post_data["user"]["email"]) do |u|
          u.name = post_data["user"]["name"]
          u.bio = post_data["user"]["bio"]
        end

        # Create post
        post = Post.find_or_create_by(title: post_data["title"]) do |p|
          p.content = post_data["content"]
          p.published = post_data["published"]
          p.user = user
        end

        # Create categories for every post (more checks!)
        post_data["categories"].each do |category_name|
          category = Category.find_or_create_by(name: category_name)

          # Create association
          PostCategory.find_or_create_by(post: post, category: category)
        end
      end

      elapsed_time = Time.current - start_time
      puts "Total time: #{elapsed_time.round(2)} seconds"
    end
  end
end

Why is this code not recommended? 🤔

Query explosion: For each post, we make multiple queries (find_or_create_by)
No transaction: If something fails in the middle, you'll have partial data in the database
Poor performance: In my test, it took ~40 seconds for just 10,000 records!
Unnecessary resource usage: Checks for duplicates on every iteration
No failure protection: An error on any line can leave garbage in the database

⚠️ Reality: I've seen code like this in production processing a lot of records. Imagine the time!

Example 2: The Reasonable Way ✅

# app/services/importer/blog_data_importer.rb
module Importer
  class BlogDataImporter
    def self.import!
      start_time = Time.current

      ActiveRecord::Base.transaction do
        blog_json = Importer::JsonImporter.new(file_name: "blog_json_data_10k.json").import!

        # Extract unique data BEFORE inserting
        categories = blog_json.map{ |post| post["categories"] }.flatten.uniq
        users = blog_json.map{ |post| post["user"] }.uniq { |user| user["email"] }

        # Insert categories and users all at once
        Category.insert_all(categories.map { |category| {name: category} })
        User.insert_all(users.map { |user| {name: user["name"], email: user["email"], bio: user["bio"]} })

        # Create lookup hashes (KEY OPTIMIZATION!)
        categories_hash = Category.all.pluck(:name, :id).to_h
        users_hash = User.all.pluck(:email, :id).to_h

        # Import posts
        blog_json.map do |post|
          result = {
            title: post["title"],
            content: post["content"],
            published: post["published"],
            user_id: users_hash[post["user"]["email"]]
          }
          post_data = Post.create!(result)

          # Create post-category associations in batch
          PostCategory.insert_all(
            post["categories"].map { |category| 
              {post_id: post_data.id, category_id: categories_hash[category]} 
            }
          )
        end

      rescue ActiveRecord::RecordInvalid, StandardError => e
        error = "Error: #{e.message}"
        puts error
        Rails.logger.error error
      end

      elapsed_time = Time.current - start_time
      puts "Total time: #{elapsed_time.round(2)} seconds"
    end
  end
end

Why is this code much better? 👍

Uses transaction: Ensures atomicity - either imports everything or nothing
insert_all: Drastically reduces the number of queries
Lookup hashes: Transforms O(n) searches into O(1) - genius!
Process duplicates only once: Much more efficient

The Magic of Lookup Hashes 🎩

categories_hash = Category.all.pluck(:name, :id).to_h
# Result: {"Ruby on Rails" => 1, "Performance" => 2, ...}

Instead of searching for the category in the database for each post (10,000 searches!), we make one query and access via hash. This is pure gold for performance!

Example 3: The OPTIMIZED Way with activerecord-import 🚀

First, add to your Gemfile:

gem 'activerecord-import'
# If using PostgreSQL (as in the repository example)
gem 'pg', '~> 1.1'

💡 Note: This post works with any database (SQLite, MySQL, PostgreSQL). In the example code I've made available on GitHub, I used PostgreSQL, but you can use whatever database you prefer!

Now watch the magic happen:

# app/services/importer/blog_data_importer_with_active_record_import.rb
module Importer
  class BlogDataImporterWithActiveRecordImport
    def self.import!
      start_time = Time.current

      ActiveRecord::Base.transaction do
        blog_json = Importer::JsonImporter.new(file_name: "blog_json_data_10k.json").import!

        # Prepare unique data
        categories = blog_json.map{ |post| post["categories"] }.flatten.uniq
        users = blog_json.map{ |post| post["user"] }.uniq { |user| user["email"] }

        # Import categories and users in batch
        category_objects = categories.map { |name| Category.new(name: name) }
        Category.import category_objects, on_duplicate_key_ignore: true, validate: false

        user_objects = users.map { |user| 
          User.new(name: user["name"], email: user["email"], bio: user["bio"]) 
        }
        User.import user_objects, on_duplicate_key_ignore: true, validate: false

        # Create lookup hashes
        categories_hash = Category.all.pluck(:name, :id).to_h
        users_hash = User.all.pluck(:email, :id).to_h

        # Import posts in batches to save memory
        blog_json.in_groups_of(1000, false) do |post_batch|
          posts_to_import = post_batch.map do |post|
            Post.new(
              title: post["title"],
              content: post["content"],
              published: post["published"],
              user_id: users_hash[post["user"]["email"]]
            )
          end

          # Import this batch of posts (without validations for maximum performance!)
          Post.import posts_to_import, validate: false
        end

        # Prepare and import associations in batches
        Post.where(title: blog_json.map { |p| p["title"] }).find_in_batches(batch_size: 1000) do |post_batch|
          post_categories = []

          post_batch.each do |post|
            post_data = blog_json.find { |p| p["title"] == post.title }
            post_data["categories"].each do |category_name|
              post_categories << PostCategory.new(
                post_id: post.id,
                category_id: categories_hash[category_name]
              )
            end
          end

          # Import associations for this batch (without validations!)
          PostCategory.import post_categories, validate: false if post_categories.any?
        end

      rescue StandardError => e
        error = "Error: #{e.message}"
        puts error
        Rails.logger.error error
      end

      elapsed_time = Time.current - start_time
      puts "Total time: #{elapsed_time.round(2)} seconds"
    end
  end
end

Why is activerecord-import AMAZING? 🎯

Optimized SQL: Generates a single INSERT with multiple VALUES
Memory management: Processes in batches of 1000 records
Total flexibility: Optional validations, upserts, callbacks
Brutal performance: In my test: ~5 seconds!

The gem offers powerful options:

on_duplicate_key_ignore: Silently ignores duplicates
on_duplicate_key_update: Updates existing records
validate: false: Skips Rails validations (extra performance gain!)
batch_size: Controls memory usage

Real Performance Comparison 📊

Local test with 10,000 records on an Apple M3 Pro with 18GB RAM:

BadImporter: ~40 seconds 😱
BlogDataImporter: ~15 seconds 😊
BlogDataImporterWithActiveRecordImport: ~5 seconds 🚀

The optimized version was 8x faster than the less recommended one!

⚠️ Important about timing: These tests were done on a powerful machine (M3 Pro, 18GB RAM). On a basic VPS (1GB RAM, 1 vCPU), these times could be 3-4x higher - meaning BadImporter could take 2-3 minutes! But the improvement ratio remains: the optimized version will always be much faster, regardless of hardware.

With larger volumes (100,000 or 1,000,000 records), you'd need even more advanced strategies like parallel processing, queues, or specialized tools - but even this version is infinitely better than the traditional approach!

Conclusion and Best Practices 🎉

The difference between a poorly done import and an optimized one can mean:

Your script running in seconds instead of hours
Less database load
Lower memory usage
Automatic rollback in case of error

Always use:

Transactions to ensure consistency
Batch insertions (insert_all or activerecord-import)
Lookup hashes to avoid unnecessary queries
Process unique data before insertion
Batches to control memory usage

activerecord-import is your best friend for bulk imports!

Liked this post? Leave a ❤️ and share with other Rails devs!

Want more content about performance and databases?

Comment below what you'd like to see:

Advanced strategies with the activerecord-import gem
Migrations and column changes in existing databases with millions of records
Real cases with Kafka or SQS for asynchronous processing
Other database optimization strategies

Just ask and we'll prepare the content! 🚀

rails #ruby #performance #database #postgresql #activerecord

DEV Community