This post was 100% created with my examples, code, and real experiences, but formatted with AI assistance for better organization. AI can help us format and structure content, but it doesn't replace the knowledge and practical experience that we developers bring!
Importing data from JSON or Excel files is a common task in Rails development, but many people still do it inefficiently. Today I'll show you three approaches with impressive results: in a local test with just 10,000 records, the difference was from ~40 seconds to ~5 seconds - an improvement of more than 8x!
What this post does NOT cover π
- How to efficiently read very large files (that's for another post)
- Advanced chunks/batches imports (to keep the post concise)
- Parallelization strategies with Sidekiq/ActiveJob
π‘ Golden tip: For PostgreSQL users, the book High Performance PostgreSQL for Rails is a must-read!
Setting Up Our Scenario π¬
First, let's create mock data to test our implementations:
# file_generator.rb (at the root of your project)
require 'json'
require 'faker'
base_users = [
{ name: "John Smith", email: "john@example.com", bio: "Ruby on Rails dev..." },
{ name: "Sarah Johnson", email: "sarah@example.com", bio: "Tech Lead..." },
{ name: "Mike Wilson", email: "mike@example.com", bio: "Full-stack dev..." },
{ name: "Emma Davis", email: "emma@example.com", bio: "DevOps engineer..." },
{ name: "Alex Rodriguez", email: "alex@example.com", bio: "Senior dev..." }
]
base_titles = [
"Complete Guide to Active Record Queries",
"Avoiding the N+1 Problem in Rails",
"TDD with RSpec: From Basic to Advanced"
]
base_contents = [
"In this article, we explore...",
"Let's dive deep into...",
"This tutorial covers..."
]
base_categories = ["Ruby on Rails", "Performance", "Database", "DevOps", "Architecture"]
posts = []
10_000.times do |i|
user = base_users.sample
post = {
title: "#{base_titles.sample} ##{i}",
content: "#{base_contents.sample} #{Faker::Lorem.paragraph(sentence_count: 5)}",
published: [true, false].sample,
user: user,
categories: base_categories.sample(rand(1..3))
}
posts << post
end
File.write("blog_json_data_10k.json", JSON.pretty_generate(posts))
How to run πββοΈ
- Generate the JSON file with test data:
# At the root of your Rails project, run:
ruby file_generator.rb
This will create a blog_json_data_10k.json
file with 10,000 mock posts.
- To test the different implementations in Rails console:
# Open Rails console
rails console
# To test the less recommended implementation (grab some coffee!)
Importer::BadImporter.import!
# To test the reasonable implementation
Importer::BlogDataImporter.import!
# To test the optimized implementation (it will fly!)
Importer::BlogDataImporterWithActiveRecordImport.import!
π‘ Note: You can change
10_000.times
to generate more or less data. To start, 1,000 records already show the difference well!
The JsonImporter Class (Helper) π
You may have noticed we're using Importer::JsonImporter
in all examples. It's a simple helper class to read the JSON file:
# app/services/importer/json_importer.rb
module Importer
class JsonImporter
attr_reader :file_name, :file_path
def initialize(file_name:)
@file_name = file_name
@file_path = Rails.root.join(file_name)
end
def import!
unless File.exist?(@file_path)
raise "File not found: #{@file_path}"
end
puts "π Reading JSON file: #{@file_name}"
file_content = File.read(@file_path)
JSON.parse(file_content)
rescue JSON::ParserError => e
raise "Invalid JSON format: #{e.message}"
rescue StandardError => e
raise "Error reading file: #{e.message}"
end
end
end
This is a basic implementation that loads the entire file into memory. There are MUCH more performant and efficient ways to read large files (streaming, chunking, etc.), but that's for another post! The focus here is on importing data to the database.
Example 1: The Less Recommended Way β
# app/services/importer/bad_importer.rb
module Importer
class BadImporter
def self.import!
start_time = Time.current
blog_json = Importer::JsonImporter.new(file_name: "blog_json_data_10k.json").import!
# Process each post individually
blog_json.each do |post_data|
# Create user for every post (checks for duplicates every time!)
user = User.find_or_create_by(email: post_data["user"]["email"]) do |u|
u.name = post_data["user"]["name"]
u.bio = post_data["user"]["bio"]
end
# Create post
post = Post.find_or_create_by(title: post_data["title"]) do |p|
p.content = post_data["content"]
p.published = post_data["published"]
p.user = user
end
# Create categories for every post (more checks!)
post_data["categories"].each do |category_name|
category = Category.find_or_create_by(name: category_name)
# Create association
PostCategory.find_or_create_by(post: post, category: category)
end
end
elapsed_time = Time.current - start_time
puts "Total time: #{elapsed_time.round(2)} seconds"
end
end
end
Why is this code not recommended? π€
-
Query explosion: For each post, we make multiple queries (
find_or_create_by
) - No transaction: If something fails in the middle, you'll have partial data in the database
- Poor performance: In my test, it took ~40 seconds for just 10,000 records!
- Unnecessary resource usage: Checks for duplicates on every iteration
- No failure protection: An error on any line can leave garbage in the database
β οΈ Reality: I've seen code like this in production processing a lot of records. Imagine the time!
Example 2: The Reasonable Way β
# app/services/importer/blog_data_importer.rb
module Importer
class BlogDataImporter
def self.import!
start_time = Time.current
ActiveRecord::Base.transaction do
blog_json = Importer::JsonImporter.new(file_name: "blog_json_data_10k.json").import!
# Extract unique data BEFORE inserting
categories = blog_json.map{ |post| post["categories"] }.flatten.uniq
users = blog_json.map{ |post| post["user"] }.uniq { |user| user["email"] }
# Insert categories and users all at once
Category.insert_all(categories.map { |category| {name: category} })
User.insert_all(users.map { |user| {name: user["name"], email: user["email"], bio: user["bio"]} })
# Create lookup hashes (KEY OPTIMIZATION!)
categories_hash = Category.all.pluck(:name, :id).to_h
users_hash = User.all.pluck(:email, :id).to_h
# Import posts
blog_json.map do |post|
result = {
title: post["title"],
content: post["content"],
published: post["published"],
user_id: users_hash[post["user"]["email"]]
}
post_data = Post.create!(result)
# Create post-category associations in batch
PostCategory.insert_all(
post["categories"].map { |category|
{post_id: post_data.id, category_id: categories_hash[category]}
}
)
end
rescue ActiveRecord::RecordInvalid, StandardError => e
error = "Error: #{e.message}"
puts error
Rails.logger.error error
end
elapsed_time = Time.current - start_time
puts "Total time: #{elapsed_time.round(2)} seconds"
end
end
end
Why is this code much better? π
- Uses transaction: Ensures atomicity - either imports everything or nothing
- insert_all: Drastically reduces the number of queries
- Lookup hashes: Transforms O(n) searches into O(1) - genius!
- Process duplicates only once: Much more efficient
The Magic of Lookup Hashes π©
categories_hash = Category.all.pluck(:name, :id).to_h
# Result: {"Ruby on Rails" => 1, "Performance" => 2, ...}
Instead of searching for the category in the database for each post (10,000 searches!), we make one query and access via hash. This is pure gold for performance!
Example 3: The OPTIMIZED Way with activerecord-import π
First, add to your Gemfile:
gem 'activerecord-import'
# If using PostgreSQL (as in the repository example)
gem 'pg', '~> 1.1'
π‘ Note: This post works with any database (SQLite, MySQL, PostgreSQL). In the example code I've made available on GitHub, I used PostgreSQL, but you can use whatever database you prefer!
Now watch the magic happen:
# app/services/importer/blog_data_importer_with_active_record_import.rb
module Importer
class BlogDataImporterWithActiveRecordImport
def self.import!
start_time = Time.current
ActiveRecord::Base.transaction do
blog_json = Importer::JsonImporter.new(file_name: "blog_json_data_10k.json").import!
# Prepare unique data
categories = blog_json.map{ |post| post["categories"] }.flatten.uniq
users = blog_json.map{ |post| post["user"] }.uniq { |user| user["email"] }
# Import categories and users in batch
category_objects = categories.map { |name| Category.new(name: name) }
Category.import category_objects, on_duplicate_key_ignore: true, validate: false
user_objects = users.map { |user|
User.new(name: user["name"], email: user["email"], bio: user["bio"])
}
User.import user_objects, on_duplicate_key_ignore: true, validate: false
# Create lookup hashes
categories_hash = Category.all.pluck(:name, :id).to_h
users_hash = User.all.pluck(:email, :id).to_h
# Import posts in batches to save memory
blog_json.in_groups_of(1000, false) do |post_batch|
posts_to_import = post_batch.map do |post|
Post.new(
title: post["title"],
content: post["content"],
published: post["published"],
user_id: users_hash[post["user"]["email"]]
)
end
# Import this batch of posts (without validations for maximum performance!)
Post.import posts_to_import, validate: false
end
# Prepare and import associations in batches
Post.where(title: blog_json.map { |p| p["title"] }).find_in_batches(batch_size: 1000) do |post_batch|
post_categories = []
post_batch.each do |post|
post_data = blog_json.find { |p| p["title"] == post.title }
post_data["categories"].each do |category_name|
post_categories << PostCategory.new(
post_id: post.id,
category_id: categories_hash[category_name]
)
end
end
# Import associations for this batch (without validations!)
PostCategory.import post_categories, validate: false if post_categories.any?
end
rescue StandardError => e
error = "Error: #{e.message}"
puts error
Rails.logger.error error
end
elapsed_time = Time.current - start_time
puts "Total time: #{elapsed_time.round(2)} seconds"
end
end
end
Why is activerecord-import AMAZING? π―
- Optimized SQL: Generates a single INSERT with multiple VALUES
- Memory management: Processes in batches of 1000 records
- Total flexibility: Optional validations, upserts, callbacks
- Brutal performance: In my test: ~5 seconds!
The gem offers powerful options:
-
on_duplicate_key_ignore
: Silently ignores duplicates -
on_duplicate_key_update
: Updates existing records -
validate: false
: Skips Rails validations (extra performance gain!) -
batch_size
: Controls memory usage
Real Performance Comparison π
Local test with 10,000 records on an Apple M3 Pro with 18GB RAM:
- BadImporter: ~40 seconds π±
- BlogDataImporter: ~15 seconds π
- BlogDataImporterWithActiveRecordImport: ~5 seconds π
The optimized version was 8x faster than the less recommended one!
β οΈ Important about timing: These tests were done on a powerful machine (M3 Pro, 18GB RAM). On a basic VPS (1GB RAM, 1 vCPU), these times could be 3-4x higher - meaning BadImporter could take 2-3 minutes! But the improvement ratio remains: the optimized version will always be much faster, regardless of hardware.
With larger volumes (100,000 or 1,000,000 records), you'd need even more advanced strategies like parallel processing, queues, or specialized tools - but even this version is infinitely better than the traditional approach!
Conclusion and Best Practices π
The difference between a poorly done import and an optimized one can mean:
- Your script running in seconds instead of hours
- Less database load
- Lower memory usage
- Automatic rollback in case of error
Always use:
- Transactions to ensure consistency
- Batch insertions (
insert_all
oractiverecord-import
) - Lookup hashes to avoid unnecessary queries
- Process unique data before insertion
- Batches to control memory usage
activerecord-import is your best friend for bulk imports!
Liked this post? Leave a β€οΈ and share with other Rails devs!
Want more content about performance and databases?
Comment below what you'd like to see:
- Advanced strategies with the activerecord-import gem
- Migrations and column changes in existing databases with millions of records
- Real cases with Kafka or SQS for asynchronous processing
- Other database optimization strategies
Just ask and we'll prepare the content! π
Top comments (0)