In a world where database performance and efficiency are critical factors in application development, the swift import of a large number of records has become a necessity.
In a previous article, we extensively examined various methods for effectively creating 100,000 records with Ruby on Rails. As a quick recap, we achieved a record insertion time of less than 3 seconds using the insert_all
method.
However, I received feedback mentioning an even more efficient method than insert_all
. This method is called ActiveRecord-Copy. Based on the feedback I received, this method is expected to revolutionize my perspective on fast insertion.
Today, we are going to shift our focus to a larger scale and find the fastest method for inserting 1 million records into a database!
DataSet
For today's benchmark, we will start with the PostgreSQL database used in my previous article:
# db/schema.rb
create_table "accounts", force: :cascade do |t|
t.string "first_name"
t.string "last_name"
t.string "phone"
t.string "email"
t.string "role"
end
As we aim to load 1 million records, using FactoryBot to load data for each benchmark execution would be too time-consuming. Therefore, I preloaded one million records into a file named accounts.csv
to load the data at the beginning of the benchmark as follows:
accounts = []
CSV.foreach(Rails.root.join('accounts.csv'), headers: true) do |row|
next if row['role'] == 'role'
accounts << row.to_h
end
Now, we can feed these users
to our various methods to kickstart the insertion process!
Introduction to ActiveRecord-Copy
ActiveRecord-Copy is a gem that enables the use of the PostgreSQL COPY command. Typically, the COPY command is used to import data from a .sql file into a local database using PostgreSQL directly.
ActiveRecord-Copy facilitates interfacing this command directly within Ruby. The use of COPY minimizes the overhead associated with creating and executing multiple SQL statements for each record, resulting in remarkable speed and efficiency.
To install it in your project:
bundle add activerecord-copy
Here is how I used ActiveRecord-Copy:
columns = users.first.keys + %w[created_at updated_at]
time = Time.now.getutc
Account.copy_from_client(columns) do |copy|
accounts.each do |account|
copy << (account.values + [time, time])
end
end
When using ActiveRecord-Copy, you must specify the columns you will provide. You also need to provide the created_at
and updated_at
timestamps, which are not automatically set during the COPY process.
Benchmark
In this benchmark, we will compare insert_all
, activerecord-import
, and our new contender, activerecord-copy
.
You can find the benchmark here.
Now, let's compare our three methods for creating 1 million records:
Performance Benchmark
For the pure performance benchmark, we will run the benchmark script ten times, and here are the results:
Method | Average | Min | Max | Sum |
---|---|---|---|---|
ActiveRecord-Import | 73.42 | 68.37 | 79.30 | 734.26 |
InsertAll | 48.71 | 46.09 | 52.14 | 487.11 |
ActiveRecord-Copy | 14.81 | 11.42 | 17.46 | 148.17 |
Here is the distribution of execution times over the ten trials:
We have an intriguing ranking!
🥇 ActiveRecord Copy (~5 times faster than ActiveRecord-Import)
🥈 InsertAll (~1.5 times faster than ActiveRecord-Import)
🥉 ActiveRecord-Import
ActiveRecord-Copy demonstrates significantly better performance than InsertAll and ActiveRecord-Import.
Error Handling Comparison
Beyond raw performance, error handling is a crucial aspect when inserting a large number of records. Let's compare how the three methods behave in terms of error handling.
For these tests, we will add two validations:
- A database validation
# db/migrate/20231101115631_add_unique_on_accounts_email.rb
class AddUniqueOnAccountsEmail < ActiveRecord::Migration[7.0]
def change
add_index :accounts, :email, unique: true
end
end
- A validation in our ActiveRecord model
# app/models/account.rb
class Account < ApplicationRecord
[ ... ]
validates :first_name, presence: true, length: { maximum: 50 }
[ ... ]
end
We will then create a script to introduce errors for each of the methods:
accounts = FactoryBot.attributes_for_list(:account, 100)
# We add an account with a first_name that is too long
accounts << FactoryBot.attributes_for(:account, first_name: 'a' * 51)
# We add a duplicate email in the array
accounts << accounts.first
Here are the behaviors observed when trying to insert records with the different methods:
Method | Database Error | Validation Error |
---|---|---|
InsertAll | No error, but the record is not created in the database | Unhandled. The record is created with the excessively long first name |
ActiveRecord-Import | Raises a PG::UniqueViolation error. Can be bypassed with on_duplicate_key_ignore | No error, but the record is not created in the database |
ActiveRecord-Copy | Raises a PG::UniqueViolation error. Cannot be bypassed. | Unhandled. The record is created with the excessively long first name |
ActiveRecord-Import is the only method that effectively handles model validations. This difference can partially explain why its processing time is significantly longer than the other methods.
I am also surprised that InsertAll does not create a record when the error originates from the database. Given that the method does not raise any errors, the expected behavior would have been database insertion.
Interpretation
After analyzing the benchmark results, we can draw the following conclusions regarding the performance and error handling of the three insertion methods:
InsertAll: The InsertAll method is highly efficient, taking around 48.71 seconds to insert 1 million records. It handles database errors but ignores model validation errors.
ActiveRecord-Import: This method offers reasonable performance, taking around 73.42 seconds to insert 1 million records. Importantly, this method actively handles database errors (PG::UniqueViolation) and model validation errors. This approach results in slower insertion but allows for comprehensive validation handling.
ActiveRecord-Copy: ActiveRecord-Copy stands out in terms of performance, inserting 1 million records in just 14.81 seconds. However, concerning error handling, it behaves similarly to InsertAll, as it handles database errors (triggering a PG::UniqueViolation in our tests) but does not address model validation errors.
Recommendations
Based on the benchmark results and your specific project priorities, here are my recommendations:
-
Best Performance -
ActiveRecord-Copy
: If you seek the best performance for rapid insertion of large amounts of data, ActiveRecord-Copy is the top choice. It is approximately five times faster than ActiveRecord-Import. The only drawback is that it is exclusively available on PostgreSQL. -
Model Validation -
ActiveRecord-Import
: If handling model validation errors is essential for your project, ActiveRecord-Import is a viable option. It is important to note that this method offers extensive customization in its usage. I encourage you to consult the documentation, as it allows for advanced features like recursive model creation and skipping model validations for faster insertion. - In the worst case -
InsertAll
: If you are not using PostgreSQL and do not require model validations,InsertAll
is a solid choice. It provides good performance while handling database errors, even though it ignores model validation errors.
Conclusion
In the demanding world of application development, database performance plays a crucial role. Rapid data insertion is a common challenge, and our benchmark has highlighted significant differences between the three methods examined.
ActiveRecord-Copy stands out as the most efficient and memory-friendly method. However, ActiveRecord-Import is the only method that actively handles model validation errors.
Ultimately, the choice of the method depends on your priorities and project requirements. Each of these methods has its advantages and disadvantages, and the decision should be made based on your specific needs in terms of performance, memory usage, and error handling.
Personally, I am thrilled to have discovered ActiveRecord-Copy and look forward to applying it to my projects!
Learn More
- Repository link used in this article: https://github.com/just-the-v/fastest-way-to-insert-in-rails/tree/activerecord-copy
- PostgreSQL COPY documentation: https://docs.postgresql.fr/12/sql-copy.html
- ActiveRecord-Import documentation: https://github.com/zdennis/activerecord-import
- Video demonstration of COPY in PostgreSQL: https://www.scalingpostgres.com/tutorials/fast-postgresql-data-loading-using-ruby/
- ActiveRecord-Copy documentation: https://github.com/pganalyze/activerecord-copy
- Link to my article on how to efficiently insert 100k records: https://dev.to/justthev/performance-guide-to-create-100k-records-in-less-than-3s-using-ruby-on-rails-3k07
Top comments (3)
Great tips, thanks ❤️
Just to notice that 1 billions records will not take 30 seconds to insert (insert_all). It was 3 minutes on my machine
Hi Marwan !
What is your computer ? I may need to specify in my article that I'm running all my benchmark on my M1 Pro so maybe if you have better specs than mine ofc the benchmark will be faster.
I will add this detail to the next articles, thanks for feedback!
I don't exactly remember what is my hardware, but it's nearly an M2.
Btw, it is more about saying that we can't just multiply those figures (100K = 3s, 1M = 30s ...).
PS : Sorry it was 1 million, not 1 billion