Hey there! Are you curious about why the SQL generated by ActiveRecord's insert_all
in Ruby on Rails for MySQL includes ON DUPLICATE KEY UPDATE foo = foo? Today, we'll delve into this interesting topic and uncover the reasons behind it. Let's dive in! 😄
TL;DR
- The SQL generated by Ruby on Rails'
insert_all
includesON DUPLICATE KEY UPDATE foo = foo
at the end.- It seems that foo is the first column specified as the destination for
INSERT INTO
. - It feels strange that this specification is present in
insert_all
rather thanupsert_all
ifON DUPLICATE KEY UPDATE
is specified.
- It seems that foo is the first column specified as the destination for
- The relevant implementation can be found here:
- The commit for the implementation is:
- https://github.com/rails/rails/commit/91ed21b304c468db8ce9fd830312c151432935d0
- As you can see from the implementation, foo is set as the first column specified as the destination for
INSERT INTO
.
- As mentioned in the commit message, it is done to "skip duplicate records."
- When using
insert_all
with the minimum arguments, skip_duplicates? becomes true. - By using
insert_all!
, it is possible to raise an error and fail all the INSERTs in case of duplicates.
The SQL generated by ActiveRecord's insert_all
in Ruby on Rails for MySQL includes ON DUPLICATE KEY UPDATE foo = foo
For example, let's say there is a table called products:
class CreateProducts < ActiveRecord::Migration[7.0]
def change
create_table :products do |t|
t.string :name
t.text :description
t.timestamps
end
end
end
If you do something like this with insert_all
:
# Since Rails 7 automatically adds timestamps, created_at and updated_at are not necessary
# Here, I'm intentionally writing it in the pre-7.0 syntax to make it correspond with the generated query
now = Time.zone.now
products = [
{name: 'foo', description: 'foo_description', created_at: now, updated_at: now},
{name: 'bar', description: 'bar_description', created_at: now, updated_at: now},
{name: 'baz', description: 'baz_description', created_at: now, updated_at: now},
]
Product.insert_all(products)
The SQL issued to MySQL will look like this:
-- The SQL is issued without line breaks, but I added them here for readability
INSERT INTO
`products` (`name`, `description`,`created_at`,`updated_at`)
VALUES
('foo', 'foo_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('bar', 'bar_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('baz', 'baz_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
ON DUPLICATE KEY UPDATE `name`=`name`;
The ON DUPLICATE KEY UPDATE
statement is used to update a row instead of inserting it when a value duplicates an existing UNIQUE INDEX or PRIMARY KEY during insertion.
If you specify an ON DUPLICATE KEY UPDATE clause and a row to be inserted would cause a duplicate value in a UNIQUE index or PRIMARY KEY, an UPDATE of the old row occurs.
- 13.2.6.2 INSERT ... ON DUPLICATE KEY UPDATE Statement | MySQL 8.0 Reference Manual
ActiveRecord also has a method called upsert_all
, and it is understandable that the ON DUPLICATE KEY UPDATE
statement is specified when using it, but why is this statement issued even when using insert_all
...?
ON DUPLICATE KEY UPDATE
Statement in MySQL
As mentioned earlier, the ON DUPLICATE KEY UPDATE
statement in MySQL is used to update a row instead of inserting it when a value duplicates an existing UNIQUE INDEX or PRIMARY KEY.
It is typically used like this:
-- For example, when there is a UNIQUE INDEX constraint on the name column
-- In the case of a single row insertion
INSERT INTO
`products` (`name`, `description`,`created_at`,`updated_at`)
VALUES
('foo', 'foo_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
ON DUPLICATE KEY UPDATE `description`='foo_description';
-- When bulk inserting multiple rows, the VALUES(col_name) function can be used
-- VALUES(col_name) function can also be used for single row insertion
INSERT INTO
`products` (`name`, `description`,`created_at`,`updated_at`)
VALUES
('foo', 'foo_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('bar', 'bar_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('baz', 'baz_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
ON DUPLICATE KEY UPDATE `description`=VALUES(`description`);
In this way, when inserting rows using INSERT INTO
, if a value duplicates an existing UNIQUE INDEX or PRIMARY KEY, instead of inserting the row, it updates the existing name column value, effectively not updating anything. Why is such a specification being used?
Reading the Implementation of ActiveRecord
I couldn't find the exact description in the reference documentation after a quick search, so let's read the implementation. Open-source software is great!
While reading through it, I found the relevant implementation.
Here's a snippet:
rails/activerecord/lib/active_record/connection_adapters/abstract_mysql_adapter.rb
def build_insert_sql(insert) # :nodoc:
sql = +"INSERT #{insert.into} #{insert.values_list}"
if insert.skip_duplicates?
no_op_column = quote_column_name(insert.keys.first)
sql << " ON DUPLICATE KEY UPDATE #{no_op_column}=#{no_op_column}"
elsif insert.update_duplicates?
sql << " ON DUPLICATE KEY UPDATE "
if insert.raw_update_sql?
sql << insert.raw_update_sql
else
sql << insert.touch_model_timestamps_unless { |column| "#{column}<=>VALUES(#{column})" }
sql << insert.updatable_columns.map { |column| "#{column}=VALUES(#{column})" }.join(",")
end
end
sql
end
It appends ON DUPLICATE KEY UPDATE #{no_op_column}=#{no_op_column}
to the SQL if insert.skip_duplicates?
is true.
Since no_op_column = quote_column_name(insert.keys.first)
, it sets the first column specified as the destination for INSERT INTO as #{no_op_column}=#{no_op_column}
.
Why is this implementation like that? Let's look at the commit that was made when this feature was implemented.
https://github.com/rails/rails/commit/91ed21b304c468db8ce9fd830312c151432935d0
Adds a method to ActiveRecord allowing records to be inserted in bulk without instantiating ActiveRecord models. This method supports options for handling uniqueness violations by skipping duplicate records or overwriting them in an UPSERT operation.
ActiveRecord already supports bulk-update and bulk-destroy actions that execute SQL UPDATE and DELETE commands directly. It also supports bulk-read actions through
pluck
. It makes sense for it also to support bulk-creation.
It mentions that it supports options for handling uniqueness violations by skipping duplicate records or overwriting them in an UPSERT operation.
The corresponding pull request is this one:
https://github.com/rails/rails/pull/35077
In the discussions within this pull request:
I just pushed a commit that:
Extracts a command object
Splits the API into six methods
insert! / insert_all!
insert / insert_all (skip duplicates)
upsert / upsert_all
It is mentioned that insert_all
skips duplicate records, while insert_all!
raises an error if duplicates are encountered.
Let's follow the code a bit more. insert.skip_duplicates?
is defined here:
rails/activerecord/lib/active_record/insert_all.rb
def skip_duplicates?
on_duplicate == :skip
end
The implementations of the insert_all
and insert_all!
methods are here.
There are many comments on the behavior of the methods and examples of runtime behavior that are worth reading.
activerecord/lib/active_record/persistence.rb
# (comments and code excerpted by the author of this article)
# Rows are considered to be unique by every unique index on the table. Any
# duplicate rows are skipped.
def insert_all(attributes, returning: nil, unique_by: nil, record_timestamps: nil)
InsertAll.new(self, attributes, on_duplicate: :skip, returning: returning, unique_by: unique_by, record_timestamps: record_timestamps).execute
end
# Raises ActiveRecord::RecordNotUnique if any rows violate a
# unique index on the table. In that case, no rows are inserted.
def insert_all!(attributes, returning: nil, record_timestamps: nil)
InsertAll.new(self, attributes, on_duplicate: :raise, returning: returning, record_timestamps: record_timestamps).execute
end
Findings from the Investigation
Regarding the main question of this article, "Why does insert_all generate ON DUPLICATE KEY UPDATE foo = foo?", the reason is that it is implemented in such a way to skip duplicate rows without raising an error when a duplicate error occurs.
Additional findings include the fact that using insert_all!
allows you to raise an exception when a duplicate error occurs.
In Ruby on Rails, ActiveRecord follows a convention where method names ending with !
will raise an exception when there is an error, while methods without the !
suffix will not raise an exception. This convention is also followed in this implementation. For example, Foo.create!
will raise an exception on error, while Foo.create
will not raise an exception.
After conducting this investigation and revisiting the references, I found that this information is indeed documented:
- insert_all | api.rubyonrails.org
- insert_all! | api.rubyonrails.org
That's all for this investigation. See you later!
Top comments (0)