Junya Kitayama

Posted on Jul 2, 2023

Why is ON DUPLICATE KEY UPDATE included in the SQL issued by insert_all in ActiveRecord when using MySQL?

#rails #ruby

Hey there! Are you curious about why the SQL generated by ActiveRecord's insert_all in Ruby on Rails for MySQL includes ON DUPLICATE KEY UPDATE foo = foo? Today, we'll delve into this interesting topic and uncover the reasons behind it. Let's dive in! 😄

TL;DR

The SQL generated by Ruby on Rails' insert_all includes ON DUPLICATE KEY UPDATE foo = foo at the end.
- It seems that foo is the first column specified as the destination for INSERT INTO.
- It feels strange that this specification is present in insert_all rather than upsert_all if ON DUPLICATE KEY UPDATE is specified.
The relevant implementation can be found here:
- https://github.com/rails/rails/blob/5ed37b35d666b833aeccb14a4cacd2926251232d/activerecord/lib/active_record/connection_adapters/abstract_mysql_adapter.rb#L624
The commit for the implementation is:
- https://github.com/rails/rails/commit/91ed21b304c468db8ce9fd830312c151432935d0
- As you can see from the implementation, foo is set as the first column specified as the destination for INSERT INTO.
As mentioned in the commit message, it is done to "skip duplicate records."
When using insert_all with the minimum arguments, skip_duplicates? becomes true.
- https://github.com/rails/rails/blob/5ed37b35d666b833aeccb14a4cacd2926251232d/activerecord/lib/active_record/insert_all.rb#L61
- https://github.com/rails/rails/blob/5ed37b35d666b833aeccb14a4cacd2926251232d/activerecord/lib/active_record/persistence.rb#L176
By using insert_all!, it is possible to raise an error and fail all the INSERTs in case of duplicates.
- https://api.rubyonrails.org/classes/ActiveRecord/Persistence/ClassMethods.html#method-i-insert_all-21

The SQL generated by ActiveRecord's `insert_all` in Ruby on Rails for MySQL includes `ON DUPLICATE KEY UPDATE foo = foo`

For example, let's say there is a table called products:

class CreateProducts < ActiveRecord::Migration[7.0]
  def change
    create_table :products do |t|
      t.string :name
      t.text :description

      t.timestamps
    end
  end
end

If you do something like this with insert_all:

# Since Rails 7 automatically adds timestamps, created_at and updated_at are not necessary
# Here, I'm intentionally writing it in the pre-7.0 syntax to make it correspond with the generated query
now = Time.zone.now
products = [
  {name: 'foo', description: 'foo_description', created_at: now, updated_at: now},
  {name: 'bar', description: 'bar_description', created_at: now, updated_at: now},
  {name: 'baz', description: 'baz_description', created_at: now, updated_at: now},
]
Product.insert_all(products)

The SQL issued to MySQL will look like this:

-- The SQL is issued without line breaks, but I added them here for readability
INSERT INTO
 `products` (`name`, `description`,`created_at`,`updated_at`)
VALUES
 ('foo', 'foo_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('bar', 'bar_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('baz', 'baz_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
 ON DUPLICATE KEY UPDATE `name`=`name`;

The ON DUPLICATE KEY UPDATE statement is used to update a row instead of inserting it when a value duplicates an existing UNIQUE INDEX or PRIMARY KEY during insertion.

If you specify an ON DUPLICATE KEY UPDATE clause and a row to be inserted would cause a duplicate value in a UNIQUE index or PRIMARY KEY, an UPDATE of the old row occurs.

13.2.6.2 INSERT ... ON DUPLICATE KEY UPDATE Statement | MySQL 8.0 Reference Manual
- https://dev.mysql.com/doc/refman/8.0/en/insert-on-duplicate.html

ActiveRecord also has a method called upsert_all, and it is understandable that the ON DUPLICATE KEY UPDATE statement is specified when using it, but why is this statement issued even when using insert_all...?

`ON DUPLICATE KEY UPDATE` Statement in MySQL

As mentioned earlier, the ON DUPLICATE KEY UPDATE statement in MySQL is used to update a row instead of inserting it when a value duplicates an existing UNIQUE INDEX or PRIMARY KEY.

It is typically used like this:

-- For example, when there is a UNIQUE INDEX constraint on the name column
-- In the case of a single row insertion
INSERT INTO
 `products` (`name`, `description`,`created_at`,`updated_at`)
VALUES
 ('foo', 'foo_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
 ON DUPLICATE KEY UPDATE `description`='foo_description';

 -- When bulk inserting multiple rows, the VALUES(col_name) function can be used
 -- VALUES(col_name) function can also be used for single row insertion
INSERT INTO
 `products` (`name`, `description`,`created_at`,`updated_at`)
VALUES
 ('foo', 'foo_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('bar', 'bar_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
,('baz', 'baz_description', '2023-07-01 00:00:00.000000', '2023-07-01 00:00:00.000000')
 ON DUPLICATE KEY UPDATE `description`=VALUES(`description`);

In this way, when inserting rows using INSERT INTO, if a value duplicates an existing UNIQUE INDEX or PRIMARY KEY, instead of inserting the row, it updates the existing name column value, effectively not updating anything. Why is such a specification being used?

Reading the Implementation of ActiveRecord

I couldn't find the exact description in the reference documentation after a quick search, so let's read the implementation. Open-source software is great!

While reading through it, I found the relevant implementation.

https://github.com/rails/rails/blob/5ed37b35d666b833aeccb14a4cacd2926251232d/activerecord/lib/active_record/connection_adapters/abstract_mysql_adapter.rb#L621

Here's a snippet:
rails/activerecord/lib/active_record/connection_adapters/abstract_mysql_adapter.rb

def build_insert_sql(insert) # :nodoc:
  sql = +"INSERT #{insert.into} #{insert.values_list}"

  if insert.skip_duplicates?
    no_op_column = quote_column_name(insert.keys.first)
    sql << " ON DUPLICATE KEY UPDATE #{no_op_column}=#{no_op_column}"
  elsif insert.update_duplicates?
    sql << " ON DUPLICATE KEY UPDATE "
    if insert.raw_update_sql?
      sql << insert.raw_update_sql
    else
      sql << insert.touch_model_timestamps_unless { |column| "#{column}<=>VALUES(#{column})" }
      sql << insert.updatable_columns.map { |column| "#{column}=VALUES(#{column})" }.join(",")
    end
  end

  sql
end

It appends ON DUPLICATE KEY UPDATE #{no_op_column}=#{no_op_column} to the SQL if insert.skip_duplicates? is true.

Since no_op_column = quote_column_name(insert.keys.first), it sets the first column specified as the destination for INSERT INTO as #{no_op_column}=#{no_op_column}.

Why is this implementation like that? Let's look at the commit that was made when this feature was implemented.

https://github.com/rails/rails/commit/91ed21b304c468db8ce9fd830312c151432935d0

Adds a method to ActiveRecord allowing records to be inserted in bulk without instantiating ActiveRecord models. This method supports options for handling uniqueness violations by skipping duplicate records or overwriting them in an UPSERT operation.

ActiveRecord already supports bulk-update and bulk-destroy actions that execute SQL UPDATE and DELETE commands directly. It also supports bulk-read actions through pluck. It makes sense for it also to support bulk-creation.

It mentions that it supports options for handling uniqueness violations by skipping duplicate records or overwriting them in an UPSERT operation.

The corresponding pull request is this one:
https://github.com/rails/rails/pull/35077

In the discussions within this pull request:

I just pushed a commit that:

Extracts a command object
Splits the API into six methods
insert! / insert_all!
insert / insert_all (skip duplicates)
upsert / upsert_all

It is mentioned that insert_all skips duplicate records, while insert_all! raises an error if duplicates are encountered.

Let's follow the code a bit more. insert.skip_duplicates? is defined here:
rails/activerecord/lib/active_record/insert_all.rb

https://github.com/rails/rails/blob/5ed37b35d666b833aeccb14a4cacd2926251232d/activerecord/lib/active_record/insert_all.rb#L61

def skip_duplicates?
  on_duplicate == :skip
end

The implementations of the insert_all and insert_all! methods are here.
There are many comments on the behavior of the methods and examples of runtime behavior that are worth reading.

https://github.com/rails/rails/blob/5ed37b35d666b833aeccb14a4cacd2926251232d/activerecord/lib/active_record/persistence.rb#L175

activerecord/lib/active_record/persistence.rb

# (comments and code excerpted by the author of this article)
# Rows are considered to be unique by every unique index on the table. Any
# duplicate rows are skipped.
def insert_all(attributes, returning: nil, unique_by: nil, record_timestamps: nil)
  InsertAll.new(self, attributes, on_duplicate: :skip, returning: returning, unique_by: unique_by, record_timestamps: record_timestamps).execute
end

# Raises ActiveRecord::RecordNotUnique if any rows violate a
# unique index on the table. In that case, no rows are inserted.
def insert_all!(attributes, returning: nil, record_timestamps: nil)
  InsertAll.new(self, attributes, on_duplicate: :raise, returning: returning, record_timestamps: record_timestamps).execute
end

Findings from the Investigation

Regarding the main question of this article, "Why does insert_all generate ON DUPLICATE KEY UPDATE foo = foo?", the reason is that it is implemented in such a way to skip duplicate rows without raising an error when a duplicate error occurs.

Additional findings include the fact that using insert_all! allows you to raise an exception when a duplicate error occurs.

In Ruby on Rails, ActiveRecord follows a convention where method names ending with ! will raise an exception when there is an error, while methods without the ! suffix will not raise an exception. This convention is also followed in this implementation. For example, Foo.create! will raise an exception on error, while Foo.create will not raise an exception.

After conducting this investigation and revisiting the references, I found that this information is indeed documented:

insert_all | api.rubyonrails.org
- https://api.rubyonrails.org/classes/ActiveRecord/Persistence/ClassMethods.html#method-i-insert_all
insert_all! | api.rubyonrails.org
- https://api.rubyonrails.org/classes/ActiveRecord/Persistence/ClassMethods.html#method-i-insert_all-21

That's all for this investigation. See you later!

Django MongoDB Backend Quickstart! A Step-by-Step Tutorial

Get up and running with the new Django MongoDB Backend Python library! This tutorial covers creating a Django application, connecting it to MongoDB Atlas, performing CRUD operations, and configuring the Django admin for MongoDB.

Watch full video →