DEV Community

Cover image for YugabyteDB Transactional Load with Non-transactional COPY
Franck Pachot for YugabyteDB

Posted on

YugabyteDB Transactional Load with Non-transactional COPY

By default, YugabyteDB COPY does intermediate commits every 20000 rows:

yugabyte=# show yb_default_copy_from_rows_per_transaction;

 yb_default_copy_from_rows_per_transaction
-------------------------------------------
 20000
(1 row)

yugabyte=# show yb_disable_transactional_writes;

 yb_disable_transactional_writes
---------------------------------
 on
(1 row)
Enter fullscreen mode Exit fullscreen mode

Let's take an example with the following table:

yugabyte=# create table loaded_data ( id bigserial, data text );
CREATE TABLE
Enter fullscreen mode Exit fullscreen mode

I set statement_timeout to 5 seconds to simulate a failure before the end, and load some rows:

yugabyte=# set statement_timeout=5000;
SET

yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
ERROR:  57014: canceling statement due to statement timeout

yugabyte=# select count(*) from loaded_data;
 count
--------
 120000
(1 row)

Enter fullscreen mode Exit fullscreen mode

The statement has failed but, because of intermediate commits, some rows are loaded, a multiple of 20000, have been visible during the load, and have to be cleaned up manually:

yugabyte=# delete from loaded_data;
DELETE 120000
Enter fullscreen mode Exit fullscreen mode

To be transactional, you can disable intermediate commits:

yugabyte=# set yb_default_copy_from_rows_per_transaction=0;
SET

yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
ERROR:  57014: canceling statement due to statement timeout

yugabyte=# select count(*) from loaded_data;
 count
-------
     0
(1 row)
Enter fullscreen mode Exit fullscreen mode

However, this is less efficient, allocating more memory and large inserts into IntentsDB

Another solution is to keep intermediate commits but add a column that will set the visibility at the end when the table is queried though a view. It will use a user-define batch number:

yugabyte=# set app.batchid=1;
SET
yugabyte=# alter table loaded_data add batch bigint default current_setting('app.batchid')::bigint;
ALTER TABLE
Enter fullscreen mode Exit fullscreen mode

A table will store the visible batches, and a view will filter on it:

yugabyte=# create table loaded_batches(batch bigint primary key default current_setting('app.batchid')::bigint );
CREATE TABLE

yugabyte=# create view loaded_view as select * from loaded_data where batch in (select batch from loaded_batches);
CREATE VIEW
Enter fullscreen mode Exit fullscreen mode

I set the batch ID:

yugabyte=# select set_config('app.batchid',extract (epoch from now())::bigint::text,false);

 set_config
------------
 1710884655
(1 row)
Enter fullscreen mode Exit fullscreen mode

When I COPY, the batch ID will be set to this value. I can use intermediate commits, or even non-transactional inserts which will bypasse the provisional records and make each row visible as it is inserted:

yugabyte=# set yb_disable_transactional_writes=on;
SET

yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
ERROR:  57014: canceling statement due to statement timeout

Enter fullscreen mode Exit fullscreen mode

The rows are visible from the table but not from the view:

yugabyte=# select count(*) from loaded_data;

 count
-------
 27648
(1 row)

yugabyte=# set yb_bnl_batch_size=1024;
SET

yugabyte=# select count(*) from loaded_view;

 count
-------
     0
(1 row)
Enter fullscreen mode Exit fullscreen mode

The execution plan is efficient with Batched Nested Loops.

In case of failure as above, the rows can be cleaned-up later:

yugabyte=# delete from loaded_data where batch not in (select batch from loaded_batches);
DELETE 27648
Enter fullscreen mode Exit fullscreen mode

When an insert is sucessful, the batch ID is inserted at the end to make it visible:

yugabyte=# select set_config('app.batchid',extract (epoch from now())::bigint::text,false);
 set_config
------------
 1710885534
(1 row)

yugabyte=# set statement_timeout=0;
SET

yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
COPY 10631108

yugabyte=# insert into loaded_batches select;
INSERT 0 1
Enter fullscreen mode Exit fullscreen mode

All new rows have been made atomically visible from the view:

yugabyte=# select count(*) from loaded_data;
  count
----------
 10631108
(1 row)

yugabyte=# select count(*) from loaded_view;
  count
----------
 10631108
(1 row)
Enter fullscreen mode Exit fullscreen mode

This is a trade-off: faster load with non-transactional inserts, but the overhead of filtering on queries.

Top comments (0)