Inserting and deleting rows continuously has always been challenging for MVCC (Multi-Version Concurrency Control) databases because they have to keep deleted rows for a while. This pattern is frequently encountered in OLTP applications, especially microservices, where events are queued or messages are stored in an outbox. The outbox pattern inserts the payload into the database as part of the business transaction. These items are dequeued by another process, pulling them by batch to propagate them to another system, which may be transactional but decoupled from the first one.
YugabyteDB stores its tables in an LSM-Tree based on RocksDB. It provides fast ingest and limits the space and read amplification with compaction (YugabyteDB uses only level 0 with tiered compaction). Each row or index entry is a key-value that includes the commit timestamp, and deletes are recorded as a new key-value tombstone.
I have created the following table:
create table outbox (
primary key (ts asc, id asc)
, ts timestamptz default now()
, id bigserial
, message text
);
insert into outbox(message) select 'Hello World'
from generate_series(1,1000);
The ASC option in the primary key index is a YugabyteDB additional syntax to choose range sharding (the default is hash) as I'll dequeue Last-In-First-Out. You can remove ASC to test in PostgreSQL, where indexes ascending by default.
I will continuously insert new rows, delete them, and watch how the table grows. As it will run in a loop, limiting the number of rows to process in each iteration with FETCH FIRST 1000 ONLY
is better to avoid long transactions. I add ORDER BY ID ASC
to dequeue the oldest message first. As the old rows have to stay for a while (MVCC retention for consistent reads and point-in-time recovery defaults to 15 minutes) and I don't want to scan all the history in each iteration, I use the timestamp as a starting point with where ts> timestamp :'lower_ts'
.
I initialized it with the minimum value and also enabled Batched Nested Loop for the session so that it doesn't depend on the table statistics (which can change for a queuing table):
set yb_bnl_batch_size=1024;
set enable_mergejoin=off;
set enable_hashjoin=off;
select min(ts) as lower_ts from outbox \gset
With \gset
the lower_ts
psql variable is set from the result of the query.
Each deletion will update this variable to the latest value that was dequeued, minus a margin of 60 seconds in case some ongoing transactions take time to commit. The margin should cover the maximum time between the insert, where the timestamp is set by the producer, and the commit, where it is visible from the consumer.
Here is what I run in a loop to test this pattern:
--- insert 1000 rows
insert into outbox(message) select 'Hello World'
from generate_series(1,1000);
--- delete 1000 rows
with deleted as (
delete from outbox where id in (
select id from outbox
where ts> timestamp :'lower_ts'
order by ts asc limit 1000
) returning ts
)
select coalesce(max(ts),date '1970-01-01')
- interval '60 seconds' as lower_ts
from deleted
\gset
The last timestamp read will set the lower_ts
variable for the next loop. You can do the same with the application.
Depending on your application, you may optimize this with multiple buckets to distribute the queuing and dequeuing and with FOR UPDATE SKIP LOCKED to have multiple consumers in parallel. In this blog post, I'm interested in the insert and delete pattern and the table size. Additional optimizations depend on the desired throughput for queuing and dequeuing.
I have run this for a few hours and read the statistics from my YugabyteDB Managed's Grafana dashboard. It is important to understand how it sustains.
YSQL Operations/Sec shows the INSERT and DELETE rate (deletion encapsulated in a SELECT command to return the last timestamp). Each operation is a batch of 1000 rows: about 500 and 800 rows are inserted and deleted per second:
Quickly, the delete latency is between 1 and 2 seconds per batch, a millisecond per row:
The inserts and delete are batched to YB-TServer read and write operations which show the same seesaw every 15 minutes:
In DocDB those rows read or written are seek
operations in the RocksDB LSM-Tree and next
operations to scan:
Because of the deletions, there are a lot of next
operations to go through the deleted tombstones. They are fast operations, reading the adjacent key in RocksDB blocks, and the throughput is constant.
The 15-minute fluctuations in higher operations are due to the number of next
operations increasing with more deletion tombstones and decreasing when they are removed by compaction.
The RocksDB MemTable is flushed to disk regularly:
This increases the number of SST Files, and read amplification as the key can be in any file in addition to the MemTable. This is the reason why we see an increasing latency when it starts and after each 15-minute drop.
Those drops are due to the compactions reducing the number of SST-Files, by merging them, to reduce the amplification:
The background compactions stop the increasing latency every 15 minutes:
The compaction, while merging files, removes the deletion tombstones after their MVCC retention to reduce the space amplification:
The size seems to have increased slightly, but I've taken this screenshot from the node's storage. The RocksDB metric would have been better, or simply pg_table_size()
, but I dropped my table before adding more explanations on space amplification in this blog post.
It is easy to understand the table size as the flush and compaction are triggered on defined thresholds. The MemTable is flushed when it reaches --memstore_size_mb=128
, creating 128MB files. When they reach the read amplification threshold of --rocksdb_level0_file_num_compaction_trigger=5
, they are merged into one. The size of the disk stays between 128MB and 512MB.
You may be surprised by this volume for a table with 2000 rows (1000 I initially inserted and 1000 I inserted and deleted in a loop).
Remember that it stores 15 minutes of changes for MVCC snapshots (--timestamp_history_retention_interval_sec=900
), at a rate of one row per millisecond, to provide consistent non-blocking reads and point-in-time recovery. You implemented the outbox pattern because you wanted it to be transactional, ACID, and resilient to failure.
This math, confirmed by the test and metrics observations, proves that size and performance are predictable.
Here is more info about compaction in YugabyteDB:
- background compaction, by John Meehan https://www.yugabyte.com/blog/background-data-compaction/
- the technique used to limit how many tombstones to scan, by Prem Thangamani: https://www.yugabyte.com/blog/keep-tombstones-data-markers-slowing-scan-performance/
Compaction works on immutable SST Files and doesn't depend on other write activity, locks, or long-running transactions in YugabyteDB.
I added a condition to start the index scan after the last timestamp that was read. This was done to avoid reading too many tombstones. However, given that the compaction works well, it's possible that this optimization may not be necessary and a simple delete statement could provide the required performance.
explain (analyze, dist, debug, costs off, summary off)
delete from outbox where id in (
select id from outbox
order by ts asc limit 1000
) returning ts
;
This workload, which follows the outbox pattern, can run forever and its performance and space remain consistent, predictable, and manageable without requiring manual operations such as table vacuuming or index shrink/rebuild. If you need to increase the throughput, you can add a bucket number before the key to distribute it to multiple tablets.
Top comments (0)