Franck Pachot for YugabyteDB

Posted on Aug 24, 2023 • Edited on Nov 13, 2024

No-gap sequence in PostgreSQL and YugabyteDB

#yugabytedb #distributed #postgres #sql

Sequences are special in a SQL database, distributed or not. To be scalable, they are not transactional. Consequently, the result may show some gaps and unpredictable ordering.

Non-transactional sequences

In PostgreSQL, I create a table where the id comes from a sequence:

create sequence my_sequence;
create table demo (
 primary key(id)
 , id bigint default nextval('my_sequence')
 , value text
);

I insert two rows, but with a failed attempt that was rolled back:

begin transaction;
insert into demo(value) values('one');
commit;
begin transaction;
insert into demo(value) values('two');
rollback;
begin transaction;
insert into demo(value) values('two');
commit;

The result has a gap:

postgres=# select * from demo order by id;

 id | value
----+-------
  1 | one
  3 | two
(2 rows)

This has nothing to do with ordering and cache because the sequence increment is not transactional and was not rolled back. Generally, it doesn't matter because sequences generate a unique number, which works.

Cached sequences

There's another reason for seeing gaps even when no transactions were rolled back. To avoid incrementing a single row each time we get the next value, a range of values is cached:

alter sequence my_sequence cache 100;

PostgreSQL caches it per connection, and the range of value is lost if we reconnect:

insert into demo(value) values('three');
\connect
insert into demo(value) values('four');
insert into demo(value) values('five');

Here is the result

postgres=# select * from demo order by id;

 id  | value
-----+-------
   1 | one
   3 | two
   4 | three
 104 | four
 105 | five
(5 rows)

In YugabyteDB, the sequence can be cached on the tablet server, and you may not see this gap when re-connecting. However, YugabyteDB is elastic and resilient by allowing nodes to be added or removed, which may also introduce a gap. Non-cached sequences are not scalable, and that's why, by default, YugabyteDB sets a minimum cache (--ysql_sequence_cache_minval=100) that forces a minimum even if the CREATE SEQUENCE sets it lower.

This should not be a problem. SQL sequences are provided to generate unique values in a scalable way, and doing so by incrementing a number is an implementation detail. You should not expect anything else than a unique number from a sequence. If you do, you don't need a dedicated database object to do a simple transactional udpate my_counter set my_value=my_value+1 returning val on a row.

Re-numerotation after commit

If you generate a number for a sequence and need no-gap numbering, you can do that after committing, with another column:

alter table demo add column num bigint;
create unique index demo_num on demo(num asc);
alter table demo add unique using index demo_num;

I run a job that adds the no-gap numbering for the rows that don't have a number yet in this column:

update demo 
 set num=new.num
 from (
  select id
   , coalesce( (select max(num) from demo) ,0)
     + row_number()over(order by id)
     as num
  from demo where num is null order by id
 ) new where demo.id=new.id
;

You can check the execution plan to ensure it uses an Index Scan to get the max(num). The result is:

postgres=# select * from demo order by num;
 id  | value | num
-----+-------+-----
   1 | one   |   1
   3 | two   |   2
   4 | three |   3
 104 | four  |   4
 105 | five  |   5
(5 rows)

This can be sufficient if you want a no-gap number before a specific point in time, like some end-of-month accounting. Be aware of the cost of this in PostgreSQL, where updating a single column copies the whole row, with WAL, bloat, and the need to vacuum consequences. YugabyteDB doesn't have those problems.

Transaction ordering (external consistency)

The database determines the order in which transactions are committed at commit time. Two transactions may be committed in a different order than the nextval() execution inside them. The sequence value retrieval doesn't request a transaction lock to be scalable. Without explicit locking, it's impossible to determine which transaction will commit first. The linear sequence of transactions is unknown before commit, where databases mark them with an internal clock (such as SCN in Oracle, LSN in PostgreSQL, HLC in YugabyteDB).

While this method allows database sequences to scale, it also means that the visible order of your transaction may differ from the retrieved sequence number. Ignoring this can cause problems in production, even if it isn't immediately apparent in unit tests. Any application that relies on the sequence order has a bug that will break your business logic on race conditions.

If sequence order is a requirement, a single counter must be used, and a single row table must be updated with an exclusive lock acquired until commit. A sequence is not needed for this. However, this will affect scalability, so SQL databases provide a particular object, called 'sequence', with a non-transactional behavior for generating unique numbers.

Here is an example of an application that uses a sequence to check for new rows incrementally and relies on the last sequence number processed to get only the new rows since the last execution.

I delete previous rows and insert a single one. It has id=204:

postgres=# alter sequence my_sequence cache 1;
delete from demo;
ALTER SEQUENCE
postgres=# delete from demo;
DELETE 5
postgres=# insert into demo(value) values('six');
INSERT 0 1
postgres=# \! psql -c "select * from demo"
 id  | value | num
-----+-------+-----
 204 | six   |
(1 row)

The job that checks for new rows remembers this number and will get new rows with:

postgres=# \! psql -c "select * from demo where id > 204"
 id | value | num
----+-------+-----
(0 rows)

There is no new row yet.

A transaction starts and inserts a new row. It has id=205:

postgres=# begin transaction;
BEGIN
postgres=*# insert into demo(value) values('seven') returning *;
 id  | value | num
-----+-------+-----
 205 | seven |
(1 row)

INSERT 0 1

In the meantime, another transaction will do the same.
As there's no cache, the number follows with no gap, id=206:

postgres=*# \! psql -c "insert into demo(value) values('eight') returning *"
 id  | value | num
-----+-------+-----
 206 | eight |
(1 row)

INSERT 0 1

So, the users have seen the insertion of id=205 and id=206 with an impression of ordering.

My job queries for new rows since the last execution and finds only the committed one, as the other is not visible yet:

postgres=*# -- read only new records (last check was 204)
postgres=*# \! psql -c "select * from demo where id > 204"
 id  | value | num
-----+-------+-----
 206 | eight |
(1 row)

This job remembers id=206 as the last value processed to start the next incremental round from there.

I used begin transaction to reproduce the overlapping transactions, but the same happens in auto-commit in a shorter time. Now the transaction commits, and the job runs again:

postgres=*# commit;
COMMIT

postgres=# -- read only new records (last check was 206)
postgres=# \! psql -c "select * from demo where id > 206"
 id | value | num
----+-------+-----
(0 rows)

The row with id=205 is lost by this incremental job and will never be processed because it thinks that it has seen all rows before id=206 but this is wrong:

postgres=# select * from demo;
 id  | value | num
-----+-------+-----
 204 | six   |
 205 | seven |
 206 | eight |
(3 rows)

The problem here is that id=205 was generated from the sequence before id=206, but because of the ordering of concurrent transactions, it was visible after it. In the actual order of the SQL transactions, id=206 happened before id=205. You can't rely on sequences (or even timestamps) to know in which order the transactions happened unless you control the order with transactional DML and pessimistic locking.

No-gap ordered sequence from a transactional SQL table

If you want an ordered no-gap sequence, you must implement it yourself from a transactional table. With write locks, the commit order will be forced. This is easy, but you must accept that having multiple sessions asking for the next value for one sequence will have to queue to get their number, and this will reach some scalability limits.

A simple way is to implement those in a table with a function to get the next value. Here is an example where I want a no-gap sequence, starting at one each year, to generate invoice numbers:

create table next_invoice_number(
 year int primary key, num int default 1
);

create function next_invoice_number(year int) returns int as $$
  insert into next_invoice_number as tab (year)
  values(next_invoice_number.year)
  on conflict (year) do
  update set num=tab.num+1
  returning tab.num
  ;
$$ language sql;

The function increments the value and returns it. If there is no record for the year yet, a new one is created.

This, with a serializable isolation level, will generate no-gap numbers:

create table invoices (
 primary key(year, num) 
 , year int, num int
);

insert into invoices 
 select 2023, next_invoice_number(2023) 
 from generate_series(1,10)
 returning num
;

The result:

postgres=# select * from invoices order by num;
 year | num
------+-----
 2023 |   1
 2023 |   2
 2023 |   3
 2023 |   4
 2023 |   5
 2023 |   6
 2023 |   7
 2023 |   8
 2023 |   9
 2023 |  10
(10 rows)

Note that for scalability, it might be better to try the update first and insert only when not found, especially from multiple sessions in repeatable read to avoid retries. The following PL/pgSQL function may be better than the previous SQL function with on-conflict insert:

create table next_invoice_number(
 year int primary key, num int default 1
);

create function next_invoice_number(year int) returns int as $$
declare
 mynum bigint := null;
begin
 update next_invoice_number as tab set num=num+1
  where tab.year=next_invoice_number.year
  returning num into mynum;
 if mynum is null then
  insert into next_invoice_number as tab (year)
   values (next_invoice_number.year) 
   -- still catch conflict in case it was inserted after the update attempt
   on conflict (year) do update set num=tab.num+1
   returning num into mynum;
 end if;
 return mynum;
end;
$$ language plpgsql;

No-Gap and Scalability

With pgbench, I test the throughput when generating numbers from one session only:

\! pgbench -t 1000 -nf <(echo '
insert into invoices values (2023, next_invoice_number(2023) );
') -c 1 

pgbench (16beta2, server 15.1 (Debian 15.1-1.pgdg110+1))
transaction type: /dev/fd/63
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 0.981 ms
initial connection time = 5.319 ms
tps = 1019.719332 (without initial connection time)

Now I can compare these 1019 transactions per second, at 1 millisecond latency, with a run with ten concurrent threads:

\! pgbench -t 1000 -nf <(echo '
insert into invoices values (2023, next_invoice_number(2023) );
') -c 10

pgbench (16beta2, server 15.1 (Debian 15.1-1.pgdg110+1))
transaction type: /dev/fd/63
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 10000/10000
number of failed transactions: 0 (0.000%)
latency average = 14.528 ms
initial connection time = 53.172 ms
tps = 688.320969 (without initial connection time)

There is a trade-off here. If you need no-gap numbering, the only solution is to generate the numbers using a single point of truth, locked exclusively until the end of the transaction.

Note that all those examples work the same in PostgreSQL and YugabyteDB. I've created the unique constraint on num by creating the unique index beforehand so that I can precise ASC for range sharding. The default would have been HASH on YugabyteDB, but there's a good chance that if you want a no-gap sequence, you will also do some range queries on it.

Top comments (2)

Abdulnasir Ismail • Apr 17 '24

Hello Mr. Franck Pachot,
Thank you for this useful write-up. You wrote:

There is a trade-off here. If you need no-gap numbering, there's no other solution than having a single point of truth, locked exclusively until the end of the transaction, to generate the numbers.

May I ask, is this lock per row or per table? I assume one may yet increase throughput by paginating the no-gap number and spreading the locks across multiple rows, if the former is the case.

Franck Pachot YugabyteDB • Apr 22 '24

Hi, yes it is per row. You can spread the lock to multiple rows if short term gaps are acceptable, yes