Sequences are special in a SQL database, distributed or not. To be scalable, they are not transactional and, consequently, the result may show some gaps and unpredictable ordering.
Non-transactional sequences
In PostgreSQL, I create a table where the id
comes from a sequence:
create sequence my_sequence;
create table demo (
primary key(id)
, id bigint default nextval('my_sequence')
, value text
);
I insert two rows, but with a failed attempt that was rolled back:
begin transaction;
insert into demo(value) values('one');
commit;
begin transaction;
insert into demo(value) values('two');
rollback;
begin transaction;
insert into demo(value) values('two');
commit;
The result has a gap:
postgres=# select * from demo order by id;
id | value
----+-------
1 | one
3 | two
(2 rows)
This has nothing to do with ordering and cache and is because the sequence increment is not transactional and was not rolled back. Generally, it doesn't matter because sequences are used to generate a unique number and that works.
Cached sequences
There's another reason for seeing gaps even when no transactions were rolled back. To avoid incrementing a single row each time we get the next value, a range of values is cached:
alter sequence my_sequence cache 100;
PostgreSQL caches it per connection, and the range of value is lost if we reconnect:
insert into demo(value) values('three');
\connect
insert into demo(value) values('four');
insert into demo(value) values('five');
Here is the result
postgres=# select * from demo order by id;
id | value
-----+-------
1 | one
3 | two
4 | three
104 | four
105 | five
(5 rows)
In YugabyteDB, the sequence can be cached on the tablet server and you may not see this gap when re-connecting. However, YugabyteDB is elastic and resilient by allowing nodes to be added or removed and this may also introduce a gap. Non cached sequences are not scalable and that why by default YugabyteDB sets a minimum cache (--ysql_sequence_cache_minval=100
) that forces a minimum even if the CREATE SEQUENCE sets it lower.
This should not be a problem. SQL sequences are provided to generate unique values in a scalable way, and doing so by incrementing a number is an implementation detail. You should not expect expect anything else than a unique number from a sequence. If you do, you don't need a dedicated database object to do a simple transactional udpate my_counter set my_value=my_value+1 returning val
on a row.
Re-numerotation after commit
If you generate a number for a sequence and need a no-gap numbering, you can do that after commit, with another column:
alter table demo add column num bigint;
create unique index demo_num on demo(num asc);
alter table demo add unique using index demo_num;
I run a job that adds the no-gap numbering for the rows that don't have a number yet in this column:
update demo
set num=new.num
from (
select id
, coalesce( (select max(num) from demo) ,0)
+ row_number()over(order by id)
as num
from demo where num is null order by id
) new where demo.id=new.id
;
You can check the execution plan to be sure that it uses an Index Scan to get the max(num)
. The result is:
postgres=# select * from demo order by num;
id | value | num
-----+-------+-----
1 | one | 1
3 | two | 2
4 | three | 3
104 | four | 4
105 | five | 5
(5 rows)
This can be sufficient if you want a no-gap number before a specific point in time, like some end of month accounting. Be aware of the cost of this in PostgreSQL where updating a single column copies the whole row, with WAL, bloat, and need to vacuum consequences. YugabyteDB doesn't have those problems.
Transaction ordering (external consistency)
The order in which transactions are committed is determined at commit time by the database. Two transactions may be committed in a different order than the nextval()
execution inside them. To be scalable, the sequence value retrieval doesn't request a transaction lock. Without explicit locking, it's impossible to determine which transaction will commit first. The linear sequence of transactions is unknown before commit, where databases mark them with an internal clock (such as SCN in Oracle, LSN in PostgreSQL, HLC in YugabyteDB).
While this method allows database sequences to scale, it also means that the visible order of your transaction may differ from the sequence number that was retrieved. Ignoring this can cause problems in production, even if it isn't immediately apparent in unit tests. Any application that relies on the sequence order has a bug that will break your business logic on race conditions.
If sequence order is a requirement, a single counter must be used, and a single row table updated with an exclusive lock acquired until commit. A sequence is not needed for this. However, this will affect scalability, which is why SQL databases provide a special object with a non-transactional behavior for the specific use-case of generating unique numbers.
Here is an example where an application uses a sequence to check for new rows, incrementally, and relies on the last sequence number processed to get only the new rows since the last execution.
I delete previous rows and insert a single one. It has id=204
:
postgres=# alter sequence my_sequence cache 1;
delete from demo;
ALTER SEQUENCE
postgres=# delete from demo;
DELETE 5
postgres=# insert into demo(value) values('six');
INSERT 0 1
postgres=# \! psql -c "select * from demo"
id | value | num
-----+-------+-----
204 | six |
(1 row)
The job that checks for new rows remembers this number and will get new rows with:
postgres=# \! psql -c "select * from demo where id > 204"
id | value | num
----+-------+-----
(0 rows)
There is no new row yet.
A transaction starts and inserts a new row. It has id=205
:
postgres=# begin transaction;
BEGIN
postgres=*# insert into demo(value) values('seven') returning *;
id | value | num
-----+-------+-----
205 | seven |
(1 row)
INSERT 0 1
In the meantime another transaction does the same.
As there's no cache, the number follows with no gap, id=206
:
postgres=*# \! psql -c "insert into demo(value) values('eight') returning *"
id | value | num
-----+-------+-----
206 | eight |
(1 row)
INSERT 0 1
So, the users have seen the insertion of id=205
and id=206
with an impression of ordering.
My job queries for new rows since the last execution and finds only the committed one, as the other is not visible yet:
postgres=*# -- read only new records (last check was 204)
postgres=*# \! psql -c "select * from demo where id > 204"
id | value | num
-----+-------+-----
206 | eight |
(1 row)
This job remembers id=206
as the last value processed, to start the next incremental round from there.
I used begin transaction
to reproduce the overlaping transactions but the same happens in autocommit in a shorter time. Now the transaction commits and the job runs again:
postgres=*# commit;
COMMIT
postgres=# -- read only new records (last check was 206)
postgres=# \! psql -c "select * from demo where id > 206"
id | value | num
----+-------+-----
(0 rows)
The row with id=205
is lost by this incremental job and will never be processed because it thinks that it has seen all rows before id=206
but this is wrong:
postgres=# select * from demo;
id | value | num
-----+-------+-----
204 | six |
205 | seven |
206 | eight |
(3 rows)
The problem here is that id=205
was generated, from the sequence, before id=206
but, because of the ordering of concurrent transactions, it was visible after it. In the real order of the SQL transactions, id=206
happened before id=205
. You can't rely on sequences (or even timestamps) to know in which order the transactions happened, unless you control the order with transactional DML and pessimistic locking.
No-gap ordered sequence from a transactional SQL table
If you want an ordered no-gap sequence, you need to implement it yourself from a transactional table. With write locks, it will force the commit order. This is quite easy but you must accept that having multiple sessions asking for the next value for one sequence will have to queue to get their number and this will reach some scalability limits.
A simple way is implementing those in a table with a function to get the next value. Here is an example where I want a no-gap sequence starting at one each year, to generate invoice numbers:
create table next_invoice_number(
year int primary key, num int default 1
);
create function next_invoice_number(year int) returns int as $$
insert into next_invoice_number as tab (year)
values(next_invoice_number.year)
on conflict (year) do
update set num=tab.num+1
returning tab.num
;
$$ language sql;
The function will increment the value and return it. If there's no record yet for the year, a new one is created.
This, with serializable isolation level, will generate no-gap numbers:
create table invoices (
primary key(year, num)
, year int, num int
);
insert into invoices
select 2023, next_invoice_number(2023)
from generate_series(1,10)
returning num
;
The result:
postgres=# select * from invoices order by num;
year | num
------+-----
2023 | 1
2023 | 2
2023 | 3
2023 | 4
2023 | 5
2023 | 6
2023 | 7
2023 | 8
2023 | 9
2023 | 10
(10 rows)
No-Gap and Scalability
With pgbench
, I test the throughput when generating numbers from one session only:
\! pgbench -t 1000 -nf <(echo '
insert into invoices values (2023, next_invoice_number(2023) );
') -c 1
pgbench (16beta2, server 15.1 (Debian 15.1-1.pgdg110+1))
transaction type: /dev/fd/63
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 0.981 ms
initial connection time = 5.319 ms
tps = 1019.719332 (without initial connection time)
Now I can compare this 1019 transactions per second, at 1 millisecond latency, with a run with 10 concurrent threads:
\! pgbench -t 1000 -nf <(echo '
insert into invoices values (2023, next_invoice_number(2023) );
') -c 10
pgbench (16beta2, server 15.1 (Debian 15.1-1.pgdg110+1))
transaction type: /dev/fd/63
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 10000/10000
number of failed transactions: 0 (0.000%)
latency average = 14.528 ms
initial connection time = 53.172 ms
tps = 688.320969 (without initial connection time)
There is a trade-off here. If you need no-gap numbering, there's no other solution than having a single point of truth, locked exclusively until the end of the transaction, to generate the numbers.
Note that all those examples work the same in PostgreSQL and in YugabyteDB. I've created the unique constraint on num
by creating the unique index beforehand so that I can precise ASC for range sharding. The default would have been HASH on YugabyteDB but there's a good chance that if you want a no-gap sequence you will also do some range queries on it.
Top comments (2)
Hello Mr. Franck Pachot,
Thank you for this useful write-up. You wrote:
May I ask, is this lock per row or per table? I assume one may yet increase throughput by paginating the no-gap number and spreading the locks across multiple rows, if the former is the case.
Hi, yes it is per row. You can spread the lock to multiple rows if short term gaps are acceptable, yes