DEV Community

Scalable bounded COUNT DISTINCT in YugabyteDB

The following question was implicitly for Oracle Database:
invoices have a status (paid/unpaid) and client (client_id) and we want to count, quickly, how many clients have unpaid invoices, stopping the count at 99 when there are more:

To be efficient, we don't want to index all paid invoices, but only the unpaid ones. And we don't want to scan all unpaid invoices to count the distinct clients.

Oracle Database lacks a few features to do that easily and efficiently: no partial indexes, no index skip scan (there is a skip scan, but used only for a different case). A solution can be using a procedural approach to do the skip scan, either from PL/SQL code or SQL recursive WITH clause. And simulate partial index with a virtual column returning null for the unindexed subset.

PostgreSQL provides an easy declaration for partial index:

create index invoice_unpaid on invoice ( client_id asc )
 where status='unpaid';

Enter fullscreen mode Exit fullscreen mode

However, the PostgreSQL community recommends the WITH RECURSIVE for an efficient index scan: https://wiki.postgresql.org/wiki/Loose_indexscan

With YugabyteDB you can write the question as a simple SQL that describes the business rule:

select count(*) from (
  select distinct client_id 
  from invoice 
  where status='unpaid'
  limit 99
) as clients_unpaid
Enter fullscreen mode Exit fullscreen mode

This, with the index above, is sufficient to get a high-performance and scalable response to our query.

Testing on 100 million invoices

Here is how I create the table, and index, to test it:

create extension if not exists pgcrypto;
create table invoice (
 primary key (invoice_uid)
 , invoice_uid uuid default gen_random_uuid()
 , client_id bigint check ( client_id>0 )
 , status text check ( status in ('paid','unpaid')  ) default 'unpaid'
);

insert into invoice (client_id, status)
select client_id , case when n>10 then 'paid' else 'unpaid' end as status
 from generate_series(1,1000) as client_id
    , generate_series(1,10000) as n
;

create index invoice_unpaid on invoice ( client_id asc ) 
 where status='unpaid';

Enter fullscreen mode Exit fullscreen mode

Note that in real life I may have defined the primary key as (client_id, invoice_uid) as it could serve fast access pattern per client without the need of a secondary index. Here I want to show how it works with a secondary index. Anyway, there's no overhead when inserting rows. In a multi-AZ region, maintaining the index takes at most two milliseconds to add the index entry for it. The beauty of SQL, as opposed to NoSQL databases: you can add new access patterns without changing the existing schema and code.

I've also added more invoices by running the inserts 9 more times. I did that on a small YugabyteDB Managed 3 nodes cluster:

Image description

The 43 write ops/s are batches of rows sent from the SQL processing layer (YSQL) to the distributed transactional storage (DocDB). This was about 27000 inserts per second on this 3x4vCPU cluster with replication factor RF=3 (application continuity guaranteed even if one Availability Zone is down):

Image description

I have 100 million invoices from 1000 distinct clients and 100 unpaid invoice per client:

yugabyte=> \timing on
Timing is on.

yugabyte=> select count(*) from invoice;

   count
-----------
 100000000
(1 row)

Time: 37560.658 ms (00:37.561)

yugabyte=> select count(*) from invoice where status='unpaid';

 count
--------
 100000
(1 row)

Time: 469.696 ms

yugabyte=> select count(distinct client_id)
           from invoice where status='unpaid' ;

 count
-------
  1000
(1 row)

Time: 470.734 ms
Enter fullscreen mode Exit fullscreen mode

Counting all rows took less than one minute. Thanks to the partial index, counting the unpaid ones takes less than 500 milliseconds and the same when getting the distinct client_id from it.

That's already good and even better when limiting to 99 distinct values (with the PostgreSQL limit or fetch first 99 rows only):

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice 
            where status='unpaid' 
            limit 99
           ) as clients_unpaid;

 count
-------
    99
(1 row)

Time: 81.335 ms

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid' 
            fetch first 99 rows only
           ) as clients_unpaid;

 count
-------
    99
(1 row)

Time: 74.347 ms
Enter fullscreen mode Exit fullscreen mode

Given that I'm connecting remotely (Switzerland to Ireland is about 30 millisecond), this is about 50 milliseconds execution time. I can check it with EXPLAIN ANALYZE.

The pushdown of DISTINCT to the hybrid scan doesn't happen yet as I can see from the execution plan:

yugabyte=> explain (costs off, analyze on, dist on)
            select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid' 
            fetch first 99 rows only
           ) as clients_unpaid;

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=47.162..47.162 rows=1 loops=1)
   ->  Limit (actual time=1.541..47.147 rows=99 loops=1)
         ->  Unique (actual time=1.540..47.137 rows=99 loops=1)
               ->  Index Only Scan using invoice_unpaid on invoice (actual time=1.539..46.530 rows=9801 loops=1)
                     Heap Fetches: 0
                     Storage Index Read Requests: 11
                     Storage Index Execution Time: 43.999 ms
 Planning Time: 0.086 ms
 Execution Time: 47.201 ms
 Storage Read Requests: 11
 Storage Write Requests: 0
 Storage Execution Time: 43.999 ms
 Peak Memory Usage: 24 kB
(13 rows)

Time: 77.360 ms
Enter fullscreen mode Exit fullscreen mode

We should add a note about the pushdown, but the numbers tell it: rows=9801 have been returned from the Index Only Scan. If the DISTINCT had been done during the scan, I should see 99 rows from here because the LIMIT is also pushed down. In addition to that, 99 rows would have been returned in one Read Requests only but here I see Storage Index Read Requests: 11.

Currently (I'm testing this in YugabyteDB 2.17.2 and have opened #16771 for this), for the optimization to kick-in, I need to add an explicit start for the range, which is easy as I know the lower bound of client_id:

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice 
            where status='unpaid' 
            and client_id>0 -- guaranteed by check constraint
            limit 99
           ) as clients_unpaid;
 count
-------
    99
(1 row)

Time: 31.505 ms

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid' 
            and client_id>0 -- guaranteed by check constraint
            fetch first 99 rows only
           ) as clients_unpaid;

 count
-------
    99
(1 row)

Time: 30.983 ms
Enter fullscreen mode Exit fullscreen mode

I can check from the execution plan:

yugabyte=> explain (costs off, analyze on, dist on)
            select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid'
            and client_id>0  -- guaranteed by check constraint
            fetch first 99 rows only
           ) as clients_unpaid;

                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2.025..2.025 rows=1 loops=1)
   ->  Limit (actual time=1.971..2.014 rows=99 loops=1)
         ->  Unique (actual time=1.969..2.005 rows=99 loops=1)
               ->  Index Only Scan using invoice_unpaid on invoice (actual time=1.967..1.987 rows=99 loops=1)
                     Index Cond: (client_id > 0)
                     Heap Fetches: 0
                     Storage Index Read Requests: 1
                     Storage Index Execution Time: 2.000 ms
 Planning Time: 0.096 ms
 Execution Time: 2.068 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 2.000 ms
 Peak Memory Usage: 24 kB
(14 rows)

Time: 31.029 ms
Enter fullscreen mode Exit fullscreen mode

Now, only rows=99 have been returned by the Index Only Scan, proving that both the DISTINCT and LIMIT have been pushed down into one index scan (Storage Index Read Requests: 1). I get a response time of 31 millisecond because I'm connected remotely. from Switzerland and my database is in Ireland (AWS eu-west-1) but the Execution Time is 2 milliseconds which is the response time from when the application runs in the same region.

The single-digit response time is great, but more important is the scalability of it. Here the execution time does not depend on the total number of rows in the table (100 million), nor the total number of unpaid invoices (100 thousand), nor the number of clients (1000), but only on what is asked by the query: the 99 clients with unpaid invoices. Having a response time that is proportional to the result is the best way to provide predictable performance to the user. You don't need a NoSQL database for that. You don't even have to change your data model. Just add the right index for your use case, with the right SQL database of course. YugabyteDB is distributed SQL, PostgreSQL-compatible, and open source.

Top comments (0)