DEV Community

Cover image for General and Unconventional Pagination Techniques in Postgres
Tariq Abughofa
Tariq Abughofa

Posted on • Updated on • Originally published at rabbitoncode.com

General and Unconventional Pagination Techniques in Postgres

I talked in a previous article on how pagination can be done in SQL databases (head there to read more about the general techniques. Their advantages, disadvantages, and scenarios):


In this article, I show how to implement these approaches in the Postgres database engine. In addition to 3 unconventional pagination methods special for Postgres.

limit / offset

The limit / offset is pretty SQL standard with PLSQL:

SELECT *
  FROM products
 ORDER BY sale_date DESC
 WHERE sold = TRUE
 LIMIT 10 OFFSET 30;

Cursors

Cursors are also pretty straightforward. You first declare the cursor with the query that it will execute (the query can be bounded or unbounded). Then, you open the cursor and fetch pages from it with each fetch statement. It's closed in the end to release the resources.

BEGIN
   DECLARE cur CURSOR FOR SELECT *
       FROM products
       WHERE production_year = 2000;
   -- Open the cursor
   OPEN cur;
   LOOP
    -- fetch row into the film
      FETCH 10 FROM cur;
    -- exit when no more row to fetch
      EXIT WHEN NOT FOUND;
   END LOOP;

   -- Close the cursor
   CLOSE cur;
END;

Key-based Pagination

Pretty much SQL standard as well:

SELECT *
FROM products
WHERE id > 1000
ORDER BY id ASC
LIMIT 1000;

Now let's start with the unconventional pagination approaches:

Paginating over xmin

xmin is one of many system columns that Postgres adds implicitly to each table. This column in particular represents an identifier of the database transaction the inserted/updated the corresponding column. Because of that, it servers as a good solution to identify the changes that appeared on a table after a certain point and to sort the rows on the time they were touched by a transaction.

To use the xmin column for pagination, we can use the same key-based pagination approach. The following query paginate with a 1000 row limit with the data sorted in ascending order on the last update time.

SELECT *
FROM users
WHERE xmin::text::int > 5000
ORDER BY xmin::text::int ASC
LIMIT 1000;

The method has the same disadvantages as key-based pagination as you can't reach a certain page randomly. Instead scrolling through the pages until you reach the needed page.

One thing to be aware of when using xmin is that since it represents the transaction id, rows that are inserted/updated in the same transaction has the same xmin and thus no specific order.

paginating over ctid

ctid is another system column in Postgres. it has the physical location of the row so when the data is sorted on this column, the data is returned with true randomness logically speaking since the order is on storage location. It also means that the retrieved data is fetched very fast since there is no disk access cost to move to the next row. That's why internally indices use ctids instead of the primary key to point to the rows.

The ctid value consists of a tuple (p, n) where p represent the page number and n represents the row number within the page.

This is good for a scenario in which:

  • we need extremely fast deep page access.
  • filtering is not needed.
  • we don't care about the row orders or we want random row order.
  • we don't require all the pages to have the same number of rows (since deleted rows leave holes in pages).

To get page number p, we need to do some calculations. Each page contains a "block size" bytes of data (8192 bytes or 8 KB by default). Rows are referenced by a 32-bit pointer so there are at most block size / 4 rows per page. The following query will generate all the ctid values in page p:

SELECT ('(' || p || ',' || s.n || ')')::tid
 FROM generate_series(0,current_setting('block_size')::int / 4) AS s(n);

To get rows in page p:

SELECT * FROM users WHERE ctid = ANY (ARRAY
  (SELECT ('(' || p || ',' || s.n || ')')::tid
     FROM generate_series(0,current_setting('block_size')::int / 4) AS s(n)
  )
);

Pagination using row statistics

Postgres records statistics about its tables in the pg_statistics catalog and provide asn interface to access the information with the view pg_stats. One of these statistics is called histograms_bounds. It hold column statistics representing the value distribution divided into buckets. When querying this field we get something like this:

SELECT histogram_bounds FROM pg_stats
WHERE tablename='users' AND attname='id';

                   histogram_bounds
------------------------------------------------------
 {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}

We notice the in the example the values for the id column goes from 0 to 9995. The values are divided into buckets with around a 1000 values each. The first bucket goes from id 0 to 993, the second one is from 993 to 1997, and so on. As you can see, there is an opportunity here to use these buckets to do pagination over id. If we assumed the bucket size is b, the page size is n, and the page we want is p, with a simple calculation we can find that the bucket which contain our page has a lower bound with index n * p / b + 1 and an upper bound with index n * p / b + 2. After we get the histogram bounds for our bucket, the query is pretty easy to do:

WITH bucket AS (
    SELECT (histogram_bounds::text::int[])[n * p / b + 1] AS lower_bound,
           (histogram_bounds::text::int[])[n * p / b + 2] AS upper_bound
    FROM pg_stats
    WHERE tablename = 'users'
    AND attname = 'id'
    LIMIT 1
  )
SELECT *
FROM users
WHERE id >= (select lower_bound from bucket)
AND id < (select upper_bound from bucket)
ORDER BY id ASC
LIMIT n
OFFSET (n * p % b);

Notice that we use Offset in the query. However, it's only applied within the bucket instead of the whole table. Which means at the worst case scenario in which we are fetching the last page, we are reading all b rows from the database instead of the whole table.

The results, however, can be approximate since we use table statistics which might not be up-to-date with the table. However, the method is blazing fast for deep pagination with random access that tolerates approximate results and doesn't require any filtering.

Stay tuned for more posts on this subject with NoSQL Databases :).

Top comments (0)