DEV Community: Mircea Cadariu

Postgres tuning using feedback loops

Mircea Cadariu — Sat, 15 Nov 2025 11:50:36 +0000

I wanted to gather in one place a selection of resources that have helped me learn how to monitor and tune Postgres effectively. For optimal database performance, there's a high chance we'll have to get our hands dirty with this topic, because the default settings are on the conservative side (it has to start even on a Raspberry Pi), but with proper tuning it is impressive how far can Postgres take us.

Given the number of individual settings you can configure, the task might seem daunting at first sight, but luckily there are great guides out there, as well as I've found you can structure them into some categories. It will help a lot also to learn a bit how the Postgres internal components work - I suggest reading the Postgres internals book (free) for this purpose.

Our workloads can change so we have to continuously track the internal health over time and adjust when needed, using feedback loops that let us know if we have to take any action. For this purpose, a custom Grafana dashboard based on Postgres internal metrics have provided me with everything I need for doing a good job. For best practices on setting up actionable dashboards, I recommend this book. There are many readily available ones as well, and below I'll gather some options for you. Lastly, you have AI agents that check metrics themselves and take action.

Postgres wiki

Monitoring

Monitoring tools

Postgres FM podcast episode

Monitoring checklist

Agents

Xata Agent

Blog post selection

Postgres Column Tetris

Mircea Cadariu — Tue, 14 Oct 2025 19:42:22 +0000

If you've been working with Postgres for a while, you have probably already learned how to write queries and tune them for performance. Today, I'd like to show you a lesser known optimization technique called Column Tetris: the practice of ordering your columns to minimize storage overhead due to CPU alignment requirements.

As far as I can tell, the name of this technique was coined by Erwin Brandstetter here.

Important: The best time to apply this is when you initially create a table, since there's no data to migrate. But as they say, better late than never!

Why bother?

Column Tetris delivers tangible benefits:

More rows per page: Postgres' 8kb pages can fit more rows, reducing I/O operations
Better cache utilization: denser pages mean more data fits in RAM, reducing slow disk reads
Faster sequential scans: Less data to read means faster table scans
Lower backup/restore times: Smaller tables are faster to backup and restore

Convinced? Let's dive in.

Why does column order matter?

In Postgres, the order you define columns in your CREATE TABLE statement affects how much disk space your table consumes. This is because CPUs prefer to read data at memory addresses that are multiples of the data type's size.

An 4-byte integer wants to start at an address divisible by 4
An 8-byte timestamp wants to start at an address divisible by 8
etc

This is called "alignment". Not that alignment, but this one.
When data isn't naturally aligned, Postgres inserts padding bytes to maintain proper alignment. These padding bytes waste space and bloat your storage.

A concrete example

Let's track user logins with a simple schema:

CREATE TABLE logins (
    user_id    INTEGER,        -- 4 bytes
    is_success BOOLEAN,        -- 1 byte (+ 3 bytes padding)
    login_time TIMESTAMP,      -- 8 bytes
    is_mobile  BOOLEAN         -- 1 byte
);

The problem? The login_time TIMESTAMP needs to start at an address divisible by 8. Since is_success (1 byte) comes right before it, PostgreSQL must insert 3 bytes of padding to align the TIMESTAMP properly.
Let's verify this. We know the row header is 24 bytes. Add our data (4 + 1 + 3 padding + 8 + 1 = 17 bytes), and we should get 41 bytes total.

INSERT INTO logins VALUES (12345, true, NOW(), false);

SELECT pg_column_size(logins.*) as row_size_bytes FROM logins;

Result:

 row_size_bytes 
----------------
             41

Confirmed! Now let's optimize by reordering the columns:

CREATE TABLE logins_optimized (
    login_time TIMESTAMP,      -- 8 bytes
    user_id    INTEGER,        -- 4 bytes
    is_success BOOLEAN,        -- 1 byte
    is_mobile  BOOLEAN         -- 1 byte
);

By placing the TIMESTAMP first, it's naturally aligned at the start. The INTEGER fits perfectly after it, and both BOOLEANs last.

INSERT INTO logins_optimized VALUES (NOW(), 12345, true, false);

SELECT pg_column_size(logins_optimized.*) as row_size_bytes FROM logins_optimized;

Result:

 row_size_bytes 
----------------
             38

Savings: 3 bytes per row just from reordering!

Padding between rows

There's more to the story. PostgreSQL also aligns entire tuples to 8-byte boundaries (MAXALIGN on 64-bit systems). This means padding is inserted not just within rows, but also between them.

Let's see this effect at scale by inserting 1 million rows:

TRUNCATE logins, logins_optimized;

-- Insert 1 million rows
INSERT INTO logins (user_id, is_success, login_time, is_mobile)
SELECT 
(random() * 1000000)::INTEGER,
random() > 0.1,
NOW() - (random() * interval '365 days'),
random() > 0.5
FROM generate_series(1, 1000000);

INSERT INTO logins_optimized 
SELECT login_time, user_id, is_success, is_mobile 
FROM logins;

Now let's analyze the storage:

SELECT 
    pg_relation_size('logins') as original_bytes,
    pg_relation_size('logins') / 8192 as original_pages,
    1000000.0 / (pg_relation_size('logins') / 8192.0) as rows_per_page_original,

    pg_relation_size('logins_optimized') as optimized_bytes,
    pg_relation_size('logins_optimized') / 8192 as optimized_pages,
    1000000.0 / (pg_relation_size('logins_optimized') / 8192.0) as rows_per_page_optimized
FROM logins, logins_optimized
LIMIT 1;

Results:

 original_bytes | original_pages | rows_per_page_original  | optimized_bytes | optimized_pages | rows_per_page_optimized 
-------------+-------------+----------------------+-----------------+-----------------+-------------------------
    52183040 |        6370 | 156.9858712715855573 |        44285952 |            5406 |    184.9796522382537921

Breaking this down:

Original:

1,000,000 rows in 6,370 pages
~157 rows per page
52,183,040 bytes / 1,000,000 rows = 52.18 bytes per row (including all overhead)

Optimized:

1,000,000 rows in 5,406 pages
~185 rows per page
44,285,952 bytes / 1,000,000 rows = 44.29 bytes per row (including all overhead)

Difference: 52.18 - 44.29 = 7.89 bytes per row

Notice the difference is larger than the 3 bytes we saved from column alignment alone. The extra savings come from inter-row padding—each tuple is padded to reach an 8-byte boundary, and poorly ordered columns require more padding.

Total savings

Let's see the overall impact:

-- Check the table sizes
SELECT 
    pg_size_pretty(pg_relation_size('logins')) as original_size,
    pg_size_pretty(pg_relation_size('logins_optimized')) as optimized_size,
    pg_size_pretty(
        pg_relation_size('logins) - 
        pg_relation_size('logins_optimized')
    ) as savings;

Results:

 original_size | optimized_size | savings 
------------+----------------+---------
 50 MB      | 42 MB          | 7712 kB

That's a 14% reduction in storage from simply reordering four columns. Not bad, I'll take it!

Guideline

To minimize the padding I showed you in the sections above, order your columns by alignment requirements, largest to smallest:

8-byte types: BIGINT, DOUBLE PRECISION, TIMESTAMP, TIMESTAMPTZ
4-byte types: INTEGER, REAL, DATE
2-byte types: SMALLINT
1-byte types: BOOLEAN
Variable-width types last: TEXT, VARCHAR, BYTEA

When does this matter the most?

Most impactful when you have:

Tables with millions or billions of rows
Tables with many small fixed-width columns
Tables that are frequently scanned in full

Low-impact scenarios:

Small lookup tables (< 1000 rows)
Tables with only a few columns
Tables dominated by large TEXT/VARCHAR fields

For a table with 100 million rows, saving 8 bytes per row translates to ~800 MB less storage, faster scans, better cache utilization, and lower I/O costs.

Final Thoughts

Column Tetris is a simple technique that costs nothing at design time but can yield significant storage savings. Think of it like organizing your closet: arranging items thoughtfully takes the same effort as tossing them in randomly, but the results are much better.
So next time you write a CREATE TABLE statement, take a moment to play Column Tetris. Your database will thank you.

Thanks for reading! Until next time!

Postgres Range Types

Mircea Cadariu — Tue, 07 Oct 2025 13:10:01 +0000

When developing applications that track measurements over time, you'll often encounter scenarios where values remain constant across multiple readings. Consider a temperature monitoring system that takes daily measurements: if the temperature stays at 20.5°C for an entire month, storing 30 identical rows is wasteful and degrades query performance.

Postgres' range types offer an elegant solution to this problem, potentially reducing storage requirements by orders of magnitude while maintaining data integrity. In this post, I'll demonstrate how range types work and show you the dramatic space savings they can deliver.

A straightforward approach: one row per day

This is how you would store one reading per day.

CREATE TABLE temperature_readings_daily (
    sensor_id INTEGER,
    reading_date DATE,
    temperature DECIMAL(5,2),
    PRIMARY KEY (sensor_id, reading_date)
);

-- Example data: same temperature for 30 days
INSERT INTO temperature_readings_daily 
SELECT 
    1 as sensor_id,
    generate_series('2025-01-01'::date, '2025-01-30'::date, '1 day'::interval)::date,
    20.5 as temperature;

This approach creates 30 rows to represent a single temperature value that remained constant throughout January.

Date ranges

To use date ranges, we will have to rename our column and define it as having type daterange.

CREATE TABLE temperature_readings_range (
    sensor_id INTEGER,
    valid_period DATERANGE,
    temperature DECIMAL(5,2)
);

-- Same data: one row covers 30 days
INSERT INTO temperature_readings_range VALUES
    (1, '[2025-01-01,2025-01-31)'::daterange, 20.5);

The daterange type uses interval notation: [2025-01-01,2025-01-31) means the range includes January 1st through January 30th (the closing parenthesis excludes January 31st).

Measuring the impact

Let's have a look at what savings we can expect if we start working with range types. Let's generate some test data for our experiment.

To quantify the space savings, let's run an experiment with realistic data. We'll simulate 100 sensors tracking temperatures over six months, with values changing twice per month (on the 1st and 15th). You can find the SQL to generate this data at the end of this post [1].

Alright, how much did we gain? Here's the query we'll use to interrogate the table sizes.

SELECT 
    pg_size_pretty(pg_total_relation_size('temperature_readings_daily')) as daily_size,
    pg_size_pretty(pg_total_relation_size('temperature_readings_range')) as range_size,
    round(
        100 - (pg_total_relation_size('temperature_readings_range')::numeric / 
               pg_total_relation_size('temperature_readings_daily') * 100),
        2
    ) as space_savings_percent;

  daily_size | range_size | space_savings_percent 
------------+------------+-----------------------
 1448 kB    | 128 kB     |                 91.16

We can see a difference of one order of magnitude. Not bad!

The space savings depend entirely on your data distribution though. If values change every day, you'll see minimal benefit. But if values remain constant for weeks or months at a time, the gains can be dramatic.

Querying

Here's how you'd write queries to retrieve results of interest.

-- Find the temperature on a specific date
SELECT sensor_id, temperature 
FROM temperature_readings_range
WHERE sensor_id = 1 
  AND valid_period @> '2025-06-15'::date;


-- Get temperature changes in a date range
SELECT sensor_id, valid_period, temperature
FROM temperature_readings_range
WHERE sensor_id = 1
  AND valid_period && '[2025-11-01,2025-12-31)'::daterange;

Alternative: start and end columns

You might wonder: why use range types at all? Why not just add start_date and end_date columns?

This is a valid approach and achieves similar storage savings. So what are the tradeoffs?

Start/end columns:

More familiar and intuitive for most developers
Works with any database system, not just PostgreSQL
Easier to understand in query results
No need to learn range-specific operators

Range types:

Data integrity: range types enforce that the period is valid (start before end) at the type level
Specialized operators: @> (contains), && (overlaps), <@ (contained by) make queries more expressive
GiST indexing: PostgreSQL can build efficient indexes specifically designed for range queries
NULL handling: With start/end columns, you need to handle the case where end_date is NULL (for ongoing periods). Range types handle open-ended ranges naturally with the [2025-01-01,) notation
Cleaner semantics: A single valid_period column is conceptually clearer than two related columns

Beyond date ranges: other Postgres range types

While we've focused on daterange for this example, Postgres provides several built-in range types for different use cases.

int4range and int8range – Integer ranges, useful for ID ranges, version numbers, or inventory levels
numrange – Numeric ranges for decimal values like prices or measurements
tsrange and tstzrange – Timestamp ranges (with and without timezone) for precise event tracking
daterange – Date ranges as we've used in this post

Conclusion

Postgres's range types is an example of why it's such a feature-rich database. By representing continuous periods with a single row instead of many, you can achieve storage savings and cleaner data models. For applications dealing with time-series data that changes infrequently, always consider range types for your design.

[1]

-- Generate readings that change twice per month (1st and 15th)
CREATE TEMP TABLE temp_changes AS
SELECT 
    sensor_id,
    day::date as change_date,
    5 + (random() * 10)::numeric(5,2) as temperature
FROM 
    generate_series(1, 100) as sensor_id,
    generate_series('2025-06-01'::date, '2025-12-31'::date, '1 day'::interval) as day
WHERE 
    EXTRACT(day FROM day) IN (1, 15);

-- Then, fill in all days with the temperature from the most recent change
INSERT INTO temperature_readings_daily
SELECT 
    s.sensor_id,
    day::date,
    tc.temperature
FROM 
    generate_series(1, 100) as s(sensor_id),
    generate_series('2025-06-01'::date, '2025-12-31'::date, '1 day'::interval) as day
    CROSS JOIN LATERAL (
        SELECT temperature
        FROM temp_changes tc2
        WHERE tc2.sensor_id = s.sensor_id
          AND tc2.change_date <= day::date
        ORDER BY tc2.change_date DESC
        LIMIT 1
    ) tc;

-- Populate the range table too
INSERT INTO temperature_readings_range
SELECT 
    sensor_id,
    daterange(
        change_date,
        LEAD(change_date) OVER (PARTITION BY sensor_id ORDER BY change_date),
        '[)'
    ) as valid_period,
    temperature
FROM temp_changes;

Links instead of repetition

Mircea Cadariu — Thu, 11 Sep 2025 18:46:47 +0000

Today I'm sharing with you a principle that I keep encountering in several systems I'm researching, appearing in different forms but always serving the same underlying goal: removing redundancy through indirection.

I've structured this post into four separate patterns, let me walk you through them and you will see how all share the same substrate of favoring links instead of repetition.

Dictionary Encoding

Dictionary encoding is perhaps the clearest expression. Instead of storing repeated values directly, we create a dictionary (lookup table) and store only references to entries in that dictionary.

Let's consider this array:
Fruits: ["apple", "banana", "apple", "cherry", "banana", "apple"]

We can transform it to:
Dictionary: {0: "apple", 1: "banana", 2: "cherry"}
Fruits: [0, 1, 0, 2, 1, 0]

The same information is stored, however this approach has the benefit of immediate reduction of storage space, but we've also managed to establish a single source of truth for each unique value. Hmm, where have you heard this before..

Database Normalization

Database normalization takes this same principle but applies it to relational data structures.

For example, instead of repeating customer information across every order record, we create separate tables and link them through foreign keys.

Denormalized:
Orders: [order_id, customer_name, customer_email, product_name, quantity]

Normalized:
Customers: [customer_id, name, email]
Products: [product_id, name, price]
Orders: [order_id, customer_id, product_id, quantity]

This isn't just about storage efficiency, but also about data integrity. When customer information changes, there's only one place to update it. We've eliminated the possibility of inconsistent data.

String Interning

String interning is more of a programming language runtime feature, and it ensures that identical string literals share the same memory location. Instead of creating multiple string objects with the same content, the runtime maintains a pool of unique strings and returns references to existing instances.

As a case study, see for example this entry from the Java Language Specification describing how it works in Java. We find this concept in python as well.

Outside of the programming languages domain, this is a post from Victoria Metrics on how they applied it in their solution.

German Strings

"German strings" is a clever string optimization technique to know about when you're building a database. The approach stores a 4-character prefix directly within the string header, avoiding pointer dereferences for common string operations. The key insight is that most string operations only need to examine the beginning of a string.

Let's consider this full string: "PostgreSQL is awesome".

This is the German string structure:
[length][prefix: "Post"][pointer] -> "greSQL is awesome"

This creates an indirection pattern where the prefix enables fast string comparisons and filtering operations without dereferencing pointers, since most mismatches can be detected by comparing just the first few characters.

However, German strings aren't always optimal. The overhead per string can be problematic for certain workloads. As the team at Polar Signals described, for low-cardinality string columns (like airport codes or status enums), simple dictionary encoding provides a 75% memory reduction compared to German strings.

Conclusion

The pattern of creating links to canonical sources is prevalent because it addresses fundamental challenges: storage efficiency, data consistency, and maintainability.

The next time you encounter repeated data in your systems, ask yourself: "Could I consider a link here instead?" The answer might lead you to refactor towards more elegant, efficient, and maintainable solutions.

Thanks for reading! Until next time!

Loading One-to-Many relationships efficiently using Spring Data JPA and Postgres

Mircea Cadariu — Sun, 24 Aug 2025 16:45:48 +0000

Intro

The One-to-Many relationship, or parent-child, is a common occurrence in application development. Off the cuff, we can name numerous instances, a sports team with its players, blog posts and their comments, etc. A natural solution for this use-case is to use a relational database and create foreign key constraints to enforce data integrity.

This post focuses on the following task: how to efficiently load the list of parents, and all their corresponding children in one go, with Spring Data JPA and Postgres. We'll start with the slowest approach (hello, N+1!) and show how to make it faster through successive refinements. The code is available in this repo.

Authors and books

Let's use a familiar scenario: authors and their books. This is how we'll create the tables.

create table authors (
    id             bigint primary key generated always as identity,
    name           varchar(255) not null,
    bio            text
);

create table books (
    id             bigint primary key generated always as identity,
    title          varchar(255) not null,
    isbn           varchar(13),
    published_year integer,
    author_id      bigint not null,
    constraint fk_books_author foreign key (author_id) references authors(id)
);

create index idx_books_author_id on books(author_id);

Populating the tables

Let's insert some data to work with. We'll generate 1000 authors, and every author wrote 30 books each.

insert into authors (name, bio)
select
    'Author ' || seq,
    'Bio for author ' || seq
from generate_series(1, 1000) seq;

insert into books (author_id, title, isbn, published_year)
select
    a.id as author_id,
    'Book ' || gs as title,
    random()::bigint::text as isbn,
    (2000 + FLOOR(random() * 25))::int as published_year
from authors a, generate_series(1, 30) as gs;

Entities

We'll create two entity classes, Author and Book. In order to learn how to map them correctly with JPA/Hibernate, you can read this post.

In the Book class, we'll reference Author like this.

 @ManyToOne(fetch = LAZY)
 @JoinColumn(name = "author_id")
 private Author author;

Accordingly, in the Author class:

@OneToMany(mappedBy = "author", fetch = LAZY)
private List<Book> books = new ArrayList<>();

At this point, we have the tables, the test data and the entities. So far so good! We're ready to look at ways we can query the data.

Querying

We want to always keep a close eye on the queries Hibernate is generating for us under the hood in order to avoid surprises. We do it with the following setting.

@DynamicPropertySource 
static void registerPgProperties(DynamicPropertyRegistry registry) {
  registry.add("spring.jpa.show_sql", () -> true);
}

Iteration 1

We'll start with a pure Java approach, which looks quite elegant actually.

return authorRepository
            .findAll()
            .stream()
            .map(author -> {
               List<Book> books = author.getBooks();
                 return new AuthorWithBooksDto(
                       author.getId(),
                       author.getName(),
                       author.getBio(),
                       books.stream()
                           .map(book -> new BookDto(
                                  book.getTitle(),
                                  book.getIsbn(),
                                  book.getPublishedYear()))
                           .collect(toList())
                    );
                })
                .collect(toList());

But when running it, we immediately notice our console filling up with queries. You've just witnessed the N+1 problem. You want to avoid this if you want a fast application.

Iteration 2

Alright, let's make this better. This is what we'll add in the repository class:

 @Query("SELECT a FROM Author a JOIN FETCH a.books")
 List<Author> findAllWithBooks();

Great stuff! Turns out, this cuts the time to approximately half. Hibernate generates one query only. You should always try to load all the data you need with a single query. It's this one:

select a1_0.id,a1_0.bio,b1_0.author_id,b1_0.id,b1_0.isbn,b1_0.published_year,b1_0.title,a1_0.name from authors a1_0 join books b1_0 on a1_0.id=b1_0.author_id

Let's have a look at the explain plan to learn the steps the database took to retrieve our data.

 Hash Join  (cost=31.50..631.58 rows=30000 width=65) (actual time=0.476..8.912 rows=30000 loops=1)
   Hash Cond: (b1_0.author_id = a1_0.id)
   Buffers: shared hit=230
   ->  Seq Scan on books b1_0  (cost=0.00..521.00 rows=30000 width=29) (actual time=0.009..2.466 rows=30000 loops=1)
         Buffers: shared hit=221
   ->  Hash  (cost=19.00..19.00 rows=1000 width=36) (actual time=0.431..0.432 rows=1000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 77kB
         Buffers: shared hit=9
         ->  Seq Scan on authors a1_0  (cost=0.00..19.00 rows=1000 width=36) (actual time=0.010..0.169 rows=1000 loops=1)
               Buffers: shared hit=9

Nothing surprising, it joined two tables, authors and books, using the hash join algorithm. Note though the rows=30000 on the first line of the explain plan. This tells us that our final result set consists of 30000 rows. Let's have a look also at the layout of these rows. Below are the first 10 rows of the result set.

You're seeing where I'm going with this. Because of the join, we are fetching to our application code a result set that's larger than necessary and duplicated, with other words quite wasteful.

Iteration 3

Let's try something else. We will construct the expected shape of the response fully database-side, using Postgres features. For this, we'll have to write a native query like the one below.

@Query(value = """
            SELECT
                a.id,
                a.name,
                a.bio,
                jsonb_agg(
                    jsonb_build_object(
                        'id', b.id,
                        'title', b.title,
                        'isbn', b.isbn,
                        'publishedYear', b.published_year
                    )
                ) AS books
            FROM authors a
            JOIN books b ON b.author_id = a.id
            GROUP BY a.id;
        """, nativeQuery = true)
    List<Object[]> findAllWithBooksAsJson();

This is how the result looks like:

 652 | Author 652  | Bio for author 652  | [{"id": 652, "isbn": "0", "title": "Book 1 of author 652", "publishedYear": 2013}, {"id": 1652, "isbn": "1", "title": "Book 2 of author 652", "publishedYear": 2010}, {"id": 2652, "isbn": "1", "title": "Book 3 of author 652", "publishedYear": 2000}, {"id": 3652, "isbn": "0", "title": "Book 4 of author 652", "publishedYear": 2002}, {"id": 4652, "isbn": "1", "title": "Book 5 of author 652", "publishedYear": 2019}, {"id": 5652, "isbn": "1", "title": "Book 6 of author 652", "publishedYear": 2006}, {"id": 6652, "isbn": "1", "title": "Book 7 of author 652", "publishedYear": 2020}, {"id": 7652, "isbn": "1", "title": "Book 8 of author 652", "publishedYear": 2004}, {"id": 8652, "isbn": "1", "title": "Book 9 of author 652", "publishedYear": 2010}, {"id": 9652, "isbn": "1", "title": "Book 10 of author 652", "publishedYear": 2022}, {"id": 10652, "isbn": "1", "title": "Book 11 of author 652", "publishedYear": 2001}, {"id": 11652, "isbn": "1", "title": "Book 12 of author 652", "publishedYear": 2010}, {"id": 12652, "isbn": "1", "title": "Book 13 of author 652", "publishedYear": 2024}, {"id": 13652, "isbn": "1", "title": "Book 14 of author 652", "publishedYear": 2021}, {"id": 14652, "isbn": "0", "title": "Book 15 of author 652", "publishedYear": 2004}, {"id": 15652, "isbn": "1", "title": "Book 16 of author 652", "publishedYear": 2001}, {"id": 16652, "isbn": "1", "title": "Book 17 of author 652", "publishedYear": 2001}, {"id": 17652, "isbn": "0", "title": "Book 18 of author 652", "publishedYear": 2020}, {"id": 18652, "isbn": "0", "title": "Book 19 of author 652", "publishedYear": 2009}, {"id": 19652, "isbn": "1", "title": "Book 20 of author 652", "publishedYear": 2000}, {"id": 20652, "isbn": "1", "title": "Book 21 of author 652", "publishedYear": 2000}, {"id": 21652, "isbn": "1", "title": "Book 22 of author 652", "publishedYear": 2013}, {"id": 22652, "isbn": "1", "title": "Book 23 of author 652", "publishedYear": 2012}, {"id": 23652, "isbn": "0", "title": "Book 24 of author 652", "publishedYear": 2014}, {"id": 24652, "isbn": "0", "title": "Book 25 of author 652", "publishedYear": 2001}, {"id": 25652, "isbn": "1", "title": "Book 26 of author 652", "publishedYear": 2016}, {"id": 26652, "isbn": "1", "title": "Book 27 of author 652", "publishedYear": 2014}, {"id": 27652, "isbn": "0", "title": "Book 28 of author 652", "publishedYear": 2024}, {"id": 28652, "isbn": "0", "title": "Book 29 of author 652", "publishedYear": 2016}, {"id": 29652, "isbn": "0", "title": "Book 30 of author 652", "publishedYear": 2012}]
 273 | Author 273  | Bio for author 273  | [{"id": 273, "isbn": "1", "title": "Book 1 of author 273", "publishedYear": 2007}, {"id": 1273, "isbn": "1", "title": "Book 2 of author 273", "publishedYear": 2010}, {"id": 2273, "isbn": "1", "title": "Book 3 of author 273", "publishedYear": 2023}, {"id": 3273, "isbn": "0", "title": "Book 4 of author 273", "publishedYear": 2010}, {"id": 4273, "isbn": "1", "title": "Book 5 of author 273", "publishedYear": 2020}, {"id": 5273, "isbn": "0", "title": "Book 6 of author 273", "publishedYear": 2013}, {"id": 6273, "isbn": "0", "title": "Book 7 of author 273", "publishedYear": 2008}, {"id": 7273, "isbn": "1", "title": "Book 8 of author 273", "publishedYear": 2012}, {"id": 8273, "isbn": "1", "title": "Book 9 of author 273", "publishedYear": 2001}, {"id": 9273, "isbn": "0", "title": "Book 10 of author 273", "publishedYear": 2011}, {"id": 10273, "isbn": "0", "title": "Book 11 of author 273", "publishedYear": 2005}, {"id": 11273, "isbn": "1", "title": "Book 12 of author 273", "publishedYear": 2012}, {"id": 12273, "isbn": "1", "title": "Book 13 of author 273", "publishedYear": 2010}, {"id": 13273, "isbn": "1", "title": "Book 14 of author 273", "publishedYear": 2013}, {"id": 14273, "isbn": "1", "title": "Book 15 of author 273", "publishedYear": 2019}, {"id": 15273, "isbn": "1", "title": "Book 16 of author 273", "publishedYear": 2004}, {"id": 16273, "isbn": "1", "title": "Book 17 of author 273", "publishedYear": 2022}, {"id": 17273, "isbn": "0", "title": "Book 18 of author 273", "publishedYear": 2021}, {"id": 18273, "isbn": "0", "title": "Book 19 of author 273", "publishedYear": 2004}, {"id": 19273, "isbn": "1", "title": "Book 20 of author 273", "publishedYear": 2022}, {"id": 20273, "isbn": "1", "title": "Book 21 of author 273", "publishedYear": 2021}, {"id": 21273, "isbn": "1", "title": "Book 22 of author 273", "publishedYear": 2016}, {"id": 22273, "isbn": "1", "title": "Book 23 of author 273", "publishedYear": 2002}, {"id": 23273, "isbn": "0", "title": "Book 24 of author 273", "publishedYear": 2015}, {"id": 24273, "isbn": "1", "title": "Book 25 of author 273", "publishedYear": 2010}, {"id": 25273, "isbn": "1", "title": "Book 26 of author 273", "publishedYear": 2021}, {"id": 26273, "isbn": "1", "title": "Book 27 of author 273", "publishedYear": 2016}, {"id": 27273, "isbn": "0", "title": "Book 28 of author 273", "publishedYear": 2017}, {"id": 28273, "isbn": "0", "title": "Book 29 of author 273", "publishedYear": 2024}, {"id": 29273, "isbn": "1", "title": "Book 30 of author 273", "publishedYear": 2007}]
...

Let's have a look at the explain plan as well.

 HashAggregate  (cost=856.58..869.08 rows=1000 width=68) (actual time=68.398..74.411 rows=1000 loops=1)
   Group Key: a.id
   Batches: 1  Memory Usage: 23361kB
   Buffers: shared hit=230
   ->  Hash Join  (cost=31.50..631.58 rows=30000 width=57) (actual time=0.385..7.345 rows=30000 loops=1)
         Hash Cond: (b.author_id = a.id)
         Buffers: shared hit=230
         ->  Seq Scan on books b  (cost=0.00..521.00 rows=30000 width=29) (actual time=0.012..1.821 rows=30000 loops=1)
               Buffers: shared hit=221
         ->  Hash  (cost=19.00..19.00 rows=1000 width=36) (actual time=0.356..0.357 rows=1000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 77kB
               Buffers: shared hit=9
               ->  Seq Scan on authors a  (cost=0.00..19.00 rows=1000 width=36) (actual time=0.005..0.109 rows=1000 loops=1)
                     Buffers: shared hit=9

The same hash join you saw before, plus the operations needed to provide the in-line JSON list of books for every author. Note how we're returning now only 1000 rows, 30 times less than before. On my laptop, it turned out to run just a little bit faster than Option 2 above. However, in a typical 3 tier architecture, where the database is separated from the application by a network, choosing this approach will make all the difference, due to much less data needed to be transported over the network for every user request.

Conclusion

We have seen three ways to retrieve the data we needed and looked closely at what makes one approach more performant than the other. Option 3 is the fastest but also the least portable, because it is based on a native query which leverages Postgres specific functionality.

I hope you enjoyed reading it, and potentially even applied it in order to make your application faster.

Thanks for reading! Until next time!

How to add rate limiting to your API using TigerBeetle

Mircea Cadariu — Fri, 30 May 2025 19:03:22 +0000

You should always consider having explicit limits in place when building software. For online services this ensures fair use and also prevents operational headaches. You witnessed the concept in the "real world" as well - in some more busy restaurants, you have only a limited time slot in which to enjoy being seated at a table.

In this post, I'll show you in detail one solution for adding rate limiting to a Spring Boot API application. For the book-keeping required to make this work we will be using TigerBeetle, a financial transactions OLTP database that recently caught my attention and wanted to try out. As a bonus, I'll show you how to capture and visualise your app's rate limiting capability using Prometheus and Grafana, a common open-source stack for application observability. This repo contains the code I'm about to show you, if you'd like to check it out. Onwards!

TigerBeetle

TigerBeetle is a financial transactions database which appeared a couple of years ago. Their ambition is to provide a highly performant and reliable OLTP database for customers operating at massive scale. Reading about their design decisions is rather captivating and in some way reminds me of the LMAX architecture. The schema is very simple, by design. The main concept is debit / credit. It's a very flexible abstraction which can be applied to many use cases, even outside of the financial domain. After all, right, the idea of a "transaction" is pretty universal. On their website, you can find several recipes which can serve as a starting point of working with it. In the next sections, I will be applying the rate limiting recipe. It's really clear what I have to do upon reading it.

Alternatives

When doing Spring Boot application development, I expect you will most frequently encounter Redis as a backing data store for rate limiting. The existing integrations make it easy to start using it. You have the option to include Spring Cloud Gateway as a dependency and you're off to the races after you configure some things. If you already have experience with Redis, that's a totally fine route to take as well.

Getting started

We start our work, as usual with Spring Boot development, by going to start.spring.io and selecting Spring Web as dependency. We'll develop this initial empty shell into a little web application with a single API endpoint. Let's add an initial class which will determine that we do when we get web requests.

@RestController
@RequiredArgsConstructor
public class GreetingController {

    @GetMapping("/greeting")
    public String greeting() {
        return "hello";
    }
}

Intercepting requests

Now, we want to add rate limiting to this endpoint. This means that we have to hook into the Spring request handling mechanism and inject our rate limiting logic between the point where the request is received and when it's handed over to the GreetingController. We do this by creating a class which implements the HandlerInterceptor interface and then providing it to the InterceptorRegistry:

 registry.addInterceptor(rateLimitInterceptor());

When constructing the interceptor we have to provide the TigerBeetle client and the observation registry as collaborating services for the rate limiting. At this point, you might want to get an introduction to the observability registry and all the other related topics, I recommend this post from the Spring blog for getting familiarised about how the integration between Spring Boot and the observability stack works.

@Bean
@RequestScope
HandlerInterceptor rateLimitInterceptor() {
   return new RateLimitInterceptor(client, observationRegistry);
}

The logic for performing the rate limiting will be in the implementation of the preHandle method which is part of the HandlerInterceptor interface.

Note that this means all your endpoints will be subject to rate limiting. If you want to, you can define a list of exceptions, or create custom annotations which you will apply to specific endpoints for more fine-grain control. But for this post, we're keeping it simple.

Every request means a debit

Let us now define two accounts:

the operator
the user

The operator is responsible to initialise the user accounts with a finite amount from our application will deduct a finite amount when handling every request from that particular user. In addition, the user account has the following important restriction: the debits must not exceed the credits. For every request, we will make a transfer from the user to the operator, but if the limit is reached, we will short-circuit the request from proceeding as usual and return with 429 ("Too Many Requests") response code right away.

Worth mentioning, is that the general idea is we can represent any kind of resource we are interested in rate limiting, such as an IP, customer, etc.

Here is how the creation of the user account looks. The USER_ID is just a generated random integer, however you can imagine that in a real system it's retrieved from something like the an authentication system. In the reference system architecture, this would be what is depicted as the OLGP database (e.g. Postgres).

   AccountBatch accountBatch = new AccountBatch(1);
   accountBatch.add();
   accountBatch.setId(USER_ID);
   accountBatch.setLedger(1);
   accountBatch.setCode(1);
   accountBatch.setFlags(DEBITS_MUST_NOT_EXCEED_CREDITS);

   client.createAccounts(accountBatch);

Notice that the interface is modelled around batching. This comes back to the performance as a first class principle in TigerBeetle. With batching, we amortise the cost of overhead. Given the use-case we're tackling here, our batch is limited to one account, but normally you would have more.

Onto the method we use to perform a transfer. It is invoked on every web request.

private CreateTransferResultBatch makeTransfer(long amount, long debitAcct, long creditAcct, int timeout, int flag) {
    TransferBatch transfer = new TransferBatch(1);

    transfer.add();
    transfer.setId(new Random().nextInt());
    transfer.setDebitAccountId(debitAcct);
    transfer.setCreditAccountId(creditAcct);
    transfer.setLedger(1);
    transfer.setCode(1);
    transfer.setAmount(amount);
    transfer.setFlags(flag);
    transfer.setTimeout(timeout);

    return client.createTransfers(transfer);
}

The flag and timeout parameters are needed because for every user requests, we will create a "pending" transfer (this is a type of flag). This means it expires after timeout seconds. This makes it so that the allowance will replenish after a configurable period, which we want to happen.

On the first request by a user, we have to initialise the account, by doing a transfer from the operator to the user:

 makeTransfer(
   USER_CREDIT_INITIAL_AMOUNT,
   OPERATOR_ID,
   USER_ID,
   0,
   0
 );

For every intercepted request, we perform a deduction from the user's account:

 CreateTransferResultBatch transferErrors = 
   makeTransfer(
     PER_REQUEST_DEDUCTION,
     USER_ID,
     OPERATOR_ID,
     TIMEOUT_IN_SECONDS,
     PENDING
   );

If the above operation returns an error of type ExceedsCredits (one of the values of the CreateTransferResult enum), this means that we will not let this request to proceed. We will send an observation towards our observability stack, set an attribute on the current tracing span, as well as set the response code to 429.

  Observation observation = start("ratelimit", observationRegistry);
  observation.event(of("limited"));
  observation.highCardinalityKeyValue(of("user", valueOf(USER_ID)));
  observation.stop();

  Span.current().setAttribute("user", valueOf(USER_ID));

  response.setStatus(TOO_MANY_REQUESTS.value());
  return false;

Testing

So far so good. Let's write a Spring Boot test in which we assert that what I've described above actually happens as we expect:

package com.example.tigerbeetle_ratelimiter;

...

import static io.micrometer.observation.tck.TestObservationRegistryAssert.assertThat;
import static org.assertj.core.api.Assertions.assertThat;

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Testcontainers
class RatelimiterApplicationTests {

    private static final String ENDPOINT = "/greeting";

    @Container
    public static DockerComposeContainer<?> environment =
            new DockerComposeContainer<>(new File("docker-compose.yml"));

    @Autowired
    private TestRestTemplate restTemplate;

    @Autowired
    private TestObservationRegistry observationRegistry;

    @Test
    void contextLoads() {
    }

    @Test
    void shouldRejectRequestsBeyondRateLimit() {
        for (int i = 0; i < USER_CREDIT_INITIAL_AMOUNT / PER_REQUEST_DEDUCTION; i++) {
            restTemplate.getForEntity(ENDPOINT, String.class);
        }

        // The next request should be rate limited
        ResponseEntity<String> response = restTemplate.getForEntity(ENDPOINT, String.class);
        assertThat(response.getStatusCode()).isEqualTo(HttpStatus.TOO_MANY_REQUESTS);

        assertThat(observationRegistry)
                .hasObservationWithNameEqualTo("ratelimit")
                .that()
                .hasBeenStarted()
                .hasBeenStopped();
    }

    @TestConfiguration
    static class ObservationTestConfiguration {

        @Bean
        TestObservationRegistry observationRegistry() {
            return TestObservationRegistry.create();
        }
    }
}

Time to show what happens when we run it.

Showtime

In production environments, TigerBeetle is normally deployed as a cluster of multiple replicas. However, given that we're just experimenting with it locally, we'll start a single instance, fully accepting that it is not set up in a highly available fashion and we will not do this in production.

Let's format the data file first:

docker run --security-opt seccomp=unconfined \
     -v $(pwd)/data:/data ghcr.io/tigerbeetle/tigerbeetle \
    format --cluster=0 --replica=0 --replica-count=1 /data/0_0.tigerbeetle

You observed that as a result, a folder was created called data having a file in it called 0_0.tigerbeetle. This single file is where the TigerBeetle replica will store its our rate limiting book-keeping data.

We're now ready to start our docker-compose setup where everything is wired up and ready to go.

We will first install the app:

./mvnw install

After this, we are ready to start our full environment:

docker-compose up -d

If all services started correctly, we're in business!

Load testing

As a next step, let's set up some requests that will hit the endpoint. k6s is a tool for doing load testing which is very handy for these situations. It's easy to work with it - you write javascript code to describe the load you want to generate and it will proceed to execute it against your target when you run it.

This is the contents of the k6s script. We'll issue 3000 requests within a span of 30 seconds.

import http from 'k6/http';
import { sleep, check } from 'k6';

export const options = {
  vus: 100,
  duration: '30s',
};

export default function() {
  let res = http.get('http://host.docker.internal:8080/greeting');
  check(res, { "status is 200": (res) => res.status === 200 });
  sleep(1);
}

We'll now run the script:

docker run --rm -i grafana/k6 run - <script.js

After 30 seconds, we get the following:

 █ TOTAL RESULTS 

    checks_total.......................: 3000   97.661337/s
    checks_succeeded...................: 16.80% 504 out of 3000
    checks_failed......................: 83.20% 2496 out of 3000

    ✗ status is 200
      ↳  16% — ✓ 504 / ✗ 2496

As we can see, there are more requests which were rate limited than successful ones. We were not kidding. We applied quite a high deduction per request, but we might want to get our foot off the brakes in the context of a real app!

Visualising rate limiting

Moving over to Grafana. I've prepared a pre-configured dashboard for your convenience which we'll now open up and have a look. Let's go to localhost:3000 and fill in admin/admin as credentials, and then click Skip when asked about changing the password. Then, on the left side of the screen, click on Dashboards.

You'll then see our preconfigured dashboard called Rate limiting. Click on it and you will see the following:

Alright, time to have a look at the request traces. These show you the "path" taken by the request through our code. This is where you can find them in the menu.

In the lower part of the next screen you will see some outstanding green dots. Those are so-called exemplars. Metrics give you an aggregated perspective of what you're tracking, but with exemplars you can drill down to understand particular single instances. Here's how one looks like. I have highlighted the span attribute representing the user ID which we set in the Java code you've seen earlier.

The End

Like I've mentioned before, having limits in place for everything is a good thing. Same goes for this post! 😀
So - that's all I have for you today, hope you enjoyed it and thanks for reading.

Time to clean up by tearing down our setup.

docker-compose down -v

Thanks - until next time!

Cover Photo by Spencer DeMera on Unsplash

High speed data loading into Postgres

Mircea Cadariu — Tue, 04 Feb 2025 21:25:03 +0000

Intro

In this post I'll show you how to speed up data loading in Postgres. Using a worked example, we'll start at half an hour runtime and end up with a version which is done in half a minute. Step by step, we'll get it ~60x faster. This is so you wait less and can jump straight to querying your data right away.

In the sections below we'll apply each of the above steps in order, as well as understand why they speed up the data loading.

I'm using Postgres on my laptop (Apple MacBook Pro M1 and 32GB RAM).

Tables

If you've checked out my other posts, you'll recognise familiar tables. It's the same ones I used for this post. We have meters and their readings stored in their respective tables, here's what they look like:

create table meters
(
    id  uuid primary key
);

create table readings
(

    id           uuid primary key,
    meter_id     bigint,
    rating       double precision,
    date         date

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);

We'll be inserting 15000 meters with one reading every day for each one, for 5 years.

insert into meters 
    select uuidv4() from generate_series(1, 15000) seq;

insert into readings(id, meter_id, date, reading)
    select uuidv4(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m

I'm letting this run, and after a (rather long) while, it's done (in 2098.31s).

It's just our starting point, we'll get it much faster. Buckle up!

UUID v7

Let's start by focusing on the primary key. The UUID v4 is not the best when it comes to insert performance. I elaborated why that is in another post, but the gist is that its randomness cause a lot of page modifications. The database has to do a lot of work in order to keep the B-tree balanced after every tuple inserted.

Recently, Postgres got support for UUID v7! It will be available in version 18, however we can already use it if we work with the source code directly. These are time-sortable identifiers, which means insertions will "affect" only a isolated and specific part of the B-tree instead of everything. This means much less work for the database to do. Let's give this a try.

insert into meters 
    select uuidv7() from generate_series(1, 15000) seq;

insert into readings(id, meter_id, date, reading)
    select uuidv7(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m

Check this out - with this, we've reduced the time to more than half! It finished in 821.50s. So far so good.

Numeric IDs

Let's try something else. The UUIDs themselves are generated before insertion (the call to uuidv7()). Let's then replace the UUID primary key with a numeric one, which does not have to be generated, as it will be just read from a sequence. In addition, the corresponding data type (bigint) will be half the size of a UUID. Sounds good, let's see where this brings us.

create table meters
(
    id bigint primary key 
);

create table readings
(

    id           bigint primary key generated always as identity,
    meter_id     bigint,
    reading      double precision,
    date         date,

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);

Here's the updated script:

insert into meters 
    select seq from generate_series(1, 15000) seq;

insert into readings(meter_id, date, reading)
    select m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m

This gets us a little further indeed! We're at 646.73s. So, a bit over 10 minutes. This is great, but we've still got work to do - remember, we're eventually going to get it about 20 times faster than this.

Let's move on to doing some configuration tuning.

Shared buffers

Postgres uses its shared memory buffers to make reads and writes more efficient. It is a set of 8kb pages in memory which are used in order to avoid doing slower disk operations all the time.

If we don't size the shared buffers correctly, we can expect a lot of evictions during our data loading, slowing it down. The default setting of 128 MB is low compared to how much data we're inserting (~2GB), so I'll increase the shared buffers accordingly.

alter system set shared_buffers = '2GB';

Time to run it again. As expected, this brought us closer to our goal. We're now at 595.33s.

Full page writes

As mentioned above, Postgres works with 8kb pages, however the OS and the disk do not (e.g. in Linux the page is 4kb, and a sector on disk is 512 bytes). This can lead to - in the event of a power failure - pages being only partially written. This would prevent Postgres from being able to do its data recovery, because it relies on the fact that the pages are not corrupted in any way when it starts its recovery protocol. The solution to this is that after every checkpoint, at the first update of a page, the full page is written instead of only the changes as is the common case.

For the sake of experimentation let's shut it off, however I do not recommend doing this in production, except only on a temporary basis strictly for the data loading.

alter system set full_page_writes = off;

Hmm, well, it got us to 590.01s. It's not that much, but we'll take it!

Constraints

Next up, we'll remove the table constraints. From the script above, I'll remove the the fk__readings_meters foreign key constraint. The database has to do less work because there's no more checking this at runtime.

Quite a difference this made with this. We're now at 150.71s. This is the biggest gain so far.

Indexes

We're onto something. I'll now remove the indexes as well. This means no more updating the index after every tuple inserted. By the way, we're dropping constraints and indexes but only temporarily. You can always recreate them after the data loading finished successfully.

These are my tables now.

create table meters
(
    id bigint 
);

create table readings
(

    id           bigint,
    meter_id     bigint,
    reading      double precision,
    date         date

);

I've ran the same import script as above, and now we're at 109.51s. Great stuff! Can we get it under 100s?

Unlogged tables

Sure thing! But we'll have to make some more concessions. For example, for the rest of the experiment I'll be using unlogged tables. Again, not a setting to keep on for production beyond strictly the data loading procedure. This is because this way the database does not ensure durability anymore because we've disabled the write-ahead logging facility.

create unlogged table meters
(
    id bigint 
);

create unlogged table readings
(

    id           bigint,
    meter_id     bigint,
    reading      double precision,
    date         date

);

I'm now at 40.65s. Believe it or not, we're not done yet here.

Copy

The copy command is the Postgres method for data loading. Let's give it a go.

\copy readings from '<path_to_file>/readings.csv' delimiter ',';

This finishes in 35.41s. Amazing!

Here are all our results in one view:

That's quite a difference from when we started out. As expected, using COPY lead to the shortest time. But what was interesting to see is the difference it made when we dropped the constraints, compared with the other changes.

I want to add that I've experimented with checkpoint tuning as well. It didn't yield any notable improvements for this experiment. Nonetheless, you might want to keep it in mind as it can affect performance if misconfigured.

Thanks for reading!

Cover Photo by Florian Steciuk on Unsplash

Hierarchical data with Postgres and Spring Data JPA

Mircea Cadariu — Thu, 31 Oct 2024 16:08:50 +0000

He who plants a tree,
Plants a hope.
Plant a tree by Lucy Larcom 🌳

Intro

In this post I'm going to show you a couple of options for managing hierarchical data represented as a tree structure. This applies when you need to implement things like:

file system paths
org charts
discussion forum comments
a more contemporary topic: small2big retrieval for RAG applications [1][2]

Now, if you know what a graph is already, a tree is basically a graph without any cycles. Visually, it looks like this:

There are multiple alternatives for storing trees in relational databases. In the sections below, I'll show you three ways of doing it:

adjacency list
materialized paths
nested sets

There will be two parts to this blog post. In this first one I show you how to load and store data using the above approaches - the basics. Having that out of the way, in the second part, the focus is more on their comparison and trade-offs, for example I want to look at what happens at increased data volumes and what are the appropriate indexing strategies.

All the code you'll see in the sections below can be found here.

The running use-case I picked will be employees and their managers, and the IDs for each will be exactly the ones you saw in the tree visualisation I showed above.

Local environment

I'm using the recently released Postgres 17 with Testcontainers. This gives me a repeatable setup to work with on my laptop. For example, we can provide initialisation SQL scripts. I use this to automate the creation of a Postgres database with the necessary tables and populate with test data.

@TestConfiguration(proxyBeanMethods = false)
class TestcontainersConfiguration {

    private static final String POSTGRES = "postgres";

    @Bean
    @ServiceConnection
    PostgreSQLContainer<?> postgresContainer() {
        return new PostgreSQLContainer<>(DockerImageName.parse("postgres:latest"))
                .withUsername(POSTGRES)
                .withPassword(POSTGRES)
                .withDatabaseName(POSTGRES)
                .withInitScript("init-script.sql");
    }
}

Let's jump in and have a look at the first approach.

1. The adjacency list model

This was the first solution for managing hierarchical data, so we can expect that it's still widely present in codebases - chances are, you might encounter it sometime. The idea is that we store the manager's, or more generically said, the "parent ID" in the same row.

Schema

Let's have a look at the table structure.

create table employees
(
    id           bigserial primary key,
    manager_id   bigint references employees
    name         text,
);

I omitted them here, but in order to ensure data integrity, we should also write constraint checks that ensure at least the following:

there is a single parent for every node
no cycles

Generating test data

Especially for Part 2 of this post, we need a way to generate as much data as we want for populating the tables. Let's do it at first step by step for more clarity, then afterwards recursively.

Iteration 1 - step by step

We start simple by explicitly inserting three levels of employees in the hierarchy.

Now, you might know already about CTEs in Postgres - they are auxiliary queries executed within the context of a main query. Below, you can see how I construct each level on the basis of the level before.

with root as (
  insert into 
    employees(manager_id, name)
      select 
        null, 
        'root' || md5(random()::text) 
      from  
        generate_series(1, 1) g
      returning 
        employees.id
  ),
  first_level as (
    insert into 
      employees(manager_id, name)
        select 
          root.id, 
          'first_level' || md5(random()::text) 
        from 
          generate_series(1, 2) g, 
          root
        returning 
          employees.id
  ),
  second_level as (
    insert into 
      employees(manager_id, name)
        select 
          first_level.id, 
          'second_level' || md5(random()::text) 
        from 
          generate_series(1, 2) g, 
          first_level
        returning 
          employees.id
  )
insert into 
  employees(manager_id, name)
select 
  second_level.id, 
  'third_level' || md5(random()::text) 
from 
  generate_series(1, 2) g, 
  second_level;

Cool. Let's now verify that it works as expected. We do a count to see how many elements have been inserted. You can compare it with the number of nodes in the tree visualisation I showed at the beginning of this post.

postgres=# select count(*) from employees;
 count 
-------
 15
(1 row)

Looks alright! Three levels, and in total we get 15 nodes.

Time to move on to the recursive approach. This is needed for Part 2 of this post, where we want to generate much larger volume of data.

Iteration 2 - recursive

Writing recursive queries follows a standard procedure similar to how it works in regular software development. We define a base step and a recursive step then "connect" them to each other using union all. At runtime Postgres will follow this recipe and generate all our results. Have a look.

create temporary sequence employees_id_seq;
insert into employees (id, manager_id, name)
with recursive t(id, parent_id, level, name) AS
(
  select 
    nextval('employees_id_seq')::bigint,
    null::bigint, 
    1, 
    'root' from generate_series(1,1) g

    union all

    select 
      nextval('employees_id_seq')::bigint, 
      t.id, 
      level+1, 
      'level' || level || '-' || md5(random()::text) 
    from 
      t, 
      generate_series(1,2) g
    where 
      level < 4
)
select 
  id, 
  parent_id, 
  name 
from 
  t;
drop sequence employees_id_seq;

After running it, let's do a count again to see if the same number of elements are inserted.

postgres=# select count(*) from employees;
 count 
-------
 15
(1 row)

Cool! We're in business. We can now populate the schema with however many levels and elements we want, and thus, completely control the inserted volume. No worries if for now recursive queries look a bit hard to grasp still, we'll actually revisit them a bit later with the occasion of writing the queries to retrieve the data.

For now, let's proceed to have a look at the Hibernate entity we can use to map our table to a Java class. This is it:

@Entity
@Table(name = "employees")
@Getter
@Setter
public class Employee {
    @Id
    private Long id;

    private String name;

    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "manager_id")
    private Employee manager;

    @OneToMany(
            mappedBy = "parent",
            cascade = CascadeType.ALL,
            orphanRemoval = true
    )
    private List<Employee> employees = new ArrayList<>();
}

Nothing special, you saw this coming, just a one-to-many relationship between managers and employees. Let's start querying!

Descendants (top-down)

All subordinates of a manager

For retrieving all employees which are the subordinates of a specific manager referenced by her ID, we'll write a recursive query again. You'll see again a base step and a recursive step that is linked up with the base step. Postgres will then repeat this and retrieve all the need rows to satisfy the query. Let's take the employee with ID = 2 for example. This is a visual representation that I found helpful which helped me to understand how it works. I haven't included all the output results you'd get, just the first few to show the principle.

Here's the JPQL query for querying descendants:

return entityManager.createQuery("""
 with employeeRoot as (
  select
    employee.employees employee
  from
    Employee employee
  where
    employee.id = :employeeId

  union all

  select
    employee.employees employee
  from
    Employee employee
  join
    employeeRoot root ON employee = root.employee
  order by
    employee.id
  )
  select 
    new Employee(
     root.employee.id
   )
  from 
  employeeRoot root
 """, Employee.class
)
 .setParameter("employeeId", employeeId)
 .getResultList();

In order to make the queries cleaner by not needing to write the fully qualified name of the record we write the results into, we can use the hypersistence-utils library to write a ClassImportIntegratorProvider, like this:

public class ClassImportIntegratorProvider implements IntegratorProvider {
    @Override
    public List<Integrator> getIntegrators() {
        return List.of(
                new ClassImportIntegrator(
                        singletonList(
                                Employee.class
                        )
                )
        );
    }
}

Important: reviewing the generated queries

It works, but let's have a deeper look at what Hibernate generated. It's always good to understand what's happening under the hood, otherwise we might incur inefficiencies that will happen with every user request - this will add up.

For this, we'll start the Spring Boot app with the following setting:

@DynamicPropertySource
static void registerPgProperties(DynamicPropertyRegistry registry) {
    registry.add("spring.jpa.show_sql", () -> true);
}

Alright, let's have a look. Here's the query for the descendants generated by Hibernate.

with recursive employeeRoot (employee_id) as 
(
select 
  e1_0.id
from 
  employees eal1_0
join 
  employees e1_0 on eal1_0.id = e1_0.manager_id
where eal1_0.id=?

union all

(
select 
  e2_0.id
from 
  employees eal2_0
join 
  employeeRoot root1_0 on eal2_0.id = root1_0.employee_id
join 
  employees e2_0 on eal2_0.id = e2_0.manager_id
order by 
  eal2_0.id
)
)
select 
  root2_0.employee_id
from 
  employeeRoot root2_0

Hmm - looks like there's some extra steps in here! Let's see if we can simplify it a bit, keeping in mind the picture I showed you earlier about the base step and the recursive step linked with the base step. We shouldn't need to do more than that. See what you think of the following:

with recursive employeeRoot (employee_id) as 
(
select 
  e1_0.id
from 
  employees e1_0
where 
  e1_0.id = ?
union all
(
select 
  e2_0.id
from 
  employees e2_0
join 
  employeeRoot root1_0 on e2_0.manager_id = root1_0.employee_id
order by 
  e2_0.id
 )
)
select 
  root2_0.employee_id
from 
  employeeRoot root2_0

Much better! We removed some unnecessary joins. This is expected to make the query go faster because it will have less work to do.

Final result

As a final step let's clean up the query above and replace the table names that Hibernate adds with ones that are more human readable.

with recursive employee_root (id, name) as 
(
select 
  id, 
  name
from 
  employees
where 
  id = ?
union all
(
select 
  employees.id, 
  employees.name
from 
  employees
join 
  employee_root on employees.manager_id = employee_root.id
order by 
  employees.id
 )
)
select
  id, 
  name
from 
  employee_root
order by 
  name;

Alright, time to see how we go "up" the tree.

Ancestors (bottom-up)

All managers up the chain

Let's first try to write down the conceptual steps for getting the managers of employee with ID = 14.

Looks very much like the one for the descendants you saw above, just the connection between the base step and the recursive step is inverted.

We can write the JPQL query:

 return entityManager.createQuery("""
   with employeeRoot as (
     select
       employee.id           as employeeId,
       employee.manager.id   as manager_id
     from
       Employee employee
     where
       employee.id = :employeeId

     union all

     select
       employee.id          as pid,
       employee.manager.id  as manager_id
     from
       Employee employee
     join
       employeeRoot root on employee.id = root.manager_id
     order by
       employee.id
      )
    select 
      new Employee(root.employeeId)
    from 
      employeeRoot root
   """, 
   Employee.class
 )
  .setParameter("employeeId", employeeId)
  .getResultList();

And that's it! I have looked at the SQL query generated but I could not find any extra commands that I could shave off like before. Time to move on to approach 2.

2. Materialized paths

ltree is a Postgres extension we can use to work with hierarchical tree structures as materialized paths (starting from the top of the tree). For example, this is how we will record the one path: 1.2.4.8. There are several useful functions it comes with. We can use it as a table column:

create table employees_ltree
(
    id        bigserial,
    path      ltree
);

In order to populate the above table with test data, the approach I took is basically migrate the generated data from the table used for the adjacency list you saw before, using the following SQL command. It's again a recursive query which collects elements into an accumulator at every step.

with recursive leafnodes AS (
    select
        array_agg(id) as leaves
    from
        employees
    where
        id not in (
        select
            manager_id
        from
            employees
        where
            manager_id is not null
    )
), chain as (
    select
        employees.manager_id,
        employees.id,
        array[]::bigint[] as descendants
    from
        employees

    union all

    select
        employees.manager_id,
        employees.id,
        chain.id || chain.descendants
    from
        employees,
        chain
    where
        employees.id = chain.manager_id
)
insert into employees_ltree(path)
select
       array_to_string(chain.id || descendants, '.')::ltree as path
from
    chain,
    leafnodes
where
    manager_id is null and
    leaves && (chain.descendants);

Here's the entries that the above command generated.

 postgres=# select * from employees_ltree;
 id |   path   
----+----------
  1 | 1.2.4.8
  2 | 1.3.5.9
  3 | 1.2.6.10
  4 | 1.3.7.11
  5 | 1.2.4.12
  6 | 1.3.5.13
  7 | 1.2.6.14
  8 | 1.3.7.15
(8 rows)

We have our table ready. We can proceed to write the Hibernate entity. In order to map columns of type ltree, I implemented a UserType. I can then map the path field with @Type(LTreeType.class).

@Entity
@Table(name = "employees_ltree")
@Getter
@Setter
public class EmployeeLtree {
    @Id
    private Long id;

    @Column(name = "path", nullable = false, columnDefinition = "ltree")
    @Type(LTreeType.class)
    private String path;
}

We're ready to write some queries. In native SQL, it would look like the following:

select
  *
from
  employees_ltree
where
  path ~ '*.2.*'

Now, we can always write a native query in Spring Data JPA and call it a day. But let's push the envelope a bit and write our queries in JPQL. Because we're using a native Postgres feature it's not supported out of the box so we'll have to implement a couple of things to make it possible. We'll first write our custom StandardSQLFunction. This will allow us to define a substitution for the Postgres native operator in our JPQL code.

public class LtreePathContainsSQLFunction extends StandardSQLFunction {
    private static final BasicTypeReference<Boolean> RETURN_TYPE = new BasicTypeReference<>("boolean", Boolean.class, SqlTypes.BOOLEAN);

    public LtreePathContainsSQLFunction(String name) {
        super(name, true, RETURN_TYPE);
    }

    @Override
    public void render(SqlAppender appender, List<? extends SqlAstNode> args, ReturnableType<?> returnType, SqlAstTranslator<?> walker) {
        args.getFirst().accept(walker);
        appender.append("~");
        appender.append("(");
        args.get(1).accept(walker);
        appender.append(")::lquery");
    }
}

We then have to register it as a FunctionContributor, like so:

public class CustomFunctionsContributor implements FunctionContributor {

    @Override
    public void contributeFunctions(FunctionContributions functionContributions) {
        var functionName = "ltree_contains";
        functionContributions.getFunctionRegistry()
                .register(functionName, new LtreePathContainsSQLFunction(functionName));
    }
}

The last step is to create a resource file in the META-INF/services folder called org.hibernate.boot.model.FunctionContributor where we will add a single line with the fully qualified name of the class above.

Okay, cool! We can now write our JPQL query:

@Repository
public interface EmployeeLtreeRepository extends JpaRepository<EmployeeLtree, Long> {
    @Query(value = """
        select
          employee
        from
          EmployeeLtree employee
        where
          ltree_contains(path, :path)
        """)
    List<EmployeeLtree> findAllByPath(@Param("path") String path);
}

This allows us now to pass an ltree path as argument:

employeeLtreeRepository.findAllByPath("*.2.*")

Postgres offers a wide set of functions for working with ltrees. You can find them in the official docs page. As well, there's a useful cheatsheet.

Like with adjacency lists, it's important to add constraints to our schema in order to ensure data consistency - here's a good resource I found on this topic.

3. Nested sets

Easiest to understand is with an image showing the intuition behind it. At every node of the tree we have an extra "left" and a "right" column besides its ID. The rule is that all the children have their left and right in between their parent's left and right values.

Here's the table structure to represent the tree above.

create table employees_nested_sets
(
    id         bigint not null,
    lft        integer,
    rgt        integer
);

In order to populate the table, I have converted the script from Joe Celko's "SQL for smarties" book into Postgres syntax. It migrates the data from the table used in the adjacency list section to this new table structure. Here it is in all its glory:

create table employees_copy AS
select * from employees;

create function migrate_to_nested_sets() returns void as $$
declare
    counter integer;
    max_counter integer;
    current_top integer;
begin
    counter := 2;
    max_counter := 2 * (select count(*) from employees_copy);
    current_top := 1;

    insert into employees_nested_sets
    select 1, id, 1, max_counter
    from employees_copy where parent_id is null;

    delete from employees_copy where parent_id is null;

    while counter <= max_counter-1 loop
            if exists(select * from employees_nested_sets as s1, employees_copy as t1 where s1.id = t1.parent_id and s1.stack_top = current_top)
            then
                begin
                    insert into employees_nested_sets
                    select (current_top + 1), min(t1.id), counter, cast(null as integer)
                    from employees_nested_sets as s1, employees_copy as t1
                    where s1.id = t1.parent_id
                      and s1.stack_top = current_top;

                    delete from employees_copy
                    where id = (select id from employees_nested_sets where stack_top = current_top + 1);

                    counter := counter + 1;
                    current_top := current_top + 1;

                end;
            else
                begin
                    update employees_nested_sets
                    set rgt = counter,
                        stack_top = -stack_top
                    where stack_top = current_top;

                    counter := counter + 1;
                    current_top := current_top - 1;
                end;
            end if;
        end loop;
END;
$$ language plpgsql;

Alright, I'm ready to do some queries. Here's how to retrieve the ancestors.

 @Query("""
   select new Employee(
     manager.id
   )
   from
     EmployeeNestedSets employee,
     EmployeeNestedSets manager
    where
     employee.lft between manager.lft and manager.rgt and
     employee.id = :id
   """
    )
    List<Employee> getAncestorsOf(Long id);

For the descendants, it looks a bit different, we have to first retrieve the left and right, after which we can use the below query.

@Query(""" 
   select new Employee(
     employee.id
   )
   from
     EmployeeNestedSets employee
   where
     employee.lft > :lft and
     employee.rgt < :rgt
    """
    )
    List<Employee> getDescendantsUsing(Long lft, Long rgt);

And that's it! You've seen how to go up or down the tree for all three approaches. I hope that you enjoyed the journey and you find it useful.

Postgres vs. document/graph databases

The database we've used for the examples above is Postgres. It is not the only option, for example you might wonder why not choose a document database like MongoDB, or a graph databases like Neo4j, because they were actually built with this type of workload in mind.

Chances are, you already have your source of truth data in Postgres in a relational model with transactional guarantees. In that case, you should first check how well Postgres itself handles your auxiliary use-cases as well, in order to keep everything in one place. This way, you will avoid the increased cost and operational complexity needed to spin up and maintain/upgrade a new separate specialised data store, as well as needing to get familiar with it.

Conclusion

There are several interesting options for modelling hierarchical data in your database applications. In this post I've shown you three ways to do it. Stay tuned for Part 2 where we will compare them as well as see what happens with larger volume of data.

References

Before writing the this post I have looked at various existing ones on the topic and I am grateful for the authors for taking the time to write them.

https://dev.to/yugabyte/learn-how-to-write-sql-recursive-cte-in-5-steps-3n88
https://vladmihalcea.com/hibernate-with-recursive-query/
https://vladmihalcea.com/dto-projection-jpa-query/
https://tudborg.com/posts/2022-02-04-postgres-hierarchical-data-with-ltree/
https://aregall.tech/hibernate-6-custom-functions#heading-implementing-a-custom-function
https://www.amazon.co.uk/Joe-Celkos-SQL-Smarties-Programming/dp/0128007613
https://madecurious.com/curiosities/trees-in-postgresql/
https://schinckel.net/2014/11/27/postgres-tree-shootout-part-2%3A-adjacency-list-using-ctes/

Faster table joins

Mircea Cadariu — Mon, 02 Sep 2024 19:35:14 +0000

We can keep our database-backed applications performing pretty well already by following a couple of simple rules, for example:

no N+1 queries
adding adequate indexes
keeping the output "narrow" (retrieving only the required columns)

The scenario I'm about to show you is a bit different. It was already down to just one query, it had the adequate indexes, but it was still taking ~30 seconds to run, so it had to be improved. Everything was in place such that it would be fast, but somehow, it wasn't! I'll show you what I did such that it ran in <1s and explain every step with the reasoning behind it.

The tables

What I had is a many-to-many relationship, involving two entity classes, let's call them Foo and Bar. They were mapped with the @ManyToMany JPA annotation in the Java code. At the database level, besides the corresponding tables for storing the entities, there was a link table called foo_bar containing only two columns (foreign keys).

The tables were all pretty large, especially the link table totalling more than a hundred million rows. I should note that the link table did have standard b-tree indexes on both columns.

Table	Nr. of rows
foo	2819724
bar	21109691
foo_bar	126167975

The query

The query was the following, written in JPQL:

delete 
 from 
 Foo foo 
where 
 foo.columnA in ... and 
 foo.columnB = ...

Based on the above, Hibernate generated the following SQL query:

delete
 from foo_bar
where 
 foo_bar in (select 
              f1_0.id                   
             from 
              foo f1_0 
             where 
              f1_0.columnA in ... and 
              f1_0.columnB = ...
             )

As you can see, it's actually a delete, however as you will see below, the bottleneck was a join operation in the execution plan.

With hindsight, the query could have been written as a native query, and using the CASCADE feature would allow solving it more declaratively. However, for the rest of this post I'll continue with the query that Hibernate generated, as I still think it's a good support for showing you some instruments you have when a query is slow.

I've extracted the set of parameters for an exemplar to reproduce. I've then added a transaction block around the query, such that I can rollback and no actual deletes happen.

The explain plan

When wanting to understand what exactly a query is doing to retrieve our data, we consult so-called explain plans. Let's have a look.

Whilst it's clear what's slowing it down (the sequential scan reading all the contents of the large link table - the thick line in the image above on the left side), it's unexpected (at least for me). Given the presence of the indexes and the fact that the sub-select by itself only returns about ~800 rows, I had expected that we could avoid reading the entire large table like that, because given the data volume, it will always be slow.

No problem, let's see what our options are.

Looking for inspiration

Let's first get some inspiration by disabling some operations the database can use in planning, for example let's prevent it from going for hash joins. It will then have to use one of its alternatives (the other two options are nested loop and merge joins).

set enable_hashjoin = off;

Oh - take a look at that!

No more sequential scan of a hundred million rows. By disabling the hash join option, Postgres went for a much more efficient nested loop that fully utilises the indexes we've defined on the link table.

So far so good. We now know what we're after, the next question is how to get there, because disabling the hash joins like this is only meant to be for experimentation.

Cost-based planning

The question to ask is - what's preventing Postgres from employing the more efficient nested loop alternative? Let's take a step back and reflect on how Postgres makes its decisions with regards to how it retrieves our data from disk.

Postgres uses a cost-based optimiser that computes the cost for the various alternatives, and then selects the best one. The costs are calculated based on various factors, including table statistics of the data and configuration properties.

From the Postgres codebase we learn that it collects a sample of 300 multiplied by the so-called default statistics target, a configuration option we can control. If you're wondering like me why the 300, it's not due the 300 Spartans confronting the Persians at Thermopylae. For the real reason, you can have a look here, where as usual, the code has a helpful comment indicating even the research paper on which the choice is based on.

For our use-case, the intuition is that due to the size of the table (hundreds of millions), the sample might be too small to get an accurate representation of the data, which might lead to less efficient planning, but let's try to confirm that with some numbers.

Checking the statistics

One of the statistics Postgres collects is an n_distinct, an estimate of the number of distinct rows in the table. Let's see this for table foo_bar:

select 
 n_distinct 
from 
 pg_stats 
where 
 tablename = 'foo_bar' and 
 attname = 'foo_id'; 

28001

Let's now compute the actual number of distinct values and compare with the estimate. For this, I'm using the following SQL query:

select 
 count(distinct(foo_bar.foo_id)) 
from 
 foo_bar;

910322

There we go! The entry in the pg_stats is ~30 times smaller than the actual number of distinct rows.

This is a problem because it will impact the calculation of the selectivity value used by the planner. To calculate this, Postgres uses the frequencies of the most common values (MCV) in a table. Note that how many we collect is bounded by the default statistics target value we talked about earlier. If the values we're looking for in a table are not in this list, Postgres fallbacks to using this following formula for selectivity:

selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)

Now if the values are not in the MCV (because we collected too few), and result of this is wrong because of the wrong estimate of distinct values, then indeed, the planner will not choose the best plan. We might get a nested loop when the selectivity is low, or a hash join when the selectivity is high, which we don't want.

Increasing the amount of statistics

Let's try to improve the situation by allowing Postgres to use a larger sample size in order to get better statistics. This means it will store more values in the MCV list as well as looking at more rows when determining the n_distinct. We have to keep in mind however that it will take longer for it to create plans ("Planning time" in the explain analyse output).

Let's increase the sample size from the default of 100 to a value let's say 10 times bigger, but only for one column, like so:

alter table foo_bar alter foo_id set statistics 1000;

If we query for the n_distinct now we'll get the same value as before, it will only update after running:

analyse foo_bar;

This will take a while. Remember, it has more work to do.

After a couple of minutes, it finished. Let's have a look if the estimate is closer to reality now.

select 
 n_distinct 
from 
 pg_stats 
where 
 tablename = 'foo_bar' and 
 attname = 'foo_id';

121894

Progress! Still about 9 times smaller than the actual number, but let's run explain analyse now and see if we'll get the nested loop being chosen.

Bingo

This executes in under a second, which is a big improvement over where we started. Nice!

Altering the amount of statistics collected was enough to significantly improve this example, however, in case you need even more control, at least for the n_distinct you can manually set it to whatever you want with the following command:

alter table tool_analysis_finding alter column tool_analysis_id set (n_distinct = ...);

However, I would advise against going directly for this approach, because it means we won't benefit anymore from the automatic statistic collection that the database does in the background in an unattended fashion.

The `random_page_cost` setting

Let's look at something else. From the Postgres code we can look up other variables that come into the picture when deciding to choose an index scan over a sequential scan. For example, there's this random_page_cost that can be found here in the file costsize.c.

For a description of this setting, have a look here. Basically, it's the estimate for the cost of retrieving a page non-sequentially from disk. With modern hardware like SSDs, there isn't such a big difference between sequential retrieval and random. The default configuration of 4 is not suitable therefore. For example, Crunchy Bridge has changed this value to 1.1 for all new databases on their platform.

Let's try adjusting this to 1.1 and see what happens.

Result

It worked again! We got the nested loop and sub-second execution time again - great stuff.

Conclusion

Postgres features a very clever query planner that does an excellent job finding an efficient way to retrieve our data in most of the cases. However, with some guidance from us (mainly in uncommon situations like very large tables), giving it more details about the context, or letting it use some more storage or time for its internal operations, it continues to deliver the results to our queries as fast as possible.

Spring AI, Llama 3 and pgvector: bRAGging rights!

Mircea Cadariu — Sat, 15 Jun 2024 14:35:35 +0000

In the very beginning at least, Python reigned supreme in terms of tooling for AI development. However, recently came the answer from the Spring community, and it's called Spring AI! This means that if you're a Spring developer with working knowledge of concepts such as beans, auto-configurations and starters, you're covered, and you can write your AI apps following the standard patterns you're already familiar with.

In this post, I want to share with you an exploration that started with the goal to take Spring AI for a little spin and try out the capabilities of open-source LLMs (large language models). I got curious along the way, and ended up also looking at some pretty low-level details about data storage for AI applications as well.

To support the exploration, I've developed a simple Retrieval Augmented Generation (RAG) workflow that works completely locally on the laptop for free. If you're interested, you can find the code itself here. Basically, I've used Testcontainers to create a Postgres database container with the pgvector extension to store text embeddings and an open source LLM with which I send requests to: Meta's llama3 (8B) through ollama.

Alright, let's have a look at what I'm building with this setup.

Fun with sayings and essays

The RAG workflow works like this. I first ask it to generate a couple of sayings, for example this is one among the ones I got back:

A bad excuse is better than none.

Then, I ask it to "elaborate" and genarate an essay (of under 200 words) about the saying it provided me earlier. Here is one such essay generated by llama3 for the saying above:

The saying suggests that while an inadequate justification may not be satisfactory, it's still preferable to having no explanation at all. This phrase highlights the importance of accountability and acknowledging responsibility for one's actions or decisions.

In essence, a bad excuse implies that some effort has been made to justify or rationalize one's behavior, even if the reasoning is flawed or unconvincing. On the other hand, not offering any excuse whatsoever can come across as evasive or dishonest, leading to mistrust and resentment.

This saying encourages people to take ownership of their mistakes and apologize for any wrongdoing, rather than remaining silent and avoiding accountability. It also acknowledges that everyone makes errors and that a bad excuse is better than none because it shows a willingness to acknowledge and learn from those mistakes.

Ultimately, the saying promotes honesty, responsibility, and personal growth by emphasizing the value of taking ownership of one's actions, no matter how imperfect the explanation may be.

Then, I will take these essays and create embeddings from them, which I will store in Postgres, using the pgvector extension in columns of vector data type. All with the help of Spring AI abstractions and least amount of custom code.

I will skip the part of this process called "chunking". When you are dealing with very large documents, or want to isolate sections in your data (like in e-mails where you have subject, sender, etc..) you might look into doing that.

So far so good. At this point, we have stored the data we need in the next steps.

I will then take each saying and do a similarity search on the embeddings to retrieve the corresponding essay for each saying. Lastly, I will supply the retrieved essays back again to the LLM, and now ask it to guess the original saying from which the essay was generated. Finally I will check how many it got right.

What do you think, will it manage to correctly guess the saying from just the essay? After all, it has generated the essays from those sayings itself in the first place. A human would have no problem doing this.

But let's first have a look at how the program is set up from a technical perspective. We will look at the results and find out how capable is the LLM a bit later.

The LLM and the vector store in Testcontainers

Testcontainers makes it very easy to integrate services that each play a specific role for use-cases like this. All that is required to set up a database and the LLM are the couple of lines below and you're good to go!

@TestConfiguration(proxyBeanMethods = false)
class RagDemoApplicationConfiguration {
    private static final String POSTGRES = "postgres";

    @Bean
    @ServiceConnection
    PostgreSQLContainer<?> postgreSQLContainer() {
        return new PostgreSQLContainer<>("pgvector/pgvector:pg16")
                .withUsername(POSTGRES)
                .withPassword(POSTGRES)
                .withDatabaseName(POSTGRES)
                .withInitScript("init-script.sql");
        }
    }

    @Bean
    @ServiceConnection
    OllamaContainer ollamaContainer() {
        return new OllamaContainer("ollama/ollama:latest");
    }
}

I've used the @ServiceConnections annotation that allows me to type less configuration code. I can do this for the ollama container too only since recently, thanks to this recent contribution from Eddú Meléndez.

You might have noted there's an init script there. It's only a single line of code, and has the purpose to install a Postgres extension called pg_buffercache which lets me inspect the contents of the Postgres shared buffers in RAM. I'm interested in having a look at this in order to better understand the operational characteristics of working with vectors. With other words, what are the memory demands?

create extension pg_buffercache;

Now, to fully initialise our LLM container such that it's ready to actually handle our requests for our sayings and essays, we need to pull the models we want to work with, like so:

ollama.execInContainer("ollama", "pull", "llama3");
ollama.execInContainer("ollama", "pull", "nomic-embed-text");

If you rerun the program again you will see that it will pull the models again. You can have a look at this repo and consider using the baked images that have the models within them already.

You will notice that besides the llama3 that I mentioned before which will take care of generating text, I am also pulling a so-called embedding model: nomic-embed-text. This is to be able to convert text into embeddings, to be able store them.

The ones I'm using are not the only options. New LLM bindings and embedding models are added all the time in both Spring AI and ollama, so refer to the docs for the up-to-date list, as well as the ollama website.

Configuration properties

Let's have a look at the vector store configuration. Here's how that looks:

@DynamicPropertySource
static void pgVectorProperties(DynamicPropertyRegistry registry) {
  registry.add("spring.ai.vectorstore.pgvector.index-type", () -> "HNSW");
  registry.add("spring.ai.vectorstore.pgvector.distance-type", () -> "COSINE_DISTANCE");     
  registry.add("spring.ai.vectorstore.pgvector.dimensions", () -> 768);
}

The first one is called index-type. This means that we are creating an index in our vector store. We don't necessarily need to always use an index - it's a trade-off. With indexing, the idea is that we gain speed (and other things, like uniqueness, etc) at the expense of storage space. With indexing vectors however, the trade-off also includes the relevance aspect. Without indexing, the similarity search is based on the kNN algorithm (k-nearest neigbours) where it checks all vectors in the table. However with indexing, it will perform an aNN (approximate nearest neighbours) which is faster but might miss some results. Indeed, it's quite the balancing act.

Let's have a look at the other configuration options for indexing, which I extracted from the Spring AI code:

NONE,
IVFFLAT,
HNSW;

In the beginning, there used to be only one option for indexing in pgvector, namely ivfflat. More recently, the HNSW (Hierarchical Navigable Small Worlds) one was added, which is based on different construction principles and is more performant, and keeps getting better. The general recommendation is to go for HNSW as of now.

The next configuration option is the distance-type which is the procedure it uses to compare vectors in order to determine similarity. Here are our options:

 EUCLIDEAN_DISTANCE,
 NEGATIVE_INNER_PRODUCT,
 COSINE_DISTANCE;

I'll go with the cosine distance, but it might be helpful to have a look at their properties because it might make a difference for your use-case.

The last configuration property is called dimensions, which represent the number of components (tokenized float values) that the embeddings will be represented on. This number has to be correlated with the number of dimensions we set up in our vector store. In our example, the nomic-embedding-text one has 768, but others have more, or less. If the model returns the embeddings in more dimensions than we have set up our table, it won't work. Now you might wonder, should you strive to have as high number of dimensions as possible? Actually the answer to this question is apparently no, this blog from Supabase shows that fewer dimensions are better.

Under the hood - what's created in Postgres?

Let's explore what Spring AI has created for us with this configuration in Postgres. In a production application however, you might want to take full control and drive the schema through SQL files managed by migration tools such as Flyway. We didn't do this here for simplicity.

Firstly, we find it created a table called vector_store with the following structure:

postgres=# \d vector_store;
                     Table "public.vector_store"
  Column   |    Type     | Collation | Nullable |      Default       
-----------+-------------+-----------+----------+--------------------
 id        | uuid        |           | not null | uuid_generate_v4()
 content   | text        |           |          | 
 metadata  | json        |           |          | 
 embedding | vector(768) |           |          | 
Indexes:
    "vector_store_pkey" PRIMARY KEY, btree (id)
    "spring_ai_vector_index" hnsw (embedding vector_cosine_ops)

Nothing surprising here. It's in-line with the configuration we saw above in the Java code I showed you earlier. For example, we notice the embedding column of type vector, of 768 dimensions. We notice also the index - spring_ai_vector_index and the vector_cosine_ops operator class, which we expected given what we set in the "distance-type" setting earlier. The other index, namely vector_store_pkey, is created automatically by Postgres. It creates such an index for every primary key by itself.

The command that Spring AI used to create our index is the following:

CREATE INDEX IF NOT EXISTS %s ON %s USING %s (embedding %s)

This creates an index with the default configuration. It might be good to know that you have a couple of options if you'd like to tweak the index configuration for potentially better results (depends on use-case):

m - the max number of connections per layer
ef_construction - the size of the dynamic candidate list for constructing the graph

Theres are the boundaries you can pick from for these settings:

Setting	default	min	max
m	16	2	100
ef_construction	64	4	1000

In order to understand the internals of this index and what effect changing the above options might have, here is a link to the original paper. See also this post by J. Katz in which he presents results of experimenting with various combinations of the above settings.

When you know what values you want to set for these settings you can create the index like so:

CREATE INDEX ON vector_store
USING hnsw (embedding vector_cosine_ops)
WITH (m = 42, ef_construction = 42);

In case you get an error when constructing an index, it's worth looking into if it has enough memory to perform this operation. You can adjust the memory for it through the maintenance_work_mem setting.

Let's now check how our embedding column is actually stored on disk. We use the following query which will show us our next step.

postgres=# select 
             att.attname, 
          case 
             att.attstorage
               when 'p' then 'plain'
               when 'm' then 'main'
               when 'e' then 'external'
               when 'x' then 'extended'
            end as attstorage
           from 
            pg_attribute att  
           join 
            pg_class tbl on tbl.oid = att.attrelid   
           join 
            pg_namespace ns on tbl.relnamespace = ns.oid   
           where 
            tbl.relname = 'vector_store' and 
            ns.nspname = 'public' and   
            att.attname = 'embedding';

Result:

-[ RECORD 1 ]---------
attname    | embedding
attstorage | external

Alright, so it uses the external storage type. This means that it will store this column in a separate, so-called TOAST table. Postgres does this when columns are so large it can't fit at least 4 rows in a page. But interesting that it will not attempt to also compress it to shrink it even more. For compressed columns it would have said extended instead of external in the result above.

Normally, when you update one or multiple columns of a row, Postgres will, instead of overwriting, make a copy of the entire row (it's an MVCC database). But if there are any large TOASTed columns, then during an update it will copy only the other columns. It will copy the TOASTed column only when that is updated. This makes it more efficient by minimising the amount of copying around of large values.

Where are these separate tables though? We haven't created them ourselves, they are managed by Postgres. Let's try to locate this separate TOAST table using this query:

postgres=# select 
             relname, 
             oid
           from 
             pg_class, 
           (select 
              reltoastrelid 
            from 
              pg_class
            where 
              relname = 'vector_store') as vector_store 
            where 
              oid = vector_store.reltoastrelid or 
              oid = (select 
                      indexrelid 
                     from 
                      pg_index
                     where 
                      indrelid = vector_store.reltoastrelid
                      );

       relname        |  oid  
----------------------+-------
 pg_toast_16630       | 16634
 pg_toast_16630_index | 16635

So far so good. We now have the TOAST table ID. Let's use it to have a look at the structure of the TOAST table. For example, what columns does it have? Note that these tables are in the pg_toast schema, by the way, so to get there, we have to set the search_path to pg_toast, like below:

postgres=# set search_path to pg_toast;
SET
postgres=# \d pg_toast_16630;
TOAST table "pg_toast.pg_toast_16630"
   Column   |  Type   
------------+---------
 chunk_id   | oid
 chunk_seq  | integer
 chunk_data | bytea
Owning table: "public.vector_store"
Indexes:
    "pg_toast_16630_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

We can learn a couple of things from this. As expected, the large columns in the main table that have to be "TOASTed" are chunked (split up) and each chunk is identified by a sequence, and is always retrieved using an index.

Postgres has a mechanism to avoid "blasting" the entire shared buffer cache when it needs to do large reads, like sequential scans of a large table. When it has to do this, it actually uses a 32 page ring buffer so that it doesn't evict other data from the cache. But this mechanism will not kick in for TOAST tables, so vector-based workloads will be run without this form of protection.

Okay! We had a very good look at the database part. Let's now "resurface" for a moment and have a look at other topics pertaining to the high level workflow of interacting with the LLM.

Template-based prompts

Initially, I had constructed the prompts for the request to the LLM in the same class where I was using them. However, I found the following different approach in the Spring AI repository itself and adopted it, because it's indeed cleaner to do it this way. It's based on externalised resource files, like so:

@Value("classpath:/generate-essay.st")
protected Resource generateEssay;

@Value("classpath:/generate-saying.st")
protected Resource generateSaying;

@Value("classpath:/guess-saying.st")
protected Resource guessSaying;

This is how one of them looks inside.

Write a short essay under 200 words explaining the 
meaning of the following saying: {saying}.

As you can see, I have not applied any sophisticated prompt engineering whatsoever, and kept it simple and direct for now.

Calling the LLM

Alright, the pieces are starting to fit together! The next thing I'd like to show you is how to call the LLM.

chatModel
 .withModel(model)
 .call(createPromptFrom(promptTemplate, promptTemplateValues))
 .getResult()
 .getOutput()
 .getContent();

I am using the so-called Chat Model API, a powerful abstraction over AI models. This design allows us to switch between models with minimal code changes. If you want to work with a different model, you just change the runtime configuration. This is a nice example of the Dependency Inversion Principle; where we have higher level modules that do not depend on low-level modules, both depend on abstractions.

Storing the embeddings

To store the embeddings, I must say that I found it a pretty complicated procedure:

vectorStore.add(documents);

Just kidding, that's it!

This single command will do several things. First convert the documents (our essays) to embeddings with the help of the embeddings model, then it will run the following batched insert statement to get the embeddings into our vector_store table:

INSERT INTO vector_store (id, content, metadata, embedding) VALUES (?, ?, ?::jsonb, ?) ON CONFLICT (id) DO UPDATE SET content = ? , metadata = ?::jsonb , embedding = ?

We can see it actually performs an update of the content column in case there is already one row with that ID (taken care of by the ON CONFLICT part in the query) present in the database.

Similarity searches

To do a similarity search on the stored vectors with Spring AI, it's just a matter of:

vectorStore
.similaritySearch(SearchRequest.query(saying))
.getFirst()
.getContent();

Again you get a couple of things done for you by Spring AI. It takes the parameter you supply ("saying" in our case), and first it creates its embedding using the embedding model we talked about before. Then it uses it to retrieve the most similar results, from which we pick only the first one.

With this configuration (cosine similarity), the SQL query that it will run for you is the following:

SELECT *, embedding <=> ? AS distance FROM vector_store WHERE embedding <=> ? < ? AND metadata::jsonb @@ <nativeFilterExpression>::jsonpath ORDER BY distance LIMIT ?

It selects all the columns in the table and adds a column with the calculated distance. The results are ordered by this distance column, and you can also specify a similarity threshold and a native filter expression using Postgres' jsonpath functionality.

One thing to be noted, is that if you'd write the query yourself and run it with without letting Spring AI create it for you, you can customise the query by supplying different values for the ef_search parameter (default: 40, min: 1, max: 1000), like so:

SET hnsw.ef_search = 42;

With it, you can influence the number of neighbours that it considers for the search. The more that are checked, the better the recall, but it will be at the expense of performance.

Now that we know how to do perform similarity searches to retrieve semantically close data, we can also make a short incursion into how Postgres uses memory (shared buffers) when performing these retrievals.

How much of the shared buffers got filled up?

Let's now increase a bit the amount of essays we're working with to 100, and have a look what's in the Postgres shared buffers after we run the program. We'll use the pg_buffercache extension that I mentioned before, which was installed in the init script.

But first, let's start with looking at the size of the table and index, just to get some perspective.

postgres=# \dt+ vector_store;
                                       List of relations
 Schema |     Name     | Type  |  Owner   | Persistence | Access method |  Size  | Description 
--------+--------------+-------+----------+-------------+---------------+--------+-------------
 public | vector_store | table | postgres | permanent   | heap          | 584 kB |

postgres=# \di+ spring_ai_vector_index;
                                                   List of relations
 Schema |          Name          | Type  |  Owner   |    Table     | Persistence | Access method |  Size  | Description 
--------+------------------------+-------+----------+--------------+-------------+---------------+--------+-------------
 public | spring_ai_vector_index | index | postgres | vector_store | permanent   | hnsw          | 408 kB | 
(1 row)

Okay, so the table is 584 kB and the index is 408 kB. It seems the index gets pretty big, close to being about the same size of the table. We don't mind that much at such small scale, but if we assume this proportion will be maintained at large scale too, we will have to take it more seriously.

To contrast with how other indexes behave, I checked a table we have at work that amounts to 40Gb. The corresponding B-tree primary key index is 10Gb, while other indexes of the same type for other columns are just 3Gb.

I'm using the following query to get an overview of what's in the shared buffers:

select 
  c.relname, 
  count(*) as buffers
from 
  pg_buffercache b 
inner join 
  pg_class c on b.relfilenode = pg_relation_filenode(c.oid) and
                b.reldatabase in (0, (select 
                                        oid 
                                      from 
                                        pg_database
                                      where 
                                        datname = current_database()
                      )
                  )
group by 
  c.relname
order by 
  2 desc
limit 
   10;

            relname             | buffers 
--------------------------------+---------
 pg_proc                        |      61
 pg_toast_16630                 |      53
 spring_ai_vector_index         |      51
 pg_attribute                   |      35
 pg_proc_proname_args_nsp_index |      30
 pg_depend                      |      23
 pg_operator                    |      19
 pg_statistic                   |      19
 pg_class                       |      18
 vector_store                   |      18
(10 rows)

We see that all the index in its entirety is in there. We deduced this because the size of the index is 408 Kb, as we saw before, and if we divide that by 8 Kb, which is the size of a Postgres page, we get exactly 51 like we see in the above table (third row above).

We can draw a conclusion from this - working with vectors in Postgres is going to be pretty demanding in terms of memory. As reference, vectors that have 1536 dimensions (probably the most common case) will occupy each about 6Kb. One million of them already gets us to 6Gb. In case we have other workloads next to the vectors, they might be affected in the sense that we start seeing cache evictions because there's no free buffer. This means we might even need to consider separating the vectors from the other data we have, in separate databases, in order to isolate the workloads in case we notice the performance going downhill.

The `@ParameterizedTest` JUnit annotation

Alright, a last remark I want to make about this program is that it's set up to be able to experiment with other open-source LLMs. The entrypoint method I'm using to run the workflow, is a JUnit parameterized test where the arguments for each run can be the names of other LLM models distributed with ollama. This is how you set it up to run multiple times with a different LLM for every execution:

@ParameterizedTest
@ValueSource(strings = {"llama3", "llama2", "gemma", "mistral"})
void rag_workflow(String model) {
...
}

Outputs

Finally it's time to review how well did the LLM manage to guess the sayings. With no other help except for the initial essays provided in the prompt, it managed to guess the saying perfectly a grand total of... once!

Saying	LLM Guess
Your most powerful moments are born from the ashes of your greatest fears.	What doesn't kill you...
Every sunrise holds the promise of a new masterpiece.	What lies within is far more important than what lies without.
Every step forward is a declaration of your willingness to grow.	Any Step Forward
Your most beautiful moments are waiting just beyond your comfort zone.	What lies within...
Light reveals itself in the darkness it creates.	The darkness is not the absence of light but the presence of a different kind
Courage is not the absence of fear, but the willingness to take the next step anyway.	Be brave.
Small sparks can ignite entire galaxies.	Small sparks can ignite entire galaxies.
Believe in yourself, take the leap and watch the universe conspire to make your dreams come true.	Take the leap
Life begins at the edge of what you're willing to let go.	Take the leap.

Some responses are quite amusing, like when it tries to be "mysterious" or more conversational by not completing the sentence fully and just ending it in three dots ("What doesn't kill you..."), and the ones where it reaches for extreme succintness ("Take the leap.", "Be brave.").

Let's give it some help now. In the prompt, this time I'll provide all the sayings it initially generated as a list of options to pick from. Will it manage to pick the correct one from the bunch this way?

Turns out, indeed, if I gave it options to pick from, it picked the right one, every time. Quite the difference between with or without RAG!

Conclusion

Spring AI is a well designed application framework that helps you achieve a lot with little code. You can see it as the "linchpin" that helps you set up and easily evolve your AI use-cases in in stand-alone new applications or integrated with your existing Spring ones. It already has many integrations with specialised AI services and the list keeps growing constantly.

The open-source LLMs I tried have not raised to the occasion, and haven't passed my "challenge" to guess the initial sayings they themselves generated from their (also own) essays. They seem not ready to perform well for use-cases that require this kind of precise and correct "synthesised" answers, but I will keep trying new models as they are made available.

However, they are still useful if you know what you can expect from them - they are very good for storing and retrieving many loosely connected facts, a clear value-add when needing to brainstorm for example.

When I gave it the options, the difference is like night and day compared to when I didn't. If given the options, it picked the right answer every time, flawlessly.

We also looked at how embeddings are stored internally with the pgvector extension, and how "memory-hungry" this is - we should account for this and make some arrangements at the beginning of the project in order to have smooth operation when the scale grows.

Thanks for reading!

Query optimisation guided by explain plans

Mircea Cadariu — Wed, 03 Apr 2024 14:33:30 +0000

Use-case: we have a set of meters (e.g. gas, electricity, etc) that regularly record readings. Our task is to retrieve the latest reading of every meter (or more generically phrased: retrieving the latest row per group) using SQL.

SQL is a declarative language. Therefore, we achieve goal using it by expressing only what we want in the form of queries, instead of providing precise instructions about how to retrieve data. Then the database component called the planner will determine what's the best way to do that based on several factors such as table statistics. However, as we shall see, going for the first approach that comes to mind when writing queries might not yield the best performance. So ideally, we know as much as possible what the database does internally to retrieve our data, and what are our options to influence this towards the optimal access path for our use-cases.

In the sections below, to explain exactly what makes the alternatives I show you faster, I use as support visualisations of Postgres explain plans, showing how the database processes our queries internally. As you will see, some adjustments will make quite a difference - we will go from hundreds of ms to about 5. I sometimes thought of side notes to the main story to share with you that I've marked appropriately.

I'm using Postgres version 16.2 (with its default configuration) running on my laptop .

Tables

First, we create the table of meters.

create table meters
(
    id  bigint primary key generated always as identity,
);

Then, we create the table of readings.

create table readings
(
    id           bigint primary key generated always as identity,
    meter_id     bigint,
    date         date,  
    reading      double precision,

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);

For enforcing integrity, the tables are linked together with a foreign key constraint. I am also not using serial when setting up the primary keys.

Generating test data

Let's now populate our tables with some rows: we add 500 meters, all having one reading every day for a year. For generating test data for situations like this, the generate_series function is invaluable.

insert into meters select * from generate_series(1, 500) seq;

insert into readings(meter_id, date, reading)
select m.id, seq, random() from generate_series('2024-02-01'::date, '2025-02-01'::date, '1 day'::interval) seq, meters m;

Querying

Our tables are now populated and ready to be queried. Let's get to it!

1. Window functions: `250ms`

We first express that we want to partition our readings per meter IDs, and then use the row_number function to assign an "order number" for every reading based on its age among its peers within an individual partition. You then use this to filter the result set and return only ones that are at the top (have row number equal to 1) per every partition.

explain(analyse, buffers)
with readings_with_rownums as (                                                                                                                                                                                                                                                                       
    select                                                                                                                                                                                                                                                                                                
        meters.id        as meter_id,
        readings.reading as reading,                                                                                                                                                                                                                
        row_number() over (partition by meters.id order by readings.date desc) as rownum                                                                                                                                                                                                              
    from                                                                                                                                                                                                                                                                                                  
        readings                                                                                                                                                                                                                                                                                      
    join                                                                                                                                                                                                                                                                                                  
         meters on meters.id = readings.meter_id
)
select
    meter_id,
    reading
from 
    readings_with_rownums
where
    readings_with_rownums.rownum = 1;

This runs in about 250ms. Actually, not that bad of a starting point. However, this timing does mean that the users will not perceive the response as being instantaneous (~100ms). Let's try to do better!

But how? Time to have a first look at the explain plan and try to get some clues.

It's more pleasant to look at explain plans using visualisation tools instead of in text form (the way we get them from Postgres itself by default). You have several great alternatives, here's some examples:

For the rest of this blog I'm going to use the one from Dalibo. Here's our first one:

A word on how to interpret them. The "flow" of data is from the bottom side upwards, as the arrow in the lower left part indicates. With other words, it should be read from bottom to top. I've also opened the relevant explain plan nodes which I want to explore further, and highlighted the row counts, which are the first thing that jumped out to me.

Looking at the row counts we can conclude the query efficiency is low. It reads, and then discards 99% of the rows before returning the final results. You can see that in the Sort node, where 183500 rows arrive as input from the nodes below, but then, only 500 are actually returned in the final result set. This way to get our results has low scalability, being very sensitive to increases in the dataset.

Side note: Let's also consider the overall resource utilisation for a moment too. When the database is performing sorts like this, especially over and over again (like in an application used by several users concurrently) on potentially very large tables, you will most probably see a spike in the CPU utilisation, like I did one time. If possible, we should consider limiting CPU usage for situations where there are actually no alternatives. One of the reasons to do this could be that some cloud provider databases don't let you get more CPU without a proportional increase in RAM or other resources.

Alright, time to move on. Let's proceed and look at other options.

2. DISTINCT ON: `250ms`

I found this approach rather elegant actually, and I was rooting for it to perform better. To understand this works, you can have a look at the DISTINCT clause section in the Postgres docs. Here's how it looks like when we apply it to retrieve our readings:

explain (analyse, buffers)
select 
    distinct on (meters.id) 
    meters.id        as meter_id, 
    readings.reading as reading
from 
    meters 
join 
    readings on meters.id = readings.meter_id
order by 
    meters.id, 
    readings.date desc;

I didn't get a noticeable difference with regards to how fast it runs compared with the previous approach with the window functions. The explain plan looks quite similar to the one we have seen before:

We can observe a difference though. It doesn't contain the WindowAgg node anymore you've seen before, in general it is indeed good to have less operations, but this didn't get us very far in our pursuit of reducing the query time. It's is still inefficient, reading a lot of rows and discarding the majority before returning the final results.

One thing I noticed in the explain plan is the following detail which is definitely worth taking a closer look at when you see it in your explain plans:

Sort Method: external merge  Disk: 3968kB

This means that the sort operation is being slowed down because it is forced to use the disk. This happens when the work_mem setting is too low given the size of the dataset. The sort can't be done fully in memory, so it has no choice but to "spill" the operation to disk. The default setting for work_mem in Postgres is 4MB, which in this case proves to be too little.

Let's increase this setting to make sure it's sufficient to do the sort in memory. But note that if you give it too much, it can be problematic because if the load increases a lot surprisingly, you can run out of memory - it's a balancing act.

set work_mem='16MB';

I don't have to restart Postgres for this to take effect. For other settings we have to.

We retry the query above, and I we can confirm that the sort does happen in memory now; the proof is that we can then the following in the explain plan:

Sort Method: quicksort  Memory: 14951kB

Did this make a difference though? Not really - we're still at ~250ms response time. Ah well, no problem, we still got other options.

Side note: This experiment is on my laptop, and the CPU / memory / storage are all on one machine. But for example when using Amazon RDS, the storage relies on another service - EBS, which is separated by a network to the database. So in those scenarios, avoiding disk "spills" like the one I showed you above will make more of a difference because the data has to "travel" more between memory and disk. You might also like to know that recently AWS introduced the Optimised Reads option, where for these operations, the database instance can use a fast local NVMe SSD disk instead of the network attached EBS volume, so the disk spills are less impactful.

3. MAX(DATE): `115ms`

Time to switch gears again. What do you think of this approach, this time using SQL aggregate functions:

explain(analyse, buffers)
with latest_reads_per_meter as (
    select
        readings.meter_id   as meter_id,
        max(date)           as reading_date
    from
        readings
    group by
        readings.meter_id
)
select
    readings.meter_id,
    readings.reading
from
    meters
join
    readings on meters.id = readings.meter_id
join
    latest_reads_per_meter lrpm on lrpm.meter_id = meters.id
    and readings.date = lrpm.reading_date;

As usual, the explain plan:

This looks a bit different than what we have seen before. In a good way! Let's understand why. The key is that we're now doing the sort much earlier, and so consequently we're discarding the irrelevant rows earlier. It doesn't "carry them over" all the way throughout the execution process. This means less memory consumed because the intermediate results are smaller. However, does it speed up our query?

Indeed it does! Have a look at this.

Execution Time: 114.942 ms

Finally, some solid progress! We cut the runtime we started with in half. But can we do better? You bet, even reduce it by one order of magnitude.

4. Loose index scans: `13ms`

Let's have a look at how the loose index scan works, since by now you're probably already wondering, when are you going to "bring in" the indexes?!

When creating indexes, the order of columns matters. The columns have to be defined exactly in this order I'm about to show you, with the "grouping" element first and then the other column which will be used for determining "latest" within a group.

create index idx on readings(meter_id, date);

We can then write the query like this.

explain (analyse, buffers)
select
    meters.id        as meter_id,
    readings.reading as reading
from
    meters
cross join lateral (
    select
        meter_id, 
        reading
    from
        readings
    where
        readings.meter_id = meters.id
    order by
        date desc
    limit 1
) readings;

Execution Time: 13.814 ms

Alright, now we're talking! This is much faster, but let's find out why. For starters - you don't see the 183500 rows anywhere at all in the explain plan! We are also not sorting anything anymore, because the index keeps our data sorted already.

But, let's push the envelope to see how far we can take this. Let's open the IO & Buffers tab of the index scan node in the above diagram and have a look in there. You don’t get buffers information by default, so make sure to use the buffers option. Here it is:

Let's take a step back and consider how the index-based retrieval works at a high level. There are essentially two steps. What happens is that as a first step, the B-tree index is traversed to determine the rows that match the query predicate, however, after this step, Postgres has to now actually go ahead and retrieve the rows from the table (or heap).

As we can see from the explain plan, to perform the two steps described above it amounts to 2000 blocks read, so about 16MiB (2000 * 8 kibibytes). A block (or page) is the fundamental unit by which Postgres stores data, have a look here for details. I should also add that when you see Nested loop nodes in the explain plans, you have to be careful to not mistakenly conclude that the buffer count displayed is of distinct buffers - if Postgres reads the same block several times it will simply add these up towards the total amount and not differentiate.

Let's try to put the 2000 blocks in context and try to interpret it a bit. For example we can ask ourselves: how many blocks does the table readings have in total?

SELECT relpages FROM pg_class WHERE relname = 'readings';
 relpages 
----------
     1350

Answer: 1350. So it seems to be reading more blocks with this index-based approach than are in the actual table! If we'd just do a sequential scan and simply read the entire table, sort, and discard, like you’ve seen in the approaches before this one, we'd read only 1350 blocks. What you're witnessing is a trade-off to watch out for and factor in your design. Using an index does lead to more speed (no sorts), but actually adds up to more I/O (more blocks touched) operations.

Side note: it is good to know that for example, if you're on Amazon Aurora or other cloud databases, you pay for every block that it has to retrieve from storage because it couldn't find it in memory. A nice cautionary tale about keeping the I/O under control on Aurora would be this one. However, this situation in my generated dataset is a byproduct of how "narrow" my tables are (small number of columns) - it's an artificial setup after all. If there would be more columns, then the number of pages in the table would be much larger, so the B-tree traversing would add up to comparatively much less blocks than there are in the actual table.

Let's see if we can reduce the number of blocks. You might wonder, can we avoid the extra step (reading from the table after reading from the index)? Exactly!

5. Loose index-only scans: `5ms`

We can implement a so-called index-only scan. To do this, we use the include option when creating the index, like so:

create index idx_with_including on readings(meter_id, date) include (reading);

Let's now retry the query and look at the relevant tabs again.

Two important things to observe here. First, notice the Heap Fetches: 0, which indicates that it does not go to the heap to get the rows because they are in the index already. Secondly, the total number of blocks is now 1501, so 500 less than before. Another confirmation that it indeed doesn't read anything from the heap.

It can happen that you try this out, and don't see the Heap Fetches: 0 in your setup, and you might wonder why. This can happen when the visibility map is not up to date, and Postgres has no other option but to go to the heap to get the visibility information about a row. As the visibility map is kept up to date by the autovacuum process, it is important to regularly visit the configuration settings to make sure it is able to keep up.

Let's look at the timing. How fast did we get it?

Execution Time: 5.448 ms

Great stuff! Let's now understand it a bit better how exactly the B-tree index is traversed for retrieving the data. In Postgres, we have three types of nodes in a B-tree: the root node, intermediate nodes (used only for traversal) and leaf nodes (these contain what we’re interested in: pointers to tuples on the heap or included data). Here's a diagram showing what happens, where each level contains the type of nodes I just mentioned.

For each row returned in the final result, it will do a number of 3 index page read operations - first for the root page, then for an intermediate B-tree page, and lastly it will read the leaf page, from where it will collect the included reading. In the diagram above, I have marked these steps with 1, 2, 3, which are repeated 500 times.

Side note: including columns in the index does increase its overall size, which consequently increases the time needed to traverse it. In some cases, this might make the difference, and it will influence the overall retrieval timing in such a way that it is not beneficial anymore.

Conclusion

Quite the journey we've been on! After seeing all these querying techniques, we've finally arrived at our destination - you've seen a loose index-only scan in action, which gave us our results in the shortest time. Note though, that as the saying goes, there is no free lunch: every index gives the database more work to do at every insert, so you will have to decide on a case by case basis if it's worth it, using measurements. Also, the speedup does depend on the data distribution. In the scenario described above, we have many readings for every meter. Your mileage may vary. But it’s always worth a try!

References

Thanks for reading!

To UUID, or not to UUID, that is the primary key question

Mircea Cadariu — Fri, 02 Feb 2024 20:28:01 +0000

For most scenarios, it's beneficial to always set up a primary key for your tables. However, you might be in doubt about what you should go for. There are several options - in this post you'll learn why should you consider UUIDs. Like with many other such technical decisions it's a trade-off, however my goal is that you will make an informed one and know what to expect.

What are UUIDs

The abbreviation stands for Universal Unique IDentifier. UUIDs are sequences of 128 bits and have been invented to uniquely represent data in computer systems. In Postgres, you have the UUID data type you can use for column types. This is what they look like:

a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11

Note that what you see above is the hex-encoded string representation, it is not how they are stored internally.

Primary key options

When creating a database table, one of the first decisions you make is whether or not you'll use a natural or synthetic primary key. If you'll go for a synthetic one, UUIDs are an option. They are not the only option. Like with many other software-related decisions, this is not clear-cut and the trade-offs involved can be a source of long debates. For contrast, let's have a look at probably the most popular alternative solution.

Auto-incrementing integers

You can use auto-incrementing integers (1, 2, 3... etc) that are created by the database. Postgres creates these IDs using an object called sequence. It's essentially a single row table that keeps track of the current number and can give the next one.

Advantages:

easier to work with (you can remember them)
predictable
occupy a smaller size on disk

Disadvantages:

can expose information, like user count
potential bottleneck in distributed systems (centralisation)
need to sync them when upgrading using replication

UUIDs

Let's have a look at UUIDs now.

Advantages:

more secure - you can't guess them
great fit for use-cases where centralised generation of IDs is not feasible
no need to reset sequences in upgrades / migrations

Disadvantages:

occupy larger size on disk
harder to work with, you can't remember them
randomness impacts internal operations (B-Tree operations)
WAL amplification 1 2
can't use directly for build pagination

Let's say your use-case is better served by UUIDs as a primary key. What are you getting into? To understand the trade-off, let's start by looking at the data structure that Postgres creates for every primary key you create.

The structure of B-tree indexes

For all primary keys you define, Postgres creates an index for them automatically. This index is backed by a B-tree data structure. In order for B-trees to be able to do operations like primary key lookups and range queries very fast, the index pages are kept balanced and sorted at all times. Here is how one looks like, from the paper by P. Lehman and S. Yao (the Postgres implementation is based).

The levels-based structure make reads faster because the number of steps to find and return a specific entry (in the picture - the "associated information" node) are significantly reduced.

Random vs. Time-ordered UUIDs

Given how B-trees work, random UUIDs are not ideal, because this means a lot of "work" (page modifications) have to happen in order to keep the tree pages balanced and sorted with every new element we store.

Can this be alleviated? Let's have a look at UUID version 7. An important difference with other versions is that it has a timestamp-based component at the beginning, which means they manifest a natural ordering. This is good news! The database will do less work to keep the B-tree balanced, as all new elements will always go on the "right side", and a lot of the rest of the tree will remain untouched.

The TSIDs are another variant of UUIDs. Vlad Mihalcea considers this the best option of UUIDs for primary key. While also being time-ordered like the UUIDv7, they can be stored in a 8 bytes bigint data type, instead of 16 like the v7.

Let's conduct a small experiment with these variants mentioned so far (random UUIDs, UUIDv7 and TSIDs). For this, I'm inserting all at once about 40k rows in a simple table which has a UUID as primary key. In the case of the TSID, I use bigint.

You can't yet generate all these UUID types in Postgres directly. In Java, there are several options of libraries you can use to generate such UUIDs. This would be one option for UUID v7: java-uuid-generator. And this one for TSIDs: hypersistence-tsid.

Experiment: inserting 40k rows

I ran the same inserts of the 40k rows, for 5 test runs per each UUID. When moving on to the next UUID type I reset the statistics.

Results: timing

It took roughly the same time for each. Despite having a lot more work to do to keep the B-tree balanced for the random ones, Postgres processed all the insertions very fast anyway.

Only looking at timing is inconclusive. It's better to also look at amount of pages read/written (or with other words, amount of I/O) that were done for each, as timing is more volatile. For example, in production we might or might not have all the dataset present in the shared buffers.

Results: I/O

I queried pg_statio_all_indexes after every one of the 5 runs per each UUID in my experiment, and looked at the difference in the number of blocks for the primary key index.

I noted down the following columns after every run:

idx_blks_read (number of disk blocks read)
idx_blks_hit (number of shared buffer hits)

Let's have a look at the results!

TSID

idx_blks_read	idx_blks_hit
124	88573
245	177552
367	266532
490	327560
611	372717

Index size: 4888 kB

UUID V7

idx_blks_read	idx_blks_hit
254	88943
513	163091
769	233844
1012	310091
1245	387387

Index size: 9960 kB

Random UUID

idx_blks_read	idx_blks_hit
234	89027
459	207315
628	340988
920	474911
1149	608634

Index size: 8968 kB

We can see that for the UUIDv7 and TSID runs, the numbers grow at a different rate than the random ones. Zooming-in on the idx_blks_read column only, the blocks read from disk, we see that TSIDs accumulated only half compared with the others. Expected, due to the data type storage difference (8 vs 16 bytes). We notice this proportion reflected also in the total size of the index.

I made a simple graph with the number of blocks added up (both from memory and from disk) after each run where it is visible that the growth is indeed faster for random ones.

One thing that I found initially surprising looking at the results, is the number of index blocks from shared buffers appearing in the statistics (idx_blks_hit). When inserting ~44k rows with TSIDs as primary keys, we see ~88k in the idx_blks_hit. This looks like roughly 2 page hits for every index entry that we have inserted. What exactly are these 2 buffer hits per element? I was kind of expecting just 1 - the rightmost page of the index. Let's explore the Postgres source code in order to understand why.

Postgres B-tree index fastpath optimisation

The answer can be found in this comment in nbtinsert.c. The one extra hit is for the root page. The other hit is for the page where the new entry will be inserted, to which we get to, from the root page, when we insert a new element.

But note a fastpath optimization is mentioned (caching the rightmost leaf page). When the conditions are met to apply this, it doesn't read the root page anymore for every insert. Why is it not happening for my experiment though?

The answer to this question is here, again in nbtinsert.c. This optimisation does not get applied for small indexes like the one I created with 44k entries. Fair enough. Let me try out a larger one, say 30 million entries, to see this optimisation in action.

create table experiment (id bigserial primary key);
insert into experiment select generate_series(1, 30000000);

I can now query the pg_statio_all_indexes and expect to not see 30 million x2 in the idx_blks_hit column. And indeed! Getting just 30639018 and not ~6000000. This is the effect of the fastpath optimisation where the rightmost leaf is cached, so it's accessed directly and not through the root.

Partitioning and UUIDs

A bonus for the UUIDv7 or TSIDs, you can set up partitioning based on the timestamp component. Why does this matter?

Partitioning is very useful because it helps with predictability of operations. It makes reads faster because, while the table might grow indefinitely, the database will not scan it in its entirety, but only a smaller part of it (the relevant partition). Also, if you need to insert a lot of data at once, partitioning enables constant ingestion rates.

Conclusion

Timestamp-based UUIDs make the negative trade-offs involved in choosing between UUIDs and auto-incrementing numeric primary keys. It still depends on your specific use-case though. I suggest looking at TSIDs because they generated the least I/O in my experiments and can also be stored in 8 bytes instead of 16, which is a welcome bonus.

Especially with cloud databases like Aurora from AWS being adopted more and more, we're working with a different pricing model -- pay per I/O consumed (the requests to read from the storage when it can't be found in the shared buffers). Applying strategies like going for modern UUIDs which can be processed more efficiently by the database helps to keep the costs low. Even if you're not looking to reduce the bill, being aware of, and reducing the amount of work that has to be done in the background by the database, gets you more overall throughput and better resource utilisation. If Postgres spends less time keeping the B-tree balanced it can instead can handle more of your business use-cases.

Thanks for reading!

Photo by James Wheeler: https://www.pexels.com/photo/photo-of-pathway-surrounded-by-fir-trees-1578750/