Ivan Vakhrushev

Posted on Jan 6

pg-index-health – a static analysis tool for you PostgreSQL database

#java #postgres #indexes #staticanalysis

Hello there!

Since 2019, I have been developing an open-source tool called pg-index-health, which analyzes database structures and identifies potential issues. In one of my previous articles, I shared the story of how this tool came to life.

Over the years, pg-index-health has evolved and improved. In 2024, with the support of several contributors, I managed to address most of its remaining "growing pains" and bring the project to a state where it is ready for large-scale expansion.

The Growth of Databases with the Rise of Microservices

I've been working with PostgreSQL since 2015, and this fascinating journey began at the Yaroslavl-based company Tensor.

Back in 2015, it was still the era of monoliths with massive databases and a large number of tables. Typically, any changes to the structure of such databases required mandatory approval from an architect or a development lead, who served as the key knowledge holder. While this safeguarded against most errors, it slowed down the process of making changes and was entirely unscalable.

Gradually, people began transitioning to microservices.
The number of databases grew significantly, but the number of tables within each database, conversely, decreased. Now, each team started managing the structure of its own database independently. The centralized source of expertise disappeared, and database design errors began to multiply and propagate from one service to another.

The Testing Pyramid and Its Shapes

Most of you have probably heard of the testing pyramid. For monoliths, it has a fairly characteristic shape with a broad base of unit tests. For more details, I recommend Martin Fowler's article.

Microservices have changed not only the approach to development but also the appearance of the testing pyramid. This shift was largely driven by the rise of containerization technologies (Docker, Testcontainers). Today the testing pyramid is no longer a pyramid at all. It can have a very bizarre shape. The most well-known examples are the Honeycomb and the Testing Trophy.

The modern trend is to write as few unit tests as possible, focusing on implementation details, and to prioritize component and integration tests that validate the actual functionality provided by the service.

My personal favorite is the Testing Trophy. At its base lies static code analysis, designed to guard against common errors.

The Importance of Static Code Analysis

Static analysis for Java and Kotlin code is now common practice. For Kotlin services, the tool of choice is typically detekt. For Java applications, the range of available tools (often referred to as linters) is broader. The main tools include Checkstyle, PMD, SpotBugs, and Error Prone. You can read more about them in my previous article.

Notably, both detekt and Checkstyle also handle code formatting, effectively functioning as formatters.

Static Analysis for Database Migrations

Modern microservices often include database migrations for creating and updating the database structure alongside application code.

In the Java ecosystem, the main tools for managing migrations are Liquibase and Flyway. Any changes to the database structure must always be documented in migrations. Even if changes are made manually during an incident in production, a migration must later be created to apply those changes across all environments.

Writing migrations in plain SQL is the best practice as it provides maximum flexibility and saves time compared to learning the XML dialects of tools like Liquibase. I touched on this in my article, "Six Tips for Using PostgreSQL in Functional Tests".

Verifying SQL Migration Code

To verify the SQL code in migrations, I recommend using SQLFluff, which is essentially a Checkstyle equivalent for SQL. This linter supports multiple databases and dialects (including PostgreSQL) and can be integrated into your CI pipeline. It offers over 60 customizable rules, enabling you to manage table and column aliases, SQL command casing, indentation, column ordering in queries, and much more.

Compare the query with and without formatting:

-- well-formatted SQL
select
    pc.oid::regclass::text as table_name,
    pg_table_size(pc.oid) as table_size
from
    pg_catalog.pg_class pc
    inner join pg_catalog.pg_namespace nsp on nsp.oid = pc.relnamespace
where
    pc.relkind = 'r' and
    pc.oid not in (
        select c.conrelid as table_oid
        from pg_catalog.pg_constraint c
        where c.contype = 'p'
    ) and
    nsp.nspname = :schema_name_param::text
order by table_name;

-- poorly formatted SQL
SELECT pc.oid::regclass::text AS table_name, pg_table_size(pc.oid) AS table_size
FROM pg_catalog.pg_class  pc
JOIN pg_catalog.pg_namespace AS nsp
ON nsp.oid =  pc.relnamespace
WHERE pc.relkind = 'r’
and pc.oid NOT in (
  select c.conrelid as table_oid
  from pg_catalog.pg_constraint   c
  where    c.contype = 'p’
)
and nsp.nspname  = :schema_name_param::text
ORDER BY  table_name;

Well-formatted SQL code is much easier to read and understand. Most importantly, code reviews will no longer be bogged down by discussions about formatting preferences. SQLFluff enforces a consistent style, saving time.

SQLFluff in Action

This is what it looks like in a real pull request:

Here SQLFluff found a problem with formatting the return value in the select statement: when only one column is returned, we do not put it on a separate row. The second point is the incorrect column order in the selection results: first we return simple columns and only then the calculation results. And the third is the incorrect case for and in the join statement: I prefer to write all queries in lowercase.

For more examples of SQLFluff in use, check out my open-source projects: one, two.

Using Metadata to Analyze Database Structure

The structure of the database itself can also be checked. However, working with migrations is extremely inconvenient: there can be many of them; a new migration may fix errors in a previous migration, and so on. As a rule, we are more interested in the final structure of the database than its intermediate state.

Leveraging Information Schema

PostgreSQL (like many other relational databases) stores metadata about all objects and relationships between them and provides it externally in the form of information_schema. We can use queries to information_schema to identify any deviations, problems, or common errors (this is exactly what SchemaCrawler does).

Since we work only with PostgreSQL, instead of information_schema we can use system catalogs (the pg_catalog schema), which provide much more information about the internal structure of a specific database.

Cumulative Statistics System

In addition to metadata, PostgreSQL collects information about the operation of each database: what queries are executed, how they are executed, what access methods are used, etc. The Cumulative Statistics System is responsible for collecting this data.

By querying these statistics through system views and combining them with data from the system catalogs, we can:

Identify unused indexes;
Detect tables that lack adequate indexing.

Statistics can be reset manually. The date and time of the last reset are recorded in the system. It is important to consider this to understand whether the statistics can be trusted or not. For example, if you have some business logic that is executed once a month/quarter/half-year, the statistics need to be collected for a period of at least the interval mentioned above.

If a database cluster is used, then statistics are collected independently on each host and are not replicated within the cluster.

pg-index-health and Its Structure

The idea of analyzing database structure based on metadata within the database itself, as described above, has been implemented by me in the form of a tool called pg-index-health.

My solution includes the following components:

A set of checks in the form of SQL queries, placed in a separate repository (currently consisting of 25 checks). The queries are decoupled from the Java codebase and can be reused in projects written in other programming languages.
A domain model — a minimal set of classes that represent the results of the checks as objects.
The HighAvailabilityPgConnection abstraction for connecting to a database cluster consisting of multiple hosts.
Utilities for executing SQL queries and serializing results into domain model objects.
A Spring Boot starter for convenient and quick integration of checks into unit/component/integration tests.
A migration generator that can create corrective SQL migrations for identified issues.

Types of Checks

All checks (also referred to as diagnostics) are divided into two groups:

Runtime checks (require statistics).
Static checks (do not require statistics).

Runtime checks

Runtime checks are meaningful only when executed on a live database instance in production. These checks require accumulated statistics and aggregate this data from all hosts in the cluster.

Let’s consider a database cluster consisting of three hosts: primary, secondary, and async replica. Some services use clusters with similar topologies and execute heavy read queries only on the asynchronous replica to balance the load. Such queries are usually not executed on the primary host as they create additional load and negatively impact the latency of other queries.

As previously mentioned, in PostgreSQL, statistics are collected separately on each host and are not replicated within the cluster. Thus, you can easily encounter a situation where certain indexes are used and necessary only on the asynchronous replica. To make a reliable determination about whether an index is needed or not, it is necessary to run the check on each host in the cluster and aggregate the results.

Static checks

Static checks do not require accumulated statistics and can be executed on the primary host immediately after applying migrations. Of course, they can also be used on a production database to obtain data in real time. However, most checks are static, and they are particularly useful in tests, as they help catch and prevent common errors during the development phase.

How to Use pg-index-health

The primary use case for pg-index-health is adding tests to verify the database structure in your testing pipeline.

For Spring Boot applications, you need to add the starter to your test dependencies:

dependencies {
    testImplementation("io.github.mfvanek:pg-index-health-test-starter:0.14.4")
}

Then add a standard test:

import io.github.mfvanek.pg.core.checks.common.DatabaseCheckOnHost;
import io.github.mfvanek.pg.core.checks.common.Diagnostic;
import io.github.mfvanek.pg.model.dbobject.DbObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.ActiveProfiles;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

@SpringBootTest
@ActiveProfiles("test")
class DatabaseStructureStaticAnalysisTest {

    @Autowired
    private List<DatabaseCheckOnHost<? extends DbObject>> checks;

    @Test
    void checksShouldWork() {
        assertThat(checks)
            .hasSameSizeAs(Diagnostic.values());

        checks.stream()
            .filter(DatabaseCheckOnHost::isStatic)
            .forEach(c -> assertThat(c.check())
                .as(c.getDiagnostic().name())
                .isEmpty());
    }
}

In this test, all available checks are injected as a list. Then, only static checks are filtered and executed on the real database deployed in a container with applied migrations.

Ideally, each check should return an empty list. If there are any deviations when adding the next migration, the test will fail. The developer will be forced to pay attention to this and resolve the problem in any way: either fix it in his migration, or explicitly ignore it.

False Positives and Adding Exclusions

It is important to understand that pg-index-health, like any other static analyzer, may generate false positives. Additionally, some checks may not be relevant to your project. For example, it is considered good practice to document the database structure. PostgreSQL allows adding comments to almost all database objects. In a migration, this might look like the following:

create table if not exists demo.warehouse
(
    id bigint primary key generated always as identity,
    name varchar(255) not null
);

comment on table demo.warehouse is 'Information about the warehouses';
comment on column demo.warehouse.id is 'Unique identifier of the warehouse';
comment on column demo.warehouse.name is 'Human readable name of the warehouse';

Within your team, you might agree not to do this. In that case, the results of corresponding checks (TABLES_WITHOUT_DESCRIPTION, COLUMNS_WITHOUT_DESCRIPTION, FUNCTIONS_WITHOUT_DESCRIPTION) become irrelevant for you.

You can either completely exclude these checks:

@Test
void checksShouldWork() {
    assertThat(checks)
        .hasSameSizeAs(Diagnostic.values());

    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .filter(c -> c.getDiagnostic() != Diagnostic.TABLES_WITHOUT_DESCRIPTION &&
            c.getDiagnostic() != Diagnostic.COLUMNS_WITHOUT_DESCRIPTION)
        .forEach(c -> assertThat(c.check())
            .as(c.getDiagnostic().name())
            .isEmpty());
}

Or simply ignore their results:

@Test
void checksShouldWork() {
    assertThat(checks)
        .hasSameSizeAs(Diagnostic.values());

    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .forEach(c -> {
            final ListAssert<? extends DbObject> listAssert = assertThat(c.check())
                .as(c.getDiagnostic().name());
            switch (c.getDiagnostic()) {
                case TABLES_WITHOUT_DESCRIPTION, COLUMNS_WITHOUT_DESCRIPTION -> listAssert.hasSizeGreaterThanOrEqualTo(0); // ignored

                default -> listAssert.isEmpty();
            }
        });
}

When introducing pg-index-health, you may often encounter situations where the database structure already has some deviations, but you do not want to address them immediately. At the same time, the check is relevant, and disabling it is not an option. In such cases, it is best to fix all deviations in the code:

@Test
void checksShouldWorkForAdditionalSchema() {
    final PgContext ctx = PgContext.of("additional_schema");
    checks.stream()
        .filter(DatabaseCheckOnHost::isStatic)
        .forEach(c -> {
            final ListAssert<? extends DbObject> listAssert = assertThat(c.check(ctx))
                .as(c.getDiagnostic().name());

            switch (c.getDiagnostic()) {
                case TABLES_WITHOUT_DESCRIPTION, TABLES_NOT_LINKED_TO_OTHERS ->
                    listAssert.hasSize(1)
                        .asInstanceOf(list(Table.class))
                        .containsExactly(
                            Table.of(ctx, "additional_table")
                        );

                default -> listAssert.isEmpty();
            }
        });
}

Now, I would like to focus in more detail on the most frequently encountered issues.

Tables Without Primary Keys

Due to the specifics of the MVCC mechanism in PostgreSQL, situations like bloat can occur, where the size of your table (or index) rapidly grows due to a large number of dead tuples. This can happen, for example, as a result of long-running transactions or a one-time update of a large number of rows.

Garbage collection within the database is handled by the autovacuum process, but it does not free up the physical disk space occupied. The only way to effectively reduce the physical size of a table is by using the VACUUM FULL command, which requires an exclusive lock for the duration of the operation. For large tables, this can take several hours, making full vacuuming impractical for most modern services.

To address the issue of table bloat without downtime, third-party extensions like pg_repack are often used. One of the mandatory requirements of pg_repack is the presence of a primary key or some other uniqueness constraint on the target table. The TABLES_WITHOUT_PRIMARY_KEY diagnostic helps detect tables without primary keys and prevents maintenance problems in the future.

Below is an example of a table without a primary key. If bloat occurs in this table, pg_repack will be unable to process it and will return an error.

create table if not exists demo.payment
(
    id bigint not null, -- column is not marked as primary key
    order_id bigint references demo.orders (id),
    status int not null,
    created_at timestamp not null,
    payment_total decimal(22, 2) not null
);

Duplicated Indexes

Our databases operate on hosts with limited resources, and disk space is one of them. When using Database-as-a-Service solutions, there is often a physical limitation on the maximum database size that cannot be changed.

Each index in a table is a separate entity on the disk. It occupies space and requires resources for maintenance, which slows down data insertion and updates. We create indexes to speed up searches or ensure the uniqueness of certain values. However, improper use of indexes can lead to a situation where their combined size exceeds the size of the useful data in the table itself. Thus, the number of indexes in a table should be minimal yet sufficient for its functionality.

I have encountered numerous cases where unnecessary indexes are created in migrations. For example, an index for a primary key is created automatically. While it is technically possible to manually index the id column, doing so is completely pointless.

create table if not exists demo.buyer
(
    id bigint primary key default nextval('demo.buyer_seq'), -- PK
    first_name varchar(255) not null,
    last_name varchar(255) not null,
    middle_name varchar(255),
    phone varchar(20) not null,
    email varchar(50) not null,
    ip_address varchar(100)
);

create unique index if not exists idx_buyer_pk on demo.buyer (id); -- possible but pointless

A similar situation arises with unique constraints. When you mark a column (or a group of columns) with the unique keyword, PostgreSQL automatically creates an unique index for that column (or group of columns). Manually creating additional indexes is unnecessary. If done, this results in duplicated indexes. Such redundant indexes can and should be removed, and the DUPLICATED_INDEXES diagnostic can help identify them.

create table if not exists demo.order_item
(
    id bigint primary key default nextval('demo.order_item_seq'),
    order_id bigint not null references demo.orders (id),
    price decimal(22, 2) not null default 0,
    amount int not null default 0,
    sku varchar(255) not null,
    warehouse_id int not null,
    unique (sku, order_id) -- unique constraint

);

create unique index if not exists i_order_item_sku_order_id_unique on demo.order_item (sku, order_id); -- possible, but pointless

Overlapping (intersecting) Indexes

Most indexes are created for a single column. When query optimization begins, more complex indexes may be added, involving multiple columns. This leads to scenarios where indexes are created for columns like A, A+B, and A+B+C. The first two indexes in this series can often be discarded since they are prefixes of the third (I recommend watching this video). Removing these redundant indexes can save significant disk space, and the INTERSECTED_INDEXES diagnostic is designed to detect such cases.

create table if not exists demo.buyer
(
    id bigint primary key default nextval('demo.buyer_seq'),
    first_name varchar(255) not null,
    last_name varchar(255) not null,
    middle_name varchar(255),
    phone varchar(20) not null
);

create index if not exists i_buyer_first_name
    on demo.buyer (first_name); -- can be removed

create index if not exists i_buyer_names
    on demo.buyer (first_name, last_name, middle_name);

create index if not exists i_buyer_id_phone
    on demo.buyer (id, phone); -- including the primary key in an index is almost always a bad idea

Foreign Keys Without Indexes

PostgreSQL allows the creation of foreign key constraints without specifying a supporting index, meaning that a table referencing another does not require and will not automatically create an index. In some cases, this might not be a problem and may not manifest at all. However, sometimes it can lead to incidents in production.

Let's look at a small example (I’m using PostgreSQL 16.6):

create schema if not exists demo;

create table if not exists demo.orders
(
    id bigint primary key generated always as identity,
    user_id bigint not null,
    shop_id bigint not null,
    status int not null,
    created_at timestamptz not null default current_timestamp
);

create table if not exists demo.order_item
(
    id bigint primary key generated always as identity,
    order_id bigint not null references demo.orders (id),
    price decimal(22, 2) not null default 0,
    amount int not null default 0,
    sku varchar(255) not null,
    warehouse_id int
);

We have an orders table and an order_item table. They are linked via a foreign key on the order_id column. The foreign key should always reference either the primary key or some unique constraint, which is satisfied in our case.

Let's populate the tables with data and gather statistics. We'll add 100,000 orders, with half having two items and the others having one.

insert into demo.orders (user_id, shop_id, status) 
select
    (ids.id % 10) + 1 as user_id,
    (ids.id % 4) + 1 as shop_id,
    1 as status -- new order
from generate_series(1, 100000) ids (id);

insert into demo.order_item (order_id, price, amount, sku)
select
    id as order_id,
    (random() + 1) * 1000.0 as price,
    (random() * 10) + 1 as amount,
    md5(random()::text) as sku
from demo.orders;

insert into demo.order_item (order_id, price, amount, sku)
select
    id as order_id,
    (random() + 1) * 2000.0 as price,
    (random() * 5) + 1 as amount,
    md5((random() + 1)::text) as sku
from demo.orders where id % 2 = 0;

-- gather statistics
vacuum analyze demo.orders, demo.order_item;

If we attempt to retrieve items for an order with ID=100, we should successfully return 2 rows. Since there is an index on the id column in the orders table, it might seem that this query should be fast.

select oi.id, oi.order_id, oi.price, oi.amount, oi.sku, o.user_id, o.shop_id, o.status
from demo.orders o
join demo.order_item oi on oi.order_id = o.id
where o.id = 100 -- indexed on this column
order by oi.id;

However, if we try to profile this query, we will see sequential scanning of the table in the execution plan. We should also be concerned about the large number of pages that need to be read (Buffers parameter).

explain (analyze, buffers)
select oi.id, oi.order_id, oi.price, oi.amount, oi.sku, o.user_id, o.shop_id, o.status
from demo.orders o
join demo.order_item oi on oi.order_id = o.id
where o.id = 100 -- indexed on this column
order by oi.id;

Sort  (cost=3588.34..3588.35 rows=2 width=79) (actual time=25.244..25.248 rows=2 loops=1)
  Sort Key: oi.id
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=1711
  ->  Nested Loop  (cost=0.29..3588.33 rows=2 width=79) (actual time=0.083..25.212 rows=2 loops=1)
        Buffers: shared hit=1711
        ->  Index Scan using orders_pkey on orders o  (cost=0.29..8.31 rows=1 width=28) (actual time=0.040..0.052 rows=1 loops=1)
              Index Cond: (id = 100)
              Buffers: shared hit=6
        ->  Seq Scan on order_item oi  (cost=0.00..3580.00 rows=2 width=59) (actual time=0.036..25.145 rows=2 loops=1)
              Filter: (order_id = 100)
              Rows Removed by Filter: 149998
              Buffers: shared hit=1705
Planning:
  Buffers: shared hit=54
Planning Time: 0.793 ms
Execution Time: 25.314 ms

If we create an index for a column with a foreign key, the situation will return to normal:

create index concurrently if not exists idx_order_item_order_id
    on demo.order_item (order_id);

The sequential scan will disappear from the query plan, and the number of pages read will be significantly reduced:

Sort  (cost=19.54..19.55 rows=2 width=79) (actual time=0.110..0.113 rows=2 loops=1)
  Sort Key: oi.id
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=8
  ->  Nested Loop  (cost=0.71..19.53 rows=2 width=79) (actual time=0.076..0.086 rows=2 loops=1)
        Buffers: shared hit=8
        ->  Index Scan using orders_pkey on orders o  (cost=0.29..8.31 rows=1 width=28) (actual time=0.029..0.030 rows=1 loops=1)
              Index Cond: (id = 100)
              Buffers: shared hit=3
        ->  Index Scan using idx_order_item_order_id on order_item oi  (cost=0.42..11.20 rows=2 width=59) (actual time=0.033..0.040 rows=2 loops=1)
              Index Cond: (order_id = 100)
              Buffers: shared hit=5
Planning:
  Buffers: shared hit=15 read=1
Planning Time: 1.065 ms
Execution Time: 0.178 ms

The FOREIGN_KEYS_WITHOUT_INDEX diagnostic will allow you to catch such cases early during development, preventing performance issues.

Should I create indexes or not?

It’s important to remember the issue of false positives: not all foreign key columns need to be indexed. Try to estimate the approximate table size in production; check your code for filtering, searching, or joining on the foreign key column. If you are 100% sure that you don’t need the index, you can simply add it to the exclusions. If you're unsure, it's better to create the index (it can always be removed later).

I’ve often encountered incidents where the database "slowed down" due to the absence of an index on a foreign key, but I haven’t seen any incidents where the database "slowed down" because of the presence of such indexes. Therefore, I disagree with the point made in the Percona blog article that foreign key indexes should not be created at all from the start. This is a DBA approach. Do you have a dedicated DBA on your team?

Null value in indexes

By default, PostgreSQL includes null values in btree indexes, but they are usually not needed there. All null values are unique, and you cannot simply retrieve a record where the column value is null. Most of the time, it’s better to exclude nulls from the index by creating partial indexes on nullable columns like where <A> is not null. The diagnostic INDEXES_WITH_NULL_VALUES helps to detect such cases.

Let's consider an example with orders and order_items. The order_item table has a nullable column warehouse_id, which represents the warehouse ID.

create table if not exists demo.order_item
(
    id bigint primary key generated always as identity,
    order_id bigint not null references demo.orders (id),
    price decimal(22, 2) not null default 0,
    amount int not null default 0,
    sku varchar(255) not null,
    warehouse_id int
);

Suppose we have several warehouses. After the order is paid, we begin to assemble it. We will update the status of some orders and mark them as paid.

-- index for efficient searching
create index concurrently if not exists idx_orders_creates_at_status
    on demo.orders (created_at, status);

-- update the status for several orders
update demo.orders
    set status = 2 -- paid order
where
    status = 1 -- new order
    and id in (
    select id from demo.orders where id % 4 = 0 order by id limit 100);

Individual items in an order may be shipped from different warehouses according to an internal algorithm, considering logistics, stock, warehouse load, etc. After assigning the warehouse and updating the stock, we update the warehouse_id field for each item in the order (which was initially null).

update demo.order_item
    set warehouse_id = case when order_id % 8 = 0 then 1 else 2 end
where
    warehouse_id is null
    and order_id in (
    select id from demo.orders
    where
        status = 2
        and created_at >= current_timestamp - interval '1 day');

We will need to search by a specific warehouse ID to know which items need to be completed and shipped. We take only paid orders for a certain time range.

select oi.id, oi.order_id, oi.price, oi.amount, oi.sku, o.user_id, o.shop_id, o.status
from demo.orders o
join demo.order_item oi on oi.order_id = o.id
where
    oi.warehouse_id = 2
    and o.status = 2
    and o.created_at >= current_timestamp - interval '1 day'
order by oi.id;

The first solution would likely be a regular index on the warehouse_id column:

create index concurrently if not exists idx_order_item_warehouse_id
    on demo.order_item (warehouse_id);

If we create such an index, it will be used without issues when searching for items for a specific warehouse. It might seem that this index should allow efficiently finding all items where the warehouse is not yet assigned, filtering records with the condition warehouse_id is null.

explain (analyze, buffers)
select * from demo.order_item where warehouse_id is null;

However, if we look at the query execution plan, we will see sequential access there — the index is not used.

Seq Scan on order_item  (cost=0.00..3208.00 rows=149755 width=63) (actual time=0.214..24.909 rows=149800 loops=1)
  Filter: (warehouse_id IS NULL)
  Rows Removed by Filter: 200
  Buffers: shared hit=1708
Planning Time: 0.291 ms
Execution Time: 33.720 ms

Of course, this is related to the specific distribution of data in the test database. The warehouse_id column has low cardinality, meaning the number of unique values in it is low. An index on this column has low selectivity. Index selectivity refers to the ratio of the number of distinct indexed values (i.e., cardinality) to the total number of rows in the table distinct / count(). For example, a unique index has a selectivity of one.

We can increase the selectivity of the index by removing null values and creating a partial index on the warehouse_id column.

create index concurrently if not exists idx_order_item_warehouse_id_without_nulls
    on demo.order_item (warehouse_id) where warehouse_id is not null;

We will immediately see this index in the query plan:

Sort  (cost=860.61..860.61 rows=1 width=79) (actual time=0.628..0.640 rows=100 loops=1)
  Sort Key: oi.id
  Sort Method: quicksort  Memory: 35kB
  Buffers: shared hit=417
  ->  Nested Loop  (cost=0.44..860.60 rows=1 width=79) (actual time=0.052..0.546 rows=100 loops=1)
        Buffers: shared hit=417
        ->  Index Scan using idx_order_item_warehouse_id_without_nulls on order_item oi  (cost=0.14..246.83 rows=80 width=59) (actual time=0.023..0.069 rows=100 loops=1)
              Index Cond: (warehouse_id = 2)
              Buffers: shared hit=5
        ->  Index Scan using orders_pkey on orders o  (cost=0.29..7.67 rows=1 width=28) (actual time=0.003..0.003 rows=1 loops=100)
              Index Cond: (id = oi.order_id)
              Filter: ((status = 2) AND (created_at >= (CURRENT_TIMESTAMP - '1 day'::interval)))
              Buffers: shared hit=356
Planning:
  Buffers: shared hit=17
Planning Time: 0.835 ms
Execution Time: 0.702 ms

If we compare the sizes of the indexes, we will see a significant difference. The partial index is much smaller and will be updated less frequently. Using this index, we save disk space and improve performance.

Query to get the size of the indexes

select
    x.indrelid::regclass::text as table_name,
    x.indexrelid::regclass::text as index_name,
    pg_relation_size(x.indexrelid) as index_size_bytes
from pg_index x
join pg_stat_all_indexes psai on x.indexrelid = psai.indexrelid
where psai.schemaname = 'demo'::text;

table_name	index_name	index_size_bytes
demo.order_item	demo.idx_order_item_warehouse_id	1056768
demo.order_item	demo.idx_order_item_warehouse_id_without_nulls	16384

Plans for the Future

These are far from all the issues that pg-index-health can detect. The full list of diagnostics is available in the README of the project on GitHub, and it is regularly expanded.

Integrating pg-index-health into a Spring Boot application is quite simple. The overhead for running the checks is minimal. As a result, you will gain protection from common errors and issues. I encourage you to try implementing it!

In the near future, I plan to add full support for partitioned tables in all checks. Currently, this is implemented only for 11 out of 25 checks. I also want to expand the number of checks: there are already tickets for implementing at least 5 new checks. Additionally, in 2025, I plan to switch to Java 17 and Spring Boot 3.

DEV Community