DEV Community: Sadeq Dousti

PostgreSQL's Generic Plan Trap

Sadeq Dousti — Tue, 23 Jun 2026 13:43:07 +0000

References:

Prepare statements (PostgreSQL docs)

plan_cache_mode (PostgreSQL docs)

EXPLAIN (GENERIC_PLAN)

You use PostgreSQL, have a partial index, and your query is parameterized? You're in for a big surprise!

PostgreSQL has an optimization that trades planning time for execution efficiency. It can silently stop using your partial index after the sixth execution in the same session.

The plan cache lifecycle
    PostgreSQL sessions and connection poolers
    Planning for parameterized queries
Why partial indexes are uniquely exposed
A worked example
    Setup
    Query plans when the table is empty
    Query plans when the data distribution changes
    Query plans when another query "poisons" the generic plan
Fixing the issue
Diagnosing with EXPLAIN (GENERIC_PLAN)
Summary

The plan cache lifecycle

PostgreSQL sessions and connection poolers

A PostgreSQL session is a single, persistent TCP connection to a PostgreSQL backend process. All session-local state, including prepared statements and their plan-cache counters, lives on that connection and is discarded when it closes. Since each PostgreSQL backend process has its own memory space, plan-cache counters are neither shared nor visible across sessions.

Application-level connection poolers like HikariCP keep a fixed set of long-lived server connections and hand them out to application threads one at a time, so a prepared statement created on a given connection accumulates executions across many application requests that happen to land on it. The plan-cache behavior described below therefore applies in full.

Planning for parameterized queries

Every PostgreSQL parameterized query sent through a prepared statement goes through two distinct planning regimes. The first five executions use a custom plan: the planner substitutes the actual parameter values, consults column statistics and histogram bounds, and produces a plan tuned to the specific data being queried. Custom plans are expensive to generate but accurate.

Starting with the sixth execution, PostgreSQL builds a generic plan (one that treats all parameters as unknowns: $1, $2, and so on) and compares its estimated cost to the average of the five custom plans. If the generic plan is not significantly more expensive, PostgreSQL switches to reusing it for all subsequent executions in that session. The decision algorithm is described in the PREPARE documentation.

This behavior is controlled by the plan_cache_mode per-session configuration parameter, which defaults to auto. Setting it to force_custom_plan always re-plans with actual values; force_generic_plan always uses the parameter-agnostic plan.

Why partial indexes are uniquely exposed

A partial index carries a fixed predicate in its definition: a set of constant conditions that determine which rows it covers. For the planner to use a partial index, it must prove at planning time that the query's own WHERE clause implies the index predicate. With a custom plan, that proof is straightforward: the planner can see the actual parameter values and verify the match directly.

With a generic plan, the parameters are unknowns. If the index predicate says priority = 'HIGH' and the query says priority = $1, the planner cannot determine whether $1 will equal 'HIGH' at execution time. It must conservatively assume the index might not apply, and so it falls back to treating the partial index as if it does not exist, even though every actual execution passes 'HIGH' for $1.

Regular B-tree indexes have no predicate; they cover all rows unconditionally. The planner can use them in generic plans without proving anything. They may still produce suboptimal cost estimates when parameter selectivity varies widely, but they are not categorically excluded the way partial indexes are. A partial index disappears entirely from the generic plan; a regular index is present but may be costed incorrectly.

A worked example

Setup

Consider a task-management service. The task table holds all pending tasks, along with their priorities and due times. Once a task is complete, it is removed from the table. The table has a partial index to support fetching high-priority tasks.

DROP TABLE IF EXISTS task;

CREATE TABLE task (
    id        UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    priority  TEXT        NOT NULL,
    due_at    TIMESTAMPTZ NOT NULL
) WITH (autovacuum_enabled = off);
-- autovacuum_enabled = off prevents autovacuum from running VACUUM or ANALYZE
-- on this table. Either operation updates pg_class, which PostgreSQL broadcasts
-- as a catalog invalidation to all sessions. Any session holding a prepared
-- statement on this table will silently discard its cached plan and re-plan on
-- the next execution. This keeps the pg_prepared_statements counters
-- deterministic for the demo. NEVER do this in production.

-- covers only the rows we ever query in this hot path
CREATE INDEX idx_task_high_priority
    ON task (due_at)
    WHERE priority = 'HIGH';

-- gather stats on task
ANALYZE task;

A background worker polls for high-priority tasks in batches using a prepared statement:

PREPARE get_tasks(text) AS
    SELECT * FROM task 
    WHERE priority = $1 
    ORDER BY due_at 
    LIMIT 10;

Query plans when the table is empty

Initially, when the table is empty or has very few rows (based on statistics gathered by ANALYZE), PostgreSQL will prefer a sequential scan over an index scan. We can check this using EXPLAIN:

EXPLAIN (COSTS OFF)
EXECUTE get_tasks('HIGH');

                   QUERY PLAN
-------------------------------------------------
 Limit
   ->  Sort
         Sort Key: due_at
         ->  Seq Scan on task
               Filter: (priority = 'HIGH'::text)

You can check the generic_plans and custom_plans counters by querying the pg_prepared_statements view (this view only includes named prepared statements):

SELECT generic_plans, custom_plans
FROM pg_prepared_statements
WHERE name = 'get_tasks';

 generic_plans | custom_plans
---------------+--------------
             0 |            1

As explained above, the first five executions use a custom plan when plan_cache_mode = auto, so the custom_plans counter goes up by 1. Let's run the EXPLAIN query several more times and check the pg_prepared_statements view each time. The results are tabulated below:

Run #	`generic_plans`	`custom_plans`
1	0	1
2	0	2
3	0	3
4	0	4
5	0	5
6	1	5
7	2	5
8	3	5

From run 6 onward, PostgreSQL switches to a generic plan for this prepared statement. You can still enforce a custom plan by setting plan_cache_mode to force_custom_plan:

SET plan_cache_mode = 'force_custom_plan';

EXPLAIN (COSTS OFF)
EXECUTE get_tasks('HIGH');

                   QUERY PLAN
-------------------------------------------------
 Limit
   ->  Sort
         Sort Key: due_at
         ->  Seq Scan on task
               Filter: (priority = 'HIGH'::text)

SELECT generic_plans, custom_plans
FROM pg_prepared_statements
WHERE name = 'get_tasks';

 generic_plans | custom_plans
---------------+--------------
             3 |            6

Notice the query plan includes priority = 'HIGH'::text rather than priority = $1, confirming a custom plan was used. The custom_plans counter is also incremented by 1.

For the rest of the discussion, let's reset plan_cache_mode to its default value:

RESET plan_cache_mode;

Query plans when the data distribution changes

We will now insert some data into the table and run the query again:

INSERT INTO task(priority, due_at)
SELECT 'HIGH', now()
FROM generate_series(1, 10000);

INSERT INTO task(priority, due_at)
SELECT 'LOW', now()
FROM generate_series(1, 10000);

EXPLAIN (COSTS OFF)
EXECUTE get_tasks('HIGH');

QUERY PLAN
---------------------------------------
 Limit
   ->  Sort
         Sort Key: due_at
         ->  Seq Scan on task
               Filter: (priority = $1)

SELECT generic_plans, custom_plans
FROM pg_prepared_statements
WHERE name = 'get_tasks';

 generic_plans | custom_plans
---------------+--------------
             4 |            6

PostgreSQL used a generic plan, a sequential scan. The partial index exists but is not used.

If you run ANALYZE manually on the table and then run the query again:

ANALYZE task;

EXPLAIN (COSTS OFF)
EXECUTE get_tasks('HIGH');

                      QUERY PLAN
-------------------------------------------------------
 Limit
   ->  Index Scan using idx_task_high_priority on task

SELECT generic_plans, custom_plans
FROM pg_prepared_statements
WHERE name = 'get_tasks';

 generic_plans | custom_plans
---------------+--------------
             4 |            7

Here's the complete observation:

In the default plan_cache_mode (= auto), PostgreSQL sticks to the generic plan after 5 executions when the custom plan offers no meaningful benefit. If the data distribution changes afterward, it will keep using the generic plan unless VACUUM or ANALYZE marks the cached plan stale. With autovacuum enabled, this happens when a significant portion of the table has changed (autovacuum_analyze_scale_factor defaults to 10%).

Query plans when another query "poisons" the generic plan

Let's keep the table distribution intact, start a new session, and define the prepared statement again:

PREPARE get_tasks(text) AS
    SELECT * FROM task
    WHERE priority = $1
    ORDER BY due_at
    LIMIT 10;

Now run the following 6 times:

EXPLAIN (COSTS OFF)
EXECUTE get_tasks('LOW');

After the 6th execution, it will switch to the generic plan. If you now check the plan with HIGH, it will still do a sequential scan, and the partial index will not be used:

EXPLAIN (COSTS OFF)
EXECUTE get_tasks('HIGH');

              QUERY PLAN
---------------------------------------
 Limit
   ->  Sort
         Sort Key: due_at
         ->  Seq Scan on task
               Filter: (priority = $1)

Here's the complete observation:

Even if the data distribution does not change, a session can end up on the generic plan if the first 6 queries show no meaningful difference in cost between the generic and custom plans.

Fixing the issue

There are two solutions:

Setting plan_cache_mode to force_custom_plan: This forces custom plans, which always re-plan with actual parameter values. It works, but it gives up the planning cost savings on every execution. You can scope it per transaction with SET LOCAL, but there's a better option.
Inlining specific parameters directly in the query: If a parameter always matches the partial index predicate, inline it as a literal. The planner can then verify predicate implication regardless of plan type.

🛑⚠️ Inlining parameters can open the door to SQL injection if you accidentally pass unsanitized, dynamic user input into the inline function instead of a trusted hardcoded literal. When you inline a value, the query builder embeds that value directly into the SQL string as raw text rather than sending it safely via a separate protocol-level bind parameter. Consequently, if a malicious payload like 'HIGH' OR 1=1 is passed dynamically into inline(), it alters the structure of the executable SQL command itself. Therefore, you should only use literal inlining for internal, static system values (like known enum constants or fixed status strings) that match your partial index predicates, and never for variable data supplied by external users (unless properly sanitized, which is not an easy task).

Below, I'll be using jOOQ as an example. jOOQ is a type-safe SQL query builder for Java (and JVM languages). If you're using other query builders or ORMs, check its documentation on how to inline specific parameters. The approach is not jOOQ-specific; any query builder that supports literal inlining will work the same way.

Consider this Kotlin jOOQ query:

// All conditions bound as parameters → $1, $2
dslContext
    .select(TASK.ID)
    .from(TASK)
    .where(TASK.PRIORITY.eq("HIGH"))      // → bound as $1
    .orderBy(TASK.DUE_AT)
    .limit(batchSize)                     // → bound as $2
    .fetchInto(UUID::class.java)

jOOQ sends this as a parameterized SQL statement. To inline $1, use the inline keyword (see jOOQ documentation on inlined parameters):

import org.jooq.impl.DSL.inline

dslContext
    .select(TASK.ID)
    .from(TASK)
    .where(TASK.PRIORITY.eq(inline("HIGH")))      // → inlined
    .orderBy(TASK.DUE_AT)
    .limit(batchSize)                             // → bound as $1
    .fetchInto(UUID::class.java)

Diagnosing with EXPLAIN (GENERIC_PLAN)

PostgreSQL 16 added the GENERIC_PLAN flag to EXPLAIN, letting you inspect what the generic plan would look like without waiting for six executions. See CYBERTEC's writeup for a detailed walkthrough.

EXPLAIN (GENERIC_PLAN)
SELECT * FROM task
WHERE priority = $1
ORDER BY due_at
LIMIT $2;

If you see an unwanted plan (e.g., a sequential scan or a wrong index) here despite a relevant index existing, the generic plan is the culprit.

On PostgreSQL 15 and earlier, use SET plan_cache_mode = force_generic_plan; before checking the plan.

Summary

PostgreSQL's plan cache switches a prepared statement to a generic plan after five custom executions, provided the generic plan's estimated cost is competitive. Generic plans treat all parameters as unknowns, which means they cannot satisfy the predicate-implication check required to use a partial index. Once the switch happens, the partial index silently disappears from the execution plan, even if every actual call would have matched the index predicate.

The trigger can be subtle: querying with parameters that make the generic plan look cheap (for example, against an empty table) is enough to lock in a plan that ignores your partial index for all future executions in that session. Running the same prepared statement later with different, more selective parameters does not undo this.

VACUUM or ANALYZE resets the cycle by invalidating the cached plan, but that only helps until the generic plan wins the cost comparison again. The durable fix is to inline the parameter values that match your partial index predicates, so the planner can always verify predicate implication regardless of plan type. In jOOQ, this is a single inline() call; other query builders often expose an equivalent mechanism.

To diagnose proactively, use EXPLAIN (GENERIC_PLAN) (PostgreSQL 16+) or SET plan_cache_mode = force_generic_plan (PostgreSQL 15 and earlier) to see whether your query's generic plan silently drops an index you're counting on.

PostgreSQL + Outbox Pattern Revamped — Part 2

Sadeq Dousti — Sun, 29 Jun 2025 23:01:26 +0000

Part 1 of this series introduced the outbox pattern as a reliable approach for message delivery in distributed systems, while highlighting several implementation pitfalls that can lead to failures. These include sorting by created_at, incorrect data types, suboptimal indexing, overly large batch selections, poison pill messages, and inefficient index or visibility checks. A key insight was the performance impact of stale indexes, especially in high-throughput systems. To address this, the post proposed a revamped design using partitioned tables, which simplifies cleanup and mitigates visibility issues. Below is a summary of the traditional vs. revamped design, along with the DDL for both. Throughout this article, we will use the red color to indicate the traditional design and the blue color for the revamped design. Notice the update to the published_at column of the outbox_unpublished partition causes the record to be deleted from it and inserted into the outbox_published partition, which is a key feature of the revamped design.

Traditional (non-partitioned) outbox table:

CREATE TABLE outbox (
    id INT8 GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
);

CREATE INDEX outbox_unpublished_id_idx
    ON outbox (id)
    WHERE published_at IS NULL;

Revamped partitioned outbox tables:

CREATE TABLE outbox_new (
    id INT8 GENERATED ALWAYS AS IDENTITY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
)
PARTITION BY LIST (published_at);

CREATE TABLE outbox_unpublished
    PARTITION OF outbox_new
        FOR VALUES IN (NULL);

CREATE TABLE outbox_published
    PARTITION OF outbox_new
        DEFAULT;

CREATE INDEX outbox_new_unpublished_id_idx
    ON outbox_unpublished (id);

Although this approach is easily applicable to new systems, applying it to existing databases is less straightforward due to PostgreSQL's limitations around retrofitting table partitioning.

In this second part, we will first explore migration patterns from a "traditional" outbox table to a partitioned one. We cover five scenarios and for each scenario, introduce a migration pattern. Next, we will pertain to the autovacuum settings for the outbox tables, and how to tune them for optimal performance. Finally, we will discuss index maintenance for the outbox tables, which is crucial for high-throughput systems.

Outbox migration patterns
  * COP: Cold Outbox Partitioning
  * COPRA: Cold Outbox Partitioning w/ Rapid Attachment
  * HOP: Hot Outbox Partitioning
  * HOPER: Hot Outbox Partitioning w/ Eventual Replacement
  * HOPIA: Hot Outbox Partitioning w/ Immediate Access
Tuning AUTOVACUUM for outbox tables
Outbox table index maintenance
Conclusion

Outbox migration patterns

Let's consider the scenario that we have an existing service with a traditional outbox table, and we want to migrate to a partitioned outbox table. We will use some schematics to illustrate the migration process, and assume a rolling update approach where the new version of the service is deployed alongside the old one, and they co-exist for a while. This is the most common scenario in production systems, where we cannot afford downtime or data loss. Depending on the requirements, we may have to deploy several versions of the service. The services will be denoted by the green color, and the version number like V1, V2, etc.

Below, I will describe five patterns for migrating the outbox table, in the increasing order of complexity. If your requirements allow for a simpler approach, you can choose one of the simpler patterns. If you need more flexibility or performance, you can opt for a more complex pattern. Each pattern is given a mnemonic acronym to help you remember it, and also there is a flowchart to help you decide which pattern to use. The mnemonics are meaningful names, though some of them are not well-known English words. So, below the flowchart, I used a symbol to indicate the meaning of each mnemonic. 😉

COP: Cold Outbox Partitioning. This is the simplest pattern, where you can stop the publication of messages for a short while (e.g., 10 minutes).
COPRA: Cold Outbox Partitioning w/ Rapid Attachment. This pattern is similar to COP, except that the old outbox has to be rapidly attached to the new partitioned outbox table as the outbox_published partition.
HOP: Hot Outbox Partitioning. In this pattern, your service cannot afford to stop the publication of messages, but you do not need to attach the old outbox table to the new partitioned outbox table.
HOPER: Hot Outbox Partitioning w/ Eventual Replacement. This pattern is similar to HOP, except that the old outbox table eventually replaces the outbox_published partition, and newly published messages are copied over. No immediate access to the old outbox table is required, but it can be accessed later if needed.
HOPIA: Hot Outbox Partitioning w/ Immediate Access. This is by far the most complex pattern, with all requirements of HOP, and your service needs uninterrupted access to all the published messages.

The decision flowchart below can help you choose the right pattern for your requirements. The next sections will describe each pattern in detail.

COP: Cold Outbox Partitioning

COP is the simplest pattern for migration from a traditional outbox table to a partitioned one. The requirements are as relaxed as possible:

Paused publication: You can stop the publication of messages for a short while (e.g., 10 minutes).
Independent lifecycle: The existing data in the old outbox table is not needed to be available in the new partitioned setup. You can leave it as a separate table, or drop it if you want to save space.

An example is a service that periodically uses the outbox. For instance, in our company, we have an "interest service" that uses the outbox during the nightly accrual or the monthly interest payout. In all other times, the outbox is rarely used, and therefore we can afford to pause the publication of messages for a short while during the migration.

Given these requirements, the migration process is straightforward:

V1 is using the traditional outbox table.
V2 is deployed with the new partitioned outbox table:
- Insertions are now made into the new partitioned outbox table.
- Publication continues from the old outbox table, until there are no more messages to publish.
V3 is deployed where both insertions and publication are now performed over the new partitioned outbox table.

Between the V2 and V3 deployments, there is a short period where unpublished messages accumulated in the outbox_unpublished table, and as such the publication is effectively paused.

Let's depict this process in a schematic. Notice that due to the rolling deployment, V1 and V2 co-exist for a while, and the same applies to V2 and V3. However, V1 and V3 do not co-exist. Also, V3 is only deployed after all messages in the old outbox table are published. To prevent name clashes, let's assume that the new partitioned outbox table is named outbox_new.

COPRA: Cold Outbox Partitioning w/ Rapid Attachment

COPRA is the second-simplest pattern—after COP—for migration from a traditional outbox table to a partitioned one. The requirement on paused publication is the same as before, but the existing data in the old outbox table is needed to be available in the new partitioned setup:

Paused publication: You can stop the publication of messages for a short while (e.g., 10 minutes).
Rapid attachment: The existing data in the old outbox table has to be available in the new partitioned setup.

An example is when the application performs some kind of analytical query on the outbox table, or there is a logical replication setup that replicates the outbox table to another database. In this case, we cannot afford to lose the existing data in the old outbox table, and we need to attach it to the new partitioned outbox table.

The DDL to create the new partitioned outbox table is a bit different:

It won't create the outbox_published partition, as we will attach the old outbox table to it later.
The id column has to be generated with a higher starting value to avoid clashes with the existing id values in the old outbox table.

CREATE TABLE outbox_new (
    -- The start value of the identity column must be greater than 
    -- the greatest id of outbox to avoid clashes.
    -- 1_000_000 is used for illustration only.
    id INT8 GENERATED ALWAYS AS IDENTITY (START 1_000_000),
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
)
PARTITION BY LIST (published_at);

CREATE TABLE outbox_unpublished
    PARTITION OF outbox_new
        FOR VALUES IN (NULL);

CREATE INDEX outbox_new_unpublished_id_idx
    ON outbox_unpublished (id);

The attachment of the old outbox table to the new partitioned outbox table can be done with the following DDLs:

-- The new partition may not contain an identity column.
ALTER TABLE outbox ALTER COLUMN id DROP IDENTITY; 

-- This will lock the table, see below.
ALTER TABLE outbox_new
    ATTACH PARTITION outbox
        DEFAULT;

You may also want to change or drop the indexes on the old outbox table (e.g., the primary key). A crucial point here, according to the PostgreSQL documentation, is as follows:

Note that when running the ATTACH PARTITION command, the table will be scanned to validate the partition constraint while holding an ACCESS EXCLUSIVE lock on that partition.

It is recommended to avoid this scan by creating a CHECK constraint matching the expected partition constraint on the table prior to attaching it. Once the ATTACH PARTITION is complete, it is recommended to drop the now-redundant CHECK constraint.

In the above example, the outbox table will be locked exclusively, to check that it satisfies the condition of being the DEFAULT partition of the outbox_new table. This means that all the rows must satisfy the condition published_at IS NOT NULL. If the outbox table is large, this can take a considerable amount of time. It can be OK if:

The outbox table does not have to be accessed during the migration, and
The publication can remain paused for a while.

If #1 is not the case, you can use the CHECK constraint as the documentation suggests, and elaborated below. If #2 is not the case, you can use the HOP pattern instead, which does not require pausing publication.

To use the CHECK constraint approach, you can follow these steps. Note that these steps must be done in separate transactions. The first step adds a CHECK constraint to the outbox table, in the NOT VALID mode. This means the constraint is only enforced for new rows (which in this case we don't expect any, since no new insertions are made into the outbox table for now). The second step validates the constraint, which will check all the existing rows in the outbox table, but it will not lock the table in the ACCESS EXCLUSIVE mode (or does so only very briefly to mark the constraint as valid at the end). The third step attaches the outbox table to the outbox_new table as the DEFAULT partition. The fourth step drops the now-redundant CHECK constraint.

-- Add a constraint with NOT VALID option to avoid locking the table.
ALTER TABLE outbox
    ADD CONSTRAINT outbox_published_check
        CHECK (published_at IS NOT NULL)
        NOT VALID;

-- In a separate transaction, validate the constraint.
ALTER TABLE outbox
    VALIDATE CONSTRAINT outbox_published_check;


-- Optional: In a separate transaction, set the published_at column to NOT NULL.
-- This uses the CHECK constraint on the table, so it will be fast.
ALTER TABLE outbox
   ALTER COLUMN published_at SET NOT NULL;

-- In a separate transaction, attach the partition.
ALTER TABLE outbox_new
    ATTACH PARTITION outbox
        DEFAULT;

-- In a separate transaction, drop the now-redundant constraint.
ALTER TABLE outbox
    DROP CONSTRAINT outbox_published_check;

According to the above discussion, the migration process is as follows:

V1 is using the traditional outbox table.
V2 is deployed with outbox_new table, having outbox_unpublished as the only partition:
- Insertions are now made into the new partitioned outbox table.
- Publication continues from the old outbox table, until there are no more messages to publish.
V3 is deployed which attaches the existing outbox table to the new partitioned outbox table as the DEFAULT partition.
- This can be done with or without using the CHECK constraint approach, depending on whether an exclusive lock on the outbox table is acceptable.
V4 is deployed where both insertions and publication are now performed over the new partitioned outbox table.

The schematics below illustrate the COPRA migration pattern.

HOP: Hot Outbox Partitioning

HOP is the first pattern for migration from a traditional outbox table to a partitioned one, where we cannot afford to pause the publication of messages. The requirements are as follows:

Immediate publication: You cannot stop the publication of messages, even for a short while.
Independent lifecycle: The existing data in the old outbox table is not needed to be available in the new partitioned setup. You can leave it as a separate table, or drop it if you want to save space.

An example is a service that continuously uses the outbox, such as a payment service that processes transactions in real-time. In this case, we cannot stop the publication of messages, since it would disrupt other services that depend on the message delivery.

Contrary to the stringent requirement on the publication, the migration process is pretty straightforward:

V1 is using the traditional outbox table.
V2 is deployed with the new partitioned outbox table:
- Insertions are now made into the new partitioned outbox table.
- Publication continues from the old outbox table, until there are no more messages to publish.
- Then, publication continues from the new partitioned outbox table.
V3 is deployed where both insertions and publication are now performed over the new partitioned outbox table.

The following PL/pgSQL code shows how to implement V2 by keeping the variable outbox_table, that points to the outbox table to be used for publication. Initially, it points to the old outbox table, and then it switches to the new partitioned outbox table after all messages in the old outbox table are published. This way, the service does not have to check the source table at each iteration, which improves the performance.

DO $$
DECLARE
    outbox_table TEXT := 'outbox';
    count_updated INT;
BEGIN
    LOOP
        EXECUTE format(
            $query$
            WITH t AS (
                SELECT id
                FROM %I
                WHERE published_at IS NULL
                ORDER BY id
                LIMIT 100
            )
            UPDATE %I
            SET published_at = NOW()
            WHERE id IN (SELECT id FROM t)
            AND published_at IS NULL;
            $query$,
            outbox_table, 
            outbox_table
        );

        GET DIAGNOSTICS count_updated = ROW_COUNT;

        IF count_updated = 0 and outbox_table = 'outbox' THEN
            -- No more unpublished messages in `outbox`
            -- Switch to the new partitioned outbox table.
            outbox_table := 'outbox_new';
        END IF;

        COMMIT;
    END LOOP;
END $$;

The schematics below illustrate the HOP migration pattern.

HOPER: Hot Outbox Partitioning w/ Eventual Replacement

HOPER is a migration pattern for the following requirements:

Immediate publication: You cannot stop the publication of messages, even for a short while.
Eventual replacement: The existing data in the old outbox table is needed to be available in the new partitioned setup, but it does not have to be immediately accessible. The eventual goal is to have all published messages in a single partition.

The steps are a combination of the HOP and COPRA patterns:

V1 is using the traditional outbox table.
V2 is deployed with the new partitioned outbox table:
- The starting value of the id column is set to a higher value to avoid clashes with the existing id values in the old outbox table, as explained in the COPRA section above.
- Insertions are now made into the new partitioned outbox table.
- Publication continues from the old outbox table, until there are no more messages to publish.
- Then, publication continues from the new partitioned outbox table.
V3 is deployed where both insertions and publication are now performed over the new partitioned outbox table.
V4 is deployed which:
- Drops IDENTITY from the outbox->id column.
- First applies and NOT VALID CHECK constraint to the outbox table, and then validates it in a separate transaction. This is fully explained in the COPRA section above.
- In a single transaction: Detaches the outbox_published DEFAULT partition from outbox_new, and instead attaches the existing outbox table to the as the DEFAULT partition.
- Drops the now-redundant CHECK constraint from the outbox table.
A job is deployed which gradually copies all the rows from the outbox_published partition to the outbox table, and then drops the outbox_published partition.

The schematics below illustrate the HOPER migration pattern.

HOPIA: Hot Outbox Partitioning w/ Immediate Access

HOPIA is the migration pattern for the strictest requirements:

Immediate publication: You cannot stop the publication of messages, even for a short while.
Immediate access: All the published data has to be immediately accessible. There is no time window to copy the data from one table to another.

While this seems very complex, there is actually a simple solution: Just follow the HOPER pattern, but add an AFTER-INSERT trigger to the outbox_published partition that copies the row to the outbox table. This way, all the published messages are immediately available in the outbox table, and you can access them at any time: There's no need to copy them later.

The trigger can be implemented as follows:

-- The new partition may not contain an identity column.
ALTER TABLE outbox ALTER COLUMN id DROP IDENTITY;

CREATE OR REPLACE FUNCTION outbox_published_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO outbox (id, payload, created_at, published_at)
    VALUES (NEW.id, NEW.payload, NEW.created_at, NEW.published_at);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER outbox_published_insert
AFTER INSERT ON outbox_published
FOR EACH ROW
EXECUTE FUNCTION outbox_published_insert_trigger();

The steps are a similar to the HOPER pattern, but with the addition of the trigger:

V1 is using the traditional outbox table.
V2 is deployed with the new partitioned outbox table:
- The starting value of the id column is set to a higher value to avoid clashes with the existing id values in the old outbox table, as explained in the COPRA section above.
- The above trigger is created on the outbox_published partition.
- Insertions are now made into the new partitioned outbox table.
- Publication continues from the old outbox table, until there are no more messages to publish.
- Then, publication continues from the new partitioned outbox table.
V3 is deployed where both insertions and publication are now performed over the new partitioned outbox table.
V4 is deployed which:
- Drops IDENTITY from the outbox->id column.
- First applies and NOT VALID CHECK constraint to the outbox table, and then validates it in a separate transaction. This is fully explained in the COPRA section above.
- In a single transaction: Detaches the outbox_published DEFAULT partition from outbox_new, and instead attaches the existing outbox table to the as the DEFAULT partition.
- Drops the now-redundant CHECK constraint from the outbox table.

If needed, you can also drop the outbox_published partition after the migration is complete, since all the published messages are now available in the outbox table.

The schematics are very similar to the HOPER pattern, with the addition of the trigger in V2 and the removal of the copy job. For brevity, I will not repeat the schematics here, but you can refer to the HOPER section above.

Tuning AUTOVACUUM for outbox tables

Let's take a detour from the migration patterns and discuss how to tune the autovacuum settings for the outbox tables. This is important to ensure that the outbox tables are maintained properly, especially in high-throughput systems where many messages are inserted and published frequently. This section assumes familiarity with the autovacuum feature in PostgreSQL. A good, in-depth article is Laurenz Albe's Tuning PostgreSQL autovacuum.

By default, PostgreSQL's autovacuum settings are designed to work well on the medium-sized tables:

postgres=# \dconfig *autovacuum_*
         List of configuration parameters
               Parameter               |   Value
---------------------------------------+-----------
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_analyze_threshold          | 50
 autovacuum_freeze_max_age             | 200000000
 autovacuum_max_workers                | 3
 autovacuum_multixact_freeze_max_age   | 400000000
 autovacuum_naptime                    | 1min
 autovacuum_vacuum_cost_delay          | 2ms
 autovacuum_vacuum_cost_limit          | -1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_insert_threshold    | 1000
 autovacuum_vacuum_scale_factor        | 0.2
 autovacuum_vacuum_threshold           | 50
 autovacuum_work_mem                   | -1
 log_autovacuum_min_duration           | 10min
(14 rows)

The following table summarizes the most important settings and their default values:

AUTOVACUUM task	Trigger
Vacuum run on update/delete	At least 20% of the table changes plus 50 rows
Vacuum run on insert	At least 20% of the table changes plus 1000 rows
Analyze run	At least 10% of the table changes plus 50 rows

In our revamped outbox design, there's an imbalance between the partitions:

The outbox_unpublished partition is frequently inserted to and updated (resulting in delete), but its size is usually small.
The outbox_published partition is only inserted to, but it can grow indefinitely large.

Since PostgreSQL default autovacuum settings are designed for small and medium-sized tables, we usually do not have to tune them for the outbox_unpublished partition. However, keep an eye on the amount of time autovacuum takes to run on this partition, plus whether the query for the outbox_unpublished partition is efficient. For both purposes, you can monitor the server logs. For the latter, you would need log_min_duration_statement set to a low value, such as 100ms, or use the auto_explain extension. Another possibility is to use the pg_stat_statements extension, which can help you identify slow queries and their execution plans.

For the insert-only table outbox_published, vacuuming is not needed except for preventing the transaction ID wraparound.
If it is not used in the application code, it does not have to be analyzed either. However, since disabling auto-analyze is not possible without disabling autovacuum entirely, we can tune the parameters to make it less frequent. Here's an example to run both vacuum and analyze on the outbox_published partition every 100K rows inserted, regardless of the table size. This gives you a consistent behavior, but it is suitable only for the case where the application code does not query the outbox_published partition directly. There is also no one-size-fits-all solution, so you may need to adjust the values based on your workload and performance requirements.

ALTER TABLE outbox_published
   SET (
      autovacuum_vacuum_insert_scale_factor = 0,
      autovacuum_vacuum_insert_threshold = 100_000,
      autovacuum_analyze_scale_factor = 0,
      autovacuum_analyze_threshold = 100_000
      );

Outbox table index maintenance

In the revamped outbox design, we have a single indexes on the outbox_unpublished partition, and no indexes on the outbox_published partition:

CREATE INDEX outbox_new_unpublished_id_idx
    ON outbox_unpublished (id);

The outbox_unpublished partition is frequently inserted to and updated (resulting in delete), so the index on it is supposed to be severely bloated over time. We'll use this nice script by ioguix to estimate the bloat of the B-Tree index. The script requires superuser privileges to run, since it accesses the pg_statistic system catalog.

Let's create a view to simplify the usage of the script. In order not to clutter the blog, I will not paste the entire 90+ line script here, but you can find it in the link above.

CREATE OR REPLACE VIEW bloat AS
-- put btree_bloat-superuser.sql here
;

On a freshly created outbox_unpublished partition, the following query shows that the index is empty, thus not bloated at all:

-- Update the statistics of the outbox_unpublished partition.
ANALYZE outbox_unpublished;

SELECT * FROM bloat WHERE idxname='outbox_new_unpublished_id_idx' \gx

-[ RECORD 1 ]----+------------------------------
current_database | postgres
schemaname       | public
tblname          | outbox_unpublished
idxname          | outbox_new_unpublished_id_idx
real_size        | 8192
extra_size       | 0
extra_pct        | 0
fillfactor       | 90
bloat_size       | 0
bloat_pct        | 0
is_na            | f

Column bloat_pct is the bloat percentage of the index, which is 0% in this case, meaning that the index is not bloated at all. is_na is the estimation "Not Applicable?" If true, do not trust the stats. In this case, it is false, meaning that the stats are valid and applicable.

Let's insert 100K rows into the outbox_new table, and see how the index looks like after that.

INSERT INTO outbox_new (payload)
SELECT '{}'
FROM generate_series(1, 100_000);

ANALYZE outbox_unpublished;

SELECT * FROM bloat WHERE idxname='outbox_new_unpublished_id_idx' \gx

-[ RECORD 1 ]----+------------------------------
current_database | postgres
schemaname       | public
tblname          | outbox_unpublished
idxname          | outbox_new_unpublished_id_idx
real_size        | 2260992
extra_size       | 237568
extra_pct        | 10.507246376811594
fillfactor       | 90
bloat_size       | 8192
bloat_pct        | 0.36231884057971014
is_na            | f

Column bloat_pct is less than 0.4%, meaning that the index is still not bloated at all.

Let's update the published_at column of 10K rows, and see how the index looks like after that.

UPDATE outbox_new
SET published_at = NOW()
WHERE published_at IS NULL
AND id < (SELECT MIN(id) FROM outbox_new WHERE published_at IS NULL) + 10_000;

ANALYZE outbox_unpublished;

SELECT * FROM bloat WHERE idxname='outbox_new_unpublished_id_idx' \gx

-[ RECORD 1 ]----+------------------------------
current_database | postgres
schemaname       | public
tblname          | outbox_unpublished
idxname          | outbox_new_unpublished_id_idx
real_size        | 2260992
extra_size       | 434176
extra_pct        | 19.202898550724637
fillfactor       | 90
bloat_size       | 237568
bloat_pct        | 10.507246376811594
is_na            | f

Now, the bloat_pct is 10.5%, meaning that the index is bloated by 10.5%. So, with 10K out of 100K rows updated, the index is bloated by 10.5%. Let's update the rest of rows:

UPDATE outbox_new
SET published_at = NOW()
WHERE published_at IS NULL;

ANALYZE outbox_unpublished;

SELECT * FROM bloat WHERE idxname='outbox_new_unpublished_id_idx' \gx

-[ RECORD 1 ]----+------------------------------
current_database | postgres
schemaname       | public
tblname          | outbox_unpublished
idxname          | outbox_new_unpublished_id_idx
real_size        | 2260992
extra_size       | 2252800
extra_pct        | 99.6376811594203
fillfactor       | 90
bloat_size       | 2252800
bloat_pct        | 99.6376811594203
is_na            | f

Wow, almost 100% bloat! Can VACUUM (INDEX_CLEANUP on) outbox_new; fix it? Try it, and you'll find the answer is no. What if we insert 10K new rows?

INSERT INTO outbox_new (payload)
SELECT '{}'
FROM generate_series(1, 10_000);

ANALYZE outbox_unpublished;

SELECT * FROM bloat WHERE idxname='outbox_new_unpublished_id_idx' \gx

-[ RECORD 1 ]----+------------------------------
current_database | postgres
schemaname       | public
tblname          | outbox_unpublished
idxname          | outbox_new_unpublished_id_idx
real_size        | 2482176
extra_size       | 2269184
extra_pct        | 91.41914191419141
fillfactor       | 90
bloat_size       | 2244608
bloat_pct        | 90.42904290429043
is_na            | f

Now, the bloat_pct is 90.4%, meaning the bloat is partially fixed, but still very high.

Fortunately, it's easy to remove the index bloat by running a REINDEX command on the index or the table. To prevent an EXCLUSIVE lock on the table, you should use the CONCURRENTLY option, which will take longer but will not block other operations on the table. Since the outbox_unpublished partition is supposed to be small, the REINDEX command should not take too long:

REINDEX INDEX CONCURRENTLY outbox_new_unpublished_id_idx;

REINDEX TABLE CONCURRENTLY outbox_unpublished;

The index maintenance is not a one-time operation, but rather a periodic task that should be run regularly to keep the index in good shape. The frequency of the maintenance depends on the workload and the size of the outbox_unpublished partition. You can use the pg_cron extension to schedule the maintenance task, or use a job scheduler like cron or K8s CronJobs to run the REINDEX command periodically. There's also a nice bash script by Vitaliy Kukharik, called pg_auto_reindexer, which can run during the set time and reindex all the indexes that are bloated more than a certain threshold. It can be used to automate the index maintenance process.

Conclusion

In this article, we explored various migration patterns for transitioning from a traditional outbox table to a partitioned one in PostgreSQL. We covered five patterns—COP, COPRA, HOP, HOPER, and HOPIA—each designed for different requirements regarding publication pauses and data accessibility. The choice of pattern depends on your specific needs, with simpler patterns like COP being suitable for systems that can tolerate brief publication pauses, while more complex patterns like HOPIA address scenarios requiring continuous publication and immediate data access.

We also discussed important operational aspects, including tuning autovacuum settings for outbox tables and maintaining indexes to prevent performance degradation. The partitioned outbox design offers significant advantages in terms of performance and maintenance, particularly for high-throughput systems.

By implementing these patterns and following the recommended maintenance practices, you can achieve a more efficient and reliable outbox implementation that scales well with your system's growth while minimizing the common pitfalls associated with traditional outbox tables.

Connecting from EKS pods to MSK: Setting up IAM, service accounts, trust relationships, and security groups

Sadeq Dousti — Wed, 25 Jun 2025 23:17:22 +0000

Recently, we have been working on a project that requires connecting from a pod running in Elastic Kubernetes Service (EKS) to a Managed Streaming for Apache Kafka (MSK) cluster. I've done this before, but I've never set up the whole thing from scratch. I thought it would be a good idea to document the steps I took to get everything working. This includes the following:

Setting up the MSK cluster
Setting up the EKS cluster
Adjusting the security groups
Installing Kafka tools on the EKS pod
Enabling identity providers for the EKS cluster
Creating an IAM role for the EKS service account
Creating a Kubernetes service account
Adjusting the trust relationship for the IAM role
Creating a Kafka topic, producing messages, and consuming messages
Cleaning up the resources after testing to save costs

That seems like a lot, but I'll keep the guide full of screenshots and explanations to make it easy to follow along. Let's get started!

I intentionally refrained from explaining the "behind the scenes" of how this machinery works, to keep the guide focused on the practical steps. Google Understanding AWS IRSA to find resources that explain the concepts in detail, if you're interested in the underlying mechanisms.

Prerequisites

An AWS account: You'll need an AWS account to create the resources. This will cost you a bit, but you can use the free tier for some services. Without it, it cost me around 1 USD (estimated) for the whole setup. YMMV.
You'll need some knowledge of Kubernetes and Kafka, as well as a general understanding of AWS services.
I'll use AWS Web Console (including CloudShell) for all the steps.

Setting up the MSK cluster

Open the AWS Management Console and navigate to the MSK service.
Click on Create cluster.
Choose the Quick create option.
Name your cluster demo-msk-cluster.
Set cluster type to Serverless. (So you pay as you go, and it scales automatically.)
Leave all the other settings as default.
Click on Create cluster.

It will take a few minutes to create the cluster. Once it's created, you can see it in the list of clusters, with a status of Active.

Click on View client information to see the connection details. You'll need this information later when connecting from the EKS pod.

As you can see above, the Authentication type is IAM, and the Bootstrap server endpoint is as follows:

boot-6nbanb1s.c3.kafka-serverless.us-east-1.amazonaws.com:9098

Your cluster will have a different endpoint, so make sure to copy it down and use it in the next steps.

Setting up the EKS cluster

Open the AWS Management Console and navigate to the Elastic Kubernetes Service.
Go to the Clusters section and click on Create cluster.
Select Quick configuration (with EKS Auto Mode)
Name your cluster dev, leave the Kubernetes version, VPC, and subnets as default.
Click on the Create recommended role button for Cluster IAM role and Node IAM role, as denoted below:

Follow the default settings for Create recommended role

Cluster IAM role
- Trusted entity type: AWS service
- Use case: EKS 👉 EKS - Auto Cluster
- Add the default permissions, click next, and then create the role.
Node IAM role
- Trusted entity type: AWS service
- Use case: EKS 👉 EKS - Auto Node
- Add the default permissions, click next, and then create the role.

Go back to the EKS cluster creation page, refresh the role list, and then select the appropriate roles for the cluster and node IAM roles.

Click on Create to create the EKS cluster. It will take a few minutes to create the cluster. Once it's created, you can see it in the list of clusters, with a status of Active. (In my experience, it took a bit more to create an EKS cluster than an MSK cluster, so be patient.)

Adjusting the security groups

Let's create our first pod in the EKS cluster to connect to the MSK cluster. Open CloudShell in the AWS Management Console, and run the following command to configure kubectl to use the EKS cluster (use the region where you created the EKS cluster):

aws eks update-kubeconfig --region us-east-1 --name dev

To test, let's list the nodes in the EKS cluster:

kubectl get nodes

It should return a single node, as shown below:

Let's create an Ubuntu pod. Here's the pod manifest that we'll use to create the pod:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
spec:
  containers:
    - name: ubuntu
      image: ubuntu:24.04
      command: ["/bin/bash", "-c", "sleep infinity"]  # keep the container running

In CloudShell, save the above manifest to a file named ubuntu.yaml, and then run the following command to create the pod:

kubectl apply -f ubuntu.yaml

If you immediately run kubectl describe pods, you'll see a FailedScheduling, with error message:

0/1 nodes are available: 1 node(s) had untolerated taint {CriticalAddonsOnly: }

This is because the single node in the EKS cluster has a taint that prevents pods from being scheduled on it unless they tolerate the taint. Since this is a serverless & elastic cluster, it automatically adds a node after some time, and schedules the pod on it.

We'll now get a shell into the pod to start our work. Run the following command to get a shell into the pod:

kubectl exec -it ubuntu -- bash

Your prompt should change to something like root@ubuntu:/#. Let's update the package list and install some necessary network tools:

apt update && apt install -y net-tools netcat-traditional iputils-ping

Now let's check if the pod can resolve the MSK cluster's bootstrap server endpoint name, and access its port. Run the following command to check if the pod can resolve the endpoint (replace the endpoint with your MSK cluster's bootstrap server endpoint):

ping boot-6nbanb1s.c3.kafka-serverless.us-east-1.amazonaws.com

Yes, it could resolve the endpoint to 172.31.37.226 (in my case), but there was 100% packet loss. To check the port, run the following command:

nc -zv boot-6nbanb1s.c3.kafka-serverless.us-east-1.amazonaws.com 9098

No, the port is not accessible! We need to adjust the security groups to allow access from the EKS cluster to the MSK cluster. Let's find the IP of the pod first, so we can add it to the security group of the MSK cluster.

ifconfig

This shows the IP address of the pod, which in my case is 172.31.30.160. But I want the IP address of any pod that will be created in the EKS cluster, so I need to check the IP address of the subnets in the VPCs where the EKS cluster is created. There are actually several subnets, and it's generic enough to use 172.31.0.0/16 as the CIDR block for the VPC.

Go to the MSK cluster demo-msk-cluster 👉 Properties, and scroll down to the Security groups applied section. Click on the security group link to open it in a new tab.

Click on Edit inbound rules, and then click on Add rule. Select All Traffic as the type, and 172.31.0.0/16 as the custom source. Click on Save rules to save the changes.

Now, if you try Netcat again (inside the pod), it should work:

nc -zv boot-6nbanb1s.c3.kafka-serverless.us-east-1.amazonaws.com 9098

vpce-07ab4e3e8f8e8a7d3-4nhymqra.vpce-svc-015711cabcd5af80a.us-east-1.vpce.amazonaws.com [172.31.26.34] 9098 (?) open

Installing Kafka tools on the EKS pod

OK, so we now have the basic network connectivity between the EKS pod and the MSK cluster. But to send requests to Kafka, we need to authenticate the requests. Since our MSK cluster is configured to use IAM authentication, it won't work out of the box.

Let's try connecting to MSK without any authentication first, to see what happens. The following steps will be used to set up the Kafka client in the pod. First, we need to install OpenJDK and wget:

apt update && apt install -y openjdk-21-jre-headless wget

Then, download and install Kafka tools:

wget https://dlcdn.apache.org/kafka/3.9.1/kafka_2.13-3.9.1.tgz
tar xvf kafka_2.13-3.9.1.tgz
cd kafka_2.13-3.9.1

export BOOTSTRAP=boot-6nbanb1s.c3.kafka-serverless.us-east-1.amazonaws.com:9098

Now, let's list the topics in the MSK cluster using the kafka-topics.sh script:

bin/kafka-topics.sh \
  --bootstrap-server $BOOTSTRAP \
  --list

It errs, as expected:

Error while executing topic command : The AdminClient thread has exited. Call: listTopics
[2025-06-25 21:08:47,814] ERROR org.apache.kafka.common.errors.TimeoutException: The AdminClient thread has exited. Call: listTopics
 (org.apache.kafka.tools.TopicCommand)
[2025-06-25 21:08:47,822] ERROR Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1': (org.apache.kafka.common.utils.KafkaThread)
java.lang.OutOfMemoryError: Java heap space
        at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:71)
        at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:391)
        at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30)
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:103)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:462)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:412)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:694)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:596)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:501)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:596)
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1542)
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1473)
        at java.base/java.lang.Thread.runWith(Thread.java:1596)
        at java.base/java.lang.Thread.run(Thread.java:1583)

Enabling identity providers for the EKS cluster

To connect to the MSK cluster using IAM authentication, we need to enable identity providers for the EKS cluster. This allows the EKS cluster to use IAM roles for service accounts (IRSA) to authenticate with AWS services.

Exit out of the pod shell by typing exit, and then run the following command in CloudShell to enable identity providers for the EKS cluster (following the official Guide for eksctl installation):

# for ARM systems, set ARCH to: `arm64`, `armv6` or `armv7`
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH

curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"

# (Optional) Verify checksum
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check

tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz

sudo mv /tmp/eksctl /usr/local/bin

Next, enable the identity providers for the EKS cluster by running the following command:

eksctl utils associate-iam-oidc-provider --cluster=dev --approve

Result:

2025-06-25 21:16:35 [ℹ]  will create IAM Open ID Connect provider for cluster "dev" in "us-east-1"
2025-06-25 21:16:36 [✔]  created IAM Open ID Connect provider for cluster "dev" in "us-east-1"

To see the OIDC issuer URL for the EKS cluster, run the following command:

aws eks describe-cluster \
  --name dev \
  --region us-east-1 \
  --query "cluster.identity.oidc.issuer" \
  --output text

For me, it returned the following URL:

https://oidc.eks.us-east-1.amazonaws.com/id/72C2EA41B53D0AF8CA083756BF9D0CD8

Creating a permission policy for the IAM role

Use IAM 👉 Policies, and create a "permit-all" policy for the MSK cluster. (Obviously, this is not recommended for production, but it's fine for testing purposes.) Add the following JSON policy in the policy editor, and save it with the name MSK-permit-all:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "kafka-cluster:*",
            "Resource": "*"
        }
    ]
}

Creating an IAM role for the EKS service account

Let's create an IAM role that the EKS service account can assume to access the MSK cluster. Go to IAM 👉 Roles 👉 Create role, select Web identity as the trusted entity type, and then select the EKS cluster's OIDC provider that we just created. For audience, select sts.amazonaws.com, and then click on Next.

For Permissions policy, search for MSK-permit-all (which we created in the previous section) and select it. This policy allows full access to the MSK cluster. Select next. Give the role the name EKS-MSK-Access, take note of the Trust policy (screenshot below, we'll edit it in the next section), and then click on Create role.

Find the role EKS-MSK-Access in the list of roles, and click on it to view its details. We need the ARN of the role for the next section, so copy it down. It should look something like this:

arn:aws:iam::533267425521:role/EKS-MSK-Access

Creating a Kubernetes service account

Pods are associated with a "service account," which can be connected to an IAM role. This allows the pod to assume the IAM role and use its permissions to access AWS resources, including the MSK cluster. Kubernetes mounts the service account token in the pod, which is used to authenticate the requests. To see this token, run the following command inside the pod (using echo + cat as the file does not end with a newline):

echo `cat /var/run/secrets/kubernetes.io/serviceaccount/token`

You can paste the output in jwt.io to decode it. It should look something like this, which shows the pod is associated with the default service account in the default namespace (see default:default):

Now, let's create a Kubernetes service account that is associated with the IAM role we created earlier. Run the following command in CloudShell to create a file named sa.yaml with the following content:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: mysa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::533267425521:role/EKS-MSK-Access

Then, apply the manifest to create the service account:

kubectl apply -f sa.yaml

Make sure the service account is created successfully. Then, edit the Trust Relationship of the IAM role EKS-MSK-Access to allow the EKS service account to assume the role. Go to the IAM console, find the role EKS-MSK-Access, and click on it. Then, click on the Trust relationships tab and then Edit trust relationship.

The important part is to replace the aud key with a sub key, and put the service account name and namespace in the value. The updated trust policy should look like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::533267425521:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/72C2EA41B53D0AF8CA083756BF9D0CD8"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-east-1.amazonaws.com/id/72C2EA41B53D0AF8CA083756BF9D0CD8:sub": "system:serviceaccount:default:mysa"
        }
      }
    }
  ]
}

We now need to update the pod manifest to use the service account we just created. Let's edit the ubuntu.yaml file we created earlier, and add the serviceAccountName field under spec. The updated manifest should look like this:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
spec:
  serviceAccountName: mysa
  containers:
    - name: ubuntu
      image: ubuntu:24.04
      command: ["/bin/bash", "-c", "sleep infinity"]

If you try to apply the manifest again, it will fail because we can only edit some of the fields under the spec section of the pod. To fix this, we need to delete the existing pod and create a new one with the updated manifest. Run the following commands:

kubectl delete pod ubuntu
kubectl apply -f ubuntu.yaml

If you get the token again using the below command, and parse it using jwt.io, you'll see that the sub field is "system:serviceaccount:default:mysa" as expected:

echo `cat /var/run/secrets/kubernetes.io/serviceaccount/token`

To make sure that the pod assumes the IAM role, we can use the AWS CLI inside the pod, as follows:

apt update && apt install unzip less -y
wget "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -O "awscliv2.zip"
unzip awscliv2.zip
./aws/install
aws sts get-caller-identity

It should return the IAM role ARN that we created earlier, something like this:

{
  "UserId": "AROAXYKJXFDY26KQHFDH7:botocore-session-1750889575",
  "Account": "533267425521",
  "Arn": "arn:aws:sts::533267425521:assumed-role/EKS-MSK-Access/botocore-session-1750889575"
}

Unfortunately, we now need to install the Kafka tools again, as the pod is recreated. Log in to the pod again, and repeat the steps in section Installing Kafka tools on the EKS pod to install the Kafka tools. Don't try to list the topics yet, as we need more configuration to authenticate with the MSK cluster.

Installing the AWS MSK IAM Authentication JAR

While inside the kafka tools directory, we need to download the aws-msk-iam-auth JAR file, which is used to authenticate with the MSK cluster using IAM. Run the following command to download the JAR file, and save it in the libs directory. We also add it to the CLASSPATH environment variable so that Kafka tools can use it for authentication:

wget https://github.com/aws/aws-msk-iam-auth/releases/download/v2.3.2/aws-msk-iam-auth-2.3.2-all.jar -P libs/
export CLASSPATH=libs/aws-msk-iam-auth-2.3.2-all.jar

Save the following to client.properties file in the Kafka tools directory. This file contains the configuration for the Kafka client to connect to the MSK cluster using IAM authentication:

sasl.mechanism=AWS_MSK_IAM
security.protocol=SASL_SSL
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler

If you run the following command now, it returns no errors (but also no topics, as we haven't created any yet):

bin/kafka-topics.sh \
  --command-config client.properties \
  --bootstrap-server $BOOTSTRAP \
  --list

Creating a Kafka topic, producing messages, and consuming messages

Use the following command to create a Kafka topic named topic1 with 3 partitions and a replication factor of 1:

bin/kafka-topics.sh \
  --command-config client.properties \
  --bootstrap-server $BOOTSTRAP \
  --topic topic1 \
  --create --partitions 3 \
  --replication-factor 1

It should return something like this:

Created topic topic1.

If you now list the topics as in the previous section, you should see topic1 in the list.

To produce messages to the topic, we can use the kafka-console-producer.sh script. Run the following command to start the producer. Each message is in its own line, so when you press Enter, it will send the message to the topic. You can type multiple messages, and then press Ctrl+C to exit the producer.

bin/kafka-console-producer.sh \
  --producer.config client.properties \
  --bootstrap-server $BOOTSTRAP \
  --topic topic1

Below, I sent messages 1, 2, hello, and bye, and then exited the producer with Ctrl+C:

Let's consume the messages we just produced. We can use the kafka-console-consumer.sh script to consume messages from the topic. Run the following command to start the consumer, which will read messages from the beginning of the topic:

bin/kafka-console-consumer.sh \
--consumer.config client.properties \
--bootstrap-server $BOOTSTRAP \
--topic topic1 \
--from-beginning

The consumer will show all the messages produced to the topic, starting from the beginning. It then waits for new messages to be produced, until you exit it with Ctrl+C. Below, you can see the messages 1, 2, hello, and bye that we produced earlier:

Cleaning up the resources after testing to save costs

Phew! That was a lot of work, but we successfully connected from an EKS pod to an MSK cluster, produced messages, and consumed them. Now, let's clean up the resources we created to avoid any unnecessary costs.

This is easy: Delete the EKS cluster and the MSK cluster, and you should be good to go.

Conclusion

In this guide, we've walked through the complete process of connecting an EKS pod to an MSK cluster using IAM authentication. We covered everything from setting up both services, configuring security groups and IAM roles, to producing and consuming Kafka messages.

The key components we implemented include:

Creating and configuring MSK and EKS clusters
Setting up network connectivity between the clusters
Configuring IAM roles and service accounts for authentication
Installing and configuring Kafka tools with AWS MSK IAM authentication
Testing the connection by creating topics and exchanging messages

This setup provides a secure way to connect Kubernetes applications to Kafka streams using AWS's managed services, leveraging IAM for authentication rather than managing credentials manually. While this guide used a permissive IAM policy for demonstration purposes, in production environments you should follow the principle of least privilege and grant only the specific permissions needed. Of course, the policy we set up was too permissive, and the security group also opened all traffic from all VPCs. In production, you should restrict the security group to only allow traffic from the EKS cluster's VPC and subnets, and use a more restrictive IAM policy that only allows the necessary actions on the MSK cluster.

The approach shown here can be extended to more complex scenarios, such as multi-cluster setups or integrating with other AWS services like Lambda or Kinesis.

PostgreSQL + Outbox Pattern Revamped — Part 1

Sadeq Dousti — Sun, 08 Jun 2025 13:49:09 +0000

This is a two-part series; for part 2, see here.

The outbox pattern is a well-known design pattern used in distributed systems to ensure messages are sent reliably, by first storing messages in a local database before sending them to a message broker. The outbox table often contains fields like id, payload, created_at, and published_at. There's a scheduled process that picks up N messages from the outbox, sends them to the message broker, and marks them as published.

While this seems pretty straightforward, I've seen quite a number of incidents corresponding to the outbox pattern. In this blog post, I'll go over some of these incidents and discuss how to improve your setup to avoid them. With the last example, I'll demonstrate an issue that is often overlooked, and suggest a "revamped" solution that can significantly improve performance and reliability. This solution is based on partitioning the outbox table.

I focus exclusively on PostgreSQL as my database of choice, but I think the principles can be applied to many other databases as well. All the experiments are done on PostgreSQL 17.5, on a MacBook Air M3 with 24 GB of RAM.

Part 1 of this blog post gives a somewhat complete answer on how to avoid almost all outbox-pattern issues, yet it does not cover the problem of converting a regular outbox table to a partitioned table. Part 2 will tackle this problem, as well as give a more in-depth explanation of how to remove index bloat and improve performance further by adjusting the settings of the autovacuum daemon for the partitioned outbox table.

Note: Some people may favor Debezium or similar tools for the outbox pattern. With such tools, you only need to insert into the outbox table, and the tool will handle the rest by publishing the messages to the message broker. This is done by processing the database change logs, which is called WAL (Write-Ahead Logging) in PostgreSQL. I'm a proponent of Debezium myself, but there might be some limitations where teams prefer to implement the outbox pattern manually. Examples include:

avoiding the additional complexity of using Debezium,

publishing to something other than Kafka (Debezium supports other sinks but at additional complexity), and

avoiding logical replication in PostgreSQL (to simplify upgrades and avoid potential issues with logical replication slots).

As such, this blog post is not about Debezium, but rather about the outbox pattern itself.

Incident 1: Sorting by created_at
Incident 2: Using the wrong data types
Incident 3: Incorrect indexing
  * Bad alternative 1
  * Bad alternative 2
Incident 4: Picking too many messages at each iteration
Incident 5: Poison pill messages
Incident 6: Inefficient index and visibility checking
  * Showcasing the issue
    - Setting up the outbox table and testing the index
    - Looking at the query plan
    - Simulating outbox message insertion and processing
    - Observing the query plan changes
  * How to fix the issue
    - Step 1: Create the partitioned outbox table
    - Step 2: Test the setup
    - Step 3: Test outbox message-processing performance
Conclusion

Incident 1: Sorting by created_at

In distributed systems, getting a total ordering of events based on local timestamps is a fatal error. The first incident is when each pod of the microservice labeled the outbox messages with its own created_at timestamp. The message publisher fetched messages from the outbox and sorted them by created_at before sending them to the message broker:

SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT 10;

This led to a situation where messages were sent in an order that was not consistent with the total ordering of events across the system, since the pods could have slightly different clocks. In our case, the difference was only a fraction of milliseconds, but it was enough to cause issues on a highly concurrent system.

Setting created_at using a single source of truth, such as the database server's clock, can help mitigate this issue. However, experience shows that relying on time for ordering is not a good idea in distributed systems. Instead, consider using a database-supplied sequence number.

In PostgreSQL, define the id column as GENERATED BY DEFAULT AS IDENTITY, and then use it to order the messages:

SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY id
LIMIT 10;

Incident 2: Using the wrong data types

Continuing with the previous tip, always use the int8 (= bigint) data type for the id column, rather than int4 (= int or integer). The latter allows for only two billion entries (assuming signed integers), and I've seen incidents where the DB ran out of the sequence number (though it was not in the context of the outbox).

Tips:

Use this script to find int4 primary keys with out-of-range risks in a large database.
Use this technique to redefine an int4 primary key to int8 without downtime.

Also, for timestamps, always use the variant that accounts for time zones, such as timestamptz in PostgreSQL.

For the payload column, you can use text, or json/jsonb types in PostgreSQL. The latter is better if you want to store a serialized version of your data and possibly index it. However, it can be slower to store and retrieve, or more bulky. Also, if your message JSON serializer may output the character \x00, you can't use jsonb since it doesn't allow null characters.

create table o_jsonb (payload jsonb);
insert into o_jsonb values ('{"resp":"\x00"}');

Results in:

ERROR:  invalid input syntax for type json
LINE 1: insert into o_jsonb values ('{"resp":"\x00"}');
                                    ^
DETAIL:  Escape sequence "\x" is invalid.
CONTEXT:  JSON data, line 1: {"resp":"\x...

More info: (You don't) Insert unicode NULL character as Postgres jsonb

All in all, the outbox table should look something like this:

CREATE TABLE outbox (
    id INT8 GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
);

Incident 3: Incorrect indexing

This is the most common incident I've seen. For the following query:

SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY id
LIMIT 10;

The correct index is a partial index as follows. Note the column being indexed is id, and the condition is published_at IS NULL:

CREATE INDEX idx1 ON outbox (id)
WHERE published_at IS NULL;

Let's insert some data into the outbox table:

-- insert data with non-null published_at
INSERT INTO outbox(payload, published_at) 
SELECT '{}', now() FROM generate_series(1,10_000);

-- insert data with null published_at
INSERT INTO outbox(payload)
SELECT '{}' FROM generate_series(1,10_000);

Check the query plan:

EXPLAIN (COSTS OFF)
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY id
LIMIT 10;

Which uses the index idx1:

              QUERY PLAN
---------------------------------------
 Limit
   ->  Index Scan using idx1 on outbox

Bad alternative 1

If you drop the index (DROP INDEX idx1;), the planner will use the primary key index instead, but since that index is not partial, it will be much bigger and less efficient; note the Filter condition in the query plan:

                  QUERY PLAN
----------------------------------------------
 Limit
   ->  Index Scan using outbox_pkey on outbox
         Filter: (published_at IS NULL)

If the query sorted by created_at instead of id, which is not indexed, then Postgres will perform a sequential scan and sort, which is even worse:

                  QUERY PLAN
----------------------------------------------
 Limit
   ->  Sort
         Sort Key: created_at
         ->  Seq Scan on outbox
               Filter: (published_at IS NULL)

Bad alternative 2

I've seen several cases where the index is defined on the published_at column instead of id, which is totally useless, because the B-Tree index will be full of NULL values, and there's no way to use this index for both filtering and sorting by id.

-- ❌ Don't do this ❌
CREATE INDEX idx2 ON outbox (published_at)
WHERE published_at IS NULL;

Incident 4: Picking too many messages at each iteration

PostgreSQL hates long-running transactions, and the outbox pattern is no exception. If you pick too many messages at once, you may end up with a long-running transaction that blocks other operations on the outbox table (such as Autovacuum), leading to performance degradation. For too large N, PostgreSQL may even prefer not to use the index at all!

Notice that the outbox pattern does the following steps:

Open a transaction
Fetch N messages from the outbox
Publish the messages to the message broker
Commit the transaction

Step 3 entails network I/O, which can take a long time, especially if the message broker is slow or the network is congested. If you pick too many messages at once, the transaction will take longer to complete. Also, any failure will result in the entire transaction being rolled back, which means all the messages will be reprocessed.

Do some performance testing to find the optimal number of messages to pick at each iteration. In my experience, picking ~100 messages at a time is a good starting point. Do not pick more than 1000 messages at once!

Also, if possible, send the messages to the message broker in batches, rather than one by one. The batches should honor the message order, group messages by some criteria (e.g., user ID, account ID, etc.), and send messages in each group in order, while allowing for parallel processing of different groups. We use Kotlin coroutines as opposed to threads for utmost performance.

Incident 5: Poison pill messages

A message that cannot be processed due to some error is called a "poison pill" message. If such a message is picked up by the outbox publisher, it will cause the entire transaction to fail, and the message will be reprocessed indefinitely. Grouping messages, as mentioned in the previous incident, can help lower the effect of poison pill messages: While the DB transaction is still around all the groups, each group can be processed independently, and exceptions in one group are caught to not affect the others.

However, this is not a perfect solution, and alerting should be in place to notify the team when a poison pill message is detected. The team should then investigate the root cause and fix it. (Example: The Kafka topic to which the message is being published has a new schema in the Schema Registry that is incompatible with the message payload.)

Incident 6: Inefficient index and visibility checking

This is by far the most complex issue, and understanding it requires a bit of background on how PostgreSQL handles visibility and indexing. It is also the main reason I wrote this blog post, as the solution we came up with was really rewarding in terms of performance and reliability. It kind of revamps the outbox pattern, in that we use partitioned tables without additional indexes, and allows for easy removal of old messages in the blink of an eye!

The issue arises when the outbox table is being inserted into frequently, and at the same time the messages are being published to the message broker at a fast rate. This causes any index on the outbox table to become "inefficient" (referring to dead tuples + being bloated), which in turn leads to performance degradation. The inefficient index takes up more space and requires more time to scan. Besides, the visibility checking mechanism in PostgreSQL can cause additional overhead, as it needs to check the visibility of each row in the index, and possibly fetch many rows from the heap.

Showcasing the issue

Let's first demonstrate the issue with a simple example. It is important to note that this kind of issue is paramount in high-throughput systems, where the outbox table is being inserted into and processed at a fast rate. So, if you have a low-throughput system, you may not see this issue in practice, but it is still good to be aware of it.

Setting up the outbox table and testing the index

This can be demonstrated with a simple example. We first set up the outbox table and its partial index as described earlier. I'm using an UNLOGGED table to speed up the tests, but you should use a regular table in production:

CREATE UNLOGGED TABLE outbox (
    id INT8 GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
);

CREATE INDEX idx1 ON outbox (id)
WHERE published_at IS NULL;

Next, we insert a large number of messages into the outbox table:

INSERT INTO outbox(payload)
SELECT '{}'
FROM generate_series(1, 1_000_000);

VACUUM ANALYZE outbox;

Looking at the query plan

Next, let's check the query plan for fetching messages from the outbox. Notice that I'm only fetching the id column, to enforce the so-called "index-only scan." I'm also fetching a large number of messages (1000), in contrast to what was suggested earlier (See Incident 4). Both are for demonstration purposes only, to show a point that should be clear soon.

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

The query plan shows that Postgres is using the index idx1 to fetch the messages:

                              QUERY PLAN
-----------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox (actual rows=1000 loops=1)
         Heap Fetches: 0
 Planning Time: 0.137 ms
 Execution Time: 0.133 ms

As expected, the index idx1 is used, the planner uses an "index-only scan," and there are no heap fetches, meaning that all the required data is available in the index itself. In other words, after reading the data from the index, Postgres does not need to access the heap (the actual table) to fetch any extra information. The query is fast, and it only takes a fraction of a millisecond to execute.

Simulating outbox message insertion and processing

Now, let's run two additional parallel sessions: one that inserts new messages into the outbox table, and another that processes messages from the outbox. While both parallel sessions are running, we'll check the query plan again and again to see how it changes over time. Here, I opted for a simple loop rather than a pgbench script, as I felt it's more beginner-friendly and easier to understand. However, you can use pgbench or any other tool to simulate the load in a more realistic way, and also gather statistics on the performance of each parallel session.

Parallel Session 1: Insert messages into the outbox table

DO $$
BEGIN
    LOOP
        INSERT INTO outbox(payload)
        SELECT '{}'
        FROM generate_series(1, 1000);

        -- Commit the transaction to avoid long-running transactions
        COMMIT;
    END LOOP;
END $$;

Parallel Session 2: Process messages from the outbox table

DO $$
BEGIN
    LOOP
        WITH t AS (
            SELECT id
            FROM outbox
            WHERE published_at IS NULL
            ORDER BY id
            LIMIT 10000
        )
        UPDATE outbox
        SET published_at = NOW()
        WHERE id IN (SELECT id FROM t);

        -- Commit the transaction to avoid long-running transactions
        COMMIT;
    END LOOP;
END $$;

Observing the query plan changes

While the two parallel sessions are running, let's check the query plan again and again:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

After just a few seconds, I got this awful query plan. Note that it fetches close to 100 million rows from the heap, which takes over 18.5 seconds to execute. This means a slowdown 5 orders of magnitude compared to the initial query plan, and can easily lead to an incident.

                              QUERY PLAN
-----------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox (actual rows=1000 loops=1)
         Heap Fetches: 96633220
 Planning Time: 0.067 ms
 Execution Time: 18553.061 ms

I should point out that not all plans were this bad. I executed it several times, and the results I got were as follows:

#	Heap Fetches	Execution Time (ms)
1	870094	48
2	96633220	18553
3	11000	2844
4	1000	1791
5	11000	2708

The reason for this behavior is that autovacuum kicks in and cleans up the table and the index, and updates the visibility maps. As such, the number of heap fetches varies over time, but in most cases it was still 4 orders of magnitude slower!

Note: While the vacuum can clean up the index, it only marks dead rows in the index as invalid, and does not actually remove them from the index (because it requires rebuilding the index structure, which would slow down vacuum). Over time, this will cause the index to become bloated. We'll discuss this in the Part 2 of this blog post.

How to fix the issue

Fixing the issue is actually quite simple: Just use a partitioned outbox table. More specifically, the outbox table is partitioned by whether published_at is NULL or not. This way, one partition will always contain the messages that are not yet published, and the other partition will contain the messages that are already published. There will be no need for fetching tons of rows from the heap, since any row in the "unpublished" partition is guaranteed to have published_at IS NULL. Furthermore, it is straightforward to remove all messages from the "published" partition: You can simply TRUNCATE the partition regularly, which is a rapid operation in PostgreSQL. Finally, the operations are more "vacuum-friendly," as the "published" partition is only inserted to.

Let's dive in.

Step 1: Create the partitioned outbox table

We create the outbox table as before (note the UNLOGGED is for testing purposes, you should use a regular table in production), with two differences:

The table is partitioned by the published_at column.
The table has no primary key, as the id column is not listed in the partitioning key. Instead, we rely on the GENERATED ALWAYS AS IDENTITY to generate unique IDs. Though in some cases (e.g., in logical replication), the lack of a primary key can cause issues, this is an acceptable trade-off for the outbox table for most use cases.

CREATE UNLOGGED TABLE outbox (
    id INT8 GENERATED ALWAYS AS IDENTITY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
)
PARTITION BY LIST (published_at);

Next, we create two partitions: one for the messages that are not yet published, and another for the messages that are already published. The first partition is called outbox_unpublished, and the second partition is called outbox_published.

CREATE UNLOGGED TABLE outbox_unpublished 
    PARTITION OF outbox
    FOR VALUES IN (NULL);

CREATE UNLOGGED TABLE outbox_published 
    PARTITION OF outbox
    DEFAULT;

Finally, we create the index on the outbox_unpublished partition, as it is the one that will be queried for fetching messages to publish. The other partition will not be queried, so it does not need an index. Also, the index should no longer be a partial index, as the partition itself is already filtered by published_at IS NULL. However, for demonstration purposes and showing heap fetches, I opt to create a partial index anyway:

CREATE INDEX idx1 ON outbox_unpublished (id)
-- This is unnecessary, as the partition is already filtered by published_at IS NULL
WHERE published_at IS NULL;

Step 2: Test the setup

Let's first insert some messages into the outbox partition:

INSERT INTO outbox(payload)
SELECT '{}'
FROM generate_series(1, 1_000_000);

VACUUM ANALYZE outbox;

Next, let's check the stats of the number of tuples inserted, updated, and deleted in each partition:

SELECT relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE relname LIKE 'outbox_%';

This is as expected: All 1 million tuples are inserted into the outbox_unpublished partition:

      relname       | n_tup_ins | n_tup_upd | n_tup_del
--------------------+-----------+-----------+-----------
 outbox_unpublished |   1000000 |         0 |         0
 outbox_published   |         0 |         0 |         0

What if we update the published_at column for a single row? The wrong way to do this, which is an easy way to create an incident, is as follows:

-- ❌ Don't do this ❌
UPDATE outbox
SET published_at = NOW()
WHERE id = 1;

Why on earth is this a bad idea? Because the outbox table has no index on the id column, remember? Only the outbox_unpublished partition has an index on the id column, so the above query will perform a sequential scan on the outbox_published partition:

EXPLAIN (COSTS OFF, SUMMARY OFF)
    UPDATE outbox
    SET published_at = NOW()
    WHERE id = 1;

And here's the query plan:

                     QUERY PLAN
-----------------------------------------------------
 Update on outbox
   Update on outbox_unpublished outbox_1
   Update on outbox_published outbox_2
   ->  Append
         ->  Seq Scan on outbox_unpublished outbox_1
               Filter: (id = 1)
         ->  Seq Scan on outbox_published outbox_2
               Filter: (id = 1)

You see Seq Scan on both partitions as I opted for a partial index on the outbox_unpublished partition, but as I pointed out earlier, this is just for demonstration purposes.

Regardless, the point is that the outbox_published partition will be sequentially scanned. In the beginning, where outbox_published is empty, this Seq Scan query will be fast. But as soon as the outbox_published partition starts to accumulate messages, the query will become slower and slower, to a point where it can cause an incident.

With partitioned tables, it's always best to include the partition key in the WHERE clause. In our case, we can do this by using the published_at IS NULL condition:

EXPLAIN (COSTS OFF, SUMMARY OFF)
    UPDATE outbox
    SET published_at = NOW()
    WHERE id = 1
    AND published_at IS NULL; -- This is the correct way to do it

The above will only scan the outbox_unpublished partition, and will be fast:

                         QUERY PLAN
------------------------------------------------------------
 Update on outbox
   Update on outbox_unpublished outbox_1
   ->  Index Scan using idx1 on outbox_unpublished outbox_1
         Index Cond: (id = 1)
         Filter: (published_at IS NULL)

Let's now actually run the update query, and check the stats again:

-- Update the published_at column for a single row
UPDATE outbox
SET published_at = NOW()
WHERE id = 1
AND published_at IS NULL;

-- Check the stats again
SELECT relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE relname LIKE 'outbox_%';

The stats now show that one tuple was deleted from the outbox_unpublished partition, it was inserted into the outbox_published partition, and no tuples were updated:

      relname       | n_tup_ins | n_tup_upd | n_tup_del
--------------------+-----------+-----------+-----------
 outbox_unpublished |   1000000 |         0 |         1
 outbox_published   |         1 |         0 |         0

Step 3: Test outbox message-processing performance

Before running the message-processing query, let's check the query plan for fetching messages from the outbox:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

The result is as follows:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox_unpublished outbox (actual rows=1000 loops=1)
         Heap Fetches: 156
 Planning Time: 0.170 ms
 Execution Time: 0.165 ms

Wait, what?! There are heap fetches? Yes, because we updated the published_at column for a single row, and the change is not significant enough to trigger an autovacuum. Do a manual VACUUM outbox, and check the query plan again. You'll be surprised to see that there are still 156 heap fetches!

The reason is that, starting from Postgres 14, VACUUM may skip index cleanup when the number of removable index entries is not significant. It has an option called INDEX_CLEANUP, which is set to AUTO by default. To enforce the index cleanup, you can set it to ON:

VACUUM (INDEX_CLEANUP ON) outbox;

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

The result is as expected:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox_unpublished outbox (actual rows=1000 loops=1)
         Heap Fetches: 0
 Planning Time: 0.166 ms
 Execution Time: 0.135 ms

I'll reiterate that index cleanup is different from index bloat removal. This is a topic for the next part of this blog post, where we'll discuss how to remove index bloat and improve performance even further.

Okay, let's now run the two parallel sessions again. The code for session 1 is exactly the same as before, but for session 2, we need to change the query so that the UPDATE statement uses the additional condition published_at IS NULL to ensure that it only updates the outbox_unpublished partition:

Parallel Session 1: Insert messages into the outbox table

DO $$
BEGIN
    LOOP
        INSERT INTO outbox(payload)
        SELECT '{}'
        FROM generate_series(1, 1000);

        COMMIT;
    END LOOP;
END $$;

Parallel Session 2: Process messages from the outbox table

DO $$
BEGIN
    LOOP
        WITH t AS (
            SELECT id
            FROM outbox
            WHERE published_at IS NULL
            ORDER BY id
            LIMIT 10000
        )
        UPDATE outbox
        SET published_at = NOW()
        WHERE id IN (SELECT id FROM t)
        -- Ensure we only update the unpublished partition
        AND published_at IS NULL;

        COMMIT;
    END LOOP;
END $$;

Interestingly, the fetch query is pretty fast, and seemingly unaffected by the parallel sessions. I ran it many times over, and it gave me a consistent result of 1000 heap fetches and an execution time of 1–3 ms:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox_unpublished outbox (actual rows=1000 loops=1)
         Heap Fetches: 1000
 Planning Time: 0.042 ms
 Execution Time: 2.543 ms

Conclusion

The outbox pattern is a powerful tool for ensuring reliable message delivery in distributed systems. However, it can lead to incidents if not implemented correctly. In this blog post, we discussed several common pitfalls and how to avoid them, including:

Sorting by created_at
Using the wrong data types
Incorrect indexing
Picking too many messages at each iteration
Poison pill messages
Inefficient index and visibility checking

The last point is particularly important, as it can lead to significant performance degradation in high-throughput systems. We showed how to revamp the outbox pattern by using partitioned tables, which allows for easy removal of old messages and avoids the issues with stale indexes and visibility checking.

While this revamped solution is straightforward to implement on a greenfield project, it can be more challenging to apply to an existing system. This is because PostgreSQL does not allow converting a regular table to a partitioned table. In Part 2 of this blog post, we'll tackle this problem, as well as discuss how to remove index bloat and improve performance even further by adjusting the settings of the autovacuum daemon for the partitioned outbox table. Stay tuned for that, and wish you an incident-free outbox pattern implementation!

Investigating PostgreSQL's `ANALYZE` Runtime Behavior: Sampling, Table Size, and Cache Effects

Sadeq Dousti — Sun, 06 Apr 2025 16:47:06 +0000

PostgreSQL's ANALYZE command is designed to collect statistics for the query planner by sampling a fixed number of pages from a table. As per the documentation, when default_statistics_target is 100, the sample size is 30,000 pages (or less, if the table is samll and does not have 30K pages). This leads to a reasonable assumption: the runtime of ANALYZE should be independent of the table size, given that the number of pages sampled is constant.

However, experimental evidence suggests otherwise.

Experiment Setup

We created five PostgreSQL tables of increasing size, from 100,000 to 400 million rows, using the following SQL:

create table t(n)
with (autovacuum_enabled = false)
as select generate_series(1, N);

For each table, we recorded (on a MacBook Air M3 with 24 GB RAM):

Table size (in MB and number of pages)
Analyze time under three conditions:
- First time after table creation
- Subsequent runs with data in cache
- After purging OS and PostgreSQL cache using:

brew services stop postgresql@17 \
  && sync \
  && sudo purge \
  && brew services start postgresql@17

On Linux, use the following:

sudo systemctl stop postgresql \
 && sync \
 && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' \
 && sudo systemctl start postgresql

Results

Rows	Size (MB)	Pages	First Analyze (ms)	Subsequent Analyze (ms)	After Cache Purge (ms)
100,000	4	448	26	22	38
1,000,000	35	4,480	73	51	58
10,000,000	346	44,288	299	209	1,461
100,000,000	3,457	442,496	459	454	7,161
400,000,000	13,827	1,769,920	1,426	1,459	8,630

Visualization

The above chart shows that even though PostgreSQL samples a fixed number of pages, the ANALYZE runtime increases with table size, especially after purging caches.

Possible Explanation: Reservoir Sampling Complexity

One explanation lies in how PostgreSQL implements sampling. PostgreSQL uses reservoir sampling, a technique that allows uniform random sampling from a stream (i.e., the full table). While the sample size is fixed (e.g., 30,000 pages), the time complexity of the algorithm is not constant. In the naive implementation, it's linear, while in the optimized versions, it's logarithmic (in the number of items being sampled).

PostgreSQL uses Vitter's Reservoir Sampling Algorithm Z:

/*
 * These two routines embody Algorithm Z from "Random sampling with a
 * reservoir" by Jeffrey S. Vitter, in ACM Trans. Math. Softw. 11, 1
 * (Mar. 1985), Pages 37-57.  Vitter describes his algorithm in terms
 * of the count S of records to skip before processing another record.
 * It is computed primarily based on t, the number of records already read.
 * The only extra state needed between calls is W, a random state variable.
 *
 * reservoir_init_selection_state computes the initial W value.
 *
 * Given that we've already read t records (t >= n), reservoir_get_next_S
 * determines the number of records to skip before the next record is
 * processed.
 */

It also uses Knuth's optimized version called Algorithm S, where the total number of data to be sampled is known in advance.

/*
 * Since we know the total number of blocks in advance, we can use the
 * straightforward Algorithm S from Knuth 3.4.2, rather than Vitter's
 * algorithm.
 */

The source code actually mentions that the implemented algorithm may deviate from Algorithm S:

/*
 * It is not obvious that this code matches Knuth's Algorithm S.
 * Knuth says to skip the current block with probability 1 - k/K.
 * If we are to skip, we should advance t (hence decrease K), and
 * repeat the same probabilistic test for the next block.
 */

This means PostgreSQL must touch all pages of the table to decide whether to include them in the sample, even if it eventually keeps only 30,000 of them. This explains why larger tables take longer to ANALYZE.

But, all in all, it still seems that analyze runtime depends considerably - even linearly - on the table size! In an email to Postgres Hackers back in 2004, Manfred Koizar mentioned this:

Playing around with some real numbers (for 100 tuples/page and a sample
size of 3000) I got:
 rel  | page
 size | reads
------+-------------
   30 |    30
  300 |   300    expectation is something like 299.9995 
  500 |   499
   1K |   990
   3K |  2.6K
  30K |    8K
 100K |   12K
   1M |   19K
  10M |   26K
 100M |   33K
This growth rate is steeper than O(log(nblocks)).

I have an idea how this could be done with O(1) page reads.

What I have in mind is a kind of "Double Vitter" algorithm. Whatever we
do to get our sample of rows, in the end the sampled rows come from no
more than sample_size different blocks. So my idea is to first create a
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

The idea is implemented in Postgres, as the source code refers to it. However, based on my experiments, the running time of analyze doesn't seem like O(1) or O(log(nblocks)), but rather O(nblocks). Of course, asymptotic calculations cannot be simply based on numeric observations!

Effect of Cache Purging

The drastic increase in analyze time after purging caches highlights the importance of PostgreSQL shared buffers and OS-level caching. Without the cache:

Each page access becomes a disk I/O operation.
Disk reads (even from SSDs) are significantly slower than memory access.
This especially affects large tables where random disk reads dominate the runtime.

This is why, for the 400 million row table, analyze time jumped from ~1.4 seconds to over 8 seconds after cache purging.

Interestingly, the time to analyze did not reduce afterwards, until I ran vacuum:

\timing
Timing is on.

analyze t_400_000_000;
ANALYZE
Time: 8630.107 ms (00:08.640)

analyze t_400_000_000;
ANALYZE
Time: 8706.080 ms (00:08.706)

vacuum t_400_000_000;
VACUUM
Time: 43634.979 ms (00:43.635)

analyze t_400_000_000;
ANALYZE
Time: 1548.534 ms (00:01.549)

analyze t_400_000_000;
ANALYZE
Time: 1434.224 ms (00:01.434)

The reason is probably the sampling: Analyze loads a different sample into RAM each time, leading to a lot of disk I/O when the data is not in RAM. On the other hand, when vacuum is run the first time, it reads all the disk pages, warming up caches.

vacuum freeze is even more agressive, leading to a huge boost in analyze running time:

vacuum freeze t_400_000_000 ;
VACUUM
Time: 70350.355 ms (01:10.350)

analyze t_400_000_000;
ANALYZE
Time: 658.919 ms

analyze t_400_000_000;
ANALYZE
Time: 538.907 ms

analyze t_400_000_000;
ANALYZE
Time: 528.144 ms

analyze t_400_000_000;
ANALYZE
Time: 499.968 ms

Still, after running vacuum freeze on tables of size 100M and 400M rows, the latter takes 3x to analyze.

The effect of frequent values on histogram_bounds in PostgreSQL

Sadeq Dousti — Sun, 06 Apr 2025 12:45:47 +0000

In a recent blog post, I explored statistics in PostgreSQL, with a focus on how the database engine builds histogram bounds during the ANALYZE process. These histograms are critical for query planning and optimization, as they give the planner a way to estimate data distributions across columns.

One part of the post demonstrated how to generate a skewed distribution and inspect the resulting histogram bounds using the pg_stats view. Here's the original setup:

create table t(n)
  with (autovacuum_enabled = off)
  as select generate_series(1, i)
     from generate_series(1, 1000) as i;

This creates a table with 500,500 rows, where 1 is repeated 1000 times, 2 is repeated 999 times, and so on, until 1000 which is repeated only once. This distribution is shown in the following diagram:

The blog post also demonstrates the concept of histogram bounds in Postgres, which can be observed after analyzing the table:

analyze t;

select
  histogram_bounds::text::int[] as histogram_bounds
from pg_stats
where tablename = 't';

The output is based on a random sampling of the table, so your mileage may vary:

5, 18, 32, 40, 54, 68, 80, 100, 110, 119, 125, 136, 144, 149, 
154, 160, 165, 171, 178, 182, 189, 195, 202, 211, 216, 222, 
227, 234, 239, 245, 250, 256, 261, 268, 273, 278, 283, 289, 
295, 301 307, 314, 320, 325, 332, 339, 346, 352, 358, 364, 
370, 376, 383, 389, 396, 402, 409, 415, 422, 429 436, 443, 
450, 458, 466, 473, 481, 489, 496, 503, 511, 519, 527, 536, 
545, 555, 564, 573, 583, 593 603, 613, 623, 633, 645, 657, 
668, 680, 693, 706, 721, 734, 749, 765, 783, 802, 824, 846, 
873, 910, 994

The expectation is that the histogram would reflect this skewed distribution, assigning more histogram buckets to regions with higher data density.

Reader Observation: Histogram Does Not Match Intuition

Chris Jones, a reader of the blog, made a sharp observation:

In your out of histogram_bounds for table t above, it shows that there are 7 buckets for values 0-100, and 17 buckets for 200-300, even though we know that there are more records with values in the 0-100 range. I got similar bounds when I ran the same example. I tried with lots of different statistics_target values and always got the same. It seems like this histogram implies a distribution that is not correct.

Let's demonstrate his point with actually drawing the bucket distribution for the histogram bounds (using this Python script):

This is an important point. If histogram bounds are used for cardinality estimation, shouldn't denser regions have finer granularity?

Digging Deeper: Why Is the Histogram Counterintuitive?

To understand this, I looked at the PostgreSQL source code. Specifically, this comment in analyze.c explains part of the logic:

/*
 * Generate a histogram slot entry if there are at least two distinct
 * values not accounted for in the MCV list. (This ensures the
 * histogram won't collapse to empty or a singleton.)
 */

This suggests that histogram bounds are only created when:

The bucket contains at least two distinct elements;
The values are not in the MCV (most common values) list.

In the above distribution, high-frequency values like 1 and 2 have a higher chance of being in the MCV list than low-frequency values like 900.

Verifying the Hypothesis

To test this explanation, I modified the data by adding a small random offset to each value. This ensures that every value is unique (or nearly unique), avoiding the issue of MCV elimination during histogram generation.

drop table if exists t;

create table t(n)
  with (autovacuum_enabled = off)
  as select generate_series(1, i) + random() / 1000.0
  from generate_series(1, 1000) as i;

analyze t;

select
  histogram_bounds::text::float[] as histogram_bounds
from pg_stats
where tablename = 't';

With this change, the histogram bounds now behave as expected: the 0–100 range contains many more histogram boundaries than the 100–200 and 200–300 ranges. This better reflects the actual density of the data.

Conclusion

PostgreSQL's histogram statistics do account for data distribution, but the internal logic excludes MCV values when selecting histogram boundaries. As Rober Haas put it, "histogram is intended to be a histogram of non-MCVs".

Thanks to Chris Jones for the sharp observation that triggered a deeper dive into how PostgreSQL builds histograms, and to Rober Haas for commenting on the blog with the above useful insight on PostgreSQL Hacking Discord.

PostgreSQL partitioning with desired index names

Sadeq Dousti — Sat, 01 Feb 2025 10:31:05 +0000

Introduction

A while back, Hetti wrote an article titled Indexing Partitioned Table Disaster. There, she explained why the lack of control over index names in PostgreSQL table partitions can lead to some issues. We discussed a bit over LinkedIn comments, and this article is an elaboration on the discussion.

To be self-contained, I will show the "the lack of control over index names in PostgreSQL table partitions", but not how it can lead to issues. For that, go read the great article above!

Setup

We first need to create a base table with some indexes. Here, I'm creating two indexes on the same set of columns, but one index is a partial index. This is used later to prove a point, but in real-world applications, I wouldn't allow the partition key (created_at) to be nullable, neither would I create two indexes on the same list of columns (unless of course there's a good reason to do so, which I can't think of any right now!)

drop table if exists t;

create table t (
  id bigint generated always as identity not null,
  created_at timestamptz)
partition by range (created_at);

create index t__id__created_at__idx
  on t(id, created_at);

create index t__id__created_at_nn__idx
  on t(id, created_at)
    where created_at is not null;

Creating a partition

Next, let's create a single partition for the base table. This will serve for data in February 2025:

create table t_2025_02
  partition of t
    for values from ('2025-02-01') to ('2025-03-01');

Querying indexes

I came up with a rather complex, recursive CTE query to show all indexes defined on table 't' and its partitions, and whether those indexes are valid. We will run this query several times in this article:

with recursive partitions as (
  -- select the base partitioned table t
  select c.oid as child_oid,
         c.relname as table_name
  from pg_class c
  where c.relname = 't'

  union all

  -- recursively select all partitions that inherit from their parent
  select c.oid,
         c.relname
  from pg_inherits i
  join partitions p on i.inhparent = p.child_oid
  join pg_class c on i.inhrelid = c.oid
)
select p.table_name,
       ci.relname as index_name,
       case
         when i.indisvalid then 'VALID'
         else 'INVALID'
       end as status
from partitions p
join pg_index i on i.indrelid = p.child_oid
join pg_class ci on ci.oid = i.indexrelid
order by p.table_name, ci.relname;

Here's the result of running the query. It clearly shows that the partition t_2025_02 "inherited" the indexes from the base table.

 table_name |          index_name          | status
------------+------------------------------+--------
 t          | t__id__created_at__idx       | VALID
 t          | t__id__created_at_nn__idx    | VALID
 t_2025_02  | t_2025_02_id_created_at_idx  | VALID
 t_2025_02  | t_2025_02_id_created_at_idx1 | VALID

There's a caveat though: We had no control over how the indexes are named:

t_2025_02_id_created_at_idx
t_2025_02_id_created_at_idx1

PostgreSQL uses the partition name and column names to form the index name, and if this results in the same index name (as in above), it starts adding suffixes 1, 2, 3, ...

Can we have control over the index names? The answer is yes, though it brings a bit of complexity on the table. We'll consider two cases:

Case 1: Partition does not exist yet. We want to create it and name the indexes ourselves.
Case 2: Partition exists and is attached to the base table. We want to add a new index to the base table, and have control over the index name on the partition.

Case 1: Controlling index name for newly created partitions

Let's drop the above table, to start from a clean slate where the partition does not exist:

drop table if exists t_2025_02;

Creating a partition

We create the partition LIKE the parent table, but exclude indexes. We should also exclude other objects like identity as Postgres forbids them:

create table t_2025_02 (
  like t including all excluding indexes excluding identity
);

Now, let's create indexes with the names we desire:

create index t_2025_02__id__created_at__idx
  on t_2025_02(id, created_at);

create index t_2025_02__id__created_at_nn__idx
  on t_2025_02(id, created_at)
  where created_at is not null;

Finally, attach the partition to the base table:

alter table t
  attach partition t_2025_02
    for values from ('2025-02-01') to ('2025-03-01');

Run the query to show indexes (see section "Querying indexes" above for the query):

 table_name |            index_name             | status
------------+-----------------------------------+--------
 t          | t__id__created_at__idx            | VALID
 t          | t__id__created_at_nn__idx         | VALID
 t_2025_02  | t_2025_02__id__created_at__idx    | VALID
 t_2025_02  | t_2025_02__id__created_at_nn__idx | VALID

PostgreSQL intelligently finds the corresponding index on the parent table, and "attaches" the child index to it. We can use psql \d+ meta-command to check that each parent index has the right child:

Of course, if the child table does not have the corresponding index, Postgres will create it using the default naming convention.

Case 2: Controlling index name for existing partitions

In this section, we consider a parent table with an existing partition. We'd like to create a new index on the parent, while controlling the name of the index on the partition. Let's first drop the parent table:

drop table if exists t;

And then create the parent and one partition for it, without any indexes:

create table t (
  id bigint generated always as identity not null,
  created_at timestamptz)
partition by range (created_at);

create table t_2025_02
  partition of t
    for values from ('2025-02-01') to ('2025-03-01');

Creating indexes on parent table only

We'll first create an index on the parent, and not on any of its partitions. The trick is to use the Postgres syntax:

create index ... on ONLY table ...

From PostgreSQL create index documentation:

When CREATE INDEX is invoked on a partitioned table, the default behavior is to recurse to all partitions to ensure they all have matching indexes. Each partition is first checked to determine whether an equivalent index already exists, and if so, that index will become attached as a partition index to the index being created, which will become its parent index. If no matching index exists, a new index will be created and automatically attached; the name of the new index in each partition will be determined as if no index name had been specified in the command. If the ONLY option is specified, no recursion is done, and the index is marked invalid. (ALTER INDEX ... ATTACH PARTITION marks the index valid, once all partitions acquire matching indexes.) Note, however, that any partition that is created in the future using CREATE TABLE ... PARTITION OF will automatically have a matching index, regardless of whether ONLY is specified.

Let's do this:

create index t__id__created_at__idx
  on ONLY t(id, created_at);

create index t__id__created_at_nn__idx
  on ONLY t(id, created_at)
    where created_at is not null;

Run the query from the section "Querying indexes" above, and you'll see the indexes are created in the INVALID state:

 table_name |        index_name         | status
------------+---------------------------+---------
 t          | t__id__created_at__idx    | INVALID
 t          | t__id__created_at_nn__idx | INVALID

Creating indexes on the partition

Here, we proceed as normal, and create indexes directly on the partition, with whatever name we desire. Note the use of concurrently to prevent exclusive locks on the tables:

create index concurrently t_2025_02__id__created_at__idx
  on t_2025_02(id, created_at);

create index concurrently t_2025_02__id__created_at_nn__idx
  on t_2025_02(id, created_at)
  where created_at is not null;

Querying the indexes, we see that the partition indexes are created in the VALID state, but the parent indexes are still invalid. We'll fix that in the next section.

 table_name |            index_name             | status
------------+-----------------------------------+---------
 t          | t__id__created_at__idx            | INVALID
 t          | t__id__created_at_nn__idx         | INVALID
 t_2025_02  | t_2025_02__id__created_at__idx    | VALID
 t_2025_02  | t_2025_02__id__created_at_nn__idx | VALID

Attaching partition indexes to the parent

From PostgreSQL alter index documentation:

ALTER INDEX <name> ATTACH PARTITION <index_name>;
Causes the named index (possibly schema-qualified) to become attached to the altered index. The named index must be on a partition of the table containing the index being altered, and have an equivalent definition. An attached index cannot be dropped by itself, and will automatically be dropped if its parent index is dropped.

Let's attach the indexes:

alter index t__id__created_at__idx
  attach partition t_2025_02__id__created_at__idx;

alter index t__id__created_at_nn__idx
  attach partition t_2025_02__id__created_at_nn__idx;

Now, query indexes, and voilà! We have all the desired indexes in the valid state.

table_name |            index_name             | status
------------+-----------------------------------+--------
t          | t__id__created_at__idx            | VALID
t          | t__id__created_at_nn__idx         | VALID
t_2025_02  | t_2025_02__id__created_at__idx    | VALID
t_2025_02  | t_2025_02__id__created_at_nn__idx | VALID

Automation

Everyone loves automation! Here, we have a function that automatically creates all indexes from the parent on the child, making the child ready to be attached to the parent. It excludes indexes marked as "primary key" (because it will be created anyway during attachment):

create or replace function create_indexes_on_partition(base_table text, part_table text)
returns void as
$$
declare
    rec record;
    sql text;
begin
    for rec in
        select idx.oid as index_oid,
               idx.relname as index_name
        from pg_index ind
        join pg_class idx on ind.indexrelid = idx.oid
        join pg_class tbl on tbl.oid = ind.indrelid
        where tbl.relname = base_table
          -- all indexes, except primary keys
          and ind.indisprimary = false
    loop
        sql := format(
            'create index %I_%I on %I %s;',
            part_table,
            regexp_replace(rec.index_name, base_table, ''),
            part_table,
             -- index method and column list
            substring(pg_get_indexdef(rec.index_oid) from 'USING.*')
        );
        raise notice 'executing: %', sql;
        execute sql;
    end loop;
end;
$$ language plpgsql;

Test

Do all the steps in the "Setup" section above, and create a partition:

create table t_2025_02 (
  like t including all excluding indexes excluding identity
);

Then run the function:

select create_indexes_on_partition('t','t_2025_02');

Output:

NOTICE:  Executing: CREATE INDEX t_2025_02___id__created_at__idx ON t_2025_02 USING btree (id, created_at);
NOTICE:  Executing: CREATE INDEX t_2025_02___id__created_at_nn__idx ON t_2025_02 USING btree (id, created_at) WHERE (created_at IS NOT NULL);
 create_indexes_on_partition
-----------------------------

Finally, attach the partition to the base table:

alter table t
  attach partition t_2025_02
    for values from ('2025-02-01') to ('2025-03-01');

Run the query to show indexes (see section "Querying indexes" above for the query):

 table_name |             index_name             | status
------------+------------------------------------+--------
 t          | t__id__created_at__idx             | VALID
 t          | t__id__created_at_nn__idx          | VALID
 t_2025_02  | t_2025_02___id__created_at__idx    | VALID
 t_2025_02  | t_2025_02___id__created_at_nn__idx | VALID

You can schedule the task of creating partitions, adding indexes, and then attaching them to the base table using pg_cron or any other scheduling mechanism you like!

Database: Indexing issues with random UUIDs

Sadeq Dousti — Sun, 07 Jul 2024 15:00:28 +0000

UUIDs are very popular among developers as the datatype for the identifier of a table. Database administrators, on the other hand, despise UUIDs, as they are not very database friendly: Most UUID types are random, meaning they are all over the place. As a result, the DBMS has to put extra effort to generate, process, store, and retrieve them.

In my opinion, use of UUIDs must generally be confined to the cases where "universally unique" identifiers are needed. For an identifier that is unique within a table or a certain domain, a database sequence number would usually be OK.

There are certain cases where this general statement is debatable. See the Discussion section at the end of this article.

Also, there are versions of UUID that are time-sorted, and are preferred to random ones.

In this post, we will see a few issues with indexing UUID columns. While PostgreSQL is used, I can imagine other DBMSs have similar issues with UUIDs. We also use some "sorted" versions of UUIDs, and compare them with the random ones.

TLDR (from the summary section):

The index on a randomly sorted UUID column is more than 20% larger.

The time required to insert randomly generated UUID values is more than twice the time required to insert ordered UUID values, when the column has an index. In particular, the index creation time is 3-4 times slower.

Setup

The database schema consists of two tables with identical schemas to store random and ordered UUIDs, respectively:

create table t_rnd(id uuid primary key);
create table t_ord(id uuid primary key);

We may also use temporary tables for fast operations (store intermediate data). Note that these tables have no primary keys:

create temp table tmp_rnd(id uuid);
create temp table tmp_ord(id uuid);

I'll use psql as my database client, and specially use its meta-commands to check the timing (\timing) size of tables and indexes (\dt+ and \di+).

The PostgreSQL server is version 16, which supports digit grouping. This makes the code more readable: 10 million is written as 10_000_000.

All timings are reported while running both psql and the PostgreSQL server on a MacBook M1 Pro.

Generating and storing random UUIDs

Since version 13, PostgresQL is equipped with the function gen_random_uuid(). This function returns a version 4 (random) UUID.

Next, we generate 10 million UUIDv4, and store them in the t_rnd table. Timings are reported because I enabled \timing in psql:

insert into t_rnd
select gen_random_uuid()
from generate_series(1, 10_000_000);

This took around 56 seconds.

Look at the table and index information:

postgres=# \dt+ t_rnd

List of relations
┌─[ RECORD 1 ]──┬───────────┐
│ Schema        │ public    │
│ Name          │ t_rnd     │
│ Type          │ table     │
│ Owner         │ postgres  │
│ Persistence   │ permanent │
│ Access method │ heap      │
│ Size          │ 422 MB    │
│ Description   │           │
└───────────────┴───────────┘

postgres=# \di+ t_rnd_pkey

List of relations
┌─[ RECORD 1 ]──┬────────────┐
│ Schema        │ public     │
│ Name          │ t_rnd_pkey │
│ Type          │ index      │
│ Owner         │ postgres   │
│ Table         │ t_rnd      │
│ Persistence   │ permanent  │
│ Access method │ btree      │
│ Size          │ 383 MB     │
│ Description   │            │
└───────────────┴────────────┘

Specially, look at the index size: 383 MB.

Now let's turn our attention to an ordered UUID case.

Generating and storing ordered UUIDs

There are several proposals for ordered UUIDs, such as UUID Version 7 (UUIDv7). But since PostgreSQL does not yet support them natively, let's just "cast" consecutive integers to UUIDs. We will later use an application to generate UUIDv7s for us.

To cast an integer i to UUID, we can use the following code:

lpad(to_hex(i), 32, '0')::uuid

Next, we generate 10 million consecutive UUIDs, and store them in the t_ord table:

insert into t_ord
select lpad(to_hex(i), 32, '0')::uuid
from generate_series(1, 10_000_000) as i;

This took around 24 seconds (as opposed to 56 seconds with random UUIDs).

Look at the table and index information:

postgres=# \dt+ t_ord

List of relations
┌─[ RECORD 1 ]──┬───────────┐
│ Schema        │ public    │
│ Name          │ t_ord     │
│ Type          │ table     │
│ Owner         │ postgres  │
│ Persistence   │ permanent │
│ Access method │ heap      │
│ Size          │ 422 MB    │
│ Description   │           │
└───────────────┴───────────┘

postgres=# \di+ t_ord_pkey

List of relations
┌─[ RECORD 1 ]──┬────────────┐
│ Schema        │ public     │
│ Name          │ t_ord_pkey │
│ Type          │ index      │
│ Owner         │ postgres   │
│ Table         │ t_ord      │
│ Persistence   │ permanent  │
│ Access method │ btree      │
│ Size          │ 301 MB     │
│ Description   │            │
└───────────────┴────────────┘

The index size: 301 MB, which is more than 20% improvement over the random UUIDs.

Was the timing comparison fair?

The generation and storage of UUIDs took:

56 seconds for random UUIDs
24 seconds for ordered UUIDs

But in this post, we are mostly interested in the storage time. So let's generate and store them in the temp table, and then observe the time it takes to store them in the main table. To be completely fair, we recreate the tables:

drop table t_rnd, t_ord;
create table t_rnd(id uuid primary key);
create table t_ord(id uuid primary key);

Next, generate and insert random UUIDs into the temp table:

insert into tmp_rnd
select gen_random_uuid()
from generate_series(1, 10_000_000);

Guess what? It took less than 7 seconds.

insert into t_rnd
select *
from tmp_rnd;

This took around 51 seconds. Table and index sizes are as before with tmp_rnd.

Now let's do the same with ordered UUIDs:

insert into tmp_ord
select lpad(to_hex(i), 32, '0')::uuid
from generate_series(1, 10_000_000) as i;

This also took less than 7 seconds.

insert into t_ord
select *
from tmp_ord;

This took around 21 seconds. Table and index sizes are as before with tmp_ord.

Before we wrap up this section let's check what will happen if we did not have an index to begin with:

create table t_rnd_no_pk(id uuid);
create table t_ord_no_pk(id uuid);

insert into t_rnd_no_pk select * from tmp_rnd;
insert into t_ord_no_pk select * from tmp_ord;

Both took around 10 seconds! So, the majority of slow down is to create an index. The following table summarizes everything we discussed so far:

	Random UUID	Ordered UUID
Time to generate & save to tmp table	7	7
Time save to table without PK	10	10
Time to save to table with PK	51	21
Table size (MB)	422	422
Index size (MB)	384	301

Side note: The 3 second (10-7) is spent to write the data to WAL, which is not required for tmp tables.

It follows that the time to create an index is 51-10 = 41 seconds with random UUIDs, and 21-10 = 11 seconds with ordered UUIDs. This means an almost 400% speed up! But why?

B-Tree indexes

The reason why the random UUID indexes are bigger and slower are explained in Vlad Mihalcea's excellent blog, which I quote here verbatim:

Indexing random values using B+Tree causes a lot of problems:

Index pages will have a very low fill factor because the values come randomly. So, a page of 8kB will end up storing just a few elements, therefore wasting a lot of space, both on the disk and in the database memory, as index pages could be cached in the Buffer Pool.

Because the B+Tree index needs to rebalance itself in order to maintain its equidistant tree structure, the random key values will cause more index page splits and merges as there is no predetermined order of filling the tree structure.

The low "fill factor" is the reason why they are larger, and the more index page splits and merges are why they are built slower.

Fill factor

To showcase the fill factor issue, simply install the pgstattuple extension, and then query both indexes:

create extension if not exists pgstattuple;

select r.avg_leaf_density as rnd_ff,
       o.avg_leaf_density ord_ff,
       r.leaf_fragmentation as rnd_frag,
       o.leaf_fragmentation as ord_frag
from pgstatindex('t_rnd_pkey') as r, 
     pgstatindex('t_ord_pkey') as o;

result:

┌────────┬────────┬──────────┬──────────┐
│ rnd_ff │ ord_ff │ rnd_frag │ ord_frag │
├────────┼────────┼──────────┼──────────┤
│  71.19 │  90.04 │    49.77 │        0 │
└────────┴────────┴──────────┴──────────┘

The random index has a 20% lower fill factor, and almost 50% fragmentation [1].

👉 The default PostgreSQL B-Tree fill factor is 90%, which is achieved by the ordered index.

Side Note: If you reindex the t_rnd_pkey index, it will also achieve the 90% fill factor, as all the data required to build the index is readily available. You can also try creating a new index with
create index t_rnd_idx on t_rnd(id);
In my case, it took only 6 seconds, and achieved the 90% fill factor without fragmentation.
Also, try the following to achieve a higher fill factor, though higher is not necessarily better!
create index t_rnd_idx_2 on t_rnd(id) with (fillfactor = 100);

Index page splits and merges

It's rather hard to demonstrate B-Tree node splits and merges, as PostgreSQL does not readily make these stats available. However, we may track the number of index pages over time, after each insert to the table. The following code inserts 100K random and ordered UUIDs into a separate table, and keeps track of leaf_pages of each index in a separate table. It then export the tracked data to /tmp/data.csv for further analysis:

create table demo_rnd(id uuid primary key);
create table demo_ord(id uuid primary key);

create table demo_idx as
select
    0 as i,
    r.leaf_pages as rnd,
    o.leaf_pages as ord
from
    pgstatindex ('demo_rnd_pkey') r,
    pgstatindex ('demo_ord_pkey') o;

do $do$
begin
    for i in 1..100_000 loop
        insert into demo_rnd
        select gen_random_uuid();

        insert into demo_ord
        select lpad(to_hex(i), 32, '0')::uuid;

        insert into demo_idx
        select
            i,
            r.leaf_pages as rnd,
            o.leaf_pages as ord
        from
            pgstatindex ('demo_rnd_pkey') r,
            pgstatindex ('demo_ord_pkey') o;

        commit;
end loop;
end
$do$;

create index on demo_idx (i);

\copy (select * from demo_idx order by i) to /tmp/data.csv csv header;

The following Python3 code is used to show the number of leaf_pages after each insert for both random and ordered UUIDs:

import pandas as pd
import matplotlib.pyplot as plt

# Read the CSV file
data = pd.read_csv('/tmp/data.csv')

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(data['i'], data['rnd'], label='rnd')
plt.plot(data['i'], data['ord'], label='ord')
plt.xlabel('i')
plt.ylabel('leaf_pages')
plt.title('Plot of rnd and ord leaf_pages against i')
plt.legend()
plt.grid(True)
plt.show()

Here's a zoomed-in version of the above plot:

Since we only inserted into the tables, no node merges happened in the B-Tree. However, we can easily see node splits by observing the index page count increase: The random UUID index - blue graph - splits into many pages in an accelerated pace, and then slowly fills them in. However, the ordered UUID index - orange line - splits the pages and fills them steadily. The slope of this line corresponds to the 90% fill factor.

UUIDv7

To generate UUIDv7 in PostgreSQL, we can use the third-party extension pg_uuidv7 [2]. The x86_64 Linux module is pre-compiled and released. But for MacOS, I just ran make && make install, and then installed the extension by running

create extension if not exists pg_uuidv7;

postgres=# select uuid_generate_v7();

┌──────────────────────────────────────┐
│           uuid_generate_v7           │
├──────────────────────────────────────┤
│ 01908e8d-a5f7-74cd-8da5-ca7a4b6580fc │
└──────────────────────────────────────┘

The timings are quite similar with the ordered UUID we concocted before:

create temp table tmp_7(id uuid);
create table t_7_no_pk(id uuid);
create table t_7(id uuid primary key);


insert into tmp_7
select uuid_generate_v7()
from generate_series(1, 10_000_000);
-- Time: 7132.475 ms (00:07.132)

insert into t_7_no_pk select * from tmp_7;
-- Time: 9651.792 ms (00:09.652)

insert into t_7 select * from tmp_7;
-- Time: 24832.326 ms (00:24.832)

The table size is identical to before (422 MB), but the index size is very large (402 MB), even larger than the random UUIDs! Looking at the stats, we see a 67% average leaf density:

postgres=# select * from pgstatindex('t_7_pkey');

┌─[ RECORD 1 ]───────┬───────────┐
│ version            │ 4         │
│ tree_level         │ 2         │
│ index_size         │ 421830656 │
│ root_block_no      │ 295       │
│ internal_pages     │ 252       │
│ leaf_pages         │ 51240     │
│ empty_pages        │ 0         │
│ deleted_pages      │ 0         │
│ avg_leaf_density   │ 67.42     │
│ leaf_fragmentation │ 44.06     │
└────────────────────┴───────────┘

This came as a surprise to me: The index is both faster and larger than the index on the random UUIDs! Looking deeper, I found out that the extension does NOT generate time-sorted UUIDs (see also [3]):

select i, uuid_generate_v7()
from generate_series(1, 10) as i
order by 2;

┌────┬──────────────────────────────────────┐
│ i  │           uuid_generate_v7           │
├────┼──────────────────────────────────────┤
│  4 │ 01908eee-c083-7029-9ad0-9d12ba763d90 │
│ 10 │ 01908eee-c083-715d-a88e-e36973ae964f │
│  1 │ 01908eee-c083-7357-aaa8-ba06f7232f2d │
│  6 │ 01908eee-c083-757f-b47d-40156da5f2f7 │
│  9 │ 01908eee-c083-7595-8a37-9024f862217d │
│  7 │ 01908eee-c083-75f1-b9f1-b031f29263c5 │
│  8 │ 01908eee-c083-7607-9003-d001115fb89b │
│  5 │ 01908eee-c083-7a43-b726-e1eca4f15698 │
│  3 │ 01908eee-c083-7ea6-a572-1b6be63c8b7a │
│  2 │ 01908eee-c083-7fa0-b565-6e91d905fb1d │
└────┴──────────────────────────────────────┘

Other UUIDv7 implementations that I've seen, such as the Java implementation uuid-creator [4], guarantee a time sorted output by using a (randomly initialized) counter.

To simulate this a time-sorted sequence, let's cluster the table tmp_7 based on the id first. Here's the full SQL code:

create temp table tmp_7(id uuid);
create table t_7_no_pk(id uuid);
create table t_7(id uuid primary key);

insert into tmp_7
select uuid_generate_v7()
from generate_series(1, 10_000_000);
-- Time: 6945.381 ms (00:06.945)

--------------------------------------------
-- Clustering tmp_7 ensures proper ordering
--------------------------------------------
create index on tmp_7(id);
-- Time: 3633.888 ms (00:03.634)
cluster tmp_7 using tmp_7_id_idx;
-- Time: 4858.199 ms (00:04.858)


insert into t_7_no_pk select * from tmp_7;
-- Time: 9012.780 ms (00:09.013)

insert into t_7 select * from tmp_7;
-- Time: 27749.267 ms (00:27.749)

We can now see that the index size is 301 MB, as expected.

Summary

In this analytical post, we found out two important facts about indexing UUID columns in PostgreSQL:

The index on a randomly sorted UUID column is more than 20% larger.
The time required to insert randomly generated UUID values is more than twice the time required to insert ordered UUID values, when the column has an index. In particular, the index creation time is 3-4 times slower.

Discussion

In the beginning of this article, I stated a personal opinion:

In my opinion, use of UUIDs must generally be confined to the cases where "universally unique" identifiers are needed. For an identifier that is unique within a table or a certain domain, a database sequence number would usually be OK.

Some of my colleagues reached out and pointed out to counter-arguments:

Sequential identifiers are not an option when they are exposed to the outside, and the business rules require them to be "non-guessable".
There is a certain value in the ability to assign an ID to an entity BEFORE going to the database. You can batch generate them, use any algorithm you like (snowflake, ULID, non-standard UUIDs, etc.). This works great with non-relational databases (some of them don't support auto-increment IDs at all).
Easier to correlate in the context of mid/large organization. Auto-increment IDs will always coincide between different entities (i.e., you can have an order with ID = 10, as well as a trade, a booking, and a reservation, with the same ID). Having UUID protects you from accidentally looking at the wrong entity.

These are all valid arguments, but there are sometimes workarounds that you might be interested in:

Regarding batch sequence generation, some DB sequences can be batch-generated too [5]: But one DB query is necessary anyway to fetch all those values.
Regarding the distinctness from all other entities: In this case, we can use the numbers internally, but convert them to a unique, string format when communicating with external systems. For instance, the trade ID is 10 internally, but whenever we share it in events, REST responses, or logs, we use Trade-0000000010 as the ID. Also, whenever we get this string format from a request, we need to map it to the numeric value.

Footnotes

[1] leaf_fragmentation is the percentage of leaf pages where the following leaf page has a lower block number, and might not be a good indication for anything. See Laurenz Albe's answer.

[2] UUIDv7 extension for PostgreSQL

[3] Comment on UUIDv7 extension

[4] uuid-creator

[5] PostgreSQL Sequence Cache

Postgres partitioning performance: Hash vs. List

Sadeq Dousti — Wed, 01 May 2024 21:28:17 +0000

In our design, we came up with a would-be large PostgreSQL table that just stores IDs of incoming (Kafka) events, for the purpose of de-duplication. The IDs are of type UUID.

create table t(id uuid primary key);

After some consideration, we decided to partition this table into 16 partitions.

Before we continue, let's make it clear that this article is about a very specific case and workload. However, it should provide you with enough insight to customize it based on your own needs.

Hash partitioning

The initial idea was to use hash partitioning.

drop table if exists t;

create table t(id uuid primary key)
    partition by hash(id);

create table t_00
    partition of t 
    for values with (modulus 16, remainder 0);

create table t_01
    partition of t 
    for values with (modulus 16, remainder 1);

create table t_02
    partition of t 
    for values with (modulus 16, remainder 2);

create table t_03
    partition of t 
    for values with (modulus 16, remainder 3);

create table t_04
    partition of t 
    for values with (modulus 16, remainder 4);

create table t_05
    partition of t 
    for values with (modulus 16, remainder 5);

create table t_06
    partition of t 
    for values with (modulus 16, remainder 6);

create table t_07
    partition of t 
    for values with (modulus 16, remainder 7);

create table t_08
    partition of t 
    for values with (modulus 16, remainder 8);

create table t_09
    partition of t 
    for values with (modulus 16, remainder 9);

create table t_10
    partition of t 
    for values with (modulus 16, remainder 10);

create table t_11
    partition of t 
    for values with (modulus 16, remainder 11);

create table t_12
    partition of t 
    for values with (modulus 16, remainder 12);

create table t_13
    partition of t 
    for values with (modulus 16, remainder 13);

create table t_14
    partition of t 
    for values with (modulus 16, remainder 14);

create table t_15
    partition of t 
    for values with (modulus 16, remainder 15);

A competing idea was to ditch hash and use the last digit of the id as partitioning key. I'll discuss the idea in the next section, but let's first benchmark the hash partitioning approach:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF

I chose to run 10 connections to DB, each sending 900 queries to the DB, using 30 concurrent threads. Each query will insert 1000 UUIDs in the table.

Why these numbers? Just for fun. They should conform to the real traffic to be indicative of anything. But let's just see how this turns out on my personal laptop (old, 2017 model!):

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 13.776 ms
initial connection time = 6.491 ms
tps = 725.901931 (without initial connection time)

real    0m12.438s
user    0m0.162s
sys     0m0.396s

It took 12.4 seconds to insert 9,000,000 rows. The average TPS (transaction per second) is 725.9.

Using psql metacommands, we can see the table/index sizes:

Using \dt+ to see table sizes (some columns are removed for brevity):

┌──────┬───────────────────┬─────────┐
│ Name │       Type        │  Size   │
├──────┼───────────────────┼─────────┤
│ t    │ partitioned table │ 0 bytes │
│ t_00 │ table             │ 24 MB   │
│ t_01 │ table             │ 24 MB   │
│ t_02 │ table             │ 24 MB   │
│ t_03 │ table             │ 24 MB   │
│ t_04 │ table             │ 24 MB   │
│ t_05 │ table             │ 24 MB   │
│ t_06 │ table             │ 24 MB   │
│ t_07 │ table             │ 24 MB   │
│ t_08 │ table             │ 24 MB   │
│ t_09 │ table             │ 24 MB   │
│ t_10 │ table             │ 24 MB   │
│ t_11 │ table             │ 24 MB   │
│ t_12 │ table             │ 24 MB   │
│ t_13 │ table             │ 24 MB   │
│ t_14 │ table             │ 24 MB   │
│ t_15 │ table             │ 24 MB   │
└──────┴───────────────────┴─────────┘

Using \di+ to see index sizes (some columns are removed for brevity):

┌───────────┬───────────────────┬─────────┐
│   Name    │       Type        │  Size   │
├───────────┼───────────────────┼─────────┤
│ t_pkey    │ partitioned index │ 0 bytes │
│ t_00_pkey │ index             │ 21 MB   │
│ t_01_pkey │ index             │ 21 MB   │
│ t_02_pkey │ index             │ 22 MB   │
│ t_03_pkey │ index             │ 20 MB   │
│ t_04_pkey │ index             │ 21 MB   │
│ t_05_pkey │ index             │ 21 MB   │
│ t_06_pkey │ index             │ 21 MB   │
│ t_07_pkey │ index             │ 20 MB   │
│ t_08_pkey │ index             │ 20 MB   │
│ t_09_pkey │ index             │ 21 MB   │
│ t_10_pkey │ index             │ 21 MB   │
│ t_11_pkey │ index             │ 21 MB   │
│ t_12_pkey │ index             │ 21 MB   │
│ t_13_pkey │ index             │ 21 MB   │
│ t_14_pkey │ index             │ 21 MB   │
│ t_15_pkey │ index             │ 21 MB   │
└───────────┴───────────────────┴─────────┘

Notice the indexes are almost as large as the tables themselves. Also while the data is equally distributed among partitions (each 24 MB), the index sizes range from 20 to 22 MB. The total size for indexes is 334 MB.

List partitioning

If we want to use the right digit of the id as the partitioning key, the primary key cannot be added to the parent table:

create table t(id uuid primary key)
partition by list (lower(left(id::text, 1)));

results in error:

ERROR:  unsupported PRIMARY KEY constraint with partition key definition
DETAIL:  PRIMARY KEY constraints cannot be used when partition keys include expressions.

So, we decided to add the primary key to each individual partition (this effectively enforces uniqueness across all data):

drop table if exists t;

create table t(id uuid not null)
partition by list (lower(left(id::text, 1)));

create table t_00
partition of t
(id primary key)
for values in ('0');

create table t_01
partition of t
(id primary key) 
for values in ('1');

create table t_02
partition of t
(id primary key) 
for values in ('2');

create table t_03
partition of t
(id primary key) 
for values in ('3');

create table t_04
partition of t
(id primary key) 
for values in ('4');

create table t_05
partition of t
(id primary key) 
for values in ('5');

create table t_06
partition of t
(id primary key) 
for values in ('6');

create table t_07
partition of t
(id primary key) 
for values in ('7');

create table t_08
partition of t
(id primary key) 
for values in ('8');

create table t_09
partition of t
(id primary key) 
for values in ('9');

create table t_10
partition of t
(id primary key) 
for values in ('a');

create table t_11
partition of t
(id primary key) 
for values in ('b');

create table t_12
partition of t
(id primary key) 
for values in ('c');

create table t_13
partition of t
(id primary key) 
for values in ('d');

create table t_14
partition of t
(id primary key) 
for values in ('e');

create table t_15
partition of t
(id primary key) 
for values in ('f');

Now let's benchmark again:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF

result:

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 15.123 ms
initial connection time = 8.810 ms
tps = 661.264382 (without initial connection time)

real    0m13.654s
user    0m0.150s
sys     0m0.409s

This is slower than the hash partition:

Duration: 13.654s instead of 12.438s
TPS: 661.264382 instead of 725.901931

So, we lose the primary key, and it's even slower! Hash partitioning is a clear winner here.

Using \dt+ and \di+ results in almost identical results, so let's not repeat myself.

Using hash indexes instead of b-tree indexes

Another suggested approach is to enforce uniqueness using hash indexes instead of b-tree indexes. The benefit is that they are often smaller and faster than b-tree indexes in case equality checking is the only important operation.

Postgres primary keys do not yet support this, but we may use a hack by applying a non-equality constraint based on hashes:

drop table if exists t;

create table t(id uuid not null)
partition by list (left(id::text, 1));

create table t_00
partition of t
(exclude using hash(id with =))
for values in ('0');

create table t_01
partition of t
(exclude using hash(id with =)) 
for values in ('1');

create table t_02
partition of t
(exclude using hash(id with =)) 
for values in ('2');

create table t_03
partition of t
(exclude using hash(id with =)) 
for values in ('3');

create table t_04
partition of t
(exclude using hash(id with =)) 
for values in ('4');

create table t_05
partition of t
(exclude using hash(id with =)) 
for values in ('5');

create table t_06
partition of t
(exclude using hash(id with =)) 
for values in ('6');

create table t_07
partition of t
(exclude using hash(id with =)) 
for values in ('7');

create table t_08
partition of t
(exclude using hash(id with =)) 
for values in ('8');

create table t_09
partition of t
(exclude using hash(id with =)) 
for values in ('9');

create table t_10
partition of t
(exclude using hash(id with =)) 
for values in ('a');

create table t_11
partition of t
(exclude using hash(id with =)) 
for values in ('b');

create table t_12
partition of t
(exclude using hash(id with =)) 
for values in ('c');

create table t_13
partition of t
(exclude using hash(id with =)) 
for values in ('d');

create table t_14
partition of t
(exclude using hash(id with =)) 
for values in ('e');

create table t_15
partition of t
(exclude using hash(id with =)) 
for values in ('f');

Let's benchmark this as well:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF

result:

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 16.686 ms
initial connection time = 7.089 ms
tps = 599.314265 (without initial connection time)

real    0m15.067s
user    0m0.127s
sys     0m0.468s

Well, I didn't expect that. It's even slower now. Looking at the table sizes (\dt+), they are the same as before (24MB).

However, index sizes (\di+) are a tiny bit smaller:

┌──────────────┬───────┐
│     Name     │ Size  │
├──────────────┼───────┤
│ t_00_id_excl │ 20 MB │
│ t_01_id_excl │ 20 MB │
│ t_02_id_excl │ 20 MB │
│ t_03_id_excl │ 20 MB │
│ t_04_id_excl │ 20 MB │
│ t_05_id_excl │ 20 MB │
│ t_06_id_excl │ 20 MB │
│ t_07_id_excl │ 20 MB │
│ t_08_id_excl │ 20 MB │
│ t_09_id_excl │ 20 MB │
│ t_10_id_excl │ 20 MB │
│ t_11_id_excl │ 20 MB │
│ t_12_id_excl │ 20 MB │
│ t_13_id_excl │ 20 MB │
│ t_14_id_excl │ 20 MB │
│ t_15_id_excl │ 20 MB │
└──────────────┴───────┘

So, in total, the size of the index was reduced from 334 MB to 320 MB.

Summary

Hash partitioning outperforms list partitioning in the above example (pay attention: not always)
Hash partitioning has the added benefit that all tables have primary keys (again, specific to the above example). This is important when using logical replication. For instance, to use AWS RDS blue/green deployment:

Make sure that all tables in the DB instance have a primary key. PostgreSQL logical replication doesn't allow UPDATE or DELETE operations on tables that don't have a primary key.
Using hash indexes instead of b-tree was not a performance boost, but reduced index size by less than 5%.

Edits

Edit (2024-05-03)

A colleague of mine, who asked to remain anonymous, explained the reason why list-partitioning was slower:

In your example, to compute the partition key of the list-based approach, you use a cast (cast UUID to text), then two functions are applied (LEFT and LOWER). The functions should be pretty quick, but the cast is slow. That’s why the combined effect is slower than the hash function, which is implemented in C and is quite fast.

Another colleague, Tim, gave a nice summary:

So, if I got it right, in essence it says “Don’t try to be fancy, just do it in a boring way and PostgreSQL will deal with it in an optimized way.”

Which reminds me of a story of when we implemented a variant of strlen() function and observed it’s slower than the GLIBC library function by a factor of 300 times! I should write a post on that too 🙂

Ask not what PostgreSQL can do for AI, ask what AI can do for PostgreSQL

Sadeq Dousti — Thu, 14 Dec 2023 19:52:28 +0000

Today at PostgreSQL Conference Europe 2023, some key people addressed the future of PostgreSQL:

Building for the Future and Betting Big on Postgres by Kevin Dallas, CEO of EDB.
All the Postgres Things at Microsoft by Charles Feddersen, Director of Product Management - Postgres on Azure.

They both talked about how AI is influential, and how PostgreSQL can help. One major hot topic is around native storing and processing vector data, which AI heavily depends on.

There's nothing wrong with this approach, and in fact it's a quite welcome change. However, I see a gap here. While all the talk is about how PostgreSQL can help AI, the other way around is equally important: How AI can help PostgreSQL.

The rest of the post entertains this idea. I give one example that is so basic that it cannot really be deemed as AI, but I hope you get the idea and develop it further!

Interlude: The highest-scored question on StackOverflow

Many years ago, I thought of looking at the mostly voted question on StackOverflow:

This question was, and still is (as of this writing) on the top of the list. I read through the Q&A, and learned so much from it. I'll summarize it below, but make sure to check it out.

Consider this C snippet:

unsigned long sumConditionally(unsigned char data[], 
                               int arraySize) {
    unsigned long sum = 0;
    for (int i = 0; i < arraySize; i++)
        if (data[i] >= 128) sum += data[i];
    return sum;
}

It traverses the whole data array, and sums up entries that are 128 or greater.

Does it matter if I shuffle data? From the functionality standpoint, no: Addition is commutative, so order does not matter. What about the performance standpoint? We expect no change in performance either: Regardless of the order, the loop repeats the same number of times, the comparison is performed once per loop, and the number of entries for which the comparison holds is independent of the order of data. So, the performance must be the same, right?

Wrong! The original poster observed a 6x improvement if the data is sorted. How on earth is that possible? Well, now you get why it's the highest-scored question.

The reasoning is quite simple: CPUs have a lot of optimizations in place. One such optimization is called "branch prediction", which essentially boils down to predicting which branch of an if statement is going to be taken. When the array is sorted, the CPU predictions turn out to be correct most of the time: Initially, the CPU makes a random guess. If the guess is correct, it continues guessing the same. Otherwise, it changes its guess. As the image below depicts, for a sorted array, the guess is correct most of the time, and the CPU has to change it only once. (Of course, I'm oversimplifying!)

This is the most basic "reinforcement learning" (but it's so basic - some may argue that it's not an AI technique as the learning algorithm is prescribed).

Me: Define reinforcement learning like I'm 5.

ChatGPT: Reinforcement learning is like teaching a computer how to do something by giving it rewards when it does things right and punishments when it does things wrong. Just like how you learn to ride a bike better by getting a treat when you balance well and a small scold when you fall. Over time, the computer gets better at the task because it wants to get more rewards and avoid punishments. It's like a game where the computer tries to figure out the best way to do something by trial and error, learning from its mistakes.

What does this has to do with PostgreSQL

PostgreSQL optimizer uses a cost-based model to determine the least costly plan to execute a query. To do this, it uses some basic statistics about the data. If the statistics are not good enough, or the required statistics are not there, PostgreSQL often makes a bad guess, and thus picks a suboptimal plan.

The issue, IMHO, is that PostgreSQL never corrects its guess. Run the same query 1000 times (assuming the underlying data / statistics does not change), and it always results in the same plan. For instance, it may guess that the number of returned rows are 6, and then choose to apply a NESTED LOOP JOIN, while during the execution, it actually fetches 1200 rows, which is awful for the prescribed plan.

To see what I mean, read Franck Pachot's post, where he details the issue and suggests using extended statistics or a PostgreSQL extension to give it hints.

That kept me thinking for a while, and I guess a more futuristic approach is to equip the PostgreSQL planner with AI. In the most basic scenario, it can use something similar to the CPU's branch prediction algorithm: Record the guesses, compare against the actual rows fetched, and improve the guesses over time. But there are so many advanced machine learning algorithms that can be used here; it's just a matter of which one is efficient enough such that the planning time does not skyrocket, and the implementation does not bloat the PostgreSQL core. Alternatively, it can be added as a PostgreSQL extension.

I'm sure using AI inside PostgreSQL is not a new suggestion, and if I had searched, I would have found many others detailing it. However, I felt like writing about it myself 🙂

How Gradle disagreed with our Maven project

Sadeq Dousti — Thu, 07 Dec 2023 10:59:30 +0000

Preamble

I'm a maintainer of a popular open-source project called Logbook. It is "an extensible Java library for HTTP request and response logging". The project uses Maven for build automation.

I have recently made a change in the dependency management section of one of the modules, that resulted in an unwanted effect: Lombok became a transitive dependency of our project, but only if you use Gradle in your project! If you use Maven, you won't face this issue.

But why?!

I'm going to share the result of my investigations in this post. The structure of the article is as follows:

Toolchain used, plus an intro to dependency management
Example: Creating a library (with the dependency management section, and using it in a Maven and a Gradle project
Showing disagreement: Adding submodules to the library, overriding the dependency management, and seeing how Maven and a Gradle projects interpret it differently
Conclusion and personal opinion

Toolchain

I used:

OpenJDK Temurin 17.0.9+9
Apache Maven 3.9.6
Gradle 8.5

You can find the source code developed below on GitHub.

Dependency management

Dependency management is a way to tell Maven a few key information about the dependencies of the project. These include:

Dependency version
Dependency scope
Excluded dependencies
Optional dependencies

For a refresher, see Introduction to the Dependency Mechanism.

Example

Consider a simple project with the following pom.xml file (here, I'm using jcip-annotations as a no-fluff dependency).

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>io.msdousti</groupId>
    <artifactId>maven-library</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>net.jcip</groupId>
                <artifactId>jcip-annotations</artifactId>
                <version>1.0</version>
                <scope>provided</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>net.jcip</groupId>
            <artifactId>jcip-annotations</artifactId>
        </dependency>
    </dependencies>

</project>

You can see that the dependencyManagement section specifies both version and scope for the dependency, so we can simply include the dependency in the dependencies section without specifying those.

If you run mvn dependency:tree in the project root (I assume that you have installed Maven and Java, and they are properly configure), you'll see

[INFO] --- dependency:3.6.1:tree (default-cli) @ maven-library ---
[INFO] io.msdousti:maven-library:jar:1.0-SNAPSHOT
[INFO] \- net.jcip:jcip-annotations:jar:1.0:provided

This clearly shows that the JAR file of net.jcip:jcip-annotations is imported with proper version (1.0) and scope (provided). Cool!

You can also use your IDE to see this info graphically. For instance, in IntelliJ IDEA, you can open the Maven tool window:

As a side note, IntelliJ IDEA also provides this nifty feature called "Analyze Dependencies", which is very handy in seeing which dependencies conflict and which version is ultimately chosen by Maven:

Including the project as a dependency in other project

To including the project as a dependency in other project, we first need to build and install it in the local Maven repository. Run this in the root of the project:

mvn install

And then check the local Maven repository:

ls ~/.m2/repository/io/msdousti/maven-library/1.0-SNAPSHOT

It should contain the project files:

_remote.repositories  maven-library-1.0-SNAPSHOT.jar  maven-library-1.0-SNAPSHOT.pom  maven-metadata-local.xml

Now let's include it as a dependency in a Maven and a Gradle project.

Maven Project

The pom.xml would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>io.msdousti</groupId>
    <artifactId>test-maven</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>io.msdousti</groupId>
            <artifactId>maven-library</artifactId>
            <version>1.0-SNAPSHOT</version>
        </dependency>
    </dependencies>

</project>

If you check the dependency tree by running mvn dependency:tree, you will see:

[INFO] --- dependency:3.6.1:tree (default-cli) @ test-maven ---
[INFO] io.msdousti:test-maven:jar:1.0-SNAPSHOT
[INFO] \- io.msdousti:maven-library:jar:1.0-SNAPSHOT:compile

Note that jcip-annotations is not included. This is because it has the scope provided in the io.msdousti:maven-library dependency, so it won't be transitively included.

Gradle

The build.gradle will be like this (Notice that I added mavenLocal() to the list of repositories, as I want to include the dependency maven-library from the local .m2 repository):

plugins {
    id 'java'
}

group = 'io.msdousti'
version = '1.0-SNAPSHOT'

repositories {
    mavenCentral()
    mavenLocal()
}

dependencies {
    implementation 'io.msdousti:maven-library:1.0-SNAPSHOT'
}

Using gradle :dependencies, we can view the dependencies, but as the output is bulky, I just show the graphical version from my IDE:

We see that maven-library is included in all four Gradle "class paths". But again, jcip-annotations is not included, which is in agreement with the Maven project.

Disagreement begins

Maven and Gradle start to diverge when our maven-library has a sub-module with overriding dependencyManagement section.

What does that mean?! Let's find out.

Creating sub-modules

Let's add two sub-modules to maven-library. The directory structure will be like this (For simplicity, I ignored directories like src that are not relevant here):

.
|
├─ pom.xml
├─ my-bom
|   └─ pom.xml
└─ my-module
    └─ pom.xml

💡 BOM stands for Bill of Material. It is often a dependency that includes the dependencyManagement for the project.

In the parent pom.xml, we have to specify a packaging of pom, plus two modules elements (add them just below the project coordinates):

<groupId>io.msdousti</groupId>
<artifactId>maven-library</artifactId>
<version>1.0-SNAPSHOT</version>

<!-- add below -->
<packaging>pom</packaging>
<modules>
    <module>my-bom</module>
    <module>my-module</module>
</modules>

The pom.xml file of the first child (my-bom) will look like this:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>io.msdousti</groupId>
        <artifactId>maven-library</artifactId>
        <version>1.0-SNAPSHOT</version>
    </parent>

    <artifactId>my-bom</artifactId>
    <packaging>pom</packaging>

    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>net.jcip</groupId>
                <artifactId>jcip-annotations</artifactId>
                <version>1.0</version>
                <scope>compile</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

</project>

Note that it has a dependencyManagement section, that specifies the compile scope for jcip-annotations.

The pom.xml file of the other child (my-module) will look like this:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>io.msdousti</groupId>
        <artifactId>maven-library</artifactId>
        <version>1.0-SNAPSHOT</version>
    </parent>

    <artifactId>my-module</artifactId>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>io.msdousti</groupId>
                <artifactId>my-bom</artifactId>
                <version>1.0-SNAPSHOT</version>
                <scope>import</scope>
                <type>pom</type>
            </dependency>
        </dependencies>
    </dependencyManagement>

</project>

It includes my-bom in the dependencyManagement section, with scope import and type pom. If you look at the dependencies of this module, you see that it honors the scope specified by the parent project:

Do a mvn install on the parent project, and let's find out what how our Maven and Gradle projects that use maven-module will look like.

Maven project

Refresh the Maven project test-maven in your IDE (or use Maven command mvn dependency:tree at the root of the project).
The dependencies are still the same:

Gradle project

Refresh the Maven project test-gradle in your IDE (or use Maven command gradle :dependencies at the root of the project).
You'll see that net.jcip:jcip-annotations:1.0 will appear as a transitive dependency:

Conclusion

Looking at the article I shared at the beginning of this post, we stumble upon this section about the import scope (emphasis mine):

This scope is only supported on a dependency of type pom in the <dependencyManagement> section. It indicates the dependency is to be replaced with the effective list of dependencies in the specified POM's <dependencyManagement> section. Since they are replaced, dependencies with a scope of import do not actually participate in limiting the transitivity of a dependency.

So, the expectation is that the dependencies are replaced. This is exactly what Gradle does, and Maven does not seem to honor this section.

At least this is what I understand! If you have further insights, please feel free to share them in the comments section.

DEV Community: Sadeq Dousti

PostgreSQL's Generic Plan Trap

ToC

The plan cache lifecycle

PostgreSQL sessions and connection poolers

Planning for parameterized queries

Why partial indexes are uniquely exposed

A worked example

Setup

Query plans when the table is empty

Query plans when the data distribution changes

Query plans when another query "poisons" the generic plan

Fixing the issue

Diagnosing with EXPLAIN (GENERIC_PLAN)

Summary

PostgreSQL + Outbox Pattern Revamped — Part 2

Table of Contents

Outbox migration patterns

COP: Cold Outbox Partitioning

COPRA: Cold Outbox Partitioning w/ Rapid Attachment

HOP: Hot Outbox Partitioning

HOPER: Hot Outbox Partitioning w/ Eventual Replacement

HOPIA: Hot Outbox Partitioning w/ Immediate Access

Tuning AUTOVACUUM for outbox tables

Outbox table index maintenance

Conclusion

Connecting from EKS pods to MSK: Setting up IAM, service accounts, trust relationships, and security groups

Prerequisites

Setting up the MSK cluster

Setting up the EKS cluster

Adjusting the security groups

Installing Kafka tools on the EKS pod

Enabling identity providers for the EKS cluster

Creating a permission policy for the IAM role

Creating an IAM role for the EKS service account

Creating a Kubernetes service account

Installing the AWS MSK IAM Authentication JAR

Creating a Kafka topic, producing messages, and consuming messages

Cleaning up the resources after testing to save costs

Conclusion

PostgreSQL + Outbox Pattern Revamped — Part 1

Table of Contents

Incident 1: Sorting by created_at

Incident 2: Using the wrong data types

Incident 3: Incorrect indexing

Bad alternative 1

Bad alternative 2

Incident 4: Picking too many messages at each iteration

Incident 5: Poison pill messages

Incident 6: Inefficient index and visibility checking

Showcasing the issue

Setting up the outbox table and testing the index

Looking at the query plan

Simulating outbox message insertion and processing

Observing the query plan changes

How to fix the issue

Step 1: Create the partitioned outbox table

Step 2: Test the setup

Step 3: Test outbox message-processing performance

Conclusion

Investigating PostgreSQL's `ANALYZE` Runtime Behavior: Sampling, Table Size, and Cache Effects

Experiment Setup

Results

Visualization

Possible Explanation: Reservoir Sampling Complexity

Effect of Cache Purging

The effect of frequent values on histogram_bounds in PostgreSQL

Reader Observation: Histogram Does Not Match Intuition

Digging Deeper: Why Is the Histogram Counterintuitive?

Verifying the Hypothesis

Conclusion

PostgreSQL partitioning with desired index names

Introduction

Setup

Creating a partition

Querying indexes

Case 1: Controlling index name for newly created partitions

Creating a partition

Case 2: Controlling index name for existing partitions

Creating indexes on parent table only