Underrated Postgres: Extended Statistics for Better Execution Plans

#postgres #database #architecture #webdev

One of the features of Postgres which is very-little known and even less used (at least I rarely see it in production) is extended statistics to cover and encode the dependencies, relations, potential combinations of multi-column values.

The query planner typically creates execution plans based on single-column statistics. But meaning and relationship encoded across columns is very common in real world applications.

I personally would've loved to know about it earlier. Usually, I'd add another index in the hope that it improves access patterns and yields better execution plans. But the cost of all the indexes quickly reaches or even extends the origin table. Extended statistics are here to solve the issue. Not always but with high tendency to most often.

Not saying indexes are useless but they aren't the best way to solve bad execution plans!

The Problem with Single-Column Statistics

PostgreSQL's query planner relies heavily on statistics to estimate the cost of different execution plans. By default, it collects statistics on individual columns. This works well when columns are independent. However, in many real-world scenarios, columns are correlated.

Consider a table of users with columns city and country. If you query for users in city = 'Paris' and country = 'France', the planner might underestimate the number of rows returned because it assumes the two conditions are independent. It multiplies the probability of city = 'Paris' by the probability of country = 'France'. But we know that if the city is Paris, the country is almost certainly France.

This underestimation can lead the planner to choose a suboptimal execution plan, such as a nested loop join instead of a hash join, resulting in poor performance.

Enter Extended Statistics

Extended statistics allow you to tell PostgreSQL about the relationships between columns. You can create statistics on multiple columns, enabling the planner to make more accurate estimates.

Creating Extended Statistics

You can create extended statistics using the CREATE STATISTICS command. Here's an example:

CREATE STATISTICS city_country_stats (dependencies) ON city, country FROM users;

This command creates statistics that capture the functional dependencies between the city and country columns in the users table.

Types of Extended Statistics

PostgreSQL supports several types of extended statistics:

Dependencies: Captures functional dependencies between columns (e.g., city determines country). Useful for queries with multiple equality conditions.
N-Distinct: Estimates the number of distinct values for a combination of columns. Useful for GROUP BY queries.
MCV (Most Common Values): Stores the most common combinations of values across multiple columns. Useful for queries with complex conditions (e.g., inequalities, OR clauses).

You can specify multiple types when creating statistics:

CREATE STATISTICS complex_stats (dependencies, mcv) ON col1, col2, col3 FROM my_table;

When to Use Extended Statistics

Extended statistics are particularly useful when:

You have queries with multiple conditions on correlated columns.
The query planner is consistently underestimating or overestimating row counts.
You are considering adding a multi-column index solely to improve query planning (extended statistics are often a lighter-weight alternative).

Common Pitfalls

Overuse: Don't create extended statistics on every combination of columns. They consume storage and add overhead to ANALYZE operations. Only create them when you identify a specific query planning issue.
Forgetting to ANALYZE: After creating extended statistics, you must run ANALYZE on the table for the planner to start using them.
Complex Expressions: Extended statistics currently only work on simple column references, not expressions (e.g., LOWER(col)).

Conclusion

Extended statistics are a powerful but underutilized feature in PostgreSQL. By providing the query planner with information about column relationships, you can significantly improve the performance of complex queries without the overhead of unnecessary indexes.

Use CREATE STATISTICS to capture dependencies, n-distinct values, or MCVs.
Target specific queries where the planner is making poor estimates.
Remember to ANALYZE your tables after creating statistics.

What's your experience with extended statistics in PostgreSQL? Have you used them to solve performance issues in production? Drop your thoughts in the comments.

About the Author: Ameer Hamza is a Top-Rated Full-Stack Developer with 7+ years of experience building SaaS platforms, eCommerce solutions, and AI-powered applications. He specializes in Laravel, Vue.js, React, Next.js, and AI integrations — with 50+ projects shipped and a 100% job success rate. Check out his portfolio at ameer.pk to see his latest work, or reach out for your next development project.