DEV Community

Cover image for Efficient Percentile Ranking in PostgreSQL
JetThoughts Dev for JetThoughts

Posted on

1 1 1 1 1

Efficient Percentile Ranking in PostgreSQL

When analyzing data, percentiles help us understand the relative ranking of records within a dataset. In PostgreSQL, we can compute percentiles efficiently using the PERCENT_RANK() OVER window function.

Understanding Percentiles

The PERCENT_RANK() function calculates the relative rank of a row within a partitioned dataset. The formula used internally is:

percentile formula

This results in values ranging from 0.0 (for the lowest-ranked row) to 1.0 (for the highest-ranked row), providing a useful measure of distribution.

Example Data Structure

Let's assume we have a simple posts table:

CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    body TEXT NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

And for each post we can have multiple comments:

CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    post_id INT NOT NULL,
    comment_text TEXT NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

And we have visits table, where we keep track of each post visits:

CREATE TABLE visits (
    id SERIAL PRIMARY KEY,
    post_id INT NOT NULL,
    visit_time TIMESTAMP DEFAULT now()
);
Enter fullscreen mode Exit fullscreen mode

The goal is to find out each post performance based on the number of visits/comments compared to all other posts.

Querying Percentiles

For that we can calculate two separate percentiles for each post:

  1. Based on the number of comments
  2. Based on the number of visits

Visits Percentile

First, we need to calculate number of visits for each post:

SELECT post_id, COUNT(*) AS visit_count
    FROM visits
    GROUP BY post_id
Enter fullscreen mode Exit fullscreen mode

Then we join it with the posts:

SELECT 
    p.id, 
    p.title,
    COALESCE(vc.visit_count, 0) AS visit_count,
    PERCENT_RANK() OVER (ORDER BY COALESCE(vc.visit_count, 0)) AS visit_percentile
FROM posts p
LEFT JOIN (
SELECT post_id, COUNT(*) AS visit_count
    FROM visits
    GROUP BY post_id
) vc ON p.id = vc.post_id
Enter fullscreen mode Exit fullscreen mode

A few things to note here:

  1. We use LEFT JOIN because some posts might not have visits yet, but we want them to be present in our performance report.
  2. For the same reason we use COALESCE(vc.visit_count, 0) instead of vc.visit_count - we want zeros instead of NULL there.

The line which calculates the percentile:

PERCENT_RANK() OVER (ORDER BY COALESCE(vc.visit_count, 0)) AS visit_percentile
Enter fullscreen mode Exit fullscreen mode

Comments Percentile

To calculate percentile based on the number of comments follows the same pattern: calculate number of comments for each post, join them with posts, and use window function PERCENT_RANK() to get the percentiles.

SELECT post_id, COUNT(*) AS comment_count
    FROM comments
    GROUP BY post_id
Enter fullscreen mode Exit fullscreen mode

We'll omit the intermediate query and show the final version which returns posts with both percentiles:

SELECT 
    p.id, 
    p.title,
    COALESCE(vc.visit_count, 0) AS visit_count,
    COALESCE(cc.comment_count, 0) AS comment_count,
    PERCENT_RANK() OVER (ORDER BY COALESCE(vc.visit_count, 0)) AS visit_percentile,
    PERCENT_RANK() OVER (ORDER BY COALESCE(cc.comment_count, 0)) AS comment_percentile,
FROM posts p
LEFT JOIN (
SELECT post_id, COUNT(*) AS visit_count
    FROM visits
    GROUP BY post_id
) vc ON p.id = vc.post_id
LEFT JOIN (
SELECT post_id, COUNT(*) AS comment_count
    FROM comments
    GROUP BY post_id
) cc ON p.id = cc.post_id
Enter fullscreen mode Exit fullscreen mode

Conclusion

By leveraging PERCENT_RANK(), we can easily determine how each post ranks relative to others based on different metrics. This approach is highly efficient and useful for analytics in PostgreSQL.

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay