When analyzing data, percentiles help us understand the relative ranking of records within a dataset. In PostgreSQL, we can compute percentiles efficiently using the PERCENT_RANK() OVER
window function.
Understanding Percentiles
The PERCENT_RANK()
function calculates the relative rank of a row within a partitioned dataset. The formula used internally is:
This results in values ranging from 0.0 (for the lowest-ranked row) to 1.0 (for the highest-ranked row), providing a useful measure of distribution.
Example Data Structure
Let's assume we have a simple posts table:
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
body TEXT NOT NULL
);
And for each post we can have multiple comments:
CREATE TABLE comments (
id SERIAL PRIMARY KEY,
post_id INT NOT NULL,
comment_text TEXT NOT NULL
);
And we have visits table, where we keep track of each post visits:
CREATE TABLE visits (
id SERIAL PRIMARY KEY,
post_id INT NOT NULL,
visit_time TIMESTAMP DEFAULT now()
);
The goal is to find out each post performance based on the number of visits/comments compared to all other posts.
Querying Percentiles
For that we can calculate two separate percentiles for each post:
- Based on the number of comments
- Based on the number of visits
Visits Percentile
First, we need to calculate number of visits for each post:
SELECT post_id, COUNT(*) AS visit_count
FROM visits
GROUP BY post_id
Then we join it with the posts:
SELECT
p.id,
p.title,
COALESCE(vc.visit_count, 0) AS visit_count,
PERCENT_RANK() OVER (ORDER BY COALESCE(vc.visit_count, 0)) AS visit_percentile
FROM posts p
LEFT JOIN (
SELECT post_id, COUNT(*) AS visit_count
FROM visits
GROUP BY post_id
) vc ON p.id = vc.post_id
A few things to note here:
- We use
LEFT JOIN
because some posts might not have visits yet, but we want them to be present in our performance report. - For the same reason we use
COALESCE(vc.visit_count, 0)
instead ofvc.visit_count
- we want zeros instead ofNULL
there.
The line which calculates the percentile:
PERCENT_RANK() OVER (ORDER BY COALESCE(vc.visit_count, 0)) AS visit_percentile
Comments Percentile
To calculate percentile based on the number of comments follows the same pattern: calculate number of comments for each post, join them with posts, and use window function PERCENT_RANK()
to get the percentiles.
SELECT post_id, COUNT(*) AS comment_count
FROM comments
GROUP BY post_id
We'll omit the intermediate query and show the final version which returns posts with both percentiles:
SELECT
p.id,
p.title,
COALESCE(vc.visit_count, 0) AS visit_count,
COALESCE(cc.comment_count, 0) AS comment_count,
PERCENT_RANK() OVER (ORDER BY COALESCE(vc.visit_count, 0)) AS visit_percentile,
PERCENT_RANK() OVER (ORDER BY COALESCE(cc.comment_count, 0)) AS comment_percentile,
FROM posts p
LEFT JOIN (
SELECT post_id, COUNT(*) AS visit_count
FROM visits
GROUP BY post_id
) vc ON p.id = vc.post_id
LEFT JOIN (
SELECT post_id, COUNT(*) AS comment_count
FROM comments
GROUP BY post_id
) cc ON p.id = cc.post_id
Conclusion
By leveraging PERCENT_RANK()
, we can easily determine how each post ranks relative to others based on different metrics. This approach is highly efficient and useful for analytics in PostgreSQL.
Top comments (0)