Suppose we want to divide our data into several segments based on the value of one or more columns (e.g., to assign customers or products to different groups for marketing purposes).
The solution is to use the ntile()
function over an SQL window ordered by target columns.
Example
Let's divide the employees
into three groups according to their salary:
- high-paid,
- medium-paid,
- low-paid.
select
ntile(3) over w as tile,
name, salary
from employees
window w as (order by salary desc)
order by salary desc, id;
┌──────┬───────┬────────┐
│ tile │ name │ salary │
├──────┼───────┼────────┤
│ 1 │ Frank │ 120 │
│ 1 │ Henry │ 104 │
│ 1 │ Irene │ 104 │
│ 1 │ Alice │ 100 │
│ 2 │ Cindy │ 96 │
│ 2 │ Dave │ 96 │
│ 2 │ Grace │ 90 │
│ 3 │ Emma │ 84 │
│ 3 │ Bob │ 78 │
│ 3 │ Diane │ 70 │
└──────┴───────┴────────┘
The ntile(n)
function splits all records into n
groups and returns the group number for each record. If the total number of records (10 in our case) is not divisible by the group size (3), then the former groups will be larger than the latter.
Alternatives
ntile()
always tries to split the data so that the groups are of the same size. So records with the same value may end up in different (adjacent) groups:
select
ntile(2) over w as tile,
name, salary
from employees
window w as (order by salary desc, id)
order by salary desc, tile;
┌──────┬───────┬────────┐
│ tile │ name │ salary │
├──────┼───────┼────────┤
│ 1 │ Frank │ 120 │
│ 1 │ Henry │ 104 │
│ 1 │ Irene │ 104 │
│ 1 │ Alice │ 100 │
│ 1 │ Cindy │ 96 │
│ 2 │ Dave │ 96 │
│ 2 │ Grace │ 90 │
│ 2 │ Emma │ 84 │
│ 2 │ Bob │ 78 │
│ 2 │ Diane │ 70 │
└──────┴───────┴────────┘
To avoid this, we can use the following (much more complicated) formula instead of ntile(n)
:
1 + ((rank() over w) - 1) * N / count(*) over () as tile
For n = 2
:
select
1 + ((rank() over w) - 1) * 2 / count(*) over () as tile,
name, salary
from employees
window w as (order by salary desc)
order by salary desc, id;
┌──────┬───────┬────────┐
│ tile │ name │ salary │
├──────┼───────┼────────┤
│ 1 │ Frank │ 120 │
│ 1 │ Henry │ 104 │
│ 1 │ Irene │ 104 │
│ 1 │ Alice │ 100 │
│ 1 │ Cindy │ 96 │
│ 1 │ Dave │ 96 │
│ 2 │ Grace │ 90 │
│ 2 │ Emma │ 84 │
│ 2 │ Bob │ 78 │
│ 2 │ Diane │ 70 │
└──────┴───────┴────────┘
Compatibility
All major vendors support the ntile()
window function. Some of them, such as MS SQL and Oracle, do not support the window
clause. In these cases, we can inline the window definition:
select
ntile(3) over (
order by salary desc
) as tile,
name, salary
from employees
order by salary desc, id;
We can also rewrite the query without window functions:
select
ceil(
(select count(*) from employees as e2 where e2.salary > e1.salary) * 3 /
(select count(*) from employees)
) + 1 as tile,
name, salary
from employees as e1
order by salary desc, id;
Want to learn more about window functions? Read my book — SQL Window Functions Explained
Follow @ohmypy on Twitter to keep up with new posts
Top comments (0)