DEV Community

Cover image for SQL Recipe: Segmenting Data
Anton Zhiyanov
Anton Zhiyanov

Posted on • Originally published at antonz.org

SQL Recipe: Segmenting Data

Suppose we want to divide our data into several segments based on the value of one or more columns (e.g., to assign customers or products to different groups for marketing purposes).

The solution is to use the ntile() function over an SQL window ordered by target columns.

Example

Let's divide the employees into three groups according to their salary:

  • high-paid,
  • medium-paid,
  • low-paid.
select
  ntile(3) over w as tile,
  name, salary
from employees
window w as (order by salary desc)
order by salary desc, id;
Enter fullscreen mode Exit fullscreen mode
┌──────┬───────┬────────┐
│ tile │ name  │ salary │
├──────┼───────┼────────┤
│ 1    │ Frank │ 120    │
│ 1    │ Henry │ 104    │
│ 1    │ Irene │ 104    │
│ 1    │ Alice │ 100    │
│ 2    │ Cindy │ 96     │
│ 2    │ Dave  │ 96     │
│ 2    │ Grace │ 90     │
│ 3    │ Emma  │ 84     │
│ 3    │ Bob   │ 78     │
│ 3    │ Diane │ 70     │
└──────┴───────┴────────┘
Enter fullscreen mode Exit fullscreen mode

The ntile(n) function splits all records into n groups and returns the group number for each record. If the total number of records (10 in our case) is not divisible by the group size (3), then the former groups will be larger than the latter.

Alternatives

ntile() always tries to split the data so that the groups are of the same size. So records with the same value may end up in different (adjacent) groups:

select
  ntile(2) over w as tile,
  name, salary
from employees
window w as (order by salary desc, id)
order by salary desc, tile;
Enter fullscreen mode Exit fullscreen mode
┌──────┬───────┬────────┐
│ tile │ name  │ salary │
├──────┼───────┼────────┤
│ 1    │ Frank │ 120    │
│ 1    │ Henry │ 104    │
│ 1    │ Irene │ 104    │
│ 1    │ Alice │ 100    │
│ 1    │ Cindy │ 96     │
│ 2    │ Dave  │ 96     │
│ 2    │ Grace │ 90     │
│ 2    │ Emma  │ 84     │
│ 2    │ Bob   │ 78     │
│ 2    │ Diane │ 70     │
└──────┴───────┴────────┘
Enter fullscreen mode Exit fullscreen mode

To avoid this, we can use the following (much more complicated) formula instead of ntile(n):

1 + ((rank() over w) - 1) * N / count(*) over () as tile
Enter fullscreen mode Exit fullscreen mode

For n = 2:

select
  1 + ((rank() over w) - 1) * 2 / count(*) over () as tile,
  name, salary
from employees
window w as (order by salary desc)
order by salary desc, id;
Enter fullscreen mode Exit fullscreen mode
┌──────┬───────┬────────┐
│ tile │ name  │ salary │
├──────┼───────┼────────┤
│ 1    │ Frank │ 120    │
│ 1    │ Henry │ 104    │
│ 1    │ Irene │ 104    │
│ 1    │ Alice │ 100    │
│ 1    │ Cindy │ 96     │
│ 1    │ Dave  │ 96     │
│ 2    │ Grace │ 90     │
│ 2    │ Emma  │ 84     │
│ 2    │ Bob   │ 78     │
│ 2    │ Diane │ 70     │
└──────┴───────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Compatibility

All major vendors support the ntile() window function. Some of them, such as MS SQL and Oracle, do not support the window clause. In these cases, we can inline the window definition:

select
  ntile(3) over (
    order by salary desc
  ) as tile,
  name, salary
from employees
order by salary desc, id;
Enter fullscreen mode Exit fullscreen mode

We can also rewrite the query without window functions:

select
  ceil(
    (select count(*) from employees as e2 where e2.salary > e1.salary) * 3 /
    (select count(*) from employees)
  ) + 1 as tile,
  name, salary
from employees as e1
order by salary desc, id;
Enter fullscreen mode Exit fullscreen mode

Want to learn more about window functions? Read my book — SQL Window Functions Explained

Follow @ohmypy on Twitter to keep up with new posts

Top comments (0)