DEV Community

Tim
Tim

Posted on

T-SQL: How To Use PARTITION BY and Why

If we think about functions in mathematics, we'll recall that one function is the division function. If we have 10 apples and we want to divide them equally into 2 containers, there will be 5 apples in each container. If we have 10 apples and 10 containers and we want to divide them equally into the containers, there will be 1 apple per container. In this week's video, we're looking at the PARTITION BY functionality in SQL. Fundamentally, PARTITION BY operates like division - we're dividing a row's values into partitions. In the video, SQL Basics: How To Use PARTITION BY and Why, we see an actual demo of PARTITION BY and where we may use this to solve problems we face.

Some questions that are answered in the video:

  • In the examples, we'll notice that we're using a table with mixed values - some completely new in one column and another column with duplicates. These demos are live, so we are using live data.
  • What is the difference between partitioning by the column ID and the column Letter?
  • What happens when we insert a new record that is duplicate in both ID, Letter, and Val? How does the PARTITION BY with ROW_NUMBER() function here?
  • While people tend to think of PARTITION BY as helpful to remove duplicates, how would this functionality possibly help us with scaling horizontally and why?

I recommend that people duplicate the code with the same or different data and experiment with results.

There are numerous applications where we can apply this - provided that we remember that PARTITION BY allows us to slice data in identical categories based on what we specify. The reason we use it with duplicates easily is because when we order duplicates by ROW_NUMBER() and PARTITION the duplicate column (or columns), we'll end up with numbers that are greater than 1, if there is a duplicate. But keep in mind, this is extremely useful in batching data when horizontally scaling ETL operations: we can use PARTITION BY to order a column or set of columns to ensure that we process batches of data over different or identical periods of time.

Top comments (0)