Databricks SQL Essentials - CTE

#beginners #dataengineering #sql #tutorial

🔖 This post is part of my series Databricks SQL Essentials

Why use CTEs

In this post, I want to focus on CTEs, which can significantly simplify SQL queries by making complex logic
easier to reason about and maintain. A CTE (Common Table Expression) is a temporary, named result set defined
using the WITH clause. It exists only for the duration of a single query and provides a clean way to organize multi-step transformations.

CTEs are especially well suited for a divide-and-conquer approach to SQL.

In this context, divide and conquer means:

Breaking a complex query into smaller, logical steps
Solving each step independently
Combining the results in a final query

Each step is expressed as its own CTE, resulting in clearer, more readable, and easier-to-debug SQL.

Regarding querying data, CTEs can improve your SQL queries in two main ways:

CTE definition — define a named, reusable result set at the beginning of your query using the WITH clause.
Subquery replacement — use a CTE instead of an inline subquery to improve readability and clarity.

Next, we will explore each type of CTE in detail and show how to use them effectively.

For more details and the official syntax, see the Databricks SQL CTE documentation.

CTE Definition at the Beginning of a Query

A Common Table Expression (CTE) defines a named, reusable result set at the start of a query using the WITH clause.

This approach is ideal for organizing multi-step logic and breaking complex queries into manageable parts.

This example demonstrates how CTEs simplify data processing with the Wanderbricks dataset:

countries — selects country, continent, and country code information.
users — selects user details including their country.
user_countries — combines users with country info, filling any missing continent or country code with "UNKNOWN".

The final query aggregates total booking amounts by country, continent, and booking status, making it easy to analyze bookings across different regions while handling missing data.

-- CTE for countries info
with countries as (
    select c.country, c.continent, c.country_code 
    from samples.wanderbricks.countries c
), 
-- CTE for users info
users as (
    select u.user_id, u.name, u.user_type, u.country 
    from samples.wanderbricks.users u
), 
-- Combine users with their country info
user_countries as (
    select 
        u.user_id, 
        u.name, 
        u.user_type, 
        u.country,
        coalesce(c.continent, "UNKOWN") as continent, -- fill missing continent
        coalesce(c.country_code, "UNKOWN") as country_code -- fill missing country code
    from users u 
    left join countries c on u.country = c.country
)
-- Final aggregation
select 
    u.country,
    u.continent,
    b.status,
    round(sum(b.total_amount), 2) as total_amount -- sum of bookings per group
from user_countries u 
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id 
group by all -- group by country, continent, and status
order by country, status

This will give us the follwoing output.

Subquery Replacement with a CTE

Use a CTE instead of an inline subquery to improve readability. By giving the subquery a name, the purpose of the logic becomes clearer, and the overall query is easier to understand and maintain.

We can rewrite the previous query by defining the CTE in place of the subquery:

-- Aggregate total booking amounts per country, continent, and booking status
-- using the prepared user-country subquery
select 
    u.country,
    u.continent,
    b.status,
    round(sum(b.total_amount), 2) as total_amount -- sum of bookings per group
from (
  -- subquery with CTE
  with countries as (
      select c.country, c.continent, c.country_code 
      from samples.wanderbricks.countries c
  ), 
  -- CTE for users info
  users as (
      select u.user_id, u.name, u.user_type, u.country 
      from samples.wanderbricks.users u
  )
  -- Combine users with their country info
  select 
    u.user_id, 
    u.name, 
    u.user_type, 
    u.country,
    coalesce(c.continent, "UNKOWN") as continent, -- fill missing continent
    coalesce(c.country_code, "UNKOWN") as country_code -- fill missing country code
  from users u 
  left join countries c on u.country = c.country
) u 
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id 
group by all -- group by country, continent, and status
order by country, status

Final Thoughts

Using CTEs in Databricks SQL is a powerful way to break complex queries into clear, manageable steps. They help you structure your logic, improve readability, and make debugging much easier.

Overall, CTEs support a divide-and-conquer approach, making your SQL both cleaner and more maintainable.

From my experience, learning to use CTEs transformed the way I write SQL. What used to be long, nested queries now feels organized and much easier to follow.

My advice: don’t hesitate to break your queries into smaller, named steps—it not only helps you understand your logic but also makes it easier for others to read and maintain your code.

DEV Community

Databricks SQL Essentials - CTE

Why use CTEs

CTE Definition at the Beginning of a Query

Subquery Replacement with a CTE

Final Thoughts

Top comments (0)