DEV Community

Matt Rueedlinger
Matt Rueedlinger

Posted on • Originally published at rueedlinger.ch on

Databricks SQL Essentials - Array Data Type

πŸ”– This post is part of my series Databricks SQL Essentials

Why Working with Array Types

In this post, I want to focus on array types in Databricks SQLβ€”what they are, why they matter, and how you can use them effectively. Arrays are powerful because they allow you to store multiple values in a single column, which can be incredibly useful when working with semi-structured data like JSON, logs, or event streams.

We will look at two common scenarios:

  • From Row to Array – combining multiple rows into a single array for easier aggregation.
  • From Array to Row – exploding an array into separate rows to analyze individual elements.

These techniques help you move smoothly between structured and semi-structured data for more flexible analysis.

In this blog post, we will use the WanderBricks dataset in Databricks and show how to handle both cases: converting from rows to arrays and from arrays to rows.

Array vs Set

First, let’s start with a bit of theory about arrays. In this context, we will use the terms array and list interchangeably. While some programming languages distinguish between these two concepts, in Databricks SQL they are represented by a single data type.

An array (ARRAY < elementType >) in Databricks SQL is a data type that holds a collection of elements of another supported data type.

Arrays let you store multiple values in a single column, making them perfect for semi-structured data. An array is an ordered collection of values stored in a single column. Arrays can hold any data type, including numbers, strings, or even other arrays.

  • Arrays can contain duplicates. SELECT array(10, 20, 10) --> [10,20,10]
  • Elements are indexed starting at 0, so you can access individual items. SELECT array(10, 20, 30)[0] AS first_element --> 10
  • Useful for storing repeating or nested data in a compact way.

A set is a collection of unique values stored in a single column. Unlike arrays, sets cannot contain duplicates, and the elements have no guaranteed order.

Sets are useful for deduplicating data while keeping all distinct values together.

In Databricks SQL, there is no native SET data type, but you can achieve similar behavior using arrays with deduplication operations (e.g., array_distinct, collect_set) or other SQL functions to work with unique elements.

SELECT array_distinct(array(10, 20, 10)) --> [10,20]

Feature Array Set
Definition An ordered list of values. A collection of unique values.
Databricks SQL Fully supported with ARRAY type. No native SET type, but can use array_distinct or collect_set to emulate uniqueness.
Use Case When you need multiple values in a column, including duplicates. When you need unique values only.

You might wonder what the difference is between array_distinct() and collect_set(). The collect_set() function is an aggregate function that collects unique values from multiple rows into an array, while array_distinct() is a non-aggregate function that removes duplicates from an existing array. In short:

  • Need to collect unique values from multiple rows β†’ use collect_set()
  • Already have an array and want to remove duplicates β†’ use array_distinct()

From Row to Array

Sometimes, you might want to combine multiple rows into a single array. This is especially useful when you want to aggregate related data points for analysis. In Databricks SQL, you can use the following functions to achieve this:

  • collect_list() – collects multiple rows into a list (array) of values, including duplicates.
  • collect_set() – collects multiple rows into a list (array) of unique values, removing duplicates.

As you can see, both are aggregate functions, they differ in how they handle duplicates. collect_list() will include all values, while collect_set() will only include unique values.

Example: Simple Array

In this example, we create a simple array to summarize user bookings. Here we want to combine their data and summarize it per user:

  • We join the tables on a common key (user_id) so that each booking is matched with its corresponding user.
  • After joining, there are usually multiple rows per user. So we need to aggregate by user_id to get one row per user. Then we can use collect_list() or collect_set() to gather all the booking statuses into an array for each user.
  • With sort_array() we can sort the array of booking statuses for better readability, and size() to count the number of distinct and total statuses.
  • This results in a single row per user, with all their booking details stored in one column as an array.
select
  u.user_id as user_id,
  sort_array(collect_set(b.status)) as status,
  size(collect_set(b.status)) as status_count_distinct,
  size(collect_list(b.status)) as status_count
from samples.wanderbricks.users u
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id
group by u.user_id 
order by status_count desc
Enter fullscreen mode Exit fullscreen mode

The output looks like this:

Example output with collect_list() and group by

Example: Array of Structs

Sometimes, you want one row per user that contains all their bookings as structured data. Here we us ethe STRUCT data type
which is a data type that groups multiple related fields into a single column.
By using collect_list() together with struct(), you can create an array where each element holds detailed booking information:

  • Each element in the bookings array is a struct containing user_id, booking_id, total_amount, status, and created_at.
  • Instead of having one row per booking, you now have one row per user, with all their bookings neatly packed into a single column.

This approach makes it easy to analyze or export nested booking data while keeping the data organized and compact.

select
  u.user_id as user_id,
  collect_list(struct(
    u.user_id as user_id,
    b.booking_id as booking_id,
    b.total_amount as total_amount,
    b.status as status,
    b.created_at as created_at
  )) as bookings
from samples.wanderbricks.users u
join samples.wanderbricks.bookings b
  on u.user_id = b.user_id
group by u.user_id 
Enter fullscreen mode Exit fullscreen mode

This produces the following output:

Example output with collect_list()

From Array to Row

There might be situations where you may need to expand an array into multiple rows to analyze each element individually. Databricks SQL provides the explode() function for this purpose.

The explode() function splits an array into separate rows. In the example below, we use the table samples.wanderbricks.customer_support_logs. The message column contains a list of structs with the following schema:

{
  "message": "I'm writing to express my outrage and disappointment....",
  "sender": "user",
  "sentiment": "angry",
  "timestamp": "2025-04-08T17:23:13"
}
Enter fullscreen mode Exit fullscreen mode

Example: Expanding Nested Arrays with explode()

This query demonstrates how to use the explode() function to transform nested array data into individual rows. By applying LATERAL VIEW explode(messages), the query expands each element in the messages array into its own row. This allows direct access to the nested fields inside each message struct.

This approach is useful when working with semi-structured or nested data, enabling easier filtering, aggregation, and analysis at the individual message level.

SELECT
  cl.ticket_id,
  m.message as message,
  m.sender as sender,
  m.sentiment as sentiment,
  m.timestamp as timestamp
FROM samples.wanderbricks.customer_support_logs cl
LATERAL VIEW explode(cl.messages) AS m
Enter fullscreen mode Exit fullscreen mode

This will give us the following output:

Example output with explode()

Example: Safely Extracting Fields from a Struct Using JSON Functions

There may be cases where an attribute does not exist in a struct, or the struct has a deeply nested hierarchy. In such situations, one alternative approach is to safely extract fields by converting the struct to JSON using to_json(), and then retrieving the value with get_json_object() using JSON path notation (get_json_object(to_json(m), '$.field')).

This method allows you to reference attributes that may not exist without causing the query to fail.

SELECT
  cl.ticket_id,
  get_json_object(to_json(m), '$.message')   AS message,
  get_json_object(to_json(m), '$.sender')    AS sender,
  get_json_object(to_json(m), '$.sentiment') AS sentiment,
  get_json_object(to_json(m), '$.timestamp') AS timestamp,
  get_json_object(to_json(m), '$.foo') AS foo -- does not exist in struct
FROM samples.wanderbricks.customer_support_logs cl
LATERAL VIEW explode(cl.messages) t AS m
Enter fullscreen mode Exit fullscreen mode

This approach returns NULL for attributes that are missing instead of raising a schema error.

Final Thoughts

Arrays in Databricks SQL are a powerful tool for handling multiple values within a single column. You can aggregate rows into arrays for easier summarization or explode arrays into rows to analyze each element individually, making your queries more flexible and concise.

By leveraging arrays effectively, you can simplify complex data transformations and gain deeper insights, making your SQL workflows faster, cleaner, and more efficient.

Top comments (0)