Kanishga Subramani

Posted on Jul 1

Day 51: ClickHouse® Materialized Views: Automating Data Transformations and Aggregations

#clickhouse #devops #database #dataengineering

Handling NULLs and Missing Data in ClickHouse®: A Complete Guide

Introduction

In any analytical system, data is rarely perfect. Customer profiles may be incomplete, IoT devices may fail to report sensor readings, application logs may omit optional fields, and financial transactions may lack promotional or referral information. These gaps are a natural part of working with real-world data.

How you represent and process missing information has a direct impact on query accuracy, storage efficiency, and analytical performance.

ClickHouse®, a high-performance columnar database designed for OLAP workloads, provides multiple ways to deal with missing data. Besides supporting nullable columns, it also offers default values and several built-in functions that make handling incomplete data both efficient and flexible.

Understanding when to use NULL values, when to rely on defaults, and how various functions interact with nullable data can help you build cleaner schemas, write more reliable queries, and avoid unnecessary performance overhead.

In this guide, we'll explore how ClickHouse handles missing data, the difference between NULL and default values, useful functions for working with nullable columns, performance considerations, and best practices for designing production-ready analytical databases.

Understanding NULL Values

A common misconception is that NULL simply means "empty."

In reality, NULL represents an unknown or unavailable value.

For example, consider a customer database.

Customer	Email
Alice	alice@example.com
Bob	NULL

Bob's email isn't an empty string.

It means the email address is currently unknown or hasn't been provided.

This distinction becomes extremely important during filtering, aggregation, joins, and reporting.

NULL should never be interpreted as zero, an empty string, or false.

It simply indicates the absence of a known value.

NULL vs Missing Values vs Default Values

Although these concepts are closely related, they represent different situations.

NULL

A NULL value explicitly indicates that information is unknown or unavailable.

Examples include:

Customer has not provided a phone number.
Sensor reading failed.
Referral code is unavailable.

Missing Values

A missing value occurs when no value is supplied during insertion.

Depending on the table definition, ClickHouse may:

Insert the column's default value
Insert NULL (for nullable columns)
Reject the insert

Default Values

A default value is automatically inserted whenever data isn't provided.

Example:

status String DEFAULT 'Pending'

If no status is supplied,

ClickHouse stores:

Pending

instead of NULL.

Default values are often preferable when they accurately represent the business meaning.

When Should You Use Nullable Columns?

Not every column should allow NULL values.

Nullable columns are appropriate when missing information has genuine business significance.

Examples include:

Secondary email addresses
Optional phone numbers
Discount codes
Referral IDs
GPS coordinates
Medical measurements
IoT sensor readings
Optional application metadata

If the absence of information carries meaning, Nullable() is usually the correct choice.

Creating Nullable Columns

ClickHouse does not allow NULL values unless explicitly specified.

To enable them, wrap the data type using Nullable().

Example:

CREATE TABLE employees
(
    id UInt32,
    name String,
    email Nullable(String),
    phone Nullable(String)
)
ENGINE = MergeTree
ORDER BY id;

Now NULL values can be stored safely.

INSERT INTO employees VALUES
(1,'Alice','alice@example.com','9876543210'),
(2,'Bob',NULL,NULL);

Querying NULL Values

Unlike ordinary values, NULL cannot be compared using = or !=.

Incorrect:

SELECT *
FROM employees
WHERE email = NULL;

Correct:

SELECT *
FROM employees
WHERE email IS NULL;

Finding rows containing valid values:

SELECT *
FROM employees
WHERE email IS NOT NULL;

These operators are specifically designed for nullable columns.

Useful Functions for Handling NULL Values

ClickHouse includes several built-in functions that simplify working with missing data.

ifNull()

Returns an alternative value whenever the first expression is NULL.

Example:

SELECT
name,
ifNull(email,'Not Available') AS email
FROM employees;

Output:

Name	Email
Alice	alice@example.com
Bob	Not Available

This is ideal for reports and dashboards.

coalesce()

Returns the first non-NULL value from multiple expressions.

Example:

SELECT
customer_name,
coalesce(work_email,
personal_email,
'No Contact') AS contact
FROM customers;

If work_email is NULL,

ClickHouse checks personal_email.

If both are NULL,

"No Contact" is returned.

This greatly simplifies fallback logic.

nullIf()

Converts a value into NULL when it matches another value.

Example:

SELECT
nullIf(score,0)
FROM results;

If score equals zero,

NULL is returned.

This is useful when imported datasets use placeholder values.

isNull()

Checks whether a value is NULL.

Example:

SELECT
isNull(email)
FROM employees;

Returns:

for NULL

and

otherwise.

assumeNotNull()

Sometimes you already know a nullable column contains valid values.

Instead of repeatedly performing nullable checks,

ClickHouse allows you to remove nullable processing.

SELECT
assumeNotNull(email)
FROM employees;

This can improve execution efficiency.

However,

calling assumeNotNull() on actual NULL values produces undefined results.

Only use it when you're certain the data contains no NULL values.

Working with Aggregate Functions

Most aggregate functions ignore NULL automatically.

Example:

SELECT
avg(score)
FROM exams;

Rows containing NULL scores are excluded from the calculation.

The same applies to:

sum(score)

min(score)

max(score)

count(score)

However,

count(*)

counts every row regardless of NULL values.

Understanding this difference prevents misleading reports.

Sorting Nullable Columns

Nullable values participate in sorting.

Example:

SELECT *
FROM employees
ORDER BY salary;

Depending on the query settings,

NULL values may appear before or after normal values.

Being aware of this behavior helps maintain predictable report ordering.

Nullable Columns and Storage

Every Nullable column requires ClickHouse to maintain an additional bitmap indicating whether each row contains NULL.

This bitmap consumes additional storage and introduces small processing overhead during query execution.

While the overhead is relatively small,

using Nullable() unnecessarily across dozens of columns can noticeably affect storage and performance.

For high-volume analytical workloads,

keeping schemas lean is often beneficial.

Choosing Between NULL and Default Values

Sometimes a default value communicates the business meaning more effectively.

Instead of storing:

NULL

consider:

Unknown

Pending

Not Assigned

N/A

only if these values accurately represent reality.

Remember:

NULL means "unknown."

A default value means "known."

The distinction matters.

Practical Data Cleaning Workflow

Suppose an organization wants every customer report to include contact information.

A simple workflow might look like this:

Check whether an email exists.
If email is missing, use the phone number.
If both are unavailable, display a default label.

Example:

SELECT
customer_name,
coalesce(email,
phone,
'No Contact Available') AS contact
FROM customers;

Every row now contains meaningful output without excluding incomplete records.

Common Mistakes

Many beginners misuse NULL values.

Common mistakes include:

Comparing NULL using =
Making every column Nullable()
Using Nullable columns as primary keys unnecessarily
Ignoring NULL behavior in aggregate functions
Replacing unknown values with incorrect defaults
Assuming NULL equals an empty string

Avoiding these issues results in cleaner data models and faster analytical queries.

Real-World Applications

Nullable columns appear in almost every analytical workload.

Examples include:

Customer Analytics

Optional phone numbers, secondary email addresses, referral codes.

IoT Platforms

Missing temperature, humidity, or pressure readings caused by sensor failures.

Financial Systems

Transactions without coupons, promotions, or affiliate IDs.

Application Monitoring

Optional request metadata, browser information, or debugging fields.

Healthcare

Unavailable laboratory measurements or incomplete patient records.

Each scenario benefits from representing unknown information correctly.

Best Practices

When working with NULL values in ClickHouse, consider the following guidelines:

Use Nullable() only when missing information has genuine business meaning.
Prefer default values whenever they accurately represent the data.
Use IS NULL and IS NOT NULL for filtering.
Avoid comparing NULL using standard comparison operators.
Use ifNull() for simple replacements.
Use coalesce() when multiple fallback values are available.
Keep nullable columns out of primary and sorting keys whenever possible.
Clean and validate data before ingestion.
Document how NULL values are interpreted across your datasets.
Review schema design periodically to eliminate unnecessary nullable columns.

Following these practices leads to more efficient storage, faster queries, and more reliable analytical results.

Conclusion

Handling missing data is far more than a technical detail—it is a fundamental aspect of designing reliable analytical systems.

ClickHouse® provides powerful support for managing incomplete information through nullable columns, default values, and a rich set of built-in functions such as ifNull(), coalesce(), nullIf(), isNull(), and assumeNotNull().

Choosing between NULL values and defaults requires understanding the business meaning behind the data. Used appropriately, these features help maintain data integrity while preserving ClickHouse's exceptional query performance.

By designing schemas carefully, minimizing unnecessary nullable columns, and applying the right functions during query execution, you can build analytical applications that remain both accurate and highly performant as your datasets continue to grow.

DEV Community

Day 51: ClickHouse® Materialized Views: Automating Data Transformations and Aggregations

Handling NULLs and Missing Data in ClickHouse®: A Complete Guide

Introduction

Understanding NULL Values

NULL vs Missing Values vs Default Values

NULL

Missing Values

Default Values

When Should You Use Nullable Columns?

Creating Nullable Columns

Querying NULL Values

Useful Functions for Handling NULL Values

ifNull()

coalesce()

nullIf()

isNull()

assumeNotNull()

Working with Aggregate Functions

Sorting Nullable Columns

Nullable Columns and Storage

Choosing Between NULL and Default Values

Practical Data Cleaning Workflow

Common Mistakes

Real-World Applications

Best Practices

Conclusion

Top comments (0)