Handling NULLs and Missing Data in ClickHouse®: A Complete Guide
Introduction
In any analytical system, data is rarely perfect. Customer profiles may be incomplete, IoT devices may fail to report sensor readings, application logs may omit optional fields, and financial transactions may lack promotional or referral information. These gaps are a natural part of working with real-world data.
How you represent and process missing information has a direct impact on query accuracy, storage efficiency, and analytical performance.
ClickHouse®, a high-performance columnar database designed for OLAP workloads, provides multiple ways to deal with missing data. Besides supporting nullable columns, it also offers default values and several built-in functions that make handling incomplete data both efficient and flexible.
Understanding when to use NULL values, when to rely on defaults, and how various functions interact with nullable data can help you build cleaner schemas, write more reliable queries, and avoid unnecessary performance overhead.
In this guide, we'll explore how ClickHouse handles missing data, the difference between NULL and default values, useful functions for working with nullable columns, performance considerations, and best practices for designing production-ready analytical databases.
Understanding NULL Values
A common misconception is that NULL simply means "empty."
In reality, NULL represents an unknown or unavailable value.
For example, consider a customer database.
| Customer | |
|---|---|
| Alice | alice@example.com |
| Bob | NULL |
Bob's email isn't an empty string.
It means the email address is currently unknown or hasn't been provided.
This distinction becomes extremely important during filtering, aggregation, joins, and reporting.
NULL should never be interpreted as zero, an empty string, or false.
It simply indicates the absence of a known value.
NULL vs Missing Values vs Default Values
Although these concepts are closely related, they represent different situations.
NULL
A NULL value explicitly indicates that information is unknown or unavailable.
Examples include:
- Customer has not provided a phone number.
- Sensor reading failed.
- Referral code is unavailable.
Missing Values
A missing value occurs when no value is supplied during insertion.
Depending on the table definition, ClickHouse may:
- Insert the column's default value
- Insert NULL (for nullable columns)
- Reject the insert
Default Values
A default value is automatically inserted whenever data isn't provided.
Example:
status String DEFAULT 'Pending'
If no status is supplied,
ClickHouse stores:
Pending
instead of NULL.
Default values are often preferable when they accurately represent the business meaning.
When Should You Use Nullable Columns?
Not every column should allow NULL values.
Nullable columns are appropriate when missing information has genuine business significance.
Examples include:
- Secondary email addresses
- Optional phone numbers
- Discount codes
- Referral IDs
- GPS coordinates
- Medical measurements
- IoT sensor readings
- Optional application metadata
If the absence of information carries meaning, Nullable() is usually the correct choice.
Creating Nullable Columns
ClickHouse does not allow NULL values unless explicitly specified.
To enable them, wrap the data type using Nullable().
Example:
CREATE TABLE employees
(
id UInt32,
name String,
email Nullable(String),
phone Nullable(String)
)
ENGINE = MergeTree
ORDER BY id;
Now NULL values can be stored safely.
INSERT INTO employees VALUES
(1,'Alice','alice@example.com','9876543210'),
(2,'Bob',NULL,NULL);
Querying NULL Values
Unlike ordinary values, NULL cannot be compared using = or !=.
Incorrect:
SELECT *
FROM employees
WHERE email = NULL;
Correct:
SELECT *
FROM employees
WHERE email IS NULL;
Finding rows containing valid values:
SELECT *
FROM employees
WHERE email IS NOT NULL;
These operators are specifically designed for nullable columns.
Useful Functions for Handling NULL Values
ClickHouse includes several built-in functions that simplify working with missing data.
ifNull()
Returns an alternative value whenever the first expression is NULL.
Example:
SELECT
name,
ifNull(email,'Not Available') AS email
FROM employees;
Output:
| Name | |
|---|---|
| Alice | alice@example.com |
| Bob | Not Available |
This is ideal for reports and dashboards.
coalesce()
Returns the first non-NULL value from multiple expressions.
Example:
SELECT
customer_name,
coalesce(work_email,
personal_email,
'No Contact') AS contact
FROM customers;
If work_email is NULL,
ClickHouse checks personal_email.
If both are NULL,
"No Contact" is returned.
This greatly simplifies fallback logic.
nullIf()
Converts a value into NULL when it matches another value.
Example:
SELECT
nullIf(score,0)
FROM results;
If score equals zero,
NULL is returned.
This is useful when imported datasets use placeholder values.
isNull()
Checks whether a value is NULL.
Example:
SELECT
isNull(email)
FROM employees;
Returns:
1
for NULL
and
0
otherwise.
assumeNotNull()
Sometimes you already know a nullable column contains valid values.
Instead of repeatedly performing nullable checks,
ClickHouse allows you to remove nullable processing.
SELECT
assumeNotNull(email)
FROM employees;
This can improve execution efficiency.
However,
calling assumeNotNull() on actual NULL values produces undefined results.
Only use it when you're certain the data contains no NULL values.
Working with Aggregate Functions
Most aggregate functions ignore NULL automatically.
Example:
SELECT
avg(score)
FROM exams;
Rows containing NULL scores are excluded from the calculation.
The same applies to:
sum(score)
min(score)
max(score)
count(score)
However,
count(*)
counts every row regardless of NULL values.
Understanding this difference prevents misleading reports.
Sorting Nullable Columns
Nullable values participate in sorting.
Example:
SELECT *
FROM employees
ORDER BY salary;
Depending on the query settings,
NULL values may appear before or after normal values.
Being aware of this behavior helps maintain predictable report ordering.
Nullable Columns and Storage
Every Nullable column requires ClickHouse to maintain an additional bitmap indicating whether each row contains NULL.
This bitmap consumes additional storage and introduces small processing overhead during query execution.
While the overhead is relatively small,
using Nullable() unnecessarily across dozens of columns can noticeably affect storage and performance.
For high-volume analytical workloads,
keeping schemas lean is often beneficial.
Choosing Between NULL and Default Values
Sometimes a default value communicates the business meaning more effectively.
Instead of storing:
NULL
consider:
Unknown
Pending
Not Assigned
N/A
only if these values accurately represent reality.
Remember:
NULL means "unknown."
A default value means "known."
The distinction matters.
Practical Data Cleaning Workflow
Suppose an organization wants every customer report to include contact information.
A simple workflow might look like this:
Check whether an email exists.
If email is missing, use the phone number.
If both are unavailable, display a default label.
Example:
SELECT
customer_name,
coalesce(email,
phone,
'No Contact Available') AS contact
FROM customers;
Every row now contains meaningful output without excluding incomplete records.
Common Mistakes
Many beginners misuse NULL values.
Common mistakes include:
- Comparing NULL using =
- Making every column Nullable()
- Using Nullable columns as primary keys unnecessarily
- Ignoring NULL behavior in aggregate functions
- Replacing unknown values with incorrect defaults
- Assuming NULL equals an empty string
Avoiding these issues results in cleaner data models and faster analytical queries.
Real-World Applications
Nullable columns appear in almost every analytical workload.
Examples include:
Customer Analytics
Optional phone numbers, secondary email addresses, referral codes.
IoT Platforms
Missing temperature, humidity, or pressure readings caused by sensor failures.
Financial Systems
Transactions without coupons, promotions, or affiliate IDs.
Application Monitoring
Optional request metadata, browser information, or debugging fields.
Healthcare
Unavailable laboratory measurements or incomplete patient records.
Each scenario benefits from representing unknown information correctly.
Best Practices
When working with NULL values in ClickHouse, consider the following guidelines:
- Use Nullable() only when missing information has genuine business meaning.
- Prefer default values whenever they accurately represent the data.
- Use IS NULL and IS NOT NULL for filtering.
- Avoid comparing NULL using standard comparison operators.
- Use ifNull() for simple replacements.
- Use coalesce() when multiple fallback values are available.
- Keep nullable columns out of primary and sorting keys whenever possible.
- Clean and validate data before ingestion.
- Document how NULL values are interpreted across your datasets.
- Review schema design periodically to eliminate unnecessary nullable columns.
Following these practices leads to more efficient storage, faster queries, and more reliable analytical results.
Conclusion
Handling missing data is far more than a technical detail—it is a fundamental aspect of designing reliable analytical systems.
ClickHouse® provides powerful support for managing incomplete information through nullable columns, default values, and a rich set of built-in functions such as ifNull(), coalesce(), nullIf(), isNull(), and assumeNotNull().
Choosing between NULL values and defaults requires understanding the business meaning behind the data. Used appropriately, these features help maintain data integrity while preserving ClickHouse's exceptional query performance.
By designing schemas carefully, minimizing unnecessary nullable columns, and applying the right functions during query execution, you can build analytical applications that remain both accurate and highly performant as your datasets continue to grow.
Top comments (0)