DEV Community

Yu-Chuan Hung
Yu-Chuan Hung

Posted on

[Apache Iceberg] Iceberg Performance: The Hidden Cost of NULLS FIRST

Introduction

Apache Iceberg is a widely-used table format in Data Lakehouse architectures. It provides flexibility in how data is written, with two key optimizations: partition, which splits data into segments, and sort, which reorders data within those segments. These optimizations can significantly reduce the amount of data scanned by query engines, ultimately boosting query performance.

When querying data with high-cardinality columns (e.g., IDs or serial numbers), quickly filtering out unnecessary values is crucial. Sorting becomes particularly valuable in these scenarios. The rationale is simple: if data is written in order, query engines can rapidly locate the needed data rather than performing a full table scan and discarding irrelevant rows.

When configuring Iceberg table sort properties, engineers can specify whether sorting follows ascending or descending order—with ascending as the default. While reading about this configuration, a question came to mind: Is there any performance difference between these two ordering approaches? If so, which one performs better, and why? To answer these questions, I designed an experiment to find out.

Experiment

Detailed code and performance analysis can be found in my repo: https://github.com/CuteChuanChuan/Dive-Into-Iceberg

Testing Materials

  • Generated 1,000,000 rows with 30% null values

  • Created two identically configured Iceberg tables with different null sorting orders (i.e., NULLS FIRST vs. NULLS LAST)

Queries Executed to Evaluate Performance

  • select count(*) from table where value is not null

  • select sum(value) from table where value is not null

  • select avg(value) from table where value is not null

  • select count(*) from table where value is null

  • select count(*) from table

Performance Evaluation Metrics

  • Query plan: Whether different sorting orders generate different execution plans

  • Execution time with statistical analysis: Overall query time comparison

  • CPU profiling: Detailed CPU usage analysis

Findings

To obtain a complete picture, I planned to conduct three types of analysis. First, I compared query plans to see whether different null placements generate different plans, which might influence query performance. Second, I conducted statistical analysis on execution times for rigorous examination. Since query time differences are the observable outcome, we need to identify the root cause if significant differences exist. Therefore, if statistical significance is found, CPU profiling will be conducted in the final phase.

Query Plan

Details

select count(*) from table where value is not null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1557]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnotnull(value#508)
               +- BatchScan local.db.test_nulls_first[value#508] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1574]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnotnull(value#521)
               +- BatchScan local.db.test_nulls_last[value#521] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
Enter fullscreen mode Exit fullscreen mode

select sum(value) from table where value is not null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#886)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3045]
      +- HashAggregate(keys=[], functions=[partial_sum(value#886)])
         +- Filter isnotnull(value#886)
            +- BatchScan local.db.test_nulls_first[value#886] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#899)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3064]
      +- HashAggregate(keys=[], functions=[partial_sum(value#899)])
         +- Filter isnotnull(value#899)
            +- BatchScan local.db.test_nulls_last[value#899] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
Enter fullscreen mode Exit fullscreen mode

select avg(value) from table where value is not null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1264)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4535]
      +- HashAggregate(keys=[], functions=[partial_avg(value#1264)])
         +- Filter isnotnull(value#1264)
            +- BatchScan local.db.test_nulls_first[value#1264] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1279)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4554]
      +- HashAggregate(keys=[], functions=[partial_avg(value#1279)])
         +- Filter isnotnull(value#1279)
            +- BatchScan local.db.test_nulls_last[value#1279] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
Enter fullscreen mode Exit fullscreen mode

select count(*) from table where value is null

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6023]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnull(value#1646)
               +- BatchScan local.db.test_nulls_first[value#1646] local.db.test_nulls_first (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6040]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- Project
            +- Filter isnull(value#1659)
               +- BatchScan local.db.test_nulls_last[value#1659] local.db.test_nulls_last (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []
Enter fullscreen mode Exit fullscreen mode

select count(*) from table

# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1895L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7045]
      +- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1895L)])
         +- Project [count(*)#1896L AS agg_func_0#1895L]
            +- LocalTableScan [count(*)#1896L]

# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1904L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7060]
      +- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1904L)])
         +- Project [count(*)#1905L AS agg_func_0#1904L]
            +- LocalTableScan [count(*)#1905L]
Enter fullscreen mode Exit fullscreen mode

Conclusion

For both tables, the execution plans for all queries are identical.

File-Level Statistics Analysis

Although the query plans are the same, a deeper look at the Parquet file statistics reveals important differences in how data is physically organized.

Partition Statistics Comparison

Below are the min/max statistics for each partition in both configurations:

Partition NULLS FIRST NULLS LAST Min Value Difference
cat_0-2 All nulls All nulls N/A
cat_3 min=103, max=993 min=103, max=993 Same
cat_4 min=4, max=994 min=4, max=994 Same
cat_5 min=405, max=995 min=355, max=995 -50
cat_6 min=106, max=996 min=6, max=996 -100
cat_7 min=517, max=997 min=487, max=997 -30
cat_8 min=228, max=998 min=208, max=998 -20
cat_9 min=619, max=999 min=609, max=999 -10

Why Are Statistics Different?

The different min/max values reveal that physical data layout differs between the two configurations:

  1. Different File Boundaries: When sorting with NULLS FIRST vs. NULLS LAST, Spark writes data in different orders, causing file splits to occur at different points. Even though both tables contain identical data, the way rows are distributed across files differs.

  2. File Organization Pattern:

  • NULLS FIRST: Files begin with null values, followed by non-null values. The minimum non-null value appears after skipping nulls within each file.

  • NULLS LAST: Files begin with non-null values immediately. The minimum value is at or near the start of the file.

  1. Metadata Quality: NULLS LAST produces "better" statistics for non-null queries:
  • In NULLS FIRST (e.g., cat_6): min=106 means the file starts with nulls, and 106 is the first non-null value encountered.

  • In NULLS LAST (e.g., cat_6): min=6 means the file immediately starts with value 6, providing more accurate bounds.

Impact on Query Execution

For queries with WHERE value IS NOT NULL:

NULLS FIRST:

  • Files contain nulls at the beginning, causing mixed value distribution

  • Query engine must scan through null values before reaching non-null data

  • Statistics indicate the presence of non-null values, but they're not immediately accessible

NULLS LAST:

  • Files with non-null data have those values at the beginning

  • Query engine can immediately start processing valid values

  • Better sequential access pattern for counting non-null values

This file-level organization difference, combined with CPU microarchitecture optimizations, explains why NULLS LAST performs better for counting non-null values even though logical query plans are identical.

Execution Time Analysis

Data Collection

  • 5 queries, each executed 100 times

Statistical Methods

  • T-test: Compare whether query times are statistically different

  • Cohen's d: Calculate the effect size of null ordering settings

Details

select count(*) from table where value is not null: Null Last performs better

Descriptive Statistics:
  NULLS FIRST: mean=41.46ms, sd=8.38ms
  NULLS LAST:  mean=31.55ms, sd=2.40ms

Paired t-test:
  t-statistic = 11.9367 
  p-value = 0.000000 
  95% CI: [8.26, 11.55] ms
  Result: *** HIGHLY SIGNIFICANT (p < 0.001)

Effect Size (Cohen's d):
  d = 1.1937 
  Interpretation: Large 

Summary:
  Mean difference: 9.91 ms
  Percentage difference: 23.90 %
  Winner: NULLS LAST
Enter fullscreen mode Exit fullscreen mode

select sum(value) from table where value is not null: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=34.14ms, sd=5.12ms
  NULLS LAST:  mean=33.40ms, sd=6.43ms

Paired t-test:
  t-statistic = 0.8759 
  p-value = 0.383195 
  95% CI: [-0.94, 2.43] ms
  Result: NOT SIGNIFICANT (p >= 0.05)
Enter fullscreen mode Exit fullscreen mode

select avg(value) from table where value is not null: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=28.84ms, sd=3.42ms
  NULLS LAST:  mean=27.95ms, sd=3.26ms

Paired t-test:
  t-statistic = 1.9654 
  p-value = 0.052165 
  95% CI: [-0.01, 1.80] ms
  Result: NOT SIGNIFICANT (p >= 0.05)
Enter fullscreen mode Exit fullscreen mode

select count(*) from table where value is null: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=24.00ms, sd=4.64ms
  NULLS LAST:  mean=23.16ms, sd=3.43ms

Paired t-test:
  t-statistic = 1.3804 
  p-value = 0.170582 
  95% CI: [-0.37, 2.05] ms
  Result: NOT SIGNIFICANT (p >= 0.05)
Enter fullscreen mode Exit fullscreen mode

select count(*) from table: Not significantly different

Descriptive Statistics:
  NULLS FIRST: mean=14.95ms, sd=2.41ms
  NULLS LAST:  mean=14.39ms, sd=2.45ms

Paired t-test:
  t-statistic = 1.6356 
  p-value = 0.105090 
  95% CI: [-0.12, 1.25] ms
  Result: NOT SIGNIFICANT (p >= 0.05)
Enter fullscreen mode Exit fullscreen mode

Conclusion

NULLS LAST is significantly faster than NULLS FIRST when counting non-null values.

CPU Profiling: Analyzing Count Non-Null Values Query

Details

Please refer to the flame graphs in my repo.

The performance difference observed in execution time analysis can be attributed to both file-level organization and CPU microarchitecture optimizations:

  1. File-Level Organization Impact: As shown in the file statistics analysis, NULLS LAST creates files where non-null values are positioned at the beginning. This layout means when the query engine scans data with WHERE value IS NOT NULL, it immediately encounters a continuous block of valid values rather than having to skip over nulls first. This reduces unnecessary I/O operations and deserialization overhead.

  2. CPU Microarchitecture Optimizations:

    1. SIMD (Single Instruction, Multiple Data): Modern CPUs can process multiple data elements simultaneously using SIMD instructions. When counting non-null values with NULLS LAST, the query engine encounters a continuous block of non-null values at the start of each file. This layout allows SIMD instructions to efficiently process multiple valid values in parallel. For example, when checking isnotnull(value) on 8 consecutive values that are all non-null, a single SIMD instruction can validate and count them in one operation.
    2. Branch Prediction: Modern CPUs use branch predictors to anticipate the outcome of conditional statements (like if (value != null)). With NULLS LAST, the query engine scans data following a highly predictable pattern: a long sequence of non-null values followed by nulls. This consistency allows the branch predictor to achieve high accuracy, keeping the CPU pipeline running smoothly. In contrast, NULLS FIRST presents a less predictable pattern at file boundaries where nulls transition to non-nulls, potentially causing pipeline stalls.

The CPU profiling data supports these optimizations: NULLS LAST (2,238 samples) uses approximately 11.7% less CPU time than NULLS FIRST (2,536 samples). This reduction results from the combined effects of better file organization, improved SIMD vectorization, and enhanced branch prediction accuracy.

Conclusion

NULLS LAST occupies less CPU time due to a combination of better file-level data organization and CPU microarchitecture optimizations.

Conclusion and Future Exploration

This exploration reveals that while different null value placements do not create different query plans, they significantly impact query performance through physical data organization.

Key Findings:

  1. File-Level Statistics Matter: NULLS LAST produces better min/max statistics, with non-null values positioned at file beginnings. This creates more favorable data layouts for queries filtering on non-null values.

  2. CPU Microarchitecture Synergy: The continuous blocks of non-null values in NULLS LAST enable CPU optimizations including SIMD vectorization and improved branch prediction, resulting in ~11.7% less CPU time.

  3. Significant Performance Impact: For SELECT COUNT(*) WHERE value IS NOT NULL, NULLS LAST achieves 23.90% faster execution time—a substantial improvement for such a common OLAP operation.

Practical Recommendations:

If counting non-null values is a frequent operation in your workload—which is common in OLAP scenarios—configuring Iceberg tables with NULLS LAST can provide measurable performance improvements. The benefits stem from both better file organization and CPU-level optimizations working in tandem.

Future Exploration:

This experiment tested 5 queries on a 1-million-row dataset with 30% null values. Future investigations could explore:

  • Various query patterns frequently used in OLAP scenarios (e.g., window functions like LAG, complex aggregations)

  • Larger datasets with multiple files per partition to amplify metadata pruning effects

  • Different null percentage distributions (10%, 50%, 70%) to understand the threshold where NULLS LAST benefits diminish

  • Impact on different data types (strings, decimals) and column cardinalities

  • Performance with Iceberg's metadata-based filtering in more complex predicates

These investigations would provide a more complete understanding of optimal Iceberg table sorting configurations across diverse workloads.

Top comments (0)