Introduction
Apache Iceberg is a widely-used table format in Data Lakehouse architectures. It provides flexibility in how data is written, with two key optimizations: partition, which splits data into segments, and sort, which reorders data within those segments. These optimizations can significantly reduce the amount of data scanned by query engines, ultimately boosting query performance.
When querying data with high-cardinality columns (e.g., IDs or serial numbers), quickly filtering out unnecessary values is crucial. Sorting becomes particularly valuable in these scenarios. The rationale is simple: if data is written in order, query engines can rapidly locate the needed data rather than performing a full table scan and discarding irrelevant rows.
When configuring Iceberg table sort properties, engineers can specify whether sorting follows ascending or descending order—with ascending as the default. While reading about this configuration, a question came to mind: Is there any performance difference between these two ordering approaches? If so, which one performs better, and why? To answer these questions, I designed an experiment to find out.
Experiment
Detailed code and performance analysis can be found in my repo: https://github.com/CuteChuanChuan/Dive-Into-Iceberg
Testing Materials
Generated 1,000,000 rows with 30% null values
Created two identically configured Iceberg tables with different null sorting orders (i.e., NULLS FIRST vs. NULLS LAST)
Queries Executed to Evaluate Performance
select count(*) from table where value is not nullselect sum(value) from table where value is not nullselect avg(value) from table where value is not nullselect count(*) from table where value is nullselect count(*) from table
Performance Evaluation Metrics
Query plan: Whether different sorting orders generate different execution plans
Execution time with statistical analysis: Overall query time comparison
CPU profiling: Detailed CPU usage analysis
Findings
To obtain a complete picture, I planned to conduct three types of analysis. First, I compared query plans to see whether different null placements generate different plans, which might influence query performance. Second, I conducted statistical analysis on execution times for rigorous examination. Since query time differences are the observable outcome, we need to identify the root cause if significant differences exist. Therefore, if statistical significance is found, CPU profiling will be conducted in the final phase.
Query Plan
Details
select count(*) from table where value is not null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1557]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnotnull(value#508)
+- BatchScan local.db.test_nulls_first[value#508] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=1574]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnotnull(value#521)
+- BatchScan local.db.test_nulls_last[value#521] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
select sum(value) from table where value is not null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#886)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3045]
+- HashAggregate(keys=[], functions=[partial_sum(value#886)])
+- Filter isnotnull(value#886)
+- BatchScan local.db.test_nulls_first[value#886] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(value#899)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=3064]
+- HashAggregate(keys=[], functions=[partial_sum(value#899)])
+- Filter isnotnull(value#899)
+- BatchScan local.db.test_nulls_last[value#899] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
select avg(value) from table where value is not null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1264)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4535]
+- HashAggregate(keys=[], functions=[partial_avg(value#1264)])
+- Filter isnotnull(value#1264)
+- BatchScan local.db.test_nulls_first[value#1264] local.db.test_nulls_first (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(value#1279)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=4554]
+- HashAggregate(keys=[], functions=[partial_avg(value#1279)])
+- Filter isnotnull(value#1279)
+- BatchScan local.db.test_nulls_last[value#1279] local.db.test_nulls_last (branch=null) [filters=value IS NOT NULL, groupedBy=] RuntimeFilters: []
select count(*) from table where value is null
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6023]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnull(value#1646)
+- BatchScan local.db.test_nulls_first[value#1646] local.db.test_nulls_first (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=6040]
+- HashAggregate(keys=[], functions=[partial_count(1)])
+- Project
+- Filter isnull(value#1659)
+- BatchScan local.db.test_nulls_last[value#1659] local.db.test_nulls_last (branch=null) [filters=value IS NULL, groupedBy=] RuntimeFilters: []
select count(*) from table
# Null First
Query Plan (NULLS First):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1895L)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7045]
+- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1895L)])
+- Project [count(*)#1896L AS agg_func_0#1895L]
+- LocalTableScan [count(*)#1896L]
# Null Last
Query Plan (NULLS Last):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(agg_func_0#1904L)])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=7060]
+- HashAggregate(keys=[], functions=[partial_sum(agg_func_0#1904L)])
+- Project [count(*)#1905L AS agg_func_0#1904L]
+- LocalTableScan [count(*)#1905L]
Conclusion
For both tables, the execution plans for all queries are identical.
File-Level Statistics Analysis
Although the query plans are the same, a deeper look at the Parquet file statistics reveals important differences in how data is physically organized.
Partition Statistics Comparison
Below are the min/max statistics for each partition in both configurations:
| Partition | NULLS FIRST | NULLS LAST | Min Value Difference |
|---|---|---|---|
| cat_0-2 | All nulls | All nulls | N/A |
| cat_3 | min=103, max=993 | min=103, max=993 | Same |
| cat_4 | min=4, max=994 | min=4, max=994 | Same |
| cat_5 | min=405, max=995 | min=355, max=995 | -50 |
| cat_6 | min=106, max=996 | min=6, max=996 | -100 |
| cat_7 | min=517, max=997 | min=487, max=997 | -30 |
| cat_8 | min=228, max=998 | min=208, max=998 | -20 |
| cat_9 | min=619, max=999 | min=609, max=999 | -10 |
Why Are Statistics Different?
The different min/max values reveal that physical data layout differs between the two configurations:
Different File Boundaries: When sorting with NULLS FIRST vs. NULLS LAST, Spark writes data in different orders, causing file splits to occur at different points. Even though both tables contain identical data, the way rows are distributed across files differs.
File Organization Pattern:
NULLS FIRST: Files begin with null values, followed by non-null values. The minimum non-null value appears after skipping nulls within each file.
NULLS LAST: Files begin with non-null values immediately. The minimum value is at or near the start of the file.
- Metadata Quality: NULLS LAST produces "better" statistics for non-null queries:
In NULLS FIRST (e.g., cat_6):
min=106means the file starts with nulls, and 106 is the first non-null value encountered.In NULLS LAST (e.g., cat_6):
min=6means the file immediately starts with value 6, providing more accurate bounds.
Impact on Query Execution
For queries with WHERE value IS NOT NULL:
NULLS FIRST:
Files contain nulls at the beginning, causing mixed value distribution
Query engine must scan through null values before reaching non-null data
Statistics indicate the presence of non-null values, but they're not immediately accessible
NULLS LAST:
Files with non-null data have those values at the beginning
Query engine can immediately start processing valid values
Better sequential access pattern for counting non-null values
This file-level organization difference, combined with CPU microarchitecture optimizations, explains why NULLS LAST performs better for counting non-null values even though logical query plans are identical.
Execution Time Analysis
Data Collection
- 5 queries, each executed 100 times
Statistical Methods
T-test: Compare whether query times are statistically different
Cohen's d: Calculate the effect size of null ordering settings
Details
select count(*) from table where value is not null: Null Last performs better
Descriptive Statistics:
NULLS FIRST: mean=41.46ms, sd=8.38ms
NULLS LAST: mean=31.55ms, sd=2.40ms
Paired t-test:
t-statistic = 11.9367
p-value = 0.000000
95% CI: [8.26, 11.55] ms
Result: *** HIGHLY SIGNIFICANT (p < 0.001)
Effect Size (Cohen's d):
d = 1.1937
Interpretation: Large
Summary:
Mean difference: 9.91 ms
Percentage difference: 23.90 %
Winner: NULLS LAST
select sum(value) from table where value is not null: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=34.14ms, sd=5.12ms
NULLS LAST: mean=33.40ms, sd=6.43ms
Paired t-test:
t-statistic = 0.8759
p-value = 0.383195
95% CI: [-0.94, 2.43] ms
Result: NOT SIGNIFICANT (p >= 0.05)
select avg(value) from table where value is not null: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=28.84ms, sd=3.42ms
NULLS LAST: mean=27.95ms, sd=3.26ms
Paired t-test:
t-statistic = 1.9654
p-value = 0.052165
95% CI: [-0.01, 1.80] ms
Result: NOT SIGNIFICANT (p >= 0.05)
select count(*) from table where value is null: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=24.00ms, sd=4.64ms
NULLS LAST: mean=23.16ms, sd=3.43ms
Paired t-test:
t-statistic = 1.3804
p-value = 0.170582
95% CI: [-0.37, 2.05] ms
Result: NOT SIGNIFICANT (p >= 0.05)
select count(*) from table: Not significantly different
Descriptive Statistics:
NULLS FIRST: mean=14.95ms, sd=2.41ms
NULLS LAST: mean=14.39ms, sd=2.45ms
Paired t-test:
t-statistic = 1.6356
p-value = 0.105090
95% CI: [-0.12, 1.25] ms
Result: NOT SIGNIFICANT (p >= 0.05)
Conclusion
NULLS LAST is significantly faster than NULLS FIRST when counting non-null values.
CPU Profiling: Analyzing Count Non-Null Values Query
Details
Please refer to the flame graphs in my repo.
The performance difference observed in execution time analysis can be attributed to both file-level organization and CPU microarchitecture optimizations:
File-Level Organization Impact: As shown in the file statistics analysis, NULLS LAST creates files where non-null values are positioned at the beginning. This layout means when the query engine scans data with
WHERE value IS NOT NULL, it immediately encounters a continuous block of valid values rather than having to skip over nulls first. This reduces unnecessary I/O operations and deserialization overhead.-
CPU Microarchitecture Optimizations:
-
SIMD (Single Instruction, Multiple Data):
Modern CPUs can process multiple data elements simultaneously using SIMD instructions. When counting non-null values with NULLS LAST, the query engine encounters a continuous block of non-null values at the start of each file. This layout allows SIMD instructions to efficiently process multiple valid values in parallel. For example, when checking
isnotnull(value)on 8 consecutive values that are all non-null, a single SIMD instruction can validate and count them in one operation. -
Branch Prediction:
Modern CPUs use branch predictors to anticipate the outcome of conditional statements (like
if (value != null)). With NULLS LAST, the query engine scans data following a highly predictable pattern: a long sequence of non-null values followed by nulls. This consistency allows the branch predictor to achieve high accuracy, keeping the CPU pipeline running smoothly. In contrast, NULLS FIRST presents a less predictable pattern at file boundaries where nulls transition to non-nulls, potentially causing pipeline stalls.
-
SIMD (Single Instruction, Multiple Data):
Modern CPUs can process multiple data elements simultaneously using SIMD instructions. When counting non-null values with NULLS LAST, the query engine encounters a continuous block of non-null values at the start of each file. This layout allows SIMD instructions to efficiently process multiple valid values in parallel. For example, when checking
The CPU profiling data supports these optimizations: NULLS LAST (2,238 samples) uses approximately 11.7% less CPU time than NULLS FIRST (2,536 samples). This reduction results from the combined effects of better file organization, improved SIMD vectorization, and enhanced branch prediction accuracy.
Conclusion
NULLS LAST occupies less CPU time due to a combination of better file-level data organization and CPU microarchitecture optimizations.
Conclusion and Future Exploration
This exploration reveals that while different null value placements do not create different query plans, they significantly impact query performance through physical data organization.
Key Findings:
File-Level Statistics Matter: NULLS LAST produces better min/max statistics, with non-null values positioned at file beginnings. This creates more favorable data layouts for queries filtering on non-null values.
CPU Microarchitecture Synergy: The continuous blocks of non-null values in NULLS LAST enable CPU optimizations including SIMD vectorization and improved branch prediction, resulting in ~11.7% less CPU time.
Significant Performance Impact: For
SELECT COUNT(*) WHERE value IS NOT NULL, NULLS LAST achieves 23.90% faster execution time—a substantial improvement for such a common OLAP operation.
Practical Recommendations:
If counting non-null values is a frequent operation in your workload—which is common in OLAP scenarios—configuring Iceberg tables with NULLS LAST can provide measurable performance improvements. The benefits stem from both better file organization and CPU-level optimizations working in tandem.
Future Exploration:
This experiment tested 5 queries on a 1-million-row dataset with 30% null values. Future investigations could explore:
Various query patterns frequently used in OLAP scenarios (e.g., window functions like LAG, complex aggregations)
Larger datasets with multiple files per partition to amplify metadata pruning effects
Different null percentage distributions (10%, 50%, 70%) to understand the threshold where NULLS LAST benefits diminish
Impact on different data types (strings, decimals) and column cardinalities
Performance with Iceberg's metadata-based filtering in more complex predicates
These investigations would provide a more complete understanding of optimal Iceberg table sorting configurations across diverse workloads.
Top comments (0)