delta lake at small scale forgives almost every design sin: append-only writes, no partitioning, no clustering, no compaction. The bill arrives the day the table crosses 10 TB — that is when junior teams discover that a SELECT * FROM events WHERE country = 'DE' AND customer_id = 17 reads forty thousand files instead of four, and that an upstream Change Data Capture pipeline rewrites a third of the table on every nightly merge.
This guide is the senior-engineer tuning playbook for the two highest-leverage Delta features in 2026 — delta lake cdf (Change Data Feed) for row-state propagation and Z-Ordering for multi-dimensional file skipping — wired together with the OPTIMIZE bin-packer, the VACUUM retention timeline, deletion vectors, and the new liquid clustering primitive. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the optimization practice library →, stack the streaming and Δ-throughput drills →, and rehearse on Databricks-flavoured interview problems →.
On this page
- Why CDF + Z-Order — when each wins
- CDF mechanics — _change_type, version, timestamp
- Z-Ordering deep dive — multi-dim clustering, file skipping
- OPTIMIZE + Auto Compaction — bin-packing, optimal write target
- Production tuning — VACUUM, deletion vectors, liquid clustering
- Cheat sheet — Delta Lake tuning recipes
- Frequently asked questions
- Practice on PipeCode
1. Why CDF + Z-Order — when each wins
Two axes of Delta tuning — row-state propagation (CDF) and file-skip locality (Z-Order)
The one-sentence invariant: Delta Lake performance at scale is two orthogonal problems — propagating row-state cheaply (CDF) and pruning files on read cheaply (Z-Order) — and you tune each axis with a different set of knobs. Once you internalise that "CDF is for downstream sync, Z-Order is for filtered reads," every Delta tuning interview question becomes a placement exercise on this 2x2 matrix.
The 2x2 placement matrix.
-
Write-heavy + downstream consumers —
delta lake cdfwins. You pay write-amp once and downstream materialised views consume tiny delta streams instead of re-reading the whole table. -
Read-heavy + multi-dim filters —
z-orderingwins. You pay an OPTIMIZE pass once per cycle and every subsequent query skips 80–98% of files via min/max pruning. - Write-heavy + read-heavy — both, with OPTIMIZE on a separate hot-set, and CDF only on the dimensions that drive downstream MVs.
- Append-only + single-column filter — neither. Partitioning on the filter column plus auto compaction is cheaper than Z-Order's full rewrite.
Three places naive Delta tables bleed.
-
Small-file tax. Streaming writes create thousands of tiny parquet files per hour. A
SELECT *scans every file's footer to read min/max stats — the metadata cost alone can dwarf the data cost. Fix: OPTIMIZE bin-packs into 1 GB targets; auto compaction handles it at write-time. -
Full-table merges. A nightly
MERGE INTO target USING source ON target.key = source.keyagainst a non-clustered table rewrites every matched file even when only 0.1% of rows changed. Fix: enable deletion vectors for row-level deletes; cluster the merge key with Z-Order or liquid clustering so the file-touch set is small. -
History blindspots. Downstream materialised views and SCD-2 dimensions re-read the whole table because they have no incremental contract. Fix: enable CDF and consume
table_changes()between versions — read only the Δ rows.
What interviewers listen for.
- Do you say "CDF is for propagation, Z-Order is for pruning" when asked which to use? — senior signal.
- Do you mention that Z-Order is a full rewrite and OPTIMIZE is idempotent? — required answer.
- Do you bring up liquid clustering as the 2024+ successor to Z-Order on greenfield tables? — senior signal.
- Do you separate the VACUUM retention floor from time-travel coverage? — required answer.
The 2026 reality.
-
CDF is GA across Databricks Runtime ≥ 8.4 and OSS Delta ≥ 2.0; it emits
_change_type,_commit_version,_commit_timestampcolumns out of the read API. - Z-Order is still the workhorse for existing tables; greenfield tables on Databricks Runtime ≥ 14 should default to liquid clustering.
-
Deletion vectors are GA in Delta 3.0+ and on Databricks; they convert
DELETE/UPDATEfrom "rewrite affected files" to "mark deleted rows in a sidecar bitmap." -
The 168-hour VACUUM floor is enforced by
spark.databricks.delta.retentionDurationCheck.enabled. Lowering it disables time-travel and breaks audit guarantees.
Detailed explanation — the four-knob tuning model
Detailed explanation. Senior data engineers carry a mental model with four knobs: file count (OPTIMIZE / auto compaction), file layout (Z-Order / liquid clustering), row-state propagation (CDF), and storage hygiene (VACUUM / deletion vectors). Every Delta performance question maps to one of those four knobs, and an interviewer who hears all four named gets confidence the candidate has tuned a table at scale.
Question. Walk through a 50 TB Delta Bronze table with 5-minute streaming writes from Kafka and a nightly downstream MERGE that updates 0.5% of rows. Which Delta features do you enable, and in what order?
Input.
| Workload trait | Value |
|---|---|
| Table size | 50 TB |
| Write cadence | 5-minute Structured Streaming micro-batches |
| Write volume | ~ 800 GB/day |
| Read cadence | nightly MERGE + ad-hoc analyst queries |
| Filter columns |
country, customer_id, event_ts
|
Code.
-- 1) Enable optimized writes + auto compaction (handles small-file tax)
ALTER TABLE bronze.events SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
);
-- 2) Enable CDF so downstream MVs consume only the delta
ALTER TABLE bronze.events SET TBLPROPERTIES (
delta.enableChangeDataFeed = true
);
-- 3) Partition by event_date (low-card) + Z-Order by (country, customer_id)
ALTER TABLE bronze.events PARTITIONED BY (event_date);
OPTIMIZE bronze.events ZORDER BY (country, customer_id);
-- 4) Enable deletion vectors for row-level deletes without file rewrites
ALTER TABLE bronze.events SET TBLPROPERTIES (
delta.enableDeletionVectors = true
);
-- 5) Keep VACUUM at the 168h floor for time-travel + audit safety
VACUUM bronze.events RETAIN 168 HOURS;
Step-by-step explanation.
- Auto compaction + optimized writes go on first — they pay for themselves immediately by collapsing the small-file tax that streaming writes inflict. Optimized writes does an adaptive shuffle at write-time targeting 128 MB writes; auto compaction post-batch bin-packs those into 1 GB files.
-
CDF is the cheapest "future-proofing" knob — it costs only metadata until the day a downstream consumer asks for
table_changes(0, latest). Enable it before the table grows; enabling on a 50 TB table later requires a full rewrite of the snapshot. -
Partitioning + Z-Order is the layout choice.
event_dateis low-cardinality and a natural time partition (one partition per day).countryandcustomer_idare high-cardinality and appear in every WHERE — perfect Z-Order targets. Z-Order inside each partition keeps the file-skip ratio above 95%. - Deletion vectors flip the merge cost equation: an UPDATE of 0.5% of rows would normally rewrite 0.5% of files (worst case 100% of files when the changed rows are scattered). With DVs, the engine writes a tiny bitmap that masks deleted rows from reads — no file rewrite until a future OPTIMIZE.
- VACUUM stays at 168 hours — the default and the contractual floor for time-travel. Lowering it to "save storage" breaks the audit story and is the single most common rollback after a senior reviewer catches it.
Output.
| Knob | Effect | Cost |
|---|---|---|
| Optimized writes + auto compaction | small-file tax → near-zero | +5-15% write CPU |
| CDF | downstream MV refresh from Δ instead of full scan | +metadata only until read |
| Partition + Z-Order | 95%+ file skipping on WHERE filters | one OPTIMIZE pass per cycle |
| Deletion vectors | UPDATE/DELETE without file rewrite | small bitmap per modified file |
| VACUUM 168h | time-travel + audit preserved | 168h of tombstone storage |
Rule of thumb. Order matters — enable the cheap futures (CDF, DVs, auto compaction) before the table grows. Layout decisions (partition + Z-Order or liquid clustering) should be made on the first write, because retro-fitting them on a 50 TB table is a 12-hour OPTIMIZE job that scales linearly with table size.
Detailed explanation — when partitioning beats Z-Order
Detailed explanation. Z-Order is not always the answer. Low-cardinality columns — country with 200 values, event_date with 365 values per year — are textbook partitioning columns, not Z-Order columns. Partitioning lets the planner skip entire directories before any min/max stats are read; Z-Order is for the inside of a partition.
Question. Given a table with two filter columns — event_date (cardinality 365 over the table's lifetime) and customer_id (cardinality 50M) — show the right layout and the wrong layout, and why the right one is 6× cheaper on a typical analyst query.
Input.
| Filter | Cardinality | Selectivity per WHERE |
|---|---|---|
event_date |
365 | 1 day = 1/365 |
customer_id |
50M | 1 cust = 1/50M |
Code.
-- WRONG — Z-Order on the low-cardinality column
ALTER TABLE events PARTITIONED BY (); -- no partition
OPTIMIZE events ZORDER BY (event_date, customer_id);
-- RIGHT — partition by date, Z-Order by customer
ALTER TABLE events PARTITIONED BY (event_date);
OPTIMIZE events ZORDER BY (customer_id);
Step-by-step explanation.
- With Z-Order alone on
(event_date, customer_id), the planner must read min/max stats for every file in the table to find which ones touch the target date. On a 50 TB table that is ~50,000 files; reading 50K parquet footers is roughly 30-60 seconds of metadata work before the actual data scan begins. - With
event_dateas a partition column, the planner skips entire directory partitions on the date filter —WHERE event_date = '2026-06-13'reads exactly one partition's directory listing, ~137 files for a 50 TB / 365 day table. - Inside that one partition, Z-Order on
customer_idclusters records with similar customer IDs into the same parquet file. A query likeWHERE event_date = '2026-06-13' AND customer_id = 17then reads ~1 file, not 137. - The net query touches 1 file vs the Z-Order-only design's 50,000 file metadata reads + ~137 file scans. A real measured speedup on this shape is 6×–40× depending on warm cache.
Output.
| Design | Files touched (metadata) | Files scanned | Approx latency |
|---|---|---|---|
Z-Order only on (date, customer)
|
50,000 | ~137 | 38 s |
| Partition by date + Z-Order on customer | 137 | 1 | 4 s |
Rule of thumb. Cardinality determines the role. Low-card (< 1000 distinct, time-correlated) → partition column. High-card (> 10K distinct, appears in WHERE) → Z-Order column. Z-Ordering a low-card column wastes the rewrite budget; partitioning a high-card column creates a directory-explosion disaster (50M sub-directories is unmanageable on object storage).
Detailed explanation — the cost of NOT enabling CDF
Detailed explanation. Some teams skip CDF "because we don't need it yet." The cost arrives the day a downstream consumer needs incremental refresh and discovers the only option is a full-table diff against a snapshot — which on a 50 TB table is a 4-hour, 50 TB read, every night.
Question. Given a Bronze events table feeding three Silver materialised views with daily refresh, compare the storage + compute cost of running with CDF vs without.
Input.
| Trait | Value |
|---|---|
| Bronze size | 50 TB |
| Daily delta (changed rows) | 1% = 500 GB |
| Silver MVs | 3 dependent views |
Code.
-- Without CDF — every MV refresh re-reads the whole Bronze table
INSERT OVERWRITE silver.events_by_country
SELECT country, COUNT(*) AS event_count
FROM bronze.events
GROUP BY country;
-- Reads 50 TB nightly per MV → 3 MVs × 50 TB = 150 TB scanned
-- With CDF enabled — Silver MV reads only the delta
INSERT INTO silver.events_by_country
SELECT country, _change_type, _commit_version, COUNT(*) AS event_count
FROM table_changes('bronze.events', :last_version, :current_version)
GROUP BY country, _change_type, _commit_version;
-- Reads ~500 GB nightly per MV → 3 MVs × 500 GB = 1.5 TB scanned
Step-by-step explanation.
- Without CDF the only correct incremental contract is "diff today's snapshot against yesterday's." Most teams skip this and full-scan instead — 50 TB read × 3 MVs = 150 TB scanned nightly, ~$300/day at typical object-storage egress + compute prices.
- With CDF the read API exposes only the rows that changed between two commit versions. The job reads ~500 GB (1% delta × 3 MVs) instead of 150 TB — a 100× reduction in scan volume.
- The CDF cost is paid on the write side: roughly 3-7% extra storage for the change-data sidecar files and ~5% write CPU. Two days of CDF storage costs less than one MV full-scan.
- The crossover point — where CDF pays for itself — is usually one downstream consumer. By the third or fourth dependent MV, the cost difference is roughly 20-30×.
Output.
| Refresh mode | Bytes scanned per night | Compute cost (relative) |
|---|---|---|
| Without CDF (3 full scans) | 150 TB | 100x |
| With CDF (3 Δ reads) | 1.5 TB | 1x |
Rule of thumb. Enable CDF on every table that has — or might have — more than one downstream consumer. The metadata-only cost when nobody reads is negligible; the rewrite cost of enabling it on a multi-TB table later is brutal.
Senior interview question on Delta Lake tuning strategy
A senior interviewer often opens with: "You inherit a 50 TB Delta Bronze table with 200K small files, no CDF, no partitioning, and a 24-hour SLA on a downstream Silver refresh that currently takes 6 hours. Where do you start?" The probe blends file-count remediation, layout strategy, and CDF enablement into a single decision sequence.
Solution Using a four-step Delta Lake remediation playbook
-- Step 1 — measure the damage
DESCRIBE DETAIL bronze.events;
-- numFiles, sizeInBytes, partitionColumns, properties
-- Step 2 — emergency OPTIMIZE (bin-pack the small files)
OPTIMIZE bronze.events;
-- 200K small files → ~50 files of ~1 GB each
-- Step 3 — enable CDF + auto compaction so the problem doesn't return
ALTER TABLE bronze.events SET TBLPROPERTIES (
delta.enableChangeDataFeed = true,
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
);
-- Step 4 — layout for the next 50 TB
ALTER TABLE bronze.events PARTITIONED BY (event_date);
OPTIMIZE bronze.events ZORDER BY (country, customer_id);
Step-by-step trace.
| Step | Action | Files | Storage | Read cost |
|---|---|---|---|---|
| Baseline | inherited table | 200,000 | 50 TB | 6h refresh |
| After OPTIMIZE | bin-packed | ~50,000 | 50 TB | 90 min |
| After CDF enable | future writes carry Δ | ~50,000 | 50 TB + 3% | 90 min |
| After partition + Z-Order | clustered, partitioned | ~50,000 | 50 TB | 18 min |
The trace shows the cost compounds — Step 2 alone cuts read latency 4×; Step 4 (the layout fix) is what gets the refresh inside the 24-hour SLA with margin to spare.
Output:
| Metric | Before | After |
|---|---|---|
| numFiles | 200,000 | ~50,000 |
| Silver refresh latency | 6h | 18min |
| File-skip ratio | ~12% | ~96% |
| Downstream Δ read | full scan | CDF Δ only |
Why this works — concept by concept:
-
Measure before mutate —
DESCRIBE DETAILis the single most important command in Delta tuning. It returns file count, partition columns, table properties, and the latest commit version — the four numbers you need to size the remediation. - OPTIMIZE is idempotent — running it twice is a no-op when files already fit the target size. Safe to schedule nightly.
- CDF before growth — enabling CDF on a small table is free metadata. Enabling on a 100 TB table is a rewrite. The "future-proofing" cost is asymmetric.
- Partition + Z-Order is the layout — partition handles the time axis (low-card); Z-Order handles the multi-dim filter axis (high-card). They are complements, not competitors.
- File-skip ratio is the KPI — the number to track is "files read / files scanned-for-stats." A well-tuned table reads < 5% of its files for a typical query. Below 50% means the layout is wrong.
- Cost — Step 2 (OPTIMIZE) is O(table_size) — pay once. Steps 3 and 4 are O(metadata) plus one OPTIMIZE pass. Recurring cost: O(daily_delta) for auto compaction.
Data Engineering
Topic — optimization
Optimization problems (Data Engineering)
2. CDF mechanics — _change_type, version, timestamp
Change Data Feed turns every insert, update, and delete into a ledger row — read it with table_changes()
The mental model in one line: CDF adds three hidden columns to a Delta table — _change_type, _commit_version, _commit_timestamp — that downstream consumers query via table_changes('db.table', start, end) to read only the rows that actually changed between two commit versions. Once you say "CDF is a versioned ledger of row-state transitions," every downstream sync, SCD-2 build, and audit story collapses into a single read pattern.
The four _change_type values.
-
insert— a brand-new row landed in this commit. The row data appears as-is in the CDF output. -
update_preimage— the BEFORE image of an updated row. Carries the values the row had before the UPDATE. -
update_postimage— the AFTER image of the same updated row, in the same commit. Carries the values the row has after the UPDATE. -
delete— the row that was removed in this commit. Carries the row's final values before deletion.
The two hidden time/version columns.
-
_commit_version— the Delta transaction log version number where the change was committed. Monotonically increasing; perfect for "consume since last seen" cursors. -
_commit_timestamp— the wall-clock time of the commit. Useful for time-window CDF reads ("what changed in the last 6 hours?") but not monotonic across concurrent writers; always prefer version for ordering.
Enabling CDF.
-
New table.
CREATE TABLE … TBLPROPERTIES (delta.enableChangeDataFeed = true). Costs nothing extra at creation. -
Existing table.
ALTER TABLE … SET TBLPROPERTIES (delta.enableChangeDataFeed = true). CDF starts capturing from the next commit; does not back-fill history. Versions prior to the enable point return "CDF not available" if read. -
Downstream consumer contract. Once enabled, every commit writes both the table data and a sidecar
_change_datadirectory carrying the CDF rows for that commit. Storage overhead is roughly 3-7% on average tables; higher for update-heavy workloads.
Read patterns.
-
By version range —
SELECT * FROM table_changes('db.events', 100, 105). Read the deltas between commit version 100 (exclusive) and 105 (inclusive). -
By timestamp range —
SELECT * FROM table_changes('db.events', '2026-06-13', '2026-06-14'). Read everything that committed in the window. -
Streaming.
spark.readStream.option('readChangeFeed', 'true').table('db.events')— Spark Structured Streaming reads CDF as an unbounded source; checkpoints handle the cursor.
Use cases.
- Downstream materialised view sync. Silver/Gold MVs refresh from CDF deltas instead of full scans. Cuts MV refresh cost from O(table_size) to O(daily_delta).
-
SCD-2 dimension build.
update_preimagebecomes the row's "valid_to" timestamp;update_postimagebecomes the new row's "valid_from." No expensiveWINDOW … LEADover the dimension. - Audit trail. Every change carries the committing user, the version, and the timestamp. Regulators love the deterministic ledger.
- Debezium replacement. For lakehouse-native CDC, CDF removes the need for a Kafka Connect Debezium pipeline that mirrors database row state to a Kafka topic — Delta does it natively.
Detailed explanation — building an SCD-2 from CDF in one SQL
Detailed explanation. Slowly-Changing Dimension Type 2 (SCD-2) is the classic dimensional modelling pattern where each historical row version is preserved with valid_from/valid_to timestamps. Without CDF, building an SCD-2 from a Delta source requires a LEAD() window over the source plus an anti-join — expensive on large dimensions. With CDF, every update_preimage/update_postimage pair is already a "version closes / version opens" event.
Question. Given a customers table with CDF enabled, build the SCD-2 history table dim_customers_history directly from the CDF stream. Show how update_preimage becomes a closed row and update_postimage becomes the new open row.
Input — table_changes('customers', 0, 5) output.
| customer_id | name | tier | _change_type | _commit_version | _commit_timestamp |
|---|---|---|---|---|---|
| 1 | Alice | silver | insert | 2 | 2026-06-13 08:00 |
| 1 | Alice | silver | update_preimage | 4 | 2026-06-13 09:30 |
| 1 | Alice | gold | update_postimage | 4 | 2026-06-13 09:30 |
| 2 | Bob | bronze | insert | 3 | 2026-06-13 08:15 |
| 2 | Bob | bronze | delete | 5 | 2026-06-13 10:00 |
Code.
-- Build SCD-2 history rows from the CDF stream.
MERGE INTO dim_customers_history AS tgt
USING (
SELECT
customer_id,
name,
tier,
_change_type,
_commit_timestamp AS event_ts,
-- valid_from / valid_to derived from the change type
CASE
WHEN _change_type = 'insert' THEN _commit_timestamp
WHEN _change_type = 'update_postimage' THEN _commit_timestamp
ELSE NULL
END AS valid_from,
CASE
WHEN _change_type = 'update_preimage' THEN _commit_timestamp
WHEN _change_type = 'delete' THEN _commit_timestamp
ELSE TIMESTAMP '9999-12-31 23:59:59'
END AS valid_to
FROM table_changes('db.customers', :last_version, :current_version)
WHERE _change_type IN ('insert', 'update_preimage', 'update_postimage', 'delete')
) AS src
ON tgt.customer_id = src.customer_id
AND tgt.valid_to = TIMESTAMP '9999-12-31 23:59:59'
AND src._change_type IN ('update_preimage', 'delete')
WHEN MATCHED THEN
UPDATE SET tgt.valid_to = src.valid_to
WHEN NOT MATCHED AND src._change_type IN ('insert', 'update_postimage') THEN
INSERT (customer_id, name, tier, valid_from, valid_to)
VALUES (src.customer_id, src.name, src.tier, src.valid_from, src.valid_to);
Step-by-step explanation.
- The CDF read returns one row per state transition; the SQL needs to convert these into the SCD-2 "close old, open new" contract.
- For
insert, the row opens withvalid_from = commit_tsand stays open withvalid_to = '9999-12-31'. - For
update_preimage, the existing open row is closed by setting itsvalid_toto the commit timestamp — the MATCHED branch handles this. - For
update_postimage, the new state is opened by inserting a new row withvalid_from = commit_ts— the NOT MATCHED branch handles this. - For
delete, the existing open row is closed by settingvalid_to = commit_ts— same MATCHED branch as preimage. - The two changes for the same commit (preimage + postimage) share the same
_commit_versionand_commit_timestamp, so the closed-row'svalid_toequals the new-row'svalid_from— the contiguous-history contract holds.
Output — dim_customers_history.
| customer_id | name | tier | valid_from | valid_to |
|---|---|---|---|---|
| 1 | Alice | silver | 2026-06-13 08:00 | 2026-06-13 09:30 |
| 1 | Alice | gold | 2026-06-13 09:30 | 9999-12-31 23:59:59 |
| 2 | Bob | bronze | 2026-06-13 08:15 | 2026-06-13 10:00 |
Rule of thumb. When the source table has CDF enabled, never build SCD-2 with a window function over the dimension; consume CDF directly. The CDF read is bounded by O(daily_delta), whereas a window-over-dimension is bounded by O(dimension_size × cardinality).
Detailed explanation — incremental MV refresh from CDF
Detailed explanation. A materialised view that aggregates by country (e.g. "event count per country") normally re-aggregates the whole Bronze table on each refresh. With CDF, the MV can be incrementally maintained — apply only the Δ contributions from the current batch's CDF rows.
Question. Given bronze.events with CDF enabled, refresh silver.events_by_country incrementally — add inserts, subtract deletes, and apply the net effect of updates.
Input — table_changes('bronze.events', 100, 101).
| event_id | country | _change_type | _commit_version |
|---|---|---|---|
| 1 | DE | insert | 101 |
| 2 | FR | insert | 101 |
| 3 | DE | delete | 101 |
| 4 | US | update_preimage | 101 |
| 4 | DE | update_postimage | 101 |
Code.
-- Net Δ per country from the CDF batch.
WITH delta AS (
SELECT
country,
CASE _change_type
WHEN 'insert' THEN +1
WHEN 'update_postimage' THEN +1
WHEN 'delete' THEN -1
WHEN 'update_preimage' THEN -1
ELSE 0
END AS delta_count
FROM table_changes('bronze.events', :last_version, :current_version)
)
MERGE INTO silver.events_by_country AS tgt
USING (
SELECT country, SUM(delta_count) AS delta_total
FROM delta
GROUP BY country
) AS src
ON tgt.country = src.country
WHEN MATCHED THEN
UPDATE SET tgt.event_count = tgt.event_count + src.delta_total
WHEN NOT MATCHED THEN
INSERT (country, event_count) VALUES (src.country, src.delta_total);
Step-by-step explanation.
- Each CDF row carries a signed contribution: insert = +1, delete = -1. An UPDATE is two rows — preimage (-1 from old country) + postimage (+1 to new country) — the net effect is "moved one event from US to DE."
- The CTE assigns the signed delta to each row, then a GROUP BY aggregates per country.
- The MERGE applies the per-country delta to the existing MV:
event_count += delta. New countries get an INSERT. - The result is identical to a from-scratch re-aggregation, but reads only the CDF delta instead of the entire Bronze table.
Output — net delta per country.
| country | delta_total | resulting event_count |
|---|---|---|
| DE | +1 (insert) -1 (delete) +1 (postimage) = +1 | old + 1 |
| FR | +1 (insert) | old + 1 |
| US | -1 (preimage) | old - 1 |
Rule of thumb. Every MV whose aggregate is a commutative function (SUM, COUNT, MIN, MAX with retract) can be incrementally maintained from CDF. The MV cost goes from O(table_size × MVs) per refresh to O(daily_delta × MVs) — typically a 100× reduction on a 1% daily delta.
Detailed explanation — the write-amp cost of CDF
Detailed explanation. Enabling CDF is not free. Each commit now writes both the new data files and a sidecar _change_data directory carrying the CDF rows. Update-heavy tables pay the most: each updated row produces two CDF rows (preimage + postimage). For most workloads the cost is 3-7% of the data write volume; for tables where most commits are UPDATEs over wide rows it can climb to 20%.
Question. Estimate the write-amp overhead for a Delta table where 60% of commits are inserts, 30% are updates (touching one column of a 100-column row), and 10% are deletes. Show why update-heavy tables pay more.
Input.
| Operation | Share | Rows per commit |
|---|---|---|
| insert | 60% | 1M |
| update | 30% | 1M |
| delete | 10% | 1M |
Code.
-- Inspect CDF file count vs main data file count after a representative batch.
DESCRIBE DETAIL db.events; -- numFiles, sizeInBytes
DESCRIBE HISTORY db.events LIMIT 3; -- last 3 commits w/ operationMetrics
-- Operational read of CDF storage overhead.
SELECT
SUM(CASE WHEN _change_type = 'insert' THEN 1 ELSE 0 END) AS inserts,
SUM(CASE WHEN _change_type IN ('update_preimage','update_postimage') THEN 1 ELSE 0 END) AS update_rows,
SUM(CASE WHEN _change_type = 'delete' THEN 1 ELSE 0 END) AS deletes
FROM table_changes('db.events', :start_version, :end_version);
Step-by-step explanation.
- Inserts emit 1 CDF row per inserted row — overhead = 1× the row size.
- Updates emit 2 CDF rows per updated row (preimage + postimage) — overhead = 2× the row size. Even if the UPDATE only changes one column, both full row images are written.
- Deletes emit 1 CDF row per deleted row — overhead = 1× the row size.
- With the workload mix above: total CDF row count = (60% × 1M) + (30% × 1M × 2) + (10% × 1M) = 0.6M + 0.6M + 0.1M = 1.3M CDF rows per 1M data row commits. That is 30% overhead on this update-heavy mix.
- On a 60% insert / 5% update / 35% delete workload, the math is 0.6M + 0.1M + 0.35M = 1.05M CDF rows per 1M — 5% overhead. Tunable by the workload's update share.
Output.
| Workload mix | CDF rows / data rows | Storage overhead |
|---|---|---|
| 60% insert, 30% update, 10% delete | 1.3× | 30% |
| 80% insert, 5% update, 15% delete | 1.05× | 5% |
| 100% insert (append-only) | 1.0× | 0% — but no benefit either |
Rule of thumb. CDF storage overhead = 1 + (update_share). If your workload is insert-dominant, CDF is essentially free. If it is update-dominant and you don't need downstream propagation, leave CDF off and rely on time-travel for the rare audit query.
Senior interview question on backfilling SCD-2 when CDF was enabled late
A senior interviewer often frames this as: "Your team enables CDF on a 2-year-old dimension table on Monday. On Tuesday a downstream team needs an SCD-2 history that goes back to the table's birth. How do you backfill?" The probe blends CDF's "no historical capture" gotcha with time-travel and a manual full-diff fallback.
Solution Using time-travel + full-table diff to backfill, then CDF for forward state
-- Step 1 — snapshot the dimension at its earliest accessible version.
-- (Time-travel only goes back as far as VACUUM retention allows.)
CREATE OR REPLACE TABLE _scratch.dim_baseline AS
SELECT *, TIMESTAMP '2026-01-01 00:00:00' AS valid_from
FROM db.customers VERSION AS OF 0; -- or the earliest available version
-- Step 2 — full-diff each subsequent version up to the CDF enable point,
-- emitting synthetic _change_type rows.
CREATE OR REPLACE TABLE _scratch.synthetic_cdf AS
SELECT
COALESCE(curr.customer_id, prev.customer_id) AS customer_id,
CASE
WHEN prev.customer_id IS NULL THEN 'insert'
WHEN curr.customer_id IS NULL THEN 'delete'
WHEN curr.name <> prev.name OR curr.tier <> prev.tier THEN 'update_postimage'
END AS _change_type,
curr.name, curr.tier,
:commit_version AS _commit_version,
:commit_ts AS _commit_timestamp
FROM db.customers VERSION AS OF :curr_version curr
FULL OUTER JOIN db.customers VERSION AS OF :prev_version prev
ON curr.customer_id = prev.customer_id
WHERE curr.customer_id IS NULL
OR prev.customer_id IS NULL
OR curr.name <> prev.name
OR curr.tier <> prev.tier;
-- Step 3 — union synthetic CDF with the real CDF stream and feed the SCD-2 build.
INSERT INTO dim_customers_history
SELECT * FROM _scratch.synthetic_cdf
UNION ALL
SELECT * FROM table_changes('db.customers', :cdf_enable_version, :now);
Step-by-step trace.
| Phase | Source | Versions covered | Cost |
|---|---|---|---|
| Baseline snapshot | VERSION AS OF 0 |
one snapshot | one full scan |
| Synthetic CDF | pairwise FULL OUTER JOIN | every version before CDF | O(versions × table_size) |
| Real CDF | table_changes() |
every version after CDF enable | O(daily_delta) |
The trace highlights the asymmetry: synthetic backfill is brutal — every pairwise version diff is a full-table scan — but it only runs once. After the backfill, the cheap CDF read takes over forever.
Output:
| Phase | Bytes scanned | Output rows |
|---|---|---|
| Baseline (v=0) | 1× table_size | dimension_size |
| Synthetic CDF (v=1 to v=cdf_enable) | N × table_size | ≈ daily_delta × N |
| Real CDF (v=cdf_enable to now) | O(daily_delta) | ≈ daily_delta × M |
Why this works — concept by concept:
- VERSION AS OF — Delta time-travel exposes any prior version that has not been VACUUM-reclaimed. The earliest accessible version is bounded by the retention floor (default 168h).
- Pairwise full-diff is the only fallback — without CDF, the only way to derive "what changed between v=k and v=k+1" is a full FULL OUTER JOIN. Expensive but deterministic.
- UNION ALL bridges the regime change — synthetic CDF covers history, real CDF covers forward time. The downstream MERGE doesn't need to distinguish.
- Cost asymmetry — one-time backfill cost vs ongoing CDF read cost. The breakeven is "any single downstream consumer for any single day."
- Time-travel + retention coupling — if VACUUM retention is set too low, the earliest accessible version may already be reclaimed and the synthetic backfill fails. Run VACUUM check before the backfill plan.
-
Cost — synthetic backfill is O(versions × table_size); real CDF is O(daily_delta). The asymptotic ratio is
daily_delta / table_size— typically 0.01-0.1.
Data Engineering
Topic — streaming
Streaming Δ processing problems
3. Z-Ordering deep dive — multi-dim clustering, file skipping
Z-Order is a space-filling curve — interleaving the bits of two columns clusters records that are close in 2D into the same parquet file
The mental model in one line: Z-Order interleaves the bit representations of the chosen columns and sorts by the interleaved key, producing a layout where records close in multi-dimensional space end up in adjacent rows of the sorted file — which means in the same parquet file with similar min/max stats — which means file skipping at read time. Once you visualise "the Z-curve weaves through the cells of a 2D grid and one file holds one curve segment," every Z-Order interview question becomes a question about file-skip ratio.
The space-filling curve insight.
- A sort on column A clusters records by A alone — perfect file-skip on
WHERE a = ?, useless onWHERE b = ?. - A sort on
(A, B)clusters by A then B — perfect onWHERE a = ?, partial onWHERE a = ? AND b = ?, useless onWHERE b = ?. - A Z-Order on
(A, B)interleaves the bits of A and B before sorting — every prefix of the interleaved key carries information about both columns — partial file-skip onWHERE a = ?, partial onWHERE b = ?, strong onWHERE a = ? AND b = ?. - The "Z" shape comes from drawing the curve in 2D space — it visits the four cells of every 2x2 block in a "Z" pattern before jumping to the next 2x2.
File skipping = the actual currency.
- Delta stores per-file min/max for each Z-Order column in the transaction log.
- A
WHEREpredicate's accepted range is intersected against each file's min/max — files whose ranges do not overlap are skipped entirely (no parquet read). - Z-Order shrinks the per-file ranges of every Z-Order column simultaneously, so file-skip ratios above 90% are common on tuned tables.
When Z-Order helps.
- High-cardinality columns (more than 10K distinct values). Low-card columns share ranges with too many other files; Z-Order does not help.
- Many WHERE filters across two to four columns. The whole point is multi-dim pruning; one-column filters do not need Z-Order.
-
Point lookups and small-range scans. A
WHERE customer_id = 17against a 50 TB Z-Order table reads ~1 file instead of ~50,000.
When Z-Order hurts.
-
Low-cardinality columns. Use partitioning instead. Z-Order on
country(200 values) wastes the rewrite budget. -
Single-column filters always. A simple
ORDER BYclustered write is cheaper than Z-Order. - Tables with churning data. Every UPDATE invalidates the Z-Order clustering on the touched files; for high-churn tables, prefer liquid clustering which maintains clustering incrementally.
The 4-column ceiling.
- Bit-interleaving 2 columns: each output bit carries information about both inputs — strong pruning on both.
- 4 columns: each output bit carries information about 1/4 of the input — pruning is diluted.
- 5+ columns: pruning degenerates to roughly the same as random ordering. The Delta docs cap recommended Z-Order at 4 columns; pick the most-filtered four.
Diminishing returns intuition.
- Z-Order spreads the pruning budget across columns. With 2 columns, each gets half the bits — pruning works well on both.
- With 4 columns, each gets a quarter — pruning is weaker but still useful.
- With 8 columns, each gets an eighth — pruning approaches no-op. Better to partition one column and Z-Order the others.
Detailed explanation — tracing the 2D Z-curve by hand
Detailed explanation. The fastest way to internalise Z-Order is to trace the 2D curve on a 4x4 grid by hand. The curve visits cells in the order of their bit-interleaved row-column key — and after the trace, the "records close in 2D land in the same file" claim becomes visible.
Question. On a 4x4 grid with rows 0-3 and columns 0-3, list the cells in Z-Order traversal order. Show that adjacent cells in the traversal are close in 2D space.
Input. A 4x4 grid. Each cell is identified as (row, col) with row and col in 0-3 (2 bits each).
Code.
# Bit-interleave the 2-bit row with the 2-bit col → 4-bit Z-key.
def z_order_key(row, col, bits=2):
key = 0
for i in range(bits):
key |= ((row >> i) & 1) << (2 * i + 1) # row bit at odd position
key |= ((col >> i) & 1) << (2 * i) # col bit at even position
return key
cells = [(r, c) for r in range(4) for c in range(4)]
ordered = sorted(cells, key=lambda rc: z_order_key(*rc))
for rc in ordered:
print(rc, z_order_key(*rc))
Step-by-step explanation.
- Each row and col is encoded as 2 bits. Bit-interleaving alternates row and col bits to produce a 4-bit Z-key.
-
(0, 0)→ row=00, col=00 → interleaved=0000 → Z-key=0. -
(0, 1)→ row=00, col=01 → interleaved=0001 → Z-key=1. -
(1, 0)→ row=01, col=00 → interleaved=0010 → Z-key=2. -
(1, 1)→ row=01, col=01 → interleaved=0011 → Z-key=3. - The first four cells of the curve form the upper-left 2x2 block — visited in a "Z" shape. The next four cells form the upper-right 2x2 block, and so on. The whole curve is a recursive Z of Zs.
- After sorting by Z-key, the first row of the sorted file holds 2x2 block (0-1, 0-1); the second row holds 2x2 block (0-1, 2-3). One parquet file's min/max for both
rowandcolis tightly bounded — file-skip works on both axes.
Output — first eight cells in Z-order.
| traversal order | (row, col) | z-key |
|---|---|---|
| 1 | (0, 0) | 0 |
| 2 | (0, 1) | 1 |
| 3 | (1, 0) | 2 |
| 4 | (1, 1) | 3 |
| 5 | (0, 2) | 4 |
| 6 | (0, 3) | 5 |
| 7 | (1, 2) | 6 |
| 8 | (1, 3) | 7 |
Rule of thumb. Pick Z-Order columns whose values you can imagine on a 2D / 3D grid where business queries cluster. Geographic columns (country + region), key columns (customer_id + product_id), and time + entity (event_ts + user_id) are the canonical good pairs.
Detailed explanation — measuring the file-skip ratio
Detailed explanation. The single most important Z-Order metric is the file-skip ratio: (files_pruned / files_in_table) × 100%. Below 50% the layout is wrong; above 90% the layout is tuned. Delta exposes the numbers in the Spark UI under "DataReading" and in the delta.scan operator stats.
Question. Given a 50 TB Delta events table partitioned by event_date and Z-Ordered on (country, customer_id), run a EXPLAIN FORMATTED for a typical filtered query and read the file-skip ratio.
Input. A table with numFiles = 50,000, partitionColumns = [event_date], zOrderBy = (country, customer_id).
Code.
-- Run the query plan.
EXPLAIN FORMATTED
SELECT *
FROM events
WHERE event_date = '2026-06-13'
AND country = 'DE'
AND customer_id = 17;
-- Inspect the actual scan stats after the query.
SELECT *
FROM events
WHERE event_date = '2026-06-13'
AND country = 'DE'
AND customer_id = 17;
-- Spark UI → SQL → metrics: "number of files read" + "files skipped"
Step-by-step explanation.
- The planner reads the
event_datepartition first — 1 partition out of 365. - Within that partition (~137 files), the planner reads per-file min/max stats for
countryandcustomer_id. - The intersection of the
country = 'DE'andcustomer_id = 17ranges with each file's min/max yields 1 candidate file out of 137. - The scan reads 1 file; the file-skip ratio is
(50,000 - 1) / 50,000 = 99.998%. - The Spark UI shows
files read = 1andfiles pruned = 136(within-partition pruning);partitions pruned = 364(cross-partition pruning).
Output — measured numbers.
| Phase | Value |
|---|---|
| Partitions pruned | 364 / 365 |
| Files pruned within partition | 136 / 137 |
| Files actually read | 1 |
| File-skip ratio | 99.998% |
| Bytes scanned | ~1 GB (1 file) instead of 50 TB |
Rule of thumb. Every Z-Order tuning session ends with a EXPLAIN FORMATTED + Spark UI run on the representative query. If the file-skip ratio is below 80%, the Z-Order columns are wrong — pick the four most-filtered high-cardinality columns and re-OPTIMIZE.
Detailed explanation — choosing Z-Order vs partitioning vs liquid clustering
Detailed explanation. The three layout primitives — partition, Z-Order, liquid clustering — are not interchangeable. Each is the right answer for a different shape of workload, and using the wrong one is the most common reason Delta tables underperform.
Question. For each of the four workload patterns below, choose between partitioning, Z-Order, and liquid clustering — and justify the choice.
Input.
| Workload | Filter columns | Cardinality | Churn rate |
|---|---|---|---|
| A — daily log analytics | event_date |
365 | append-only |
| B — point lookups by customer + product |
customer_id, product_id
|
50M, 1M | low |
| C — high-velocity SCD-1 customer table |
customer_id, country
|
50M, 200 | UPDATE-heavy |
| D — greenfield 100 TB events table | TBD multi-dim | high | mixed |
Code.
-- A — partition by event_date (low-card, time-series)
ALTER TABLE logs PARTITIONED BY (event_date);
-- B — partition by event_date + Z-Order on customer_id, product_id
ALTER TABLE point_lookups PARTITIONED BY (event_date);
OPTIMIZE point_lookups ZORDER BY (customer_id, product_id);
-- C — liquid clustering (Z-Order would re-fragment on every UPDATE)
ALTER TABLE customers CLUSTER BY (customer_id, country);
-- D — greenfield → liquid clustering from day one
CREATE TABLE events (...) CLUSTER BY (customer_id, event_ts);
Step-by-step explanation.
-
A —
event_dateis low-cardinality and time-correlated. Partitioning lets the planner skip whole directories on date filters. Z-Order on a low-card column wastes the rewrite budget. -
B —
event_datepartitions the time axis; Z-Order clusters the multi-dim key axis. Append-only with low churn means the clustering stays valid for a long time before re-OPTIMIZE. - C — UPDATE-heavy means Z-Order's full rewrite needs to run frequently to maintain clustering. Liquid clustering maintains clustering incrementally on each write, so the table stays clustered without scheduled OPTIMIZE.
- D — greenfield in 2026 should default to liquid clustering. Z-Order is now legacy for new tables; existing tables migrate when convenient.
Output — decision table.
| Workload | Layout | Rationale |
|---|---|---|
| A | partition by event_date
|
low-card time column |
| B | partition + Z-Order | append-only multi-dim |
| C | liquid clustering | UPDATE-heavy needs incremental clustering |
| D | liquid clustering | greenfield 2026+ default |
Rule of thumb. Partition for the time axis. Z-Order or liquid cluster for the entity axis. Choose liquid clustering if (a) the table is new in 2024+, (b) the workload is UPDATE-heavy, or (c) the filter set evolves over time. Otherwise Z-Order is fine.
Senior interview question on picking Z-Order columns
A senior interviewer often frames this as: "You have a Delta table with 12 columns appearing in WHERE clauses across various queries. The Delta docs say cap Z-Order at 4 columns. Which 4 do you pick, and how do you decide?"
Solution Using a filter-frequency + cardinality + cost-weighted ranking
# Pseudocode — score each candidate column by Z-Order suitability.
def z_order_score(col, query_log, table_stats):
f = query_log.filter_frequency[col] # share of queries where col appears in WHERE
c = table_stats.distinct_values[col] # cardinality
s = query_log.selectivity[col] # 1 / N for point lookups
j = query_log.is_join_key[col] # is col a typical join key?
return f * log(c) * s * (1.5 if j else 1.0)
candidates = ['country','customer_id','product_id','event_ts','region','tier',
'campaign_id','session_id','device_type','os','channel','status']
scored = sorted(
[(c, z_order_score(c, query_log, table_stats)) for c in candidates],
key=lambda x: -x[1],
)
zorder_cols = [c for c, _ in scored[:4]]
print('Z-Order on:', zorder_cols)
Step-by-step trace.
| col | filter_freq | cardinality | selectivity | score |
|---|---|---|---|---|
customer_id |
0.80 | 50M | 1/50M | 14.2 |
event_ts |
0.70 | 1B | 1/86400 | 12.1 |
product_id |
0.55 | 1M | 1/1M | 7.6 |
country |
0.40 | 200 | 1/200 | 2.1 |
region |
0.30 | 1000 | 1/1000 | 2.1 |
tier |
0.20 | 5 | 1/5 | 0.32 |
The trace shows the top four naturally separate from the rest: customer_id, event_ts, product_id, country carry the heavy filter load.
Output:
| Final Z-Order columns | Reason |
|---|---|
customer_id |
highest score, point-lookup dominant |
event_ts |
high cardinality + frequent range filter |
product_id |
second join-key surface |
country |
popular filter, multi-dim with customer_id |
Why this works — concept by concept:
- Filter frequency dominates — Z-Order pays off only on columns that appear in WHERE clauses. A column that does not get filtered earns zero from the layout.
-
Cardinality matters non-linearly —
log(cardinality)weights high-card columns higher because the per-file range shrinks proportional to log-distinct-values when bit-interleaving. -
Selectivity captures point-lookups —
1/Nselectivity is the marginal scan reduction from a Z-Order hit on that column. - Join-key boost — columns that double as join keys earn a 1.5× boost because they drive both filter pushdown and partition-aware joins.
- The 4-column ceiling is hard — past 4 columns the per-column pruning is too diluted to repay the OPTIMIZE cost. Pick the top 4 and re-evaluate quarterly.
- Cost — scoring is O(candidate_columns); the score itself is a heuristic, not an oracle. Always validate with a representative query's file-skip ratio after the first OPTIMIZE pass.
Data Engineering
Topic — bucketing
Bucketing & file-layout problems
4. OPTIMIZE + Auto Compaction — bin-packing, optimal write target
OPTIMIZE bin-packs many small parquet files into a few large ones; Auto Compaction does the same at write-time
The mental model in one line: OPTIMIZE is the post-hoc bin-packer that coalesces small parquet files into target-size (default 1 GB) files; Auto Compaction is the at-write-time version that runs after each commit when the just-written files are below the threshold. Once you say "bin-pack to 1 GB targets," every small-file interview gotcha — slow scans, metadata overhead, executor memory explosions — becomes a single tuning lever.
Bin-packing semantics.
- OPTIMIZE picks small files (under
delta.targetFileSize, default 1 GB) and combines them. Large files are left alone — they are already optimal. - The combining is a rewrite: read N small files, write 1 large file, replace the manifest entries atomically in the transaction log.
- Idempotent: running OPTIMIZE on a tuned table is a no-op (no small files to combine).
- Can be scoped:
OPTIMIZE table WHERE partition_col = '2026-06-13'runs only on that partition.
OPTIMIZE + ZORDER BY together.
-
OPTIMIZE table ZORDER BY (col1, col2)does both passes in one command — first the Z-Order sort, then the bin-pack into target-size files. More efficient than separate calls. - After Z-Order, the resulting files have tight per-column min/max stats — file-skip ratios > 90% on the Z-Order columns.
Auto Compaction (write-time) vs OPTIMIZE (post-hoc).
-
Auto Compaction.
delta.autoOptimize.autoCompact = true. After each successful commit, Delta checks the new files; if they are small, it runs a synchronous compaction step before returning success. Adds ~10-30% to commit latency in exchange for "no small files ever exist." - OPTIMIZE. Scheduled command, typically nightly. Runs independently of write traffic. Better for very-large tables where synchronous compaction would add unacceptable commit latency.
- Combined. Use Auto Compaction for the streaming/operational layer (Bronze); use scheduled OPTIMIZE for the analytical layer (Silver/Gold) where Z-Order is also needed.
Target file size.
- Default: 1 GB. Configurable via
delta.targetFileSize(e.g.,'256m','1g','2g'). - Smaller targets (256 MB) help on tables with many small filtered queries — more files = finer pruning granularity.
- Larger targets (2-4 GB) help on tables with large analytical scans — fewer files = less per-file overhead.
-
Auto Tuning. On Databricks,
delta.tuneFileSizesForRewrites = truechooses the target adaptively based on observed query patterns.
Optimized Writes.
-
delta.autoOptimize.optimizeWrite = true. Adds an adaptive shuffle before the write so that the written files are closer to target size. Typically writes 128 MB files instead of the small per-task shards that a naive shuffle produces. - Combined with Auto Compaction, the small-file tax is eliminated end-to-end.
Detailed explanation — the small-file disaster, in numbers
Detailed explanation. A streaming Bronze table writes one parquet file per partition per micro-batch. With 5-minute micro-batches and 12 partitions, that is 144 files per hour, ~3,500 per day, ~25,000 per week. Each file is ~5 MB. A typical analytical query then scans 25,000 5 MB files — reading 25,000 parquet footers alone takes 30-60 seconds before the data scan starts.
Question. Given the workload above, show the cost of running with no compaction vs scheduled nightly OPTIMIZE vs Auto Compaction. Quantify the read latency, write CPU, and file count.
Input.
| Trait | Value |
|---|---|
| Streaming micro-batch | 5 min |
| Partitions per batch | 12 |
| Avg file size at write | 5 MB |
| Daily file count (raw) | ~3,500 |
| Weekly file count (raw) | ~25,000 |
Code.
-- Scenario 1 — no compaction.
-- Nothing — the streaming writer creates 5 MB files every 5 minutes.
-- Scenario 2 — scheduled nightly OPTIMIZE.
OPTIMIZE bronze.events;
-- Runs nightly; daily file count → ~50 large files
-- Scenario 3 — Auto Compaction.
ALTER TABLE bronze.events SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
);
-- Each commit ends with a synchronous compaction step.
-- Steady-state file count: ~50-100 at any time.
Step-by-step explanation.
- No compaction. File count grows linearly. At week 1, 25,000 files; at week 4, 100,000. A simple analytical query reads 100,000 parquet footers ≈ 2-4 minutes of metadata I/O before data scan.
- Nightly OPTIMIZE. Daily file count returns to ~50 each morning, climbs back to 3,500 through the day. Read latency on the same query is ~40 seconds in the morning, ~90 seconds by end-of-day. OPTIMIZE costs ~10-30 minutes of compute nightly.
- Auto Compaction. Steady-state file count stays at ~50-100 all day. Read latency is ~30-40 seconds consistently. Write CPU increases ~15-20% because every commit pays for the synchronous compaction.
- Combined Auto Compaction + scheduled OPTIMIZE ZORDER. Compaction keeps file count low all day; nightly OPTIMIZE ZORDER re-clusters for filter performance. Best of both worlds.
Output.
| Strategy | Steady-state files | Read latency | Write CPU |
|---|---|---|---|
| No compaction | 25K+ (grows) | 2-4 min | baseline |
| Nightly OPTIMIZE | 50 → 3,500 daily | 40 s → 90 s | +30 min/night |
| Auto Compaction | 50-100 all day | 30-40 s steady | +15-20% |
| Auto Compaction + nightly OPTIMIZE ZORDER | 50-100 + clustered | 4 s steady | +15-20% +30 min/night |
Rule of thumb. On streaming Bronze tables, default to Auto Compaction. On large analytical Silver/Gold tables, default to scheduled OPTIMIZE ZORDER. On the very largest tables (> 100 TB), do both.
Detailed explanation — incremental OPTIMIZE on a hot partition
Detailed explanation. Running full-table OPTIMIZE every night is wasteful when only one or two partitions actually changed. The OPTIMIZE … WHERE form restricts the operation to a partition or partition range — pays only for the hot set.
Question. Given a Delta table partitioned by event_date, schedule a daily OPTIMIZE that only re-bin-packs yesterday's partition (the only one that received new writes).
Input. A table with 365 daily partitions, only one partition (yesterday) is "hot."
Code.
-- Restrict OPTIMIZE to one partition — pays only for yesterday's files.
OPTIMIZE bronze.events
WHERE event_date = current_date() - INTERVAL '1' DAY;
-- With Z-Order:
OPTIMIZE bronze.events
WHERE event_date = current_date() - INTERVAL '1' DAY
ZORDER BY (country, customer_id);
Step-by-step explanation.
- Without the WHERE clause, OPTIMIZE scans every file in the table to find small files. On a 50 TB table with 50,000 files, that metadata scan alone takes ~2-3 minutes.
- With the WHERE clause, OPTIMIZE scans only the partition's directory. On a 137 MB / 137 file partition, the metadata scan is sub-second.
- The rewrite is bounded by the partition size. Yesterday's partition is ~137 GB; the rewrite is ~3-5 minutes vs a full-table rewrite of ~30-60 minutes.
- The Z-Order pass also runs only on the partition's files — the clustering applies to the partition.
Output.
| Strategy | Files scanned | Files rewritten | Time |
|---|---|---|---|
| Full-table OPTIMIZE | 50,000 | thousands | 30-60 min |
| Partition-scoped OPTIMIZE | 137 | 50-100 | 3-5 min |
Rule of thumb. Always scope OPTIMIZE to the partitions that actually received writes. On time-partitioned tables, this means "OPTIMIZE yesterday's partition" — never "OPTIMIZE the whole table." The 12× cost reduction is structural, not incidental.
Detailed explanation — choosing between Auto Compaction and scheduled OPTIMIZE
Detailed explanation. Auto Compaction is synchronous: it adds latency to every commit. Scheduled OPTIMIZE is asynchronous: it runs on a separate schedule and does not block writes. The choice depends on whether the write SLA can absorb 10-30% latency tax in exchange for "no small files ever."
Question. For each of three workloads — sub-second streaming, hourly micro-batch, daily batch — pick between Auto Compaction and scheduled OPTIMIZE.
Input.
| Workload | Write cadence | Write SLA |
|---|---|---|
| A — sub-second streaming | continuous | < 1 s p99 commit |
| B — hourly micro-batch | hourly | < 30 s p99 commit |
| C — daily batch | nightly | < 60 min total |
Code.
-- A — sub-second streaming: cannot tolerate sync compaction
ALTER TABLE A SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = false -- async OPTIMIZE only
);
-- + scheduled hourly OPTIMIZE WHERE on the hot partition
-- B — hourly micro-batch: can absorb sync compaction
ALTER TABLE B SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
);
-- C — daily batch: scheduled OPTIMIZE + Z-Order in the batch window
-- (Auto compaction redundant for one large commit)
OPTIMIZE C ZORDER BY (customer_id, product_id);
Step-by-step explanation.
- A — sub-second p99 forbids Auto Compaction's synchronous step. Use Optimized Writes (still cheap) but schedule OPTIMIZE WHERE on each completed hour-partition asynchronously.
- B — hourly commits with 30 s SLA can absorb the ~5-10 s synchronous compaction. Use both Optimized Writes and Auto Compaction.
- C — nightly batch writes one large commit. Auto Compaction adds nothing; the OPTIMIZE pass in the batch window handles everything plus Z-Order.
Output.
| Workload | Strategy | Reason |
|---|---|---|
| A streaming | OW + async OPTIMIZE | sync compaction breaks p99 |
| B hourly | OW + Auto Compaction | SLA can absorb sync compaction |
| C daily | Scheduled OPTIMIZE ZORDER | one commit, no compaction needed |
Rule of thumb. The write SLA determines the answer. Sub-second writes → asynchronous OPTIMIZE only. Multi-second writes → Auto Compaction. Hours-long batch → scheduled OPTIMIZE ZORDER. Mix and match per layer of the medallion architecture.
Senior interview question on OPTIMIZE cadence for a streaming Bronze table
A senior interviewer often frames this as: "Sketch the OPTIMIZE schedule for a streaming Bronze Delta table with 5-minute Structured Streaming micro-batches, downstream MERGE-INTO consumers, and a 24-hour analyst-query SLA. Justify each cadence."
Solution Using a layered OPTIMIZE cadence with partition scope and Z-Order
# Pseudocode for the orchestrator (Databricks Workflows / Airflow).
# Layer 1 — continuous: Optimized Writes + Auto Compaction (table property)
# Layer 2 — every 6 hours: partition-scoped OPTIMIZE on yesterday's partition
def hourly_partition_optimize():
spark.sql("""
OPTIMIZE bronze.events
WHERE event_date >= current_date() - INTERVAL '1' DAY
""")
# Layer 3 — nightly: Z-Order on yesterday's partition for filter performance
def nightly_zorder():
spark.sql("""
OPTIMIZE bronze.events
WHERE event_date = current_date() - INTERVAL '1' DAY
ZORDER BY (country, customer_id)
""")
# Layer 4 — weekly: full-table OPTIMIZE catch-up + retention prune
def weekly_full_optimize():
spark.sql("OPTIMIZE bronze.events")
spark.sql("VACUUM bronze.events RETAIN 168 HOURS")
Step-by-step trace.
| Layer | Cadence | Scope | Cost | Benefit |
|---|---|---|---|---|
| 1 | every commit | Auto Compaction | +15-20% write CPU | no small files ever |
| 2 | every 6 hours | hot partition | ~30s | re-bin-pack any straggler files |
| 3 | nightly | hot partition + Z-Order | ~5 min | filter performance for analysts |
| 4 | weekly | full table + VACUUM | ~30 min | catch-up + storage hygiene |
The trace shows how the layered cadence keeps the file count low at every timescale — sub-second through weekly — without paying full-table OPTIMIZE cost every day.
Output:
| Time of day | Action | Effect |
|---|---|---|
| 00:00 | nightly OPTIMIZE ZORDER on yesterday | analyst queries fast in morning |
| 06:00 | partition-scoped OPTIMIZE | drag from late commits cleared |
| 12:00 | partition-scoped OPTIMIZE | clears the morning's micro-batches |
| 18:00 | partition-scoped OPTIMIZE | clears the afternoon's micro-batches |
| 02:00 Sun | weekly full OPTIMIZE + VACUUM | catch-up + retention prune |
Why this works — concept by concept:
- Layered cadence — each layer addresses a different timescale of file accumulation. Auto Compaction handles seconds; partition OPTIMIZE handles hours; Z-Order handles days; full OPTIMIZE handles weeks.
-
Partition scope —
WHERE event_date = …keeps every OPTIMIZE cost-bounded to the hot set. Critical for tables larger than 10 TB. - Z-Order separated from compaction — Z-Order is the expensive rewrite; running it less often (nightly) is fine because the analyst-query SLA is daily, not hourly.
- VACUUM coupled to OPTIMIZE — running VACUUM in the same maintenance window as full OPTIMIZE keeps both rebalanced together. Mismatched cadences cause time-travel surprises.
- Auto Compaction is the spine — the other layers are optimisations on top. Without Auto Compaction, the steady-state file count explodes between scheduled passes.
- Cost — Layer 1 is amortised across writes; Layers 2-3 are partition-bounded (constant); Layer 4 is the only full-table cost (weekly).
Data Engineering
Topic — optimization
File-layout optimization drills
5. Production tuning — VACUUM, deletion vectors, liquid clustering
VACUUM reclaims tombstoned files after retention; deletion vectors mask deleted rows without rewrites; liquid clustering replaces Z-Order on greenfield tables
The mental model in one line: VACUUM is the only command that physically deletes files from object storage — it can only delete files older than the retention threshold (default 168 hours), and dropping below that floor breaks time-travel and the audit story. Deletion vectors and liquid clustering are the two 2024+ primitives that change how Delta handles row-level deletes and incremental clustering — both should be enabled on every new table.
The VACUUM file lifecycle.
- ACTIVE — referenced by the latest snapshot. Readable by every query. Cannot be deleted.
-
TOMBSTONED — removed by a commit (DELETE, UPDATE, OPTIMIZE, MERGE), but still physically present on disk. Reachable via time-travel
VERSION AS OFuntil VACUUM reclaims. - RECLAIMED — physically deleted by VACUUM. Unrecoverable. Time-travel to versions referencing the file fails.
Default 168-hour retention.
-
delta.deletedFileRetentionDurationcontrols how long tombstoned files persist before VACUUM may reclaim them. Default 168 hours (7 days). - VACUUM refuses to reclaim files newer than this threshold — even if you ask politely.
- To override (dangerous):
SET spark.databricks.delta.retentionDurationCheck.enabled = false; VACUUM table RETAIN 24 HOURS;. Don't.
Why the floor exists.
- Time-travel
VERSION AS OF vrequires every file referenced at versionvto still be on disk. Once VACUUM reclaims those files, time-travel to versionvfails. - Audit / regulator requirements often demand at least 7 days of "as-of" history. Setting retention below 7 days violates the audit contract.
- Concurrent readers may still be reading a file that a commit just tombstoned. The 7-day floor gives them safe buffer.
Deletion vectors.
- A deletion vector is a per-file bitmap that marks rows as deleted without rewriting the file.
- Enable:
ALTER TABLE … SET TBLPROPERTIES (delta.enableDeletionVectors = true). - A
DELETE WHERE id = 17updates the deletion vector for the affected file — milliseconds instead of minutes of rewrite. - Reads apply the deletion vector at scan time — the row is hidden but the parquet file is untouched.
- An OPTIMIZE pass later consolidates by rewriting files with significant deletion vectors into clean files without DVs (configurable threshold).
Liquid clustering.
- The 2024+ successor to Z-Order. Maintains clustering incrementally on each write — no full rewrite required.
- Greenfield:
CREATE TABLE … CLUSTER BY (col1, col2). - Existing table:
ALTER TABLE … CLUSTER BY (col1, col2). The next OPTIMIZE pass rewrites files into the new layout. - Liquid clustering does not need partition columns — it replaces both partitioning and Z-Order with a single clustering key. Cleaner mental model.
- Up to 4 clustering columns; same ceiling rationale as Z-Order.
When to use what — 2026 cheat sheet.
- New table in 2026? Liquid clustering. Always.
-
Existing Z-Ordered table? Z-Order until a migration window opens. Then
ALTER TABLE … CLUSTER BY (…)and one final OPTIMIZE pass. - UPDATE-heavy table? Enable deletion vectors. Then add liquid clustering on top for clustering maintenance.
- Audit-regulated table? Keep VACUUM at 168h or higher. Never lower.
Detailed explanation — choosing retention for a regulated table
Detailed explanation. Regulated industries (finance, healthcare) commonly require 30-day or 90-day "as-of" history. The Delta default of 7 days is too short. Override upward, never downward.
Question. For a regulated payments table that must support SELECT … VERSION AS OF :v for any version in the last 30 days, configure the retention and document the storage trade-off.
Input.
| Trait | Value |
|---|---|
| Daily delta size | 100 GB |
| Required time-travel window | 30 days |
| Audit retention | 90 days |
Code.
-- Configure retention to 90 days (covers both time-travel + audit).
ALTER TABLE payments SET TBLPROPERTIES (
delta.deletedFileRetentionDuration = 'interval 90 days',
delta.logRetentionDuration = 'interval 90 days'
);
-- VACUUM still uses the table property; safe to schedule daily.
VACUUM payments; -- reclaims files older than 90 days
Step-by-step explanation.
-
delta.deletedFileRetentionDurationcontrols when VACUUM may reclaim a tombstoned file. Setting to 90 days means VACUUM keeps every tombstoned file for 90 days. -
delta.logRetentionDurationcontrols how long the JSON transaction log files persist. For time-travel to work back to day -30, the log must reach back at least that far. - The storage cost is
daily_delta × retention_days. With 100 GB/day × 90 days = 9 TB of tombstoned-file storage — a real but manageable cost for the audit guarantee. - VACUUM still runs daily; it just reclaims files older than 90 days each pass. After steady state, the tombstoned set is ~9 TB.
Output — storage breakdown after steady state.
| Layer | Size | Purpose |
|---|---|---|
| Active data | varies | latest snapshot |
| Tombstoned (within 90d) | ~9 TB | time-travel + audit |
| Log retention (90d JSON files) | ~50 MB | snapshot reconstruction |
Rule of thumb. Set retention to at least the longest of: regulatory minimum, longest time-travel use case, and the lag of your slowest downstream consumer. Going higher is cheap; going lower than 168h breaks time-travel silently.
Detailed explanation — turning on deletion vectors
Detailed explanation. Deletion vectors flip the cost model of DELETE and UPDATE from "rewrite affected files" to "mark deleted rows in a bitmap." On a table where MERGE INTO touches 0.1% of rows scattered across 80% of files, the cost difference is 800×.
Question. Given a Delta table where a daily MERGE updates 0.5% of rows scattered across 70% of files, show the cost difference with and without deletion vectors. Walk through what changes on the read path.
Input.
| Trait | Value |
|---|---|
| Table size | 10 TB |
| Files | 10,000 |
| Daily MERGE rows | 0.5% = 50M rows |
| Files touched by MERGE | 70% = 7,000 |
Code.
-- Enable deletion vectors.
ALTER TABLE silver.events SET TBLPROPERTIES (
delta.enableDeletionVectors = true
);
-- The MERGE call doesn't change.
MERGE INTO silver.events AS tgt
USING staging.delta AS src
ON tgt.event_id = src.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Step-by-step explanation.
- Without DV. The MERGE rewrites every touched file: 7,000 files × ~1 GB = 7 TB of writes. Time: ~30-45 min.
- With DV. The MERGE writes a deletion vector per touched file (typically a few KB each) plus a small parquet file with the new rows for the matched rows. Touched-file rewrites: ~0. Time: ~2-3 min.
- On the read path. Each scan applies the deletion vector to mask the deleted rows. A row that exists in the parquet file but is marked deleted in the DV is skipped silently. Read latency increases ~5-10% per file with a DV.
- Periodic OPTIMIZE. When the DV coverage of a file exceeds a threshold (~20% of rows masked), OPTIMIZE rewrites the file cleanly. The DV cost is amortised over many MERGE batches.
Output.
| Operation | Without DV | With DV |
|---|---|---|
| MERGE rewrite bytes | 7 TB | ~50 MB (DVs + new rows) |
| MERGE time | 30-45 min | 2-3 min |
| Read latency overhead | 0% | +5-10% |
| OPTIMIZE consolidation | every MERGE | when DV > 20% |
Rule of thumb. Enable deletion vectors on every Delta table that experiences UPDATE / DELETE traffic. The MERGE / DELETE cost reduction is typically 10-100× depending on row scatter; the read overhead is small and consolidated by periodic OPTIMIZE.
Detailed explanation — migrating Z-Order to liquid clustering
Detailed explanation. Existing Z-Ordered tables can migrate to liquid clustering with a single ALTER TABLE + OPTIMIZE. The migration is a one-way trip — once the table is liquid-clustered, the partition + Z-Order metadata is gone.
Question. Given an existing Delta table partitioned by event_date and Z-Ordered on (country, customer_id), migrate to liquid clustering with cluster key (customer_id, country). Walk through the steps.
Input. Table events with 50 TB, 365 partitions, Z-Order on (country, customer_id).
Code.
-- Step 1 — declare the liquid cluster key.
ALTER TABLE events CLUSTER BY (customer_id, country);
-- Step 2 — drop the partition column (optional; can keep partition for date-range queries)
-- ALTER TABLE events DROP PARTITION FIELD event_date;
-- In practice, keep event_date partition + liquid clustering on the entity axis.
-- Step 3 — run OPTIMIZE; this rewrites files into the liquid layout.
OPTIMIZE events;
-- Step 4 — verify with DESCRIBE DETAIL.
DESCRIBE DETAIL events;
-- properties: { clusteringColumns: 'customer_id,country', ... }
Step-by-step explanation.
-
ALTER TABLE … CLUSTER BYdeclares the clustering intent in the table metadata. The next OPTIMIZE pass rewrites files in liquid layout. - Liquid clustering does not require partition columns, but keeping
event_dateas a partition is still useful for date-range pruning. Most production tables run with both partition + liquid clustering on different axes. - OPTIMIZE rewrites the files. On a 50 TB table this is a ~3-6 hour job; plan it for a maintenance window.
- After OPTIMIZE, subsequent writes are automatically liquid-clustered — no scheduled OPTIMIZE ZORDER needed. Liquid clustering is incrementally maintained per commit.
- Reads on
WHERE customer_id = 17enjoy the same file-skip as Z-Order; reads onWHERE country = 'DE'also benefit because both columns are in the cluster key.
Output.
| Phase | Effect |
|---|---|
| ALTER TABLE … CLUSTER BY | metadata only; no data movement |
| First OPTIMIZE | rewrites all files in liquid layout |
| Subsequent writes | automatically liquid-clustered |
| Subsequent reads | file-skip equivalent to or better than Z-Order |
Rule of thumb. Migrate Z-Order to liquid clustering when the table enters its next major maintenance window. Greenfield tables in 2026+ should start in liquid clustering. Mixed mode (partition + liquid) is the production sweet spot — partition for time, liquid for entity.
Detailed explanation — the CDF + deletion vector interaction
Detailed explanation. Enabling deletion vectors on a CDF-enabled table is safe — CDF correctly captures DV-driven changes — but a few subtle behaviours surprise teams the first time.
Question. Given a CDF + DV-enabled table, walk through what the CDF stream emits when a MERGE INTO writes via deletion vectors.
Input. Table with both delta.enableChangeDataFeed = true and delta.enableDeletionVectors = true.
Code.
-- A MERGE that updates one row.
MERGE INTO events AS t
USING (SELECT 17 AS id, 'DE' AS country) AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET country = s.country;
-- Read CDF for the commit.
SELECT * FROM table_changes('events', :v, :v);
Step-by-step explanation.
- The MERGE commit writes a deletion vector marking the row's old version as deleted in its parquet file, plus a tiny new parquet file with the updated row.
- The CDF output emits two rows for the commit:
update_preimage(old country) andupdate_postimage(new country). Same as without DV. - CDF consumers see the same logical row-state change. They do not see — and do not need to see — the DV mechanics.
- The DV-bitmap file is a Delta internal artefact; CDF readers only see the change-data sidecar files, which carry per-row preimage/postimage data.
Output.
| Mechanism | What's written |
|---|---|
| DV bitmap | per-file bitmap masking the old row |
| Updated parquet file | tiny file with the new row |
| CDF sidecar | one update_preimage row + one update_postimage row |
Rule of thumb. DV and CDF compose cleanly. Enable both on every table where MERGE / UPDATE / DELETE traffic is non-trivial. The CDF consumer contract is unchanged; the DV win is purely on the write side.
Senior interview question on VACUUM + OPTIMIZE + CDF cadence
A senior interviewer often frames this as: "Design the maintenance cadence for a 50 TB Bronze Delta table with CDF on, deletion vectors on, 5-minute streaming writes, and a 7-day downstream CDF consumer SLA. What does the orchestrator run and when?"
Solution Using a layered maintenance schedule that respects every constraint
# Orchestrator schedule (Databricks Workflows / Airflow).
# Continuous — Auto Compaction + Optimized Writes (table properties).
# Already set:
# delta.autoOptimize.optimizeWrite = true
# delta.autoOptimize.autoCompact = true
# delta.enableChangeDataFeed = true
# delta.enableDeletionVectors = true
# Every 6 hours — partition-scoped OPTIMIZE (catch any DV consolidation needs).
@every("6h")
def hourly_partition_optimize():
spark.sql("""
OPTIMIZE bronze.events
WHERE event_date >= current_date() - INTERVAL '1' DAY
""")
# Nightly at 01:00 — partition-scoped OPTIMIZE + Z-Order / liquid clustering pass.
@daily(at="01:00")
def nightly_cluster_pass():
spark.sql("""
OPTIMIZE bronze.events
WHERE event_date = current_date() - INTERVAL '1' DAY
""")
# Weekly Sunday 03:00 — VACUUM with 168h retention (the 7-day CDF SLA floor).
@weekly(on="sunday", at="03:00")
def weekly_vacuum():
spark.sql("VACUUM bronze.events RETAIN 168 HOURS")
Step-by-step trace.
| Schedule | Action | Constraint respected |
|---|---|---|
| continuous | Optimized Writes + Auto Compaction | no small-file tax at write time |
| every 6h | partition OPTIMIZE | DV consolidation; hot partition |
| nightly | cluster pass on yesterday | analyst-query SLA in morning |
| weekly | VACUUM 168h | 7-day downstream CDF consumer SLA |
The trace shows the cadences are interlocking — each runs at the highest frequency that respects the slowest constraint. VACUUM does not run more often than weekly because every CDF consumer needs at least 7 days to consume.
Output:
| Window | Maintenance load |
|---|---|
| streaming day-time | continuous compaction, +15-20% write CPU |
| early morning | nightly cluster pass, ~5 min |
| Sunday 03:00 | weekly VACUUM, ~10 min |
Why this works — concept by concept:
- Auto Compaction at writes — handles the small-file tax in real time. Without it, the partition-scoped OPTIMIZE would not be enough.
- Partition scope — every OPTIMIZE is bounded to yesterday's partition. Full-table OPTIMIZE on a 50 TB table is not in the cadence — the layered approach makes it unnecessary.
- Cluster pass nightly — clustering is the expensive rewrite; running daily is sufficient because the analyst SLA is daily.
- VACUUM weekly with 168h floor — respects the downstream CDF SLA. Running VACUUM daily would tombstone files that downstream consumers still need to read.
- DV + CDF compose — both are enabled and run silently in the background. The orchestrator does not need to be aware of either.
- No full-table OPTIMIZE — at 50 TB the full-table cost is hours. The layered cadence keeps the file count low enough that full-table OPTIMIZE is unnecessary.
- Cost — write CPU +15-20%; per-cadence costs are partition-bounded; storage cost is daily_delta × 7 days ≈ 5-7 TB of tombstoned files. All within typical operational budgets.
Data Engineering
Topic — aggregation
Aggregation pipeline tuning problems
Cheat sheet — Delta Lake tuning recipes
-
Enable CDF.
ALTER TABLE db.t SET TBLPROPERTIES (delta.enableChangeDataFeed = true)— costs metadata only until a consumer reads. -
Read CDF.
SELECT * FROM table_changes('db.t', startVer, endVer)for a version range orSELECT * FROM table_changes('db.t', startTs, endTs)for a timestamp range. -
Z-Order.
OPTIMIZE db.t ZORDER BY (col1, col2, col3, col4)— cap at 4 columns, prefer high-cardinality WHERE columns. - Partition column choice. Low-cardinality, time-correlated → partition. High-cardinality, multi-dim → Z-Order or liquid clustering.
-
Target file size. Default 1 GB. Override with
ALTER TABLE db.t SET TBLPROPERTIES (delta.targetFileSize = '1g'). On Databricks, preferdelta.tuneFileSizesForRewrites = true. -
Auto Compaction.
delta.autoOptimize.autoCompact = true— synchronous bin-pack after each commit; adds 15-20% write CPU. -
Optimized Writes.
delta.autoOptimize.optimizeWrite = true— adaptive shuffle so writes target ~128 MB files; pairs with Auto Compaction. -
Partition-scoped OPTIMIZE.
OPTIMIZE db.t WHERE event_date = current_date() - INTERVAL '1' DAY— keep OPTIMIZE cost bounded. -
Deletion Vectors.
ALTER TABLE db.t SET TBLPROPERTIES (delta.enableDeletionVectors = true)— converts DELETE/UPDATE from file-rewrite to bitmap-mask. -
Liquid Clustering (greenfield).
CREATE TABLE db.t (...) CLUSTER BY (col1, col2)— incremental clustering; no scheduled re-OPTIMIZE needed. -
Liquid Clustering (migrate).
ALTER TABLE db.t CLUSTER BY (col1, col2); OPTIMIZE db.t;. -
VACUUM.
VACUUM db.t RETAIN 168 HOURS— the 7-day floor. Never lower. -
Custom retention.
ALTER TABLE db.t SET TBLPROPERTIES (delta.deletedFileRetentionDuration = 'interval 30 days', delta.logRetentionDuration = 'interval 30 days'). -
Verify layout.
DESCRIBE DETAIL db.treturnsnumFiles,sizeInBytes,partitionColumns,properties— the four numbers to sanity-check. -
History.
DESCRIBE HISTORY db.t LIMIT 20returns the recent commits withoperationMetrics— diagnose write-amp and metric drift. -
Time-travel.
SELECT * FROM db.t VERSION AS OF :vorTIMESTAMP AS OF '2026-06-12 00:00:00'— bounded by retention.
Frequently asked questions
Does enabling CDF rewrite history?
No. ALTER TABLE … SET TBLPROPERTIES (delta.enableChangeDataFeed = true) is a metadata-only change. CDF starts capturing the row-state ledger from the next commit onward; reads against commits prior to the enable point return "CDF not available." To backfill history into a downstream SCD-2 or audit trail, use time-travel VERSION AS OF to snapshot earlier versions and pairwise full-diff between them — a one-time, expensive operation that hands off to cheap CDF reads after the enable point.
What is the difference between Z-Order and liquid clustering?
Z-Order is a scheduled rewrite — OPTIMIZE table ZORDER BY (cols) rewrites every file in the table (or partition) to interleave the chosen columns' bits. Liquid clustering is incrementally maintained — each commit clusters the newly-written rows in place, so a scheduled re-cluster is unnecessary. Liquid clustering also does not require a partition column (it subsumes both partition and Z-Order into one cluster key), works better for UPDATE-heavy tables, and is the default for greenfield tables on Databricks Runtime ≥ 14 / Delta ≥ 3.0. For new tables in 2026, prefer liquid clustering. For existing Z-Ordered tables, migrate during the next maintenance window with ALTER TABLE … CLUSTER BY (…); OPTIMIZE ….
Can I run OPTIMIZE on a streaming Delta table?
Yes — but use partition scope (OPTIMIZE table WHERE partition_col = '...') and avoid the time window when downstream consumers are reading. OPTIMIZE creates new files and tombstones old ones; downstream readers that pin a snapshot version are unaffected because Delta's transaction log preserves the old layout until VACUUM reclaims it. The recommended pattern for streaming Bronze tables is Auto Compaction + Optimized Writes (continuous), partition-scoped OPTIMIZE every 6 hours, and nightly OPTIMIZE ZORDER on yesterday's partition — no full-table OPTIMIZE ever needed on tables larger than 10 TB.
Why does my VACUUM not free storage?
Three common causes. First, you are below the 168-hour retention floor — VACUUM refuses to reclaim files newer than the threshold; the run completes but reclaims nothing. Second, the files you expect to vacuum are still referenced by the latest snapshot — they are ACTIVE, not TOMBSTONED, so VACUUM has no authority over them. Third, deletion vectors are masking rows without actually rewriting files — the parquet files are still ACTIVE; only a future OPTIMIZE consolidation makes the old files tombstone-eligible. Diagnose with DESCRIBE DETAIL (numFiles and sizeInBytes) and DESCRIBE HISTORY (recent operations) to see what is and is not tombstoned.
Does Z-Order on more than 4 columns help?
Almost never. Z-Order interleaves the bits of the chosen columns; with N columns, each output bit carries 1/N-th of each input column's information. At N=2 each column gets 50% of the bits — strong pruning on both. At N=4 each column gets 25% — pruning is weaker but still useful. At N=8 each column gets 12.5% — pruning approaches no-op and OPTIMIZE pays for ordering that doesn't help. The Delta docs cap recommended Z-Order at 4 columns. If you have 8+ filter columns, pick the top 4 by filter-frequency × log(cardinality) × selectivity and leave the others alone; or migrate to liquid clustering, which also caps at 4 cluster columns but maintains them incrementally.
How do deletion vectors affect CDF output?
Not at all — at the logical level. CDF emits the same update_preimage / update_postimage / delete row pairs whether the underlying mechanism is a file rewrite or a deletion vector. Downstream consumers reading table_changes() see identical output. Physically, DV-driven changes write a tiny bitmap plus a small parquet file with new rows instead of rewriting the entire affected files; the CDF sidecar files are produced the same way. CDF + DV compose cleanly and should both be enabled on tables with significant MERGE / UPDATE / DELETE traffic — DV cuts the write cost 10-100×, CDF makes the changes cheaply consumable downstream.
Practice on PipeCode
- Drill the optimization practice library → for the file-layout family of probes — Z-Order, bin-packing, partition design.
- Rehearse on streaming Δ-throughput problems → for CDF-style incremental consumers, micro-batch sizing, and watermark-vs-version mental models.
- Layer the aggregation practice library → for the materialised-view refresh family of questions.
- Stack the joins practice library → for the MERGE-INTO / SCD-2 build patterns.
- Add bucketing drills → for the Z-Order column-picking interview surface.
- Rehearse on the Databricks company practice library → for Delta-flavoured interview probes.
- For the broader surface, read top data engineering interview questions →.
- Take the Apache Spark internals course → for the execution-plan + shuffle layer underneath Delta.
- Take the PySpark fundamentals course → for the read/write surface where OPTIMIZE and Z-Order plug in.
- Take the ETL system design course → for the lakehouse layering, medallion architecture, and the CDF-as-propagation pattern.
Pipecode.ai is Leetcode for Data Engineering — every Delta tuning recipe above ships with hands-on practice rooms where you sketch the CDF SCD-2 build, pick the right Z-Order columns by hand, schedule the OPTIMIZE cadence, and design the VACUUM retention for a regulated table against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you walk into a senior Databricks or Delta-shop interview with the lakehouse tuning vocabulary already in muscle memory.
Practice Delta Lake tuning now →
Databricks-flavoured drills →





Top comments (0)