Gowtham Potureddi

Posted on Jun 13

Delta Lake Change Data Feed (CDF) & Z-Ordering: Performance Tuning

#sql #python #interview #dataengineering

delta lake at small scale forgives almost every design sin: append-only writes, no partitioning, no clustering, no compaction. The bill arrives the day the table crosses 10 TB — that is when junior teams discover that a SELECT * FROM events WHERE country = 'DE' AND customer_id = 17 reads forty thousand files instead of four, and that an upstream Change Data Capture pipeline rewrites a third of the table on every nightly merge.

This guide is the senior-engineer tuning playbook for the two highest-leverage Delta features in 2026 — delta lake cdf (Change Data Feed) for row-state propagation and Z-Ordering for multi-dimensional file skipping — wired together with the OPTIMIZE bin-packer, the VACUUM retention timeline, deletion vectors, and the new liquid clustering primitive. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the optimization practice library →, stack the streaming and Δ-throughput drills →, and rehearse on Databricks-flavoured interview problems →.

On this page

Why CDF + Z-Order — when each wins
CDF mechanics — _change_type, version, timestamp
Z-Ordering deep dive — multi-dim clustering, file skipping
OPTIMIZE + Auto Compaction — bin-packing, optimal write target
Production tuning — VACUUM, deletion vectors, liquid clustering
Cheat sheet — Delta Lake tuning recipes
Frequently asked questions
Practice on PipeCode

1. Why CDF + Z-Order — when each wins

Two axes of Delta tuning — row-state propagation (CDF) and file-skip locality (Z-Order)

The one-sentence invariant: Delta Lake performance at scale is two orthogonal problems — propagating row-state cheaply (CDF) and pruning files on read cheaply (Z-Order) — and you tune each axis with a different set of knobs. Once you internalise that "CDF is for downstream sync, Z-Order is for filtered reads," every Delta tuning interview question becomes a placement exercise on this 2x2 matrix.

The 2x2 placement matrix.

Write-heavy + downstream consumers — delta lake cdf wins. You pay write-amp once and downstream materialised views consume tiny delta streams instead of re-reading the whole table.
Read-heavy + multi-dim filters — z-ordering wins. You pay an OPTIMIZE pass once per cycle and every subsequent query skips 80–98% of files via min/max pruning.
Write-heavy + read-heavy — both, with OPTIMIZE on a separate hot-set, and CDF only on the dimensions that drive downstream MVs.
Append-only + single-column filter — neither. Partitioning on the filter column plus auto compaction is cheaper than Z-Order's full rewrite.

Three places naive Delta tables bleed.

Small-file tax. Streaming writes create thousands of tiny parquet files per hour. A SELECT * scans every file's footer to read min/max stats — the metadata cost alone can dwarf the data cost. Fix: OPTIMIZE bin-packs into 1 GB targets; auto compaction handles it at write-time.
Full-table merges. A nightly MERGE INTO target USING source ON target.key = source.key against a non-clustered table rewrites every matched file even when only 0.1% of rows changed. Fix: enable deletion vectors for row-level deletes; cluster the merge key with Z-Order or liquid clustering so the file-touch set is small.
History blindspots. Downstream materialised views and SCD-2 dimensions re-read the whole table because they have no incremental contract. Fix: enable CDF and consume table_changes() between versions — read only the Δ rows.

What interviewers listen for.

Do you say "CDF is for propagation, Z-Order is for pruning" when asked which to use? — senior signal.
Do you mention that Z-Order is a full rewrite and OPTIMIZE is idempotent? — required answer.
Do you bring up liquid clustering as the 2024+ successor to Z-Order on greenfield tables? — senior signal.
Do you separate the VACUUM retention floor from time-travel coverage? — required answer.

The 2026 reality.

CDF is GA across Databricks Runtime ≥ 8.4 and OSS Delta ≥ 2.0; it emits _change_type, _commit_version, _commit_timestamp columns out of the read API.
Z-Order is still the workhorse for existing tables; greenfield tables on Databricks Runtime ≥ 14 should default to liquid clustering.
Deletion vectors are GA in Delta 3.0+ and on Databricks; they convert DELETE/UPDATE from "rewrite affected files" to "mark deleted rows in a sidecar bitmap."
The 168-hour VACUUM floor is enforced by spark.databricks.delta.retentionDurationCheck.enabled. Lowering it disables time-travel and breaks audit guarantees.

Detailed explanation — the four-knob tuning model

Detailed explanation. Senior data engineers carry a mental model with four knobs: file count (OPTIMIZE / auto compaction), file layout (Z-Order / liquid clustering), row-state propagation (CDF), and storage hygiene (VACUUM / deletion vectors). Every Delta performance question maps to one of those four knobs, and an interviewer who hears all four named gets confidence the candidate has tuned a table at scale.

Question. Walk through a 50 TB Delta Bronze table with 5-minute streaming writes from Kafka and a nightly downstream MERGE that updates 0.5% of rows. Which Delta features do you enable, and in what order?

Input.

Workload trait	Value
Table size	50 TB
Write cadence	5-minute Structured Streaming micro-batches
Write volume	~ 800 GB/day
Read cadence	nightly MERGE + ad-hoc analyst queries
Filter columns	`country`, `customer_id`, `event_ts`

Code.

-- 1) Enable optimized writes + auto compaction (handles small-file tax)
ALTER TABLE bronze.events SET TBLPROPERTIES (
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact   = true
);

-- 2) Enable CDF so downstream MVs consume only the delta
ALTER TABLE bronze.events SET TBLPROPERTIES (
  delta.enableChangeDataFeed = true
);

-- 3) Partition by event_date (low-card) + Z-Order by (country, customer_id)
ALTER TABLE bronze.events PARTITIONED BY (event_date);
OPTIMIZE bronze.events ZORDER BY (country, customer_id);

-- 4) Enable deletion vectors for row-level deletes without file rewrites
ALTER TABLE bronze.events SET TBLPROPERTIES (
  delta.enableDeletionVectors = true
);

-- 5) Keep VACUUM at the 168h floor for time-travel + audit safety
VACUUM bronze.events RETAIN 168 HOURS;

Step-by-step explanation.

Auto compaction + optimized writes go on first — they pay for themselves immediately by collapsing the small-file tax that streaming writes inflict. Optimized writes does an adaptive shuffle at write-time targeting 128 MB writes; auto compaction post-batch bin-packs those into 1 GB files.
CDF is the cheapest "future-proofing" knob — it costs only metadata until the day a downstream consumer asks for table_changes(0, latest). Enable it before the table grows; enabling on a 50 TB table later requires a full rewrite of the snapshot.
Partitioning + Z-Order is the layout choice. event_date is low-cardinality and a natural time partition (one partition per day). country and customer_id are high-cardinality and appear in every WHERE — perfect Z-Order targets. Z-Order inside each partition keeps the file-skip ratio above 95%.
Deletion vectors flip the merge cost equation: an UPDATE of 0.5% of rows would normally rewrite 0.5% of files (worst case 100% of files when the changed rows are scattered). With DVs, the engine writes a tiny bitmap that masks deleted rows from reads — no file rewrite until a future OPTIMIZE.
VACUUM stays at 168 hours — the default and the contractual floor for time-travel. Lowering it to "save storage" breaks the audit story and is the single most common rollback after a senior reviewer catches it.

Output.

Knob	Effect	Cost
Optimized writes + auto compaction	small-file tax → near-zero	+5-15% write CPU
CDF	downstream MV refresh from Δ instead of full scan	+metadata only until read
Partition + Z-Order	95%+ file skipping on WHERE filters	one OPTIMIZE pass per cycle
Deletion vectors	UPDATE/DELETE without file rewrite	small bitmap per modified file
VACUUM 168h	time-travel + audit preserved	168h of tombstone storage

Rule of thumb. Order matters — enable the cheap futures (CDF, DVs, auto compaction) before the table grows. Layout decisions (partition + Z-Order or liquid clustering) should be made on the first write, because retro-fitting them on a 50 TB table is a 12-hour OPTIMIZE job that scales linearly with table size.

Detailed explanation — when partitioning beats Z-Order

Detailed explanation. Z-Order is not always the answer. Low-cardinality columns — country with 200 values, event_date with 365 values per year — are textbook partitioning columns, not Z-Order columns. Partitioning lets the planner skip entire directories before any min/max stats are read; Z-Order is for the inside of a partition.

Question. Given a table with two filter columns — event_date (cardinality 365 over the table's lifetime) and customer_id (cardinality 50M) — show the right layout and the wrong layout, and why the right one is 6× cheaper on a typical analyst query.

Input.

Filter	Cardinality	Selectivity per WHERE
`event_date`	365	1 day = 1/365
`customer_id`	50M	1 cust = 1/50M

Code.

-- WRONG — Z-Order on the low-cardinality column
ALTER TABLE events PARTITIONED BY ();   -- no partition
OPTIMIZE events ZORDER BY (event_date, customer_id);

-- RIGHT — partition by date, Z-Order by customer
ALTER TABLE events PARTITIONED BY (event_date);
OPTIMIZE events ZORDER BY (customer_id);

Step-by-step explanation.

With Z-Order alone on (event_date, customer_id), the planner must read min/max stats for every file in the table to find which ones touch the target date. On a 50 TB table that is ~50,000 files; reading 50K parquet footers is roughly 30-60 seconds of metadata work before the actual data scan begins.
With event_date as a partition column, the planner skips entire directory partitions on the date filter — WHERE event_date = '2026-06-13' reads exactly one partition's directory listing, ~137 files for a 50 TB / 365 day table.
Inside that one partition, Z-Order on customer_id clusters records with similar customer IDs into the same parquet file. A query like WHERE event_date = '2026-06-13' AND customer_id = 17 then reads ~1 file, not 137.
The net query touches 1 file vs the Z-Order-only design's 50,000 file metadata reads + ~137 file scans. A real measured speedup on this shape is 6×–40× depending on warm cache.

Output.

Design	Files touched (metadata)	Files scanned	Approx latency
Z-Order only on `(date, customer)`	50,000	~137	38 s
Partition by date + Z-Order on customer	137	1	4 s

Rule of thumb. Cardinality determines the role. Low-card (< 1000 distinct, time-correlated) → partition column. High-card (> 10K distinct, appears in WHERE) → Z-Order column. Z-Ordering a low-card column wastes the rewrite budget; partitioning a high-card column creates a directory-explosion disaster (50M sub-directories is unmanageable on object storage).

Detailed explanation — the cost of NOT enabling CDF

Detailed explanation. Some teams skip CDF "because we don't need it yet." The cost arrives the day a downstream consumer needs incremental refresh and discovers the only option is a full-table diff against a snapshot — which on a 50 TB table is a 4-hour, 50 TB read, every night.

Question. Given a Bronze events table feeding three Silver materialised views with daily refresh, compare the storage + compute cost of running with CDF vs without.

Input.

Trait	Value
Bronze size	50 TB
Daily delta (changed rows)	1% = 500 GB
Silver MVs	3 dependent views

Code.

-- Without CDF — every MV refresh re-reads the whole Bronze table
INSERT OVERWRITE silver.events_by_country
SELECT country, COUNT(*) AS event_count
FROM bronze.events
GROUP BY country;
-- Reads 50 TB nightly per MV → 3 MVs × 50 TB = 150 TB scanned

-- With CDF enabled — Silver MV reads only the delta
INSERT INTO silver.events_by_country
SELECT country, _change_type, _commit_version, COUNT(*) AS event_count
FROM table_changes('bronze.events', :last_version, :current_version)
GROUP BY country, _change_type, _commit_version;
-- Reads ~500 GB nightly per MV → 3 MVs × 500 GB = 1.5 TB scanned

Step-by-step explanation.

Without CDF the only correct incremental contract is "diff today's snapshot against yesterday's." Most teams skip this and full-scan instead — 50 TB read × 3 MVs = 150 TB scanned nightly, ~$300/day at typical object-storage egress + compute prices.
With CDF the read API exposes only the rows that changed between two commit versions. The job reads ~500 GB (1% delta × 3 MVs) instead of 150 TB — a 100× reduction in scan volume.
The CDF cost is paid on the write side: roughly 3-7% extra storage for the change-data sidecar files and ~5% write CPU. Two days of CDF storage costs less than one MV full-scan.
The crossover point — where CDF pays for itself — is usually one downstream consumer. By the third or fourth dependent MV, the cost difference is roughly 20-30×.

Output.

Refresh mode	Bytes scanned per night	Compute cost (relative)
Without CDF (3 full scans)	150 TB	100x
With CDF (3 Δ reads)	1.5 TB	1x

Rule of thumb. Enable CDF on every table that has — or might have — more than one downstream consumer. The metadata-only cost when nobody reads is negligible; the rewrite cost of enabling it on a multi-TB table later is brutal.

Senior interview question on Delta Lake tuning strategy

A senior interviewer often opens with: "You inherit a 50 TB Delta Bronze table with 200K small files, no CDF, no partitioning, and a 24-hour SLA on a downstream Silver refresh that currently takes 6 hours. Where do you start?" The probe blends file-count remediation, layout strategy, and CDF enablement into a single decision sequence.

Solution Using a four-step Delta Lake remediation playbook

-- Step 1 — measure the damage
DESCRIBE DETAIL bronze.events;
-- numFiles, sizeInBytes, partitionColumns, properties

-- Step 2 — emergency OPTIMIZE (bin-pack the small files)
OPTIMIZE bronze.events;
-- 200K small files → ~50 files of ~1 GB each

-- Step 3 — enable CDF + auto compaction so the problem doesn't return
ALTER TABLE bronze.events SET TBLPROPERTIES (
  delta.enableChangeDataFeed       = true,
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact   = true
);

-- Step 4 — layout for the next 50 TB
ALTER TABLE bronze.events PARTITIONED BY (event_date);
OPTIMIZE bronze.events ZORDER BY (country, customer_id);

Step-by-step trace.

Step	Action	Files	Storage	Read cost
Baseline	inherited table	200,000	50 TB	6h refresh
After OPTIMIZE	bin-packed	~50,000	50 TB	90 min
After CDF enable	future writes carry Δ	~50,000	50 TB + 3%	90 min
After partition + Z-Order	clustered, partitioned	~50,000	50 TB	18 min

The trace shows the cost compounds — Step 2 alone cuts read latency 4×; Step 4 (the layout fix) is what gets the refresh inside the 24-hour SLA with margin to spare.

Output:

Metric	Before	After
numFiles	200,000	~50,000
Silver refresh latency	6h	18min
File-skip ratio	~12%	~96%
Downstream Δ read	full scan	CDF Δ only

Why this works — concept by concept:

Measure before mutate — DESCRIBE DETAIL is the single most important command in Delta tuning. It returns file count, partition columns, table properties, and the latest commit version — the four numbers you need to size the remediation.
OPTIMIZE is idempotent — running it twice is a no-op when files already fit the target size. Safe to schedule nightly.
CDF before growth — enabling CDF on a small table is free metadata. Enabling on a 100 TB table is a rewrite. The "future-proofing" cost is asymmetric.
Partition + Z-Order is the layout — partition handles the time axis (low-card); Z-Order handles the multi-dim filter axis (high-card). They are complements, not competitors.
File-skip ratio is the KPI — the number to track is "files read / files scanned-for-stats." A well-tuned table reads < 5% of its files for a typical query. Below 50% means the layout is wrong.
Cost — Step 2 (OPTIMIZE) is O(table_size) — pay once. Steps 3 and 4 are O(metadata) plus one OPTIMIZE pass. Recurring cost: O(daily_delta) for auto compaction.

Data Engineering
Topic — optimization
Optimization problems (Data Engineering)

Practice →

2. CDF mechanics — `_change_type`, version, timestamp

Change Data Feed turns every insert, update, and delete into a ledger row — read it with `table_changes()`

The mental model in one line: CDF adds three hidden columns to a Delta table — _change_type, _commit_version, _commit_timestamp — that downstream consumers query via table_changes('db.table', start, end) to read only the rows that actually changed between two commit versions. Once you say "CDF is a versioned ledger of row-state transitions," every downstream sync, SCD-2 build, and audit story collapses into a single read pattern.

The four _change_type values.

insert — a brand-new row landed in this commit. The row data appears as-is in the CDF output.
update_preimage — the BEFORE image of an updated row. Carries the values the row had before the UPDATE.
update_postimage — the AFTER image of the same updated row, in the same commit. Carries the values the row has after the UPDATE.
delete — the row that was removed in this commit. Carries the row's final values before deletion.

The two hidden time/version columns.

_commit_version — the Delta transaction log version number where the change was committed. Monotonically increasing; perfect for "consume since last seen" cursors.
_commit_timestamp — the wall-clock time of the commit. Useful for time-window CDF reads ("what changed in the last 6 hours?") but not monotonic across concurrent writers; always prefer version for ordering.

Enabling CDF.

New table. CREATE TABLE … TBLPROPERTIES (delta.enableChangeDataFeed = true). Costs nothing extra at creation.
Existing table. ALTER TABLE … SET TBLPROPERTIES (delta.enableChangeDataFeed = true). CDF starts capturing from the next commit; does not back-fill history. Versions prior to the enable point return "CDF not available" if read.
Downstream consumer contract. Once enabled, every commit writes both the table data and a sidecar _change_data directory carrying the CDF rows for that commit. Storage overhead is roughly 3-7% on average tables; higher for update-heavy workloads.

Read patterns.

By version range — SELECT * FROM table_changes('db.events', 100, 105). Read the deltas between commit version 100 (exclusive) and 105 (inclusive).
By timestamp range — SELECT * FROM table_changes('db.events', '2026-06-13', '2026-06-14'). Read everything that committed in the window.
Streaming. spark.readStream.option('readChangeFeed', 'true').table('db.events') — Spark Structured Streaming reads CDF as an unbounded source; checkpoints handle the cursor.

Use cases.

Downstream materialised view sync. Silver/Gold MVs refresh from CDF deltas instead of full scans. Cuts MV refresh cost from O(table_size) to O(daily_delta).
SCD-2 dimension build. update_preimage becomes the row's "valid_to" timestamp; update_postimage becomes the new row's "valid_from." No expensive WINDOW … LEAD over the dimension.
Audit trail. Every change carries the committing user, the version, and the timestamp. Regulators love the deterministic ledger.
Debezium replacement. For lakehouse-native CDC, CDF removes the need for a Kafka Connect Debezium pipeline that mirrors database row state to a Kafka topic — Delta does it natively.

Detailed explanation — building an SCD-2 from CDF in one SQL

Detailed explanation. Slowly-Changing Dimension Type 2 (SCD-2) is the classic dimensional modelling pattern where each historical row version is preserved with valid_from/valid_to timestamps. Without CDF, building an SCD-2 from a Delta source requires a LEAD() window over the source plus an anti-join — expensive on large dimensions. With CDF, every update_preimage/update_postimage pair is already a "version closes / version opens" event.

Question. Given a customers table with CDF enabled, build the SCD-2 history table dim_customers_history directly from the CDF stream. Show how update_preimage becomes a closed row and update_postimage becomes the new open row.

Input — table_changes('customers', 0, 5) output.

customer_id	name	tier	_change_type	_commit_version	_commit_timestamp
1	Alice	silver	insert	2	2026-06-13 08:00
1	Alice	silver	update_preimage	4	2026-06-13 09:30
1	Alice	gold	update_postimage	4	2026-06-13 09:30
2	Bob	bronze	insert	3	2026-06-13 08:15
2	Bob	bronze	delete	5	2026-06-13 10:00

Code.

-- Build SCD-2 history rows from the CDF stream.
MERGE INTO dim_customers_history AS tgt
USING (
  SELECT
    customer_id,
    name,
    tier,
    _change_type,
    _commit_timestamp                                    AS event_ts,
    -- valid_from / valid_to derived from the change type
    CASE
      WHEN _change_type = 'insert'             THEN _commit_timestamp
      WHEN _change_type = 'update_postimage'   THEN _commit_timestamp
      ELSE NULL
    END AS valid_from,
    CASE
      WHEN _change_type = 'update_preimage'    THEN _commit_timestamp
      WHEN _change_type = 'delete'             THEN _commit_timestamp
      ELSE TIMESTAMP '9999-12-31 23:59:59'
    END AS valid_to
  FROM table_changes('db.customers', :last_version, :current_version)
  WHERE _change_type IN ('insert', 'update_preimage', 'update_postimage', 'delete')
) AS src
ON tgt.customer_id = src.customer_id
   AND tgt.valid_to = TIMESTAMP '9999-12-31 23:59:59'
   AND src._change_type IN ('update_preimage', 'delete')
WHEN MATCHED THEN
  UPDATE SET tgt.valid_to = src.valid_to
WHEN NOT MATCHED AND src._change_type IN ('insert', 'update_postimage') THEN
  INSERT (customer_id, name, tier, valid_from, valid_to)
  VALUES (src.customer_id, src.name, src.tier, src.valid_from, src.valid_to);

Step-by-step explanation.

The CDF read returns one row per state transition; the SQL needs to convert these into the SCD-2 "close old, open new" contract.
For insert, the row opens with valid_from = commit_ts and stays open with valid_to = '9999-12-31'.
For update_preimage, the existing open row is closed by setting its valid_to to the commit timestamp — the MATCHED branch handles this.
For update_postimage, the new state is opened by inserting a new row with valid_from = commit_ts — the NOT MATCHED branch handles this.
For delete, the existing open row is closed by setting valid_to = commit_ts — same MATCHED branch as preimage.
The two changes for the same commit (preimage + postimage) share the same _commit_version and _commit_timestamp, so the closed-row's valid_to equals the new-row's valid_from — the contiguous-history contract holds.

Output — dim_customers_history.

customer_id	name	tier	valid_from	valid_to
1	Alice	silver	2026-06-13 08:00	2026-06-13 09:30
1	Alice	gold	2026-06-13 09:30	9999-12-31 23:59:59
2	Bob	bronze	2026-06-13 08:15	2026-06-13 10:00

Rule of thumb. When the source table has CDF enabled, never build SCD-2 with a window function over the dimension; consume CDF directly. The CDF read is bounded by O(daily_delta), whereas a window-over-dimension is bounded by O(dimension_size × cardinality).

Detailed explanation — incremental MV refresh from CDF

Detailed explanation. A materialised view that aggregates by country (e.g. "event count per country") normally re-aggregates the whole Bronze table on each refresh. With CDF, the MV can be incrementally maintained — apply only the Δ contributions from the current batch's CDF rows.

Question. Given bronze.events with CDF enabled, refresh silver.events_by_country incrementally — add inserts, subtract deletes, and apply the net effect of updates.

Input — table_changes('bronze.events', 100, 101).

event_id	country	_change_type	_commit_version
1	DE	insert	101
2	FR	insert	101
3	DE	delete	101
4	US	update_preimage	101
4	DE	update_postimage	101

Code.

-- Net Δ per country from the CDF batch.
WITH delta AS (
  SELECT
    country,
    CASE _change_type
      WHEN 'insert'           THEN +1
      WHEN 'update_postimage' THEN +1
      WHEN 'delete'           THEN -1
      WHEN 'update_preimage'  THEN -1
      ELSE 0
    END AS delta_count
  FROM table_changes('bronze.events', :last_version, :current_version)
)
MERGE INTO silver.events_by_country AS tgt
USING (
  SELECT country, SUM(delta_count) AS delta_total
  FROM delta
  GROUP BY country
) AS src
ON tgt.country = src.country
WHEN MATCHED THEN
  UPDATE SET tgt.event_count = tgt.event_count + src.delta_total
WHEN NOT MATCHED THEN
  INSERT (country, event_count) VALUES (src.country, src.delta_total);

Step-by-step explanation.

Each CDF row carries a signed contribution: insert = +1, delete = -1. An UPDATE is two rows — preimage (-1 from old country) + postimage (+1 to new country) — the net effect is "moved one event from US to DE."
The CTE assigns the signed delta to each row, then a GROUP BY aggregates per country.
The MERGE applies the per-country delta to the existing MV: event_count += delta. New countries get an INSERT.
The result is identical to a from-scratch re-aggregation, but reads only the CDF delta instead of the entire Bronze table.

Output — net delta per country.

country	delta_total	resulting event_count
DE	+1 (insert) -1 (delete) +1 (postimage) = +1	old + 1
FR	+1 (insert)	old + 1
US	-1 (preimage)	old - 1

Rule of thumb. Every MV whose aggregate is a commutative function (SUM, COUNT, MIN, MAX with retract) can be incrementally maintained from CDF. The MV cost goes from O(table_size × MVs) per refresh to O(daily_delta × MVs) — typically a 100× reduction on a 1% daily delta.

Detailed explanation — the write-amp cost of CDF

Detailed explanation. Enabling CDF is not free. Each commit now writes both the new data files and a sidecar _change_data directory carrying the CDF rows. Update-heavy tables pay the most: each updated row produces two CDF rows (preimage + postimage). For most workloads the cost is 3-7% of the data write volume; for tables where most commits are UPDATEs over wide rows it can climb to 20%.

Question. Estimate the write-amp overhead for a Delta table where 60% of commits are inserts, 30% are updates (touching one column of a 100-column row), and 10% are deletes. Show why update-heavy tables pay more.

Input.

Operation	Share	Rows per commit
insert	60%	1M
update	30%	1M
delete	10%	1M

Code.

-- Inspect CDF file count vs main data file count after a representative batch.
DESCRIBE DETAIL db.events;          -- numFiles, sizeInBytes
DESCRIBE HISTORY db.events LIMIT 3; -- last 3 commits w/ operationMetrics

-- Operational read of CDF storage overhead.
SELECT
  SUM(CASE WHEN _change_type = 'insert' THEN 1 ELSE 0 END)              AS inserts,
  SUM(CASE WHEN _change_type IN ('update_preimage','update_postimage') THEN 1 ELSE 0 END) AS update_rows,
  SUM(CASE WHEN _change_type = 'delete' THEN 1 ELSE 0 END)              AS deletes
FROM table_changes('db.events', :start_version, :end_version);

Step-by-step explanation.

Inserts emit 1 CDF row per inserted row — overhead = 1× the row size.
Updates emit 2 CDF rows per updated row (preimage + postimage) — overhead = 2× the row size. Even if the UPDATE only changes one column, both full row images are written.
Deletes emit 1 CDF row per deleted row — overhead = 1× the row size.
With the workload mix above: total CDF row count = (60% × 1M) + (30% × 1M × 2) + (10% × 1M) = 0.6M + 0.6M + 0.1M = 1.3M CDF rows per 1M data row commits. That is 30% overhead on this update-heavy mix.
On a 60% insert / 5% update / 35% delete workload, the math is 0.6M + 0.1M + 0.35M = 1.05M CDF rows per 1M — 5% overhead. Tunable by the workload's update share.

Output.

Workload mix	CDF rows / data rows	Storage overhead
60% insert, 30% update, 10% delete	1.3×	30%
80% insert, 5% update, 15% delete	1.05×	5%
100% insert (append-only)	1.0×	0% — but no benefit either

Rule of thumb. CDF storage overhead = 1 + (update_share). If your workload is insert-dominant, CDF is essentially free. If it is update-dominant and you don't need downstream propagation, leave CDF off and rely on time-travel for the rare audit query.

Senior interview question on backfilling SCD-2 when CDF was enabled late

A senior interviewer often frames this as: "Your team enables CDF on a 2-year-old dimension table on Monday. On Tuesday a downstream team needs an SCD-2 history that goes back to the table's birth. How do you backfill?" The probe blends CDF's "no historical capture" gotcha with time-travel and a manual full-diff fallback.

Solution Using time-travel + full-table diff to backfill, then CDF for forward state

-- Step 1 — snapshot the dimension at its earliest accessible version.
-- (Time-travel only goes back as far as VACUUM retention allows.)
CREATE OR REPLACE TABLE _scratch.dim_baseline AS
SELECT *, TIMESTAMP '2026-01-01 00:00:00' AS valid_from
FROM db.customers VERSION AS OF 0;   -- or the earliest available version

-- Step 2 — full-diff each subsequent version up to the CDF enable point,
-- emitting synthetic _change_type rows.
CREATE OR REPLACE TABLE _scratch.synthetic_cdf AS
SELECT
  COALESCE(curr.customer_id, prev.customer_id) AS customer_id,
  CASE
    WHEN prev.customer_id IS NULL                                 THEN 'insert'
    WHEN curr.customer_id IS NULL                                 THEN 'delete'
    WHEN curr.name <> prev.name OR curr.tier <> prev.tier         THEN 'update_postimage'
  END AS _change_type,
  curr.name, curr.tier,
  :commit_version AS _commit_version,
  :commit_ts      AS _commit_timestamp
FROM db.customers VERSION AS OF :curr_version curr
FULL OUTER JOIN db.customers VERSION AS OF :prev_version prev
  ON curr.customer_id = prev.customer_id
WHERE curr.customer_id IS NULL
   OR prev.customer_id IS NULL
   OR curr.name <> prev.name
   OR curr.tier <> prev.tier;

-- Step 3 — union synthetic CDF with the real CDF stream and feed the SCD-2 build.
INSERT INTO dim_customers_history
SELECT * FROM _scratch.synthetic_cdf
UNION ALL
SELECT * FROM table_changes('db.customers', :cdf_enable_version, :now);

Step-by-step trace.

Phase	Source	Versions covered	Cost
Baseline snapshot	`VERSION AS OF 0`	one snapshot	one full scan
Synthetic CDF	pairwise FULL OUTER JOIN	every version before CDF	O(versions × table_size)
Real CDF	`table_changes()`	every version after CDF enable	O(daily_delta)

The trace highlights the asymmetry: synthetic backfill is brutal — every pairwise version diff is a full-table scan — but it only runs once. After the backfill, the cheap CDF read takes over forever.

Output:

Phase	Bytes scanned	Output rows
Baseline (v=0)	1× table_size	dimension_size
Synthetic CDF (v=1 to v=cdf_enable)	N × table_size	≈ daily_delta × N
Real CDF (v=cdf_enable to now)	O(daily_delta)	≈ daily_delta × M

Why this works — concept by concept:

VERSION AS OF — Delta time-travel exposes any prior version that has not been VACUUM-reclaimed. The earliest accessible version is bounded by the retention floor (default 168h).
Pairwise full-diff is the only fallback — without CDF, the only way to derive "what changed between v=k and v=k+1" is a full FULL OUTER JOIN. Expensive but deterministic.
UNION ALL bridges the regime change — synthetic CDF covers history, real CDF covers forward time. The downstream MERGE doesn't need to distinguish.
Cost asymmetry — one-time backfill cost vs ongoing CDF read cost. The breakeven is "any single downstream consumer for any single day."
Time-travel + retention coupling — if VACUUM retention is set too low, the earliest accessible version may already be reclaimed and the synthetic backfill fails. Run VACUUM check before the backfill plan.
Cost — synthetic backfill is O(versions × table_size); real CDF is O(daily_delta). The asymptotic ratio is daily_delta / table_size — typically 0.01-0.1.

Data Engineering
Topic — streaming
Streaming Δ processing problems

Practice →

3. Z-Ordering deep dive — multi-dim clustering, file skipping

Z-Order is a space-filling curve — interleaving the bits of two columns clusters records that are close in 2D into the same parquet file

The mental model in one line: Z-Order interleaves the bit representations of the chosen columns and sorts by the interleaved key, producing a layout where records close in multi-dimensional space end up in adjacent rows of the sorted file — which means in the same parquet file with similar min/max stats — which means file skipping at read time. Once you visualise "the Z-curve weaves through the cells of a 2D grid and one file holds one curve segment," every Z-Order interview question becomes a question about file-skip ratio.

The space-filling curve insight.

A sort on column A clusters records by A alone — perfect file-skip on WHERE a = ?, useless on WHERE b = ?.
A sort on (A, B) clusters by A then B — perfect on WHERE a = ?, partial on WHERE a = ? AND b = ?, useless on WHERE b = ?.
A Z-Order on (A, B) interleaves the bits of A and B before sorting — every prefix of the interleaved key carries information about both columns — partial file-skip on WHERE a = ?, partial on WHERE b = ?, strong on WHERE a = ? AND b = ?.
The "Z" shape comes from drawing the curve in 2D space — it visits the four cells of every 2x2 block in a "Z" pattern before jumping to the next 2x2.

File skipping = the actual currency.

Delta stores per-file min/max for each Z-Order column in the transaction log.
A WHERE predicate's accepted range is intersected against each file's min/max — files whose ranges do not overlap are skipped entirely (no parquet read).
Z-Order shrinks the per-file ranges of every Z-Order column simultaneously, so file-skip ratios above 90% are common on tuned tables.

When Z-Order helps.

High-cardinality columns (more than 10K distinct values). Low-card columns share ranges with too many other files; Z-Order does not help.
Many WHERE filters across two to four columns. The whole point is multi-dim pruning; one-column filters do not need Z-Order.
Point lookups and small-range scans. A WHERE customer_id = 17 against a 50 TB Z-Order table reads ~1 file instead of ~50,000.

When Z-Order hurts.

Low-cardinality columns. Use partitioning instead. Z-Order on country (200 values) wastes the rewrite budget.
Single-column filters always. A simple ORDER BY clustered write is cheaper than Z-Order.
Tables with churning data. Every UPDATE invalidates the Z-Order clustering on the touched files; for high-churn tables, prefer liquid clustering which maintains clustering incrementally.

The 4-column ceiling.

Bit-interleaving 2 columns: each output bit carries information about both inputs — strong pruning on both.
4 columns: each output bit carries information about 1/4 of the input — pruning is diluted.
5+ columns: pruning degenerates to roughly the same as random ordering. The Delta docs cap recommended Z-Order at 4 columns; pick the most-filtered four.

Diminishing returns intuition.

Z-Order spreads the pruning budget across columns. With 2 columns, each gets half the bits — pruning works well on both.
With 4 columns, each gets a quarter — pruning is weaker but still useful.
With 8 columns, each gets an eighth — pruning approaches no-op. Better to partition one column and Z-Order the others.

Detailed explanation — tracing the 2D Z-curve by hand

Detailed explanation. The fastest way to internalise Z-Order is to trace the 2D curve on a 4x4 grid by hand. The curve visits cells in the order of their bit-interleaved row-column key — and after the trace, the "records close in 2D land in the same file" claim becomes visible.

Question. On a 4x4 grid with rows 0-3 and columns 0-3, list the cells in Z-Order traversal order. Show that adjacent cells in the traversal are close in 2D space.

Input. A 4x4 grid. Each cell is identified as (row, col) with row and col in 0-3 (2 bits each).

Code.

# Bit-interleave the 2-bit row with the 2-bit col → 4-bit Z-key.
def z_order_key(row, col, bits=2):
    key = 0
    for i in range(bits):
        key |= ((row >> i) & 1) << (2 * i + 1)   # row bit at odd position
        key |= ((col >> i) & 1) << (2 * i)       # col bit at even position
    return key

cells = [(r, c) for r in range(4) for c in range(4)]
ordered = sorted(cells, key=lambda rc: z_order_key(*rc))
for rc in ordered:
    print(rc, z_order_key(*rc))

Step-by-step explanation.

Each row and col is encoded as 2 bits. Bit-interleaving alternates row and col bits to produce a 4-bit Z-key.
(0, 0) → row=00, col=00 → interleaved=0000 → Z-key=0.
(0, 1) → row=00, col=01 → interleaved=0001 → Z-key=1.
(1, 0) → row=01, col=00 → interleaved=0010 → Z-key=2.
(1, 1) → row=01, col=01 → interleaved=0011 → Z-key=3.
The first four cells of the curve form the upper-left 2x2 block — visited in a "Z" shape. The next four cells form the upper-right 2x2 block, and so on. The whole curve is a recursive Z of Zs.
After sorting by Z-key, the first row of the sorted file holds 2x2 block (0-1, 0-1); the second row holds 2x2 block (0-1, 2-3). One parquet file's min/max for both row and col is tightly bounded — file-skip works on both axes.

Output — first eight cells in Z-order.

traversal order	(row, col)	z-key
1	(0, 0)	0
2	(0, 1)	1
3	(1, 0)	2
4	(1, 1)	3
5	(0, 2)	4
6	(0, 3)	5
7	(1, 2)	6
8	(1, 3)	7

Rule of thumb. Pick Z-Order columns whose values you can imagine on a 2D / 3D grid where business queries cluster. Geographic columns (country + region), key columns (customer_id + product_id), and time + entity (event_ts + user_id) are the canonical good pairs.

Detailed explanation — measuring the file-skip ratio

Detailed explanation. The single most important Z-Order metric is the file-skip ratio: (files_pruned / files_in_table) × 100%. Below 50% the layout is wrong; above 90% the layout is tuned. Delta exposes the numbers in the Spark UI under "DataReading" and in the delta.scan operator stats.

Question. Given a 50 TB Delta events table partitioned by event_date and Z-Ordered on (country, customer_id), run a EXPLAIN FORMATTED for a typical filtered query and read the file-skip ratio.

Input. A table with numFiles = 50,000, partitionColumns = [event_date], zOrderBy = (country, customer_id).

Code.

-- Run the query plan.
EXPLAIN FORMATTED
SELECT *
FROM events
WHERE event_date = '2026-06-13'
  AND country    = 'DE'
  AND customer_id = 17;

-- Inspect the actual scan stats after the query.
SELECT *
FROM events
WHERE event_date = '2026-06-13'
  AND country    = 'DE'
  AND customer_id = 17;
-- Spark UI → SQL → metrics: "number of files read" + "files skipped"

Step-by-step explanation.

The planner reads the event_date partition first — 1 partition out of 365.
Within that partition (~137 files), the planner reads per-file min/max stats for country and customer_id.
The intersection of the country = 'DE' and customer_id = 17 ranges with each file's min/max yields 1 candidate file out of 137.
The scan reads 1 file; the file-skip ratio is (50,000 - 1) / 50,000 = 99.998%.
The Spark UI shows files read = 1 and files pruned = 136 (within-partition pruning); partitions pruned = 364 (cross-partition pruning).

Output — measured numbers.

Phase	Value
Partitions pruned	364 / 365
Files pruned within partition	136 / 137
Files actually read	1
File-skip ratio	99.998%
Bytes scanned	~1 GB (1 file) instead of 50 TB

Rule of thumb. Every Z-Order tuning session ends with a EXPLAIN FORMATTED + Spark UI run on the representative query. If the file-skip ratio is below 80%, the Z-Order columns are wrong — pick the four most-filtered high-cardinality columns and re-OPTIMIZE.

Detailed explanation — choosing Z-Order vs partitioning vs liquid clustering

Detailed explanation. The three layout primitives — partition, Z-Order, liquid clustering — are not interchangeable. Each is the right answer for a different shape of workload, and using the wrong one is the most common reason Delta tables underperform.

Question. For each of the four workload patterns below, choose between partitioning, Z-Order, and liquid clustering — and justify the choice.

Input.

Workload	Filter columns	Cardinality	Churn rate
A — daily log analytics	`event_date`	365	append-only
B — point lookups by customer + product	`customer_id`, `product_id`	50M, 1M	low
C — high-velocity SCD-1 customer table	`customer_id`, `country`	50M, 200	UPDATE-heavy
D — greenfield 100 TB events table	TBD multi-dim	high	mixed

Code.

-- A — partition by event_date (low-card, time-series)
ALTER TABLE logs PARTITIONED BY (event_date);

-- B — partition by event_date + Z-Order on customer_id, product_id
ALTER TABLE point_lookups PARTITIONED BY (event_date);
OPTIMIZE point_lookups ZORDER BY (customer_id, product_id);

-- C — liquid clustering (Z-Order would re-fragment on every UPDATE)
ALTER TABLE customers CLUSTER BY (customer_id, country);

-- D — greenfield → liquid clustering from day one
CREATE TABLE events (...) CLUSTER BY (customer_id, event_ts);

Step-by-step explanation.

A — event_date is low-cardinality and time-correlated. Partitioning lets the planner skip whole directories on date filters. Z-Order on a low-card column wastes the rewrite budget.
B — event_date partitions the time axis; Z-Order clusters the multi-dim key axis. Append-only with low churn means the clustering stays valid for a long time before re-OPTIMIZE.
C — UPDATE-heavy means Z-Order's full rewrite needs to run frequently to maintain clustering. Liquid clustering maintains clustering incrementally on each write, so the table stays clustered without scheduled OPTIMIZE.
D — greenfield in 2026 should default to liquid clustering. Z-Order is now legacy for new tables; existing tables migrate when convenient.

Output — decision table.

Workload	Layout	Rationale
A	partition by `event_date`	low-card time column
B	partition + Z-Order	append-only multi-dim
C	liquid clustering	UPDATE-heavy needs incremental clustering
D	liquid clustering	greenfield 2026+ default

Rule of thumb. Partition for the time axis. Z-Order or liquid cluster for the entity axis. Choose liquid clustering if (a) the table is new in 2024+, (b) the workload is UPDATE-heavy, or (c) the filter set evolves over time. Otherwise Z-Order is fine.

Senior interview question on picking Z-Order columns

A senior interviewer often frames this as: "You have a Delta table with 12 columns appearing in WHERE clauses across various queries. The Delta docs say cap Z-Order at 4 columns. Which 4 do you pick, and how do you decide?"

Solution Using a filter-frequency + cardinality + cost-weighted ranking

# Pseudocode — score each candidate column by Z-Order suitability.
def z_order_score(col, query_log, table_stats):
    f = query_log.filter_frequency[col]      # share of queries where col appears in WHERE
    c = table_stats.distinct_values[col]     # cardinality
    s = query_log.selectivity[col]           # 1 / N for point lookups
    j = query_log.is_join_key[col]           # is col a typical join key?
    return f * log(c) * s * (1.5 if j else 1.0)

candidates = ['country','customer_id','product_id','event_ts','region','tier',
              'campaign_id','session_id','device_type','os','channel','status']

scored = sorted(
    [(c, z_order_score(c, query_log, table_stats)) for c in candidates],
    key=lambda x: -x[1],
)
zorder_cols = [c for c, _ in scored[:4]]
print('Z-Order on:', zorder_cols)

Step-by-step trace.

col	filter_freq	cardinality	selectivity	score
`customer_id`	0.80	50M	1/50M	14.2
`event_ts`	0.70	1B	1/86400	12.1
`product_id`	0.55	1M	1/1M	7.6
`country`	0.40	200	1/200	2.1
`region`	0.30	1000	1/1000	2.1
`tier`	0.20	5	1/5	0.32

The trace shows the top four naturally separate from the rest: customer_id, event_ts, product_id, country carry the heavy filter load.

Output:

Final Z-Order columns	Reason
`customer_id`	highest score, point-lookup dominant
`event_ts`	high cardinality + frequent range filter
`product_id`	second join-key surface
`country`	popular filter, multi-dim with customer_id

Why this works — concept by concept:

Filter frequency dominates — Z-Order pays off only on columns that appear in WHERE clauses. A column that does not get filtered earns zero from the layout.
Cardinality matters non-linearly — log(cardinality) weights high-card columns higher because the per-file range shrinks proportional to log-distinct-values when bit-interleaving.
Selectivity captures point-lookups — 1/N selectivity is the marginal scan reduction from a Z-Order hit on that column.
Join-key boost — columns that double as join keys earn a 1.5× boost because they drive both filter pushdown and partition-aware joins.
The 4-column ceiling is hard — past 4 columns the per-column pruning is too diluted to repay the OPTIMIZE cost. Pick the top 4 and re-evaluate quarterly.
Cost — scoring is O(candidate_columns); the score itself is a heuristic, not an oracle. Always validate with a representative query's file-skip ratio after the first OPTIMIZE pass.

Data Engineering
Topic — bucketing
Bucketing & file-layout problems

Practice →

4. OPTIMIZE + Auto Compaction — bin-packing, optimal write target

`OPTIMIZE` bin-packs many small parquet files into a few large ones; Auto Compaction does the same at write-time

The mental model in one line: OPTIMIZE is the post-hoc bin-packer that coalesces small parquet files into target-size (default 1 GB) files; Auto Compaction is the at-write-time version that runs after each commit when the just-written files are below the threshold. Once you say "bin-pack to 1 GB targets," every small-file interview gotcha — slow scans, metadata overhead, executor memory explosions — becomes a single tuning lever.

Bin-packing semantics.

OPTIMIZE picks small files (under delta.targetFileSize, default 1 GB) and combines them. Large files are left alone — they are already optimal.
The combining is a rewrite: read N small files, write 1 large file, replace the manifest entries atomically in the transaction log.
Idempotent: running OPTIMIZE on a tuned table is a no-op (no small files to combine).
Can be scoped: OPTIMIZE table WHERE partition_col = '2026-06-13' runs only on that partition.

OPTIMIZE + ZORDER BY together.

OPTIMIZE table ZORDER BY (col1, col2) does both passes in one command — first the Z-Order sort, then the bin-pack into target-size files. More efficient than separate calls.
After Z-Order, the resulting files have tight per-column min/max stats — file-skip ratios > 90% on the Z-Order columns.

Auto Compaction (write-time) vs OPTIMIZE (post-hoc).

Auto Compaction. delta.autoOptimize.autoCompact = true. After each successful commit, Delta checks the new files; if they are small, it runs a synchronous compaction step before returning success. Adds ~10-30% to commit latency in exchange for "no small files ever exist."
OPTIMIZE. Scheduled command, typically nightly. Runs independently of write traffic. Better for very-large tables where synchronous compaction would add unacceptable commit latency.
Combined. Use Auto Compaction for the streaming/operational layer (Bronze); use scheduled OPTIMIZE for the analytical layer (Silver/Gold) where Z-Order is also needed.

Target file size.

Default: 1 GB. Configurable via delta.targetFileSize (e.g., '256m', '1g', '2g').
Smaller targets (256 MB) help on tables with many small filtered queries — more files = finer pruning granularity.
Larger targets (2-4 GB) help on tables with large analytical scans — fewer files = less per-file overhead.
Auto Tuning. On Databricks, delta.tuneFileSizesForRewrites = true chooses the target adaptively based on observed query patterns.

Optimized Writes.

delta.autoOptimize.optimizeWrite = true. Adds an adaptive shuffle before the write so that the written files are closer to target size. Typically writes 128 MB files instead of the small per-task shards that a naive shuffle produces.
Combined with Auto Compaction, the small-file tax is eliminated end-to-end.

Detailed explanation — the small-file disaster, in numbers

Detailed explanation. A streaming Bronze table writes one parquet file per partition per micro-batch. With 5-minute micro-batches and 12 partitions, that is 144 files per hour, ~3,500 per day, ~25,000 per week. Each file is ~5 MB. A typical analytical query then scans 25,000 5 MB files — reading 25,000 parquet footers alone takes 30-60 seconds before the data scan starts.

Question. Given the workload above, show the cost of running with no compaction vs scheduled nightly OPTIMIZE vs Auto Compaction. Quantify the read latency, write CPU, and file count.

Input.

Trait	Value
Streaming micro-batch	5 min
Partitions per batch	12
Avg file size at write	5 MB
Daily file count (raw)	~3,500
Weekly file count (raw)	~25,000

Code.

-- Scenario 1 — no compaction.
-- Nothing — the streaming writer creates 5 MB files every 5 minutes.

-- Scenario 2 — scheduled nightly OPTIMIZE.
OPTIMIZE bronze.events;
-- Runs nightly; daily file count → ~50 large files

-- Scenario 3 — Auto Compaction.
ALTER TABLE bronze.events SET TBLPROPERTIES (
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact   = true
);
-- Each commit ends with a synchronous compaction step.
-- Steady-state file count: ~50-100 at any time.

Step-by-step explanation.

No compaction. File count grows linearly. At week 1, 25,000 files; at week 4, 100,000. A simple analytical query reads 100,000 parquet footers ≈ 2-4 minutes of metadata I/O before data scan.
Nightly OPTIMIZE. Daily file count returns to ~50 each morning, climbs back to 3,500 through the day. Read latency on the same query is ~40 seconds in the morning, ~90 seconds by end-of-day. OPTIMIZE costs ~10-30 minutes of compute nightly.
Auto Compaction. Steady-state file count stays at ~50-100 all day. Read latency is ~30-40 seconds consistently. Write CPU increases ~15-20% because every commit pays for the synchronous compaction.
Combined Auto Compaction + scheduled OPTIMIZE ZORDER. Compaction keeps file count low all day; nightly OPTIMIZE ZORDER re-clusters for filter performance. Best of both worlds.

Output.

Strategy	Steady-state files	Read latency	Write CPU
No compaction	25K+ (grows)	2-4 min	baseline
Nightly OPTIMIZE	50 → 3,500 daily	40 s → 90 s	+30 min/night
Auto Compaction	50-100 all day	30-40 s steady	+15-20%
Auto Compaction + nightly OPTIMIZE ZORDER	50-100 + clustered	4 s steady	+15-20% +30 min/night

Rule of thumb. On streaming Bronze tables, default to Auto Compaction. On large analytical Silver/Gold tables, default to scheduled OPTIMIZE ZORDER. On the very largest tables (> 100 TB), do both.

Detailed explanation — incremental OPTIMIZE on a hot partition

Detailed explanation. Running full-table OPTIMIZE every night is wasteful when only one or two partitions actually changed. The OPTIMIZE … WHERE form restricts the operation to a partition or partition range — pays only for the hot set.

Question. Given a Delta table partitioned by event_date, schedule a daily OPTIMIZE that only re-bin-packs yesterday's partition (the only one that received new writes).

Input. A table with 365 daily partitions, only one partition (yesterday) is "hot."

Code.

-- Restrict OPTIMIZE to one partition — pays only for yesterday's files.
OPTIMIZE bronze.events
WHERE event_date = current_date() - INTERVAL '1' DAY;

-- With Z-Order:
OPTIMIZE bronze.events
WHERE event_date = current_date() - INTERVAL '1' DAY
ZORDER BY (country, customer_id);

Step-by-step explanation.

Without the WHERE clause, OPTIMIZE scans every file in the table to find small files. On a 50 TB table with 50,000 files, that metadata scan alone takes ~2-3 minutes.
With the WHERE clause, OPTIMIZE scans only the partition's directory. On a 137 MB / 137 file partition, the metadata scan is sub-second.
The rewrite is bounded by the partition size. Yesterday's partition is ~137 GB; the rewrite is ~3-5 minutes vs a full-table rewrite of ~30-60 minutes.
The Z-Order pass also runs only on the partition's files — the clustering applies to the partition.

Output.

Strategy	Files scanned	Files rewritten	Time
Full-table OPTIMIZE	50,000	thousands	30-60 min
Partition-scoped OPTIMIZE	137	50-100	3-5 min

Rule of thumb. Always scope OPTIMIZE to the partitions that actually received writes. On time-partitioned tables, this means "OPTIMIZE yesterday's partition" — never "OPTIMIZE the whole table." The 12× cost reduction is structural, not incidental.

Detailed explanation — choosing between Auto Compaction and scheduled OPTIMIZE

Detailed explanation. Auto Compaction is synchronous: it adds latency to every commit. Scheduled OPTIMIZE is asynchronous: it runs on a separate schedule and does not block writes. The choice depends on whether the write SLA can absorb 10-30% latency tax in exchange for "no small files ever."

Question. For each of three workloads — sub-second streaming, hourly micro-batch, daily batch — pick between Auto Compaction and scheduled OPTIMIZE.

Input.

Workload	Write cadence	Write SLA
A — sub-second streaming	continuous	< 1 s p99 commit
B — hourly micro-batch	hourly	< 30 s p99 commit
C — daily batch	nightly	< 60 min total

Code.

-- A — sub-second streaming: cannot tolerate sync compaction
ALTER TABLE A SET TBLPROPERTIES (
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact   = false   -- async OPTIMIZE only
);
-- + scheduled hourly OPTIMIZE WHERE on the hot partition

-- B — hourly micro-batch: can absorb sync compaction
ALTER TABLE B SET TBLPROPERTIES (
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact   = true
);

-- C — daily batch: scheduled OPTIMIZE + Z-Order in the batch window
-- (Auto compaction redundant for one large commit)
OPTIMIZE C ZORDER BY (customer_id, product_id);

Step-by-step explanation.

A — sub-second p99 forbids Auto Compaction's synchronous step. Use Optimized Writes (still cheap) but schedule OPTIMIZE WHERE on each completed hour-partition asynchronously.
B — hourly commits with 30 s SLA can absorb the ~5-10 s synchronous compaction. Use both Optimized Writes and Auto Compaction.
C — nightly batch writes one large commit. Auto Compaction adds nothing; the OPTIMIZE pass in the batch window handles everything plus Z-Order.

Output.

Workload	Strategy	Reason
A streaming	OW + async OPTIMIZE	sync compaction breaks p99
B hourly	OW + Auto Compaction	SLA can absorb sync compaction
C daily	Scheduled OPTIMIZE ZORDER	one commit, no compaction needed

Rule of thumb. The write SLA determines the answer. Sub-second writes → asynchronous OPTIMIZE only. Multi-second writes → Auto Compaction. Hours-long batch → scheduled OPTIMIZE ZORDER. Mix and match per layer of the medallion architecture.

Senior interview question on OPTIMIZE cadence for a streaming Bronze table

A senior interviewer often frames this as: "Sketch the OPTIMIZE schedule for a streaming Bronze Delta table with 5-minute Structured Streaming micro-batches, downstream MERGE-INTO consumers, and a 24-hour analyst-query SLA. Justify each cadence."

Solution Using a layered OPTIMIZE cadence with partition scope and Z-Order

# Pseudocode for the orchestrator (Databricks Workflows / Airflow).

# Layer 1 — continuous: Optimized Writes + Auto Compaction (table property)
# Layer 2 — every 6 hours: partition-scoped OPTIMIZE on yesterday's partition
def hourly_partition_optimize():
    spark.sql("""
        OPTIMIZE bronze.events
        WHERE event_date >= current_date() - INTERVAL '1' DAY
    """)

# Layer 3 — nightly: Z-Order on yesterday's partition for filter performance
def nightly_zorder():
    spark.sql("""
        OPTIMIZE bronze.events
        WHERE event_date = current_date() - INTERVAL '1' DAY
        ZORDER BY (country, customer_id)
    """)

# Layer 4 — weekly: full-table OPTIMIZE catch-up + retention prune
def weekly_full_optimize():
    spark.sql("OPTIMIZE bronze.events")
    spark.sql("VACUUM bronze.events RETAIN 168 HOURS")

Step-by-step trace.

Layer	Cadence	Scope	Cost	Benefit
1	every commit	Auto Compaction	+15-20% write CPU	no small files ever
2	every 6 hours	hot partition	~30s	re-bin-pack any straggler files
3	nightly	hot partition + Z-Order	~5 min	filter performance for analysts
4	weekly	full table + VACUUM	~30 min	catch-up + storage hygiene

The trace shows how the layered cadence keeps the file count low at every timescale — sub-second through weekly — without paying full-table OPTIMIZE cost every day.

Output:

Time of day	Action	Effect
00:00	nightly OPTIMIZE ZORDER on yesterday	analyst queries fast in morning
06:00	partition-scoped OPTIMIZE	drag from late commits cleared
12:00	partition-scoped OPTIMIZE	clears the morning's micro-batches
18:00	partition-scoped OPTIMIZE	clears the afternoon's micro-batches
02:00 Sun	weekly full OPTIMIZE + VACUUM	catch-up + retention prune

Why this works — concept by concept:

Layered cadence — each layer addresses a different timescale of file accumulation. Auto Compaction handles seconds; partition OPTIMIZE handles hours; Z-Order handles days; full OPTIMIZE handles weeks.
Partition scope — WHERE event_date = … keeps every OPTIMIZE cost-bounded to the hot set. Critical for tables larger than 10 TB.
Z-Order separated from compaction — Z-Order is the expensive rewrite; running it less often (nightly) is fine because the analyst-query SLA is daily, not hourly.
VACUUM coupled to OPTIMIZE — running VACUUM in the same maintenance window as full OPTIMIZE keeps both rebalanced together. Mismatched cadences cause time-travel surprises.
Auto Compaction is the spine — the other layers are optimisations on top. Without Auto Compaction, the steady-state file count explodes between scheduled passes.
Cost — Layer 1 is amortised across writes; Layers 2-3 are partition-bounded (constant); Layer 4 is the only full-table cost (weekly).

Data Engineering
Topic — optimization
File-layout optimization drills

Practice →

5. Production tuning — VACUUM, deletion vectors, liquid clustering

`VACUUM` reclaims tombstoned files after retention; deletion vectors mask deleted rows without rewrites; liquid clustering replaces Z-Order on greenfield tables

The mental model in one line: VACUUM is the only command that physically deletes files from object storage — it can only delete files older than the retention threshold (default 168 hours), and dropping below that floor breaks time-travel and the audit story. Deletion vectors and liquid clustering are the two 2024+ primitives that change how Delta handles row-level deletes and incremental clustering — both should be enabled on every new table.

The VACUUM file lifecycle.

ACTIVE — referenced by the latest snapshot. Readable by every query. Cannot be deleted.
TOMBSTONED — removed by a commit (DELETE, UPDATE, OPTIMIZE, MERGE), but still physically present on disk. Reachable via time-travel VERSION AS OF until VACUUM reclaims.
RECLAIMED — physically deleted by VACUUM. Unrecoverable. Time-travel to versions referencing the file fails.

Default 168-hour retention.

delta.deletedFileRetentionDuration controls how long tombstoned files persist before VACUUM may reclaim them. Default 168 hours (7 days).
VACUUM refuses to reclaim files newer than this threshold — even if you ask politely.
To override (dangerous): SET spark.databricks.delta.retentionDurationCheck.enabled = false; VACUUM table RETAIN 24 HOURS;. Don't.

Why the floor exists.

Time-travel VERSION AS OF v requires every file referenced at version v to still be on disk. Once VACUUM reclaims those files, time-travel to version v fails.
Audit / regulator requirements often demand at least 7 days of "as-of" history. Setting retention below 7 days violates the audit contract.
Concurrent readers may still be reading a file that a commit just tombstoned. The 7-day floor gives them safe buffer.

Deletion vectors.

A deletion vector is a per-file bitmap that marks rows as deleted without rewriting the file.
Enable: ALTER TABLE … SET TBLPROPERTIES (delta.enableDeletionVectors = true).
A DELETE WHERE id = 17 updates the deletion vector for the affected file — milliseconds instead of minutes of rewrite.
Reads apply the deletion vector at scan time — the row is hidden but the parquet file is untouched.
An OPTIMIZE pass later consolidates by rewriting files with significant deletion vectors into clean files without DVs (configurable threshold).

Liquid clustering.

The 2024+ successor to Z-Order. Maintains clustering incrementally on each write — no full rewrite required.
Greenfield: CREATE TABLE … CLUSTER BY (col1, col2).
Existing table: ALTER TABLE … CLUSTER BY (col1, col2). The next OPTIMIZE pass rewrites files into the new layout.
Liquid clustering does not need partition columns — it replaces both partitioning and Z-Order with a single clustering key. Cleaner mental model.
Up to 4 clustering columns; same ceiling rationale as Z-Order.

When to use what — 2026 cheat sheet.

New table in 2026? Liquid clustering. Always.
Existing Z-Ordered table? Z-Order until a migration window opens. Then ALTER TABLE … CLUSTER BY (…) and one final OPTIMIZE pass.
UPDATE-heavy table? Enable deletion vectors. Then add liquid clustering on top for clustering maintenance.
Audit-regulated table? Keep VACUUM at 168h or higher. Never lower.

Detailed explanation — choosing retention for a regulated table

Detailed explanation. Regulated industries (finance, healthcare) commonly require 30-day or 90-day "as-of" history. The Delta default of 7 days is too short. Override upward, never downward.

Question. For a regulated payments table that must support SELECT … VERSION AS OF :v for any version in the last 30 days, configure the retention and document the storage trade-off.

Input.

Trait	Value
Daily delta size	100 GB
Required time-travel window	30 days
Audit retention	90 days

Code.

-- Configure retention to 90 days (covers both time-travel + audit).
ALTER TABLE payments SET TBLPROPERTIES (
  delta.deletedFileRetentionDuration = 'interval 90 days',
  delta.logRetentionDuration         = 'interval 90 days'
);

-- VACUUM still uses the table property; safe to schedule daily.
VACUUM payments;   -- reclaims files older than 90 days

Step-by-step explanation.

delta.deletedFileRetentionDuration controls when VACUUM may reclaim a tombstoned file. Setting to 90 days means VACUUM keeps every tombstoned file for 90 days.
delta.logRetentionDuration controls how long the JSON transaction log files persist. For time-travel to work back to day -30, the log must reach back at least that far.
The storage cost is daily_delta × retention_days. With 100 GB/day × 90 days = 9 TB of tombstoned-file storage — a real but manageable cost for the audit guarantee.
VACUUM still runs daily; it just reclaims files older than 90 days each pass. After steady state, the tombstoned set is ~9 TB.

Output — storage breakdown after steady state.

Layer	Size	Purpose
Active data	varies	latest snapshot
Tombstoned (within 90d)	~9 TB	time-travel + audit
Log retention (90d JSON files)	~50 MB	snapshot reconstruction

Rule of thumb. Set retention to at least the longest of: regulatory minimum, longest time-travel use case, and the lag of your slowest downstream consumer. Going higher is cheap; going lower than 168h breaks time-travel silently.

Detailed explanation — turning on deletion vectors

Detailed explanation. Deletion vectors flip the cost model of DELETE and UPDATE from "rewrite affected files" to "mark deleted rows in a bitmap." On a table where MERGE INTO touches 0.1% of rows scattered across 80% of files, the cost difference is 800×.

Question. Given a Delta table where a daily MERGE updates 0.5% of rows scattered across 70% of files, show the cost difference with and without deletion vectors. Walk through what changes on the read path.

Input.

Trait	Value
Table size	10 TB
Files	10,000
Daily MERGE rows	0.5% = 50M rows
Files touched by MERGE	70% = 7,000

Code.

-- Enable deletion vectors.
ALTER TABLE silver.events SET TBLPROPERTIES (
  delta.enableDeletionVectors = true
);

-- The MERGE call doesn't change.
MERGE INTO silver.events AS tgt
USING staging.delta AS src
   ON tgt.event_id = src.event_id
 WHEN MATCHED THEN UPDATE SET *
 WHEN NOT MATCHED THEN INSERT *;

Step-by-step explanation.

Without DV. The MERGE rewrites every touched file: 7,000 files × ~1 GB = 7 TB of writes. Time: ~30-45 min.
With DV. The MERGE writes a deletion vector per touched file (typically a few KB each) plus a small parquet file with the new rows for the matched rows. Touched-file rewrites: ~0. Time: ~2-3 min.
On the read path. Each scan applies the deletion vector to mask the deleted rows. A row that exists in the parquet file but is marked deleted in the DV is skipped silently. Read latency increases ~5-10% per file with a DV.
Periodic OPTIMIZE. When the DV coverage of a file exceeds a threshold (~20% of rows masked), OPTIMIZE rewrites the file cleanly. The DV cost is amortised over many MERGE batches.

Output.

Operation	Without DV	With DV
MERGE rewrite bytes	7 TB	~50 MB (DVs + new rows)
MERGE time	30-45 min	2-3 min
Read latency overhead	0%	+5-10%
OPTIMIZE consolidation	every MERGE	when DV > 20%

Rule of thumb. Enable deletion vectors on every Delta table that experiences UPDATE / DELETE traffic. The MERGE / DELETE cost reduction is typically 10-100× depending on row scatter; the read overhead is small and consolidated by periodic OPTIMIZE.

Detailed explanation — migrating Z-Order to liquid clustering

Detailed explanation. Existing Z-Ordered tables can migrate to liquid clustering with a single ALTER TABLE + OPTIMIZE. The migration is a one-way trip — once the table is liquid-clustered, the partition + Z-Order metadata is gone.

Question. Given an existing Delta table partitioned by event_date and Z-Ordered on (country, customer_id), migrate to liquid clustering with cluster key (customer_id, country). Walk through the steps.

Input. Table events with 50 TB, 365 partitions, Z-Order on (country, customer_id).

Code.

-- Step 1 — declare the liquid cluster key.
ALTER TABLE events CLUSTER BY (customer_id, country);

-- Step 2 — drop the partition column (optional; can keep partition for date-range queries)
-- ALTER TABLE events DROP PARTITION FIELD event_date;
-- In practice, keep event_date partition + liquid clustering on the entity axis.

-- Step 3 — run OPTIMIZE; this rewrites files into the liquid layout.
OPTIMIZE events;

-- Step 4 — verify with DESCRIBE DETAIL.
DESCRIBE DETAIL events;
-- properties: { clusteringColumns: 'customer_id,country', ... }

Step-by-step explanation.

ALTER TABLE … CLUSTER BY declares the clustering intent in the table metadata. The next OPTIMIZE pass rewrites files in liquid layout.
Liquid clustering does not require partition columns, but keeping event_date as a partition is still useful for date-range pruning. Most production tables run with both partition + liquid clustering on different axes.
OPTIMIZE rewrites the files. On a 50 TB table this is a ~3-6 hour job; plan it for a maintenance window.
After OPTIMIZE, subsequent writes are automatically liquid-clustered — no scheduled OPTIMIZE ZORDER needed. Liquid clustering is incrementally maintained per commit.
Reads on WHERE customer_id = 17 enjoy the same file-skip as Z-Order; reads on WHERE country = 'DE' also benefit because both columns are in the cluster key.

Output.

Phase	Effect
ALTER TABLE … CLUSTER BY	metadata only; no data movement
First OPTIMIZE	rewrites all files in liquid layout
Subsequent writes	automatically liquid-clustered
Subsequent reads	file-skip equivalent to or better than Z-Order

Rule of thumb. Migrate Z-Order to liquid clustering when the table enters its next major maintenance window. Greenfield tables in 2026+ should start in liquid clustering. Mixed mode (partition + liquid) is the production sweet spot — partition for time, liquid for entity.

Detailed explanation — the CDF + deletion vector interaction

Detailed explanation. Enabling deletion vectors on a CDF-enabled table is safe — CDF correctly captures DV-driven changes — but a few subtle behaviours surprise teams the first time.

Question. Given a CDF + DV-enabled table, walk through what the CDF stream emits when a MERGE INTO writes via deletion vectors.

Input. Table with both delta.enableChangeDataFeed = true and delta.enableDeletionVectors = true.

Code.

-- A MERGE that updates one row.
MERGE INTO events AS t
USING (SELECT 17 AS id, 'DE' AS country) AS s
   ON t.id = s.id
 WHEN MATCHED THEN UPDATE SET country = s.country;

-- Read CDF for the commit.
SELECT * FROM table_changes('events', :v, :v);

Step-by-step explanation.

The MERGE commit writes a deletion vector marking the row's old version as deleted in its parquet file, plus a tiny new parquet file with the updated row.
The CDF output emits two rows for the commit: update_preimage (old country) and update_postimage (new country). Same as without DV.
CDF consumers see the same logical row-state change. They do not see — and do not need to see — the DV mechanics.
The DV-bitmap file is a Delta internal artefact; CDF readers only see the change-data sidecar files, which carry per-row preimage/postimage data.

Output.

Mechanism	What's written
DV bitmap	per-file bitmap masking the old row
Updated parquet file	tiny file with the new row
CDF sidecar	one update_preimage row + one update_postimage row

Rule of thumb. DV and CDF compose cleanly. Enable both on every table where MERGE / UPDATE / DELETE traffic is non-trivial. The CDF consumer contract is unchanged; the DV win is purely on the write side.

Senior interview question on VACUUM + OPTIMIZE + CDF cadence

A senior interviewer often frames this as: "Design the maintenance cadence for a 50 TB Bronze Delta table with CDF on, deletion vectors on, 5-minute streaming writes, and a 7-day downstream CDF consumer SLA. What does the orchestrator run and when?"

Solution Using a layered maintenance schedule that respects every constraint

# Orchestrator schedule (Databricks Workflows / Airflow).

# Continuous — Auto Compaction + Optimized Writes (table properties).
# Already set:
#   delta.autoOptimize.optimizeWrite = true
#   delta.autoOptimize.autoCompact   = true
#   delta.enableChangeDataFeed       = true
#   delta.enableDeletionVectors      = true

# Every 6 hours — partition-scoped OPTIMIZE (catch any DV consolidation needs).
@every("6h")
def hourly_partition_optimize():
    spark.sql("""
        OPTIMIZE bronze.events
        WHERE event_date >= current_date() - INTERVAL '1' DAY
    """)

# Nightly at 01:00 — partition-scoped OPTIMIZE + Z-Order / liquid clustering pass.
@daily(at="01:00")
def nightly_cluster_pass():
    spark.sql("""
        OPTIMIZE bronze.events
        WHERE event_date = current_date() - INTERVAL '1' DAY
    """)

# Weekly Sunday 03:00 — VACUUM with 168h retention (the 7-day CDF SLA floor).
@weekly(on="sunday", at="03:00")
def weekly_vacuum():
    spark.sql("VACUUM bronze.events RETAIN 168 HOURS")

Step-by-step trace.

Schedule	Action	Constraint respected
continuous	Optimized Writes + Auto Compaction	no small-file tax at write time
every 6h	partition OPTIMIZE	DV consolidation; hot partition
nightly	cluster pass on yesterday	analyst-query SLA in morning
weekly	VACUUM 168h	7-day downstream CDF consumer SLA

The trace shows the cadences are interlocking — each runs at the highest frequency that respects the slowest constraint. VACUUM does not run more often than weekly because every CDF consumer needs at least 7 days to consume.

Output:

Window	Maintenance load
streaming day-time	continuous compaction, +15-20% write CPU
early morning	nightly cluster pass, ~5 min
Sunday 03:00	weekly VACUUM, ~10 min

Why this works — concept by concept:

Auto Compaction at writes — handles the small-file tax in real time. Without it, the partition-scoped OPTIMIZE would not be enough.
Partition scope — every OPTIMIZE is bounded to yesterday's partition. Full-table OPTIMIZE on a 50 TB table is not in the cadence — the layered approach makes it unnecessary.
Cluster pass nightly — clustering is the expensive rewrite; running daily is sufficient because the analyst SLA is daily.
VACUUM weekly with 168h floor — respects the downstream CDF SLA. Running VACUUM daily would tombstone files that downstream consumers still need to read.
DV + CDF compose — both are enabled and run silently in the background. The orchestrator does not need to be aware of either.
No full-table OPTIMIZE — at 50 TB the full-table cost is hours. The layered cadence keeps the file count low enough that full-table OPTIMIZE is unnecessary.
Cost — write CPU +15-20%; per-cadence costs are partition-bounded; storage cost is daily_delta × 7 days ≈ 5-7 TB of tombstoned files. All within typical operational budgets.

Data Engineering
Topic — aggregation
Aggregation pipeline tuning problems

Practice →

Cheat sheet — Delta Lake tuning recipes

Enable CDF. ALTER TABLE db.t SET TBLPROPERTIES (delta.enableChangeDataFeed = true) — costs metadata only until a consumer reads.
Read CDF. SELECT * FROM table_changes('db.t', startVer, endVer) for a version range or SELECT * FROM table_changes('db.t', startTs, endTs) for a timestamp range.
Z-Order. OPTIMIZE db.t ZORDER BY (col1, col2, col3, col4) — cap at 4 columns, prefer high-cardinality WHERE columns.
Partition column choice. Low-cardinality, time-correlated → partition. High-cardinality, multi-dim → Z-Order or liquid clustering.
Target file size. Default 1 GB. Override with ALTER TABLE db.t SET TBLPROPERTIES (delta.targetFileSize = '1g'). On Databricks, prefer delta.tuneFileSizesForRewrites = true.
Auto Compaction. delta.autoOptimize.autoCompact = true — synchronous bin-pack after each commit; adds 15-20% write CPU.
Optimized Writes. delta.autoOptimize.optimizeWrite = true — adaptive shuffle so writes target ~128 MB files; pairs with Auto Compaction.
Partition-scoped OPTIMIZE. OPTIMIZE db.t WHERE event_date = current_date() - INTERVAL '1' DAY — keep OPTIMIZE cost bounded.
Deletion Vectors. ALTER TABLE db.t SET TBLPROPERTIES (delta.enableDeletionVectors = true) — converts DELETE/UPDATE from file-rewrite to bitmap-mask.
Liquid Clustering (greenfield). CREATE TABLE db.t (...) CLUSTER BY (col1, col2) — incremental clustering; no scheduled re-OPTIMIZE needed.
Liquid Clustering (migrate). ALTER TABLE db.t CLUSTER BY (col1, col2); OPTIMIZE db.t;.
VACUUM. VACUUM db.t RETAIN 168 HOURS — the 7-day floor. Never lower.
Custom retention. ALTER TABLE db.t SET TBLPROPERTIES (delta.deletedFileRetentionDuration = 'interval 30 days', delta.logRetentionDuration = 'interval 30 days').
Verify layout. DESCRIBE DETAIL db.t returns numFiles, sizeInBytes, partitionColumns, properties — the four numbers to sanity-check.
History. DESCRIBE HISTORY db.t LIMIT 20 returns the recent commits with operationMetrics — diagnose write-amp and metric drift.
Time-travel. SELECT * FROM db.t VERSION AS OF :v or TIMESTAMP AS OF '2026-06-12 00:00:00' — bounded by retention.

Frequently asked questions

Does enabling CDF rewrite history?

No. ALTER TABLE … SET TBLPROPERTIES (delta.enableChangeDataFeed = true) is a metadata-only change. CDF starts capturing the row-state ledger from the next commit onward; reads against commits prior to the enable point return "CDF not available." To backfill history into a downstream SCD-2 or audit trail, use time-travel VERSION AS OF to snapshot earlier versions and pairwise full-diff between them — a one-time, expensive operation that hands off to cheap CDF reads after the enable point.

What is the difference between Z-Order and liquid clustering?

Z-Order is a scheduled rewrite — OPTIMIZE table ZORDER BY (cols) rewrites every file in the table (or partition) to interleave the chosen columns' bits. Liquid clustering is incrementally maintained — each commit clusters the newly-written rows in place, so a scheduled re-cluster is unnecessary. Liquid clustering also does not require a partition column (it subsumes both partition and Z-Order into one cluster key), works better for UPDATE-heavy tables, and is the default for greenfield tables on Databricks Runtime ≥ 14 / Delta ≥ 3.0. For new tables in 2026, prefer liquid clustering. For existing Z-Ordered tables, migrate during the next maintenance window with ALTER TABLE … CLUSTER BY (…); OPTIMIZE ….

Can I run OPTIMIZE on a streaming Delta table?

Yes — but use partition scope (OPTIMIZE table WHERE partition_col = '...') and avoid the time window when downstream consumers are reading. OPTIMIZE creates new files and tombstones old ones; downstream readers that pin a snapshot version are unaffected because Delta's transaction log preserves the old layout until VACUUM reclaims it. The recommended pattern for streaming Bronze tables is Auto Compaction + Optimized Writes (continuous), partition-scoped OPTIMIZE every 6 hours, and nightly OPTIMIZE ZORDER on yesterday's partition — no full-table OPTIMIZE ever needed on tables larger than 10 TB.

Why does my VACUUM not free storage?

Three common causes. First, you are below the 168-hour retention floor — VACUUM refuses to reclaim files newer than the threshold; the run completes but reclaims nothing. Second, the files you expect to vacuum are still referenced by the latest snapshot — they are ACTIVE, not TOMBSTONED, so VACUUM has no authority over them. Third, deletion vectors are masking rows without actually rewriting files — the parquet files are still ACTIVE; only a future OPTIMIZE consolidation makes the old files tombstone-eligible. Diagnose with DESCRIBE DETAIL (numFiles and sizeInBytes) and DESCRIBE HISTORY (recent operations) to see what is and is not tombstoned.

Does Z-Order on more than 4 columns help?

Almost never. Z-Order interleaves the bits of the chosen columns; with N columns, each output bit carries 1/N-th of each input column's information. At N=2 each column gets 50% of the bits — strong pruning on both. At N=4 each column gets 25% — pruning is weaker but still useful. At N=8 each column gets 12.5% — pruning approaches no-op and OPTIMIZE pays for ordering that doesn't help. The Delta docs cap recommended Z-Order at 4 columns. If you have 8+ filter columns, pick the top 4 by filter-frequency × log(cardinality) × selectivity and leave the others alone; or migrate to liquid clustering, which also caps at 4 cluster columns but maintains them incrementally.

How do deletion vectors affect CDF output?

Not at all — at the logical level. CDF emits the same update_preimage / update_postimage / delete row pairs whether the underlying mechanism is a file rewrite or a deletion vector. Downstream consumers reading table_changes() see identical output. Physically, DV-driven changes write a tiny bitmap plus a small parquet file with new rows instead of rewriting the entire affected files; the CDF sidecar files are produced the same way. CDF + DV compose cleanly and should both be enabled on tables with significant MERGE / UPDATE / DELETE traffic — DV cuts the write cost 10-100×, CDF makes the changes cheaply consumable downstream.

Practice on PipeCode

Drill the optimization practice library → for the file-layout family of probes — Z-Order, bin-packing, partition design.
Rehearse on streaming Δ-throughput problems → for CDF-style incremental consumers, micro-batch sizing, and watermark-vs-version mental models.
Layer the aggregation practice library → for the materialised-view refresh family of questions.
Stack the joins practice library → for the MERGE-INTO / SCD-2 build patterns.
Add bucketing drills → for the Z-Order column-picking interview surface.
Rehearse on the Databricks company practice library → for Delta-flavoured interview probes.
For the broader surface, read top data engineering interview questions →.
Take the Apache Spark internals course → for the execution-plan + shuffle layer underneath Delta.
Take the PySpark fundamentals course → for the read/write surface where OPTIMIZE and Z-Order plug in.
Take the ETL system design course → for the lakehouse layering, medallion architecture, and the CDF-as-propagation pattern.

Pipecode.ai is Leetcode for Data Engineering — every Delta tuning recipe above ships with hands-on practice rooms where you sketch the CDF SCD-2 build, pick the right Z-Order columns by hand, schedule the OPTIMIZE cadence, and design the VACUUM retention for a regulated table against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you walk into a senior Databricks or Delta-shop interview with the lakehouse tuning vocabulary already in muscle memory.

Practice Delta Lake tuning now →
Databricks-flavoured drills →

1. Why CDF + Z-Order — when each wins

Two axes of Delta tuning — row-state propagation (CDF) and file-skip locality (Z-Order)

Detailed explanation — the four-knob tuning model

Detailed explanation — when partitioning beats Z-Order

Detailed explanation — the cost of NOT enabling CDF

Senior interview question on Delta Lake tuning strategy

Solution Using a four-step Delta Lake remediation playbook

2. CDF mechanics — _change_type, version, timestamp

Change Data Feed turns every insert, update, and delete into a ledger row — read it with table_changes()

Detailed explanation — building an SCD-2 from CDF in one SQL

Detailed explanation — incremental MV refresh from CDF

Detailed explanation — the write-amp cost of CDF

Senior interview question on backfilling SCD-2 when CDF was enabled late

Solution Using time-travel + full-table diff to backfill, then CDF for forward state

3. Z-Ordering deep dive — multi-dim clustering, file skipping

Z-Order is a space-filling curve — interleaving the bits of two columns clusters records that are close in 2D into the same parquet file

Detailed explanation — tracing the 2D Z-curve by hand

Detailed explanation — measuring the file-skip ratio

Detailed explanation — choosing Z-Order vs partitioning vs liquid clustering

Senior interview question on picking Z-Order columns

Solution Using a filter-frequency + cardinality + cost-weighted ranking

4. OPTIMIZE + Auto Compaction — bin-packing, optimal write target

OPTIMIZE bin-packs many small parquet files into a few large ones; Auto Compaction does the same at write-time

Detailed explanation — the small-file disaster, in numbers

Detailed explanation — incremental OPTIMIZE on a hot partition

Detailed explanation — choosing between Auto Compaction and scheduled OPTIMIZE

Senior interview question on OPTIMIZE cadence for a streaming Bronze table

Solution Using a layered OPTIMIZE cadence with partition scope and Z-Order

5. Production tuning — VACUUM, deletion vectors, liquid clustering

VACUUM reclaims tombstoned files after retention; deletion vectors mask deleted rows without rewrites; liquid clustering replaces Z-Order on greenfield tables

Detailed explanation — choosing retention for a regulated table

Detailed explanation — turning on deletion vectors

Detailed explanation — migrating Z-Order to liquid clustering

Detailed explanation — the CDF + deletion vector interaction

Senior interview question on VACUUM + OPTIMIZE + CDF cadence

Solution Using a layered maintenance schedule that respects every constraint

Cheat sheet — Delta Lake tuning recipes

Frequently asked questions

Does enabling CDF rewrite history?

What is the difference between Z-Order and liquid clustering?

Can I run OPTIMIZE on a streaming Delta table?

Why does my VACUUM not free storage?

Does Z-Order on more than 4 columns help?

How do deletion vectors affect CDF output?

Practice on PipeCode

2. CDF mechanics — `_change_type`, version, timestamp

Change Data Feed turns every insert, update, and delete into a ledger row — read it with `table_changes()`

`OPTIMIZE` bin-packs many small parquet files into a few large ones; Auto Compaction does the same at write-time

`VACUUM` reclaims tombstoned files after retention; deletion vectors mask deleted rows without rewrites; liquid clustering replaces Z-Order on greenfield tables