Delta Lake Best Practices
A practical guide to Delta Lake patterns, optimizations, and operational strategies for Databricks.
1. Merge Optimization
Choose the Right Merge Strategy
| Pattern | Use When | Performance |
|---|---|---|
| SCD1 (Overwrite) | You only need the latest value | Fast — simple update + insert |
| SCD2 (History) | Full audit trail required | Moderate — requires hash comparison |
| Upsert | General insert-or-update | Fast — standard MERGE |
| Delete+Insert | Full partition refresh | Very fast — avoids row-level matching |
| Conditional | Only update if source is newer | Moderate — adds condition evaluation |
Merge Performance Tips
- Filter source data before merge — only bring records that might match
- Use partition pruning — add partition columns to merge condition
- Avoid wide merges — only include columns you actually need to update
- Pre-sort source data on merge keys when possible
- Monitor small files — frequent merges create many small files
Merge with Delta Table API vs SQL
The DeltaTable Python API and SQL MERGE INTO produce identical results. Use SQL for simple operations and the Python API when building dynamic merge conditions.
2. Small File Problem
What Causes It
- Frequent appends (streaming or micro-batch)
- Frequent MERGE operations
- Low-volume updates to large tables
- Write partitioning that creates too many partitions
Solutions
| Approach | How | When |
|---|---|---|
autoOptimize.optimizeWrite |
Coalesces small writes | Always enable on Bronze |
autoOptimize.autoCompact |
Background compaction | Good for append-heavy tables |
Manual OPTIMIZE
|
Explicit file compaction | For maximum control |
OPTIMIZE WHERE |
Compact specific partitions | When recent data has most small files |
Recommended File Sizes
- Bronze: 128 MB (many writes, benefit from compaction)
- Silver: 256 MB (balanced read/write)
- Gold: 64-128 MB (smaller tables, fast reads)
3. Vacuum Strategy
Retention Period Guidelines
- Minimum: 7 days (default Delta retention check)
- Bronze: 7 days — raw data is reproducible from source
- Silver (SCD2): 14 days — allow time for time travel queries on dimension history
- Gold: 3 days — aggregated data is recomputable
- Staging: 1 day — ephemeral by nature
Safety Rules
- Never set retention below
delta.deletedFileRetentionDuration - Run VACUUM after OPTIMIZE, not before
- Use DRY RUN first when testing retention changes
- Coordinate vacuum with downstream consumers who may use time travel
Vacuum and Time Travel
After vacuum runs, versions older than the retention period become unreadable. Plan your retention to match your time travel needs:
Retention = max(
downstream_consumer_lag,
time_travel_query_window,
restore_recovery_window,
)
4. Change Data Feed vs Watermark
When to Use CDF
- Downstream consumers need to know what changed (inserts, updates, deletes)
- Incremental ETL where you process only changed rows
- Audit trail requirements
- CDC replication to external systems
When to Use Watermark
- Append-only processing (no updates/deletes)
- Streaming with event-time windowing
- When CDF overhead is not justified
CDF Operational Notes
- CDF adds a small storage overhead per write (~1-5%)
- Enable CDF early — it only captures changes made after enabling
- Use
_commit_versionfor ordering, not_commit_timestamp - Checkpoint your CDF consumer position for exactly-once processing
5. Liquid Clustering Migration
Why Migrate from ZORDER
| Feature | ZORDER | Liquid Clustering |
|---|---|---|
| Applied at | OPTIMIZE time (manual) | Automatically during writes |
| Incremental | No — rewrites all data | Yes — only new/changed data |
| Partition requirement | Works with partitions | Cannot use with partitions |
| Column changes | Requires full re-OPTIMIZE | Alter + next write applies it |
Migration Steps
- Verify no partitioning — Liquid Clustering is incompatible with partitioned tables
- Choose cluster columns — typically the same as your ZORDER columns
-
Enable:
ALTER TABLE t CLUSTER BY (col1, col2) - Run initial OPTIMIZE to apply clustering to existing data
- Remove ZORDER from maintenance scripts — no longer needed
-
Monitor with
DESCRIBE DETAILto verify clustering is active
Choosing Cluster Columns
- High cardinality columns used in WHERE/JOIN (e.g.,
customer_id) - Date/timestamp columns used for range queries
- Maximum 4 columns recommended
- Order matters less than with ZORDER (Liquid adapts to query patterns)
6. Table Maintenance Schedule
Recommended Cadence
| Layer | OPTIMIZE | VACUUM | ANALYZE |
|---|---|---|---|
| Bronze | Daily | Weekly | Never |
| Silver | Daily | Weekly | Daily |
| Gold | Daily | Daily | Weekly |
| Staging | Hourly | Daily | Never |
Monitoring
Track these metrics after maintenance:
-
numFilesAdded/numFilesRemovedfrom OPTIMIZE - File count delta from VACUUM
- Total table size trend over time
- Average file size (target: 128-256 MB for large tables)
By Datanest Digital | delta-lake-patterns v1.0.0
This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Delta Lake Patterns Library] with all files, templates, and documentation for $49.
Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.
Top comments (0)