If you have ever run DESCRIBE HISTORY on a Delta table that receives streaming data every 60 seconds and watched it either hang for hours or crash with an OutOfMemoryError, you are not alone and you are not doing anything wrong. The problem is architectural, and once you understand the internals, the fix becomes a lot clearer.
Here is what I learned after digging into why this happens and what you can actually do about it.
How the Delta transaction log works
Every write to a Delta table, INSERT, UPDATE, DELETE, MERGE, schema change, gets recorded as a JSON file in a directory called _delta_log at the root of the table. Files are named with zero-padded twenty-digit integers:
_delta_log/
├── 00000000000000000000.json
├── 00000000000000000001.json
├── 00000000000000000002.json
...
├── 00000000000000000010.parquet (checkpoint)
Each JSON file contains an array of "actions":
{
"commitInfo": {
"timestamp": 1714915200000,
"operation": "STREAMING UPDATE",
"operationMetrics": {
"numOutputRows": "1240",
"scanTimeMs": "320"
},
"isolationLevel": "WriteSerializable",
"isBlindAppend": true
}
}
{
"add": {
"path": "part-00001-abc123.snappy.parquet",
"partitionValues": {},
"size": 1048576,
"stats": "{\"numRecords\":1240,\"minValues\":{...},\"maxValues\":{...}}"
}
}
Every 10 commits, Delta generates a Parquet checkpoint file that captures the entire active table state as a compressed, columnar snapshot. When you run a normal query, Spark reads the latest checkpoint and applies only the small delta of JSON commits after it, which is why standard queries stay fast.
Why DESCRIBE HISTORY cannot use checkpoints
This is the core issue. The Delta protocol explicitly drops commitInfo when writing checkpoints. Checkpoints are optimized for state reconstruction, not provenance. So when you run:
DESCRIBE HISTORY my_streaming_table;
or in Python:
deltaTable.history().show()
Spark gets zero benefit from checkpoints. It has to parse every JSON file in _delta_log from scratch to extract the commitInfo blocks.
A pipeline that triggers every 60 seconds generates 1,440 commits per day. After a year, that is over half a million JSON files Spark has to read sequentially for a single DESCRIBE HISTORY call.
The three things that actually make it slow
- Cloud storage listing overhead
AWS S3, Azure ADLS, and GCS do not have real directory structures. Listing _delta_log requires paginated API calls. S3's ListObjectsV2 returns at most 1,000 keys per request, so listing one million JSON files means 1,000 sequential HTTP requests before a single read task is scheduled. This is a pure I/O bottleneck. Adding more workers does not help here.
- Small file JSON parsing
JSON is row-based text. Each two-kilobyte file requires a separate TCP connection to open, a full text parse to find the nested commitInfo struct, and type casting on every field. Multiply that by millions of files and executor CPU gets overwhelmed.
- Driver OOM on shuffle
After executor nodes parse the JSON files, they shuffle the commitInfo structs back to the driver for aggregation and sorting. The driver's JVM heap has to hold all of this at once. When millions of records with nested maps like operationMetrics and operationParameters hit the driver simultaneously, you get:
java.lang.OutOfMemoryError: GC overhead limit exceeded
And the query dies.
What you can do about it
Reduce log retention (immediate impact)
ALTER TABLE my_streaming_table
SET TBLPROPERTIES (
'delta.logRetentionDuration' = 'interval 7 days',
'delta.deletedFileRetentionDuration' = 'interval 7 days'
);
This tells Delta to purge old JSON commit files during checkpointing. DESCRIBE HISTORY will now only parse 7 days of history instead of 30. One constraint to know: starting with Databricks Runtime 18.0, logRetentionDuration must be greater than or equal to deletedFileRetentionDuration, otherwise you get a validation error.
Enable Minor Log Compaction (Delta 3.0+)
Delta 3.0 introduced Minor Log Compaction, which combines multiple sequential JSON commits into a single consolidated file:
_delta_log/00000100.00000200.compact.json
This dramatically reduces the file count DESCRIBE HISTORY has to work through. It is enabled by default in modern runtimes, but you can explicitly control it with:
spark.conf.set(
"spark.databricks.delta.deltaLog.minorCompaction.useForReads", "true"
)
Use Unity Catalog system tables instead
For systematic auditing, querying system.access.audit is significantly faster than DESCRIBE HISTORY because it is a pre-optimized Delta table, not a raw JSON parse:
SELECT
event_time,
user_identity.email,
action_name,
request_params
FROM system.access.audit
WHERE request_params['table_full_name'] = 'my_catalog.my_schema.my_table'
ORDER BY event_time DESC;
Similarly, system.query.history gives you execution metrics and durations for writes without ever touching the transaction log.
Upgrade driver memory
When you cannot avoid querying large histories, switch to a memory-optimized driver instance. On AWS, migrating from m5.xlarge to r5.4xlarge gives the JVM enough heap to aggregate millions of records without hitting OOM.
Medallion Architecture for high-frequency pipelines
If your pipeline runs MERGE operations frequently against a table that also gets queried, the pattern that works is to ingest raw streaming data as append-only writes into a Bronze table, then run a scheduled bulk MERGE on an hourly cadence into Silver or Gold. This keeps downstream tables clean while the Bronze table handles the commit volume.
Also worth looking at: Deletion Vectors (available in modern Databricks runtimes), which mark rows as logically deleted via compressed bitmap files instead of rewriting the entire Parquet file on every UPDATE or MERGE. This cuts AddFile and RemoveFile churn in the JSON commits significantly.
What I would do differently
If I were designing a high-frequency Kafka-to-Delta pipeline today, I would set a 7-day log retention from day one, enable Minor Log Compaction, route all compliance auditing to Unity Catalog system tables rather than DESCRIBE HISTORY, and extend the streaming trigger to at least 5 minutes unless the downstream business process genuinely needs sub-minute freshness. The transaction log bloat problem is much easier to prevent than to fix after the fact.
Top comments (0)