<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abhishek Ambare</title>
    <description>The latest articles on DEV Community by Abhishek Ambare (@immortalspace003).</description>
    <link>https://dev.to/immortalspace003</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3910809%2F48f7bb76-e58b-49f1-a6eb-b250c18b1d34.png</url>
      <title>DEV Community: Abhishek Ambare</title>
      <link>https://dev.to/immortalspace003</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/immortalspace003"/>
    <language>en</language>
    <item>
      <title>How I debugged a Delta Lake DESCRIBE HISTORY timeout (and what's actually causing it)</title>
      <dc:creator>Abhishek Ambare</dc:creator>
      <pubDate>Mon, 04 May 2026 06:13:16 +0000</pubDate>
      <link>https://dev.to/immortalspace003/how-i-debugged-a-delta-lake-describe-history-timeout-and-whats-actually-causing-it-4ad2</link>
      <guid>https://dev.to/immortalspace003/how-i-debugged-a-delta-lake-describe-history-timeout-and-whats-actually-causing-it-4ad2</guid>
      <description>&lt;p&gt;If you have ever run &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; on a Delta table that receives streaming data every 60 seconds and watched it either hang for hours or crash with an &lt;strong&gt;OutOfMemoryError&lt;/strong&gt;, you are not alone and you are not doing anything wrong. The problem is architectural, and once you understand the internals, the fix becomes a lot clearer.&lt;/p&gt;

&lt;p&gt;Here is what I learned after digging into why this happens and what you can actually do about it.&lt;/p&gt;

&lt;p&gt;How the Delta transaction log works&lt;br&gt;
Every write to a Delta table, &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;MERGE&lt;/code&gt;, schema change, gets recorded as a JSON file in a directory called _delta_log at the root of the table. Files are named with zero-padded twenty-digit integers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_delta_log/
├── 00000000000000000000.json
├── 00000000000000000001.json
├── 00000000000000000002.json
...
├── 00000000000000000010.parquet  (checkpoint)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each JSON file contains an array of "actions":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"commitInfo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1714915200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"operation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STREAMING UPDATE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"operationMetrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"numOutputRows"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1240"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scanTimeMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"320"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"isolationLevel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WriteSerializable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"isBlindAppend"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"add"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"part-00001-abc123.snappy.parquet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"partitionValues"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1048576&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stats"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;numRecords&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:1240,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;minValues&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:{...},&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;maxValues&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:{...}}"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every 10 commits, Delta generates a Parquet checkpoint file that captures the entire active table state as a compressed, columnar snapshot. When you run a normal query, Spark reads the latest checkpoint and applies only the small delta of JSON commits after it, which is why standard queries stay fast.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; cannot use checkpoints&lt;br&gt;
This is the core issue. The Delta protocol explicitly drops commitInfo when writing checkpoints. Checkpoints are optimized for state reconstruction, not provenance. So when you run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;HISTORY&lt;/span&gt; &lt;span class="n"&gt;my_streaming_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;deltaTable&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;history&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spark gets zero benefit from checkpoints. It has to parse every JSON file in _delta_log from scratch to extract the commitInfo blocks.&lt;/p&gt;

&lt;p&gt;A pipeline that triggers every 60 seconds generates 1,440 commits per day. After a year, that is over half a million JSON files Spark has to read sequentially for a single &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; call.&lt;/p&gt;

&lt;p&gt;The three things that actually make it slow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud storage listing overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS S3, Azure ADLS, and GCS do not have real directory structures. Listing _delta_log requires paginated API calls. S3's ListObjectsV2 returns at most 1,000 keys per request, so listing one million JSON files means 1,000 sequential HTTP requests before a single read task is scheduled. This is a pure I/O bottleneck. Adding more workers does not help here.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small file JSON parsing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;JSON is row-based text. Each two-kilobyte file requires a separate TCP connection to open, a full text parse to find the nested commitInfo struct, and type casting on every field. Multiply that by millions of files and executor CPU gets overwhelmed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Driver OOM on shuffle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After executor nodes parse the JSON files, they shuffle the commitInfo structs back to the driver for aggregation and sorting. The driver's JVM heap has to hold all of this at once. When millions of records with nested maps like operationMetrics and operationParameters hit the driver simultaneously, you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;java.lang.OutOfMemoryError: GC overhead limit exceeded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the query dies.&lt;/p&gt;

&lt;p&gt;What you can do about it&lt;br&gt;
Reduce log retention (immediate impact)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_streaming_table&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'delta.logRetentionDuration'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'interval 7 days'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'delta.deletedFileRetentionDuration'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'interval 7 days'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells Delta to purge old JSON commit files during checkpointing. &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; will now only parse 7 days of history instead of 30. One constraint to know: starting with Databricks Runtime 18.0, logRetentionDuration must be greater than or equal to deletedFileRetentionDuration, otherwise you get a validation error.&lt;/p&gt;

&lt;p&gt;Enable Minor Log Compaction (Delta 3.0+)&lt;/p&gt;

&lt;p&gt;Delta 3.0 introduced Minor Log Compaction, which combines multiple sequential JSON commits into a single consolidated file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_delta_log/00000100.00000200.compact.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This dramatically reduces the file count &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; has to work through. It is enabled by default in modern runtimes, but you can explicitly control it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="s"&gt;"spark.databricks.delta.deltaLog.minorCompaction.useForReads"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use Unity Catalog system tables instead&lt;/p&gt;

&lt;p&gt;For systematic auditing, querying system.access.audit is significantly faster than &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; because it is a pre-optimized Delta table, not a raw JSON parse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;user_identity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;action_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;request_params&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;access&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audit&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;request_params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'table_full_name'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'my_catalog.my_schema.my_table'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similarly, system.query.history gives you execution metrics and durations for writes without ever touching the transaction log.&lt;/p&gt;

&lt;p&gt;Upgrade driver memory&lt;/p&gt;

&lt;p&gt;When you cannot avoid querying large histories, switch to a memory-optimized driver instance. On AWS, migrating from m5.xlarge to r5.4xlarge gives the JVM enough heap to aggregate millions of records without hitting OOM.&lt;/p&gt;

&lt;p&gt;Medallion Architecture for high-frequency pipelines&lt;/p&gt;

&lt;p&gt;If your pipeline runs MERGE operations frequently against a table that also gets queried, the pattern that works is to ingest raw streaming data as append-only writes into a Bronze table, then run a scheduled bulk MERGE on an hourly cadence into Silver or Gold. This keeps downstream tables clean while the Bronze table handles the commit volume.&lt;/p&gt;

&lt;p&gt;Also worth looking at: Deletion Vectors (available in modern Databricks runtimes), which mark rows as logically deleted via compressed bitmap files instead of rewriting the entire Parquet file on every UPDATE or MERGE. This cuts AddFile and RemoveFile churn in the JSON commits significantly.&lt;/p&gt;

&lt;p&gt;What I would do differently&lt;br&gt;
If I were designing a high-frequency Kafka-to-Delta pipeline today, I would set a 7-day log retention from day one, enable Minor Log Compaction, route all compliance auditing to Unity Catalog system tables rather than &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt;, and extend the streaming trigger to at least 5 minutes unless the downstream business process genuinely needs sub-minute freshness. The transaction log bloat problem is much easier to prevent than to fix after the fact.&lt;/p&gt;

</description>
      <category>dataengin</category>
      <category>databricks</category>
      <category>apachespark</category>
      <category>deltalake</category>
    </item>
  </channel>
</rss>
