DAY 1 – Delta Conversion & Optimization

#ai #programming #data

As part of Day 1 of Phase 1: Better Data Engineering in the Databricks 14 Days AI Challenge – 2 (Advanced), I worked on Delta Conversion and Optimization.

A few weeks ago, during the previous challenge, I struggled with the Public DBFS Root limitation in the free workspace. This time, instead of trying to bypass the limitation, I worked with Volumes and created a managed table properly inside Databricks.

I started by reading multiple monthly CSV files. Spark handled the logical merge. I then converted the raw data into Delta format using .format("delta") and created a managed table with saveAsTable().

To simulate incremental ingestion, I used append mode. Following the official guide, I also simulated the small file problem and analyzed the results using DESCRIBE DETAIL

The metadata showed 109 files and approximately 12.3 GB of data. Breaking it down mathematically, 12.3 GB divided by 109 files gives an average file size of about 113 MB. This helped me understand whether optimization was actually required.

After applying OPTIMIZE, I focused on understanding how Delta Lake reorganizes file layout rather than reducing logical data size.