Day 12 focused on cost optimization fundamentals in Spark-based data workflows.
The objective was to analyze job runtime behavior and identify common patterns that increase compute cost in distributed processing systems.
The first experiment measured runtime consistency for a heavy analytical query. The initial execution took approximately 39.87 seconds, while the second execution completed in about 2.35 seconds, demonstrating the difference between cold and warm query execution.
Next, the impact of unnecessary actions was explored. Executing .show(), .count(), and .collect() on the same DataFrame triggered three separate Spark jobs, each scanning approximately 1.08 GB of data. By reducing execution to a single action or storing results for reuse, runtime was reduced to around 1.22 seconds.
Additional experiments highlighted query optimization techniques. Simplifying a complex aggregation query reduced runtime from 7.48 seconds to 1.66 seconds. Avoiding *SELECT ** and selecting only required columns further reduced execution time from 2.85 seconds to 1.44 seconds.
Throughout the analysis, ChatGPT supported interpretation of runtime results and identification of practical cost-saving strategies within Databricks.
These observations illustrate how query design directly influences compute cost in distributed data processing systems.







Top comments (0)