DEV Community

Cover image for DAY 12 – Cost Optimization Basics
Subhasis Das
Subhasis Das

Posted on

DAY 12 – Cost Optimization Basics

Day 12 focused on cost optimization fundamentals in Spark-based data workflows.

Visual Concept

The objective was to analyze job runtime behavior and identify common patterns that increase compute cost in distributed processing systems.

Notebook

The first experiment measured runtime consistency for a heavy analytical query. The initial execution took approximately 39.87 seconds, while the second execution completed in about 2.35 seconds, demonstrating the difference between cold and warm query execution.

Notebook

Next, the impact of unnecessary actions was explored. Executing .show(), .count(), and .collect() on the same DataFrame triggered three separate Spark jobs, each scanning approximately 1.08 GB of data. By reducing execution to a single action or storing results for reuse, runtime was reduced to around 1.22 seconds.

Notebook

Additional experiments highlighted query optimization techniques. Simplifying a complex aggregation query reduced runtime from 7.48 seconds to 1.66 seconds. Avoiding *SELECT ** and selecting only required columns further reduced execution time from 2.85 seconds to 1.44 seconds.

Notebook

Throughout the analysis, ChatGPT supported interpretation of runtime results and identification of practical cost-saving strategies within Databricks.

Notebook

These observations illustrate how query design directly influences compute cost in distributed data processing systems.

Codes

Activity Log

Top comments (0)