Day 10 of Phase 2 focused on Query Optimization & Execution Analysis in Spark.
The Objective was to run a heavy Analytical Query on the Event Dataset, inspect its Execution Plan, and Analyze how Query Design affects Performance. A Purchase Aggregation Query was executed to identify the Most Active Buyers in the Dataset.
Using Sparkโs EXPLAIN Functionality, the Parsed, Analyzed, Optimized, & Physical Execution Plans were examined. The Physical Plan revealed Stages such as Photon Scans, Hash Aggregation, Shuffle Exchanges, and Sorting Operations.
Execution Timing demonstrated the effect of Query Complexity. The Aggregation Query executed in approximately 2.20 seconds. A Simplified Projection Query that removed Aggregation and Sorting reduced Execution Time to approximately 1.41 seconds.
Caching was attempted as part of the Optimization Workflow, but Serverless Compute Restrictions prevented Persistence Operations. As a result, Optimization was demonstrated through Query Simplification and Explain-Plan Interpretation instead.
During the process, ChatGPT assisted with Explain-Plan interpretation and Query Optimization Reasoning within Databricks.






Top comments (0)