DEV Community: Subhasis Das

DAY 14 - Final Production-Ready System

Subhasis Das — Sat, 14 Mar 2026 09:39:29 +0000

Day 14 marked the final stage of the Databricks 14 Days AI Challenge – 2 (Advanced), bringing together the various components developed throughout the challenge into a complete production-ready system.

The primary objective was to integrate the data engineering pipeline with the machine learning workflow into a single operational process. Throughout the earlier phases, individual components such as data ingestion, feature engineering, model training, experiment tracking, and inference pipelines were developed separately. Day 14 focused on combining these pieces into an end-to-end architecture capable of generating predictions from raw data.

The pipeline begins with loading data from the Delta table that stores the processed e-commerce event dataset. Feature engineering is applied to transform event-level interactions into user-level behavioral features. These features include metrics such as total user activity, number of purchases, total spending, and average transaction value.

Next, a purchase label is generated to identify whether each user has made a purchase. The feature dataset and label dataset are then joined to produce the final training dataset.

Using this dataset, a Logistic Regression model is trained to predict the probability that a user will make a purchase. The dataset is split into training and testing subsets, and model performance is evaluated using the Area Under the ROC Curve (AUC). The evaluation confirmed that the model could effectively distinguish between purchasing and non-purchasing users based on their interaction patterns.

After training, the final model is saved to a Unity Catalog volume and logged using MLflow. Because the system was executed on a serverless cluster, MLflow required a Unity Catalog temporary directory for model serialization. Adjusting the MLflow configuration allowed the model to be successfully logged and stored for future use.

Once the model was persisted, batch inference was performed on the dataset to generate purchase probability predictions for each user. The predictions include the user identifier, predicted probability of purchase, and binary prediction label. These results were written to a Gold Delta table, making them accessible for downstream analytics and decision-making processes.

The output dataset revealed a strong separation between users predicted to purchase and those predicted not to purchase, indicating that the engineered features provided meaningful signals for the model.

During the implementation process, ChatGPT assisted with debugging MLflow configuration issues, refining the pipeline logic, and validating the final prediction extraction steps within the Databricks environment.

Completing Day 14 effectively demonstrates how individual data engineering and machine learning tasks can be assembled into a unified system capable of supporting production-style predictive analytics.

Activity Log

DAY 13 - End-to-End Architecture Design

Subhasis Das — Fri, 13 Mar 2026 17:25:14 +0000

Day 13 of Phase 3: Performance & Production Thinking in the Databricks 14 Days AI Challenge – 2 (Advanced) focused on designing and documenting the end-to-end architecture of the system developed throughout the challenge.

The first task involved creating an architecture diagram that represents the complete data and machine learning workflow. The architecture illustrates how raw e-commerce event data flows through a layered lakehouse design. Raw CSV data is ingested into the Bronze layer where it is stored as Delta tables. From there, feature engineering transforms event-level data into curated user-level features within the Silver layer. These features are used to construct the training dataset for machine learning models. Logistic Regression and Random Forest models are trained and evaluated, with experiments tracked using MLflow. The trained model is then used within a batch inference pipeline to score users and generate predictions that are stored in the Gold layer. In parallel, a collaborative filtering recommendation system using ALS generates product recommendations based on user interaction data.

The second task required documenting the pipeline flow. This step connected the individual components implemented across earlier phases of the challenge. The pipeline begins with data ingestion and Delta table creation, followed by feature engineering and dataset preparation. Model training and evaluation occur after the training dataset is generated, with experiment tracking handled through MLflow. The inference stage then produces prediction outputs for downstream analysis. Supporting layers such as job orchestration, streaming ingestion capability, performance monitoring, and cost optimization were incorporated to reflect how such a pipeline would operate in a real production environment.

The third task focused on defining a retraining strategy. A production-ready system must continuously adapt to evolving data patterns, so retraining can be triggered through scheduled jobs or changes in data distribution. The retraining workflow rebuilds the training dataset from updated Delta tables, retrains the models, evaluates performance metrics, and logs experiments through MLflow. The best-performing model is then deployed back into the inference pipeline.

During the design and documentation process, ChatGPT assisted with structuring the architecture, organizing the pipeline flow, and refining the retraining strategy within the environment provided by Databricks.

This exercise highlighted how individual data engineering and machine learning components can be integrated into a cohesive and scalable system architecture.

Activity Log

DAY 12 – Cost Optimization Basics

Subhasis Das — Thu, 12 Mar 2026 10:05:33 +0000

Day 12 focused on cost optimization fundamentals in Spark-based data workflows.

The objective was to analyze job runtime behavior and identify common patterns that increase compute cost in distributed processing systems.

The first experiment measured runtime consistency for a heavy analytical query. The initial execution took approximately 39.87 seconds, while the second execution completed in about 2.35 seconds, demonstrating the difference between cold and warm query execution.

Next, the impact of unnecessary actions was explored. Executing .show(), .count(), and .collect() on the same DataFrame triggered three separate Spark jobs, each scanning approximately 1.08 GB of data. By reducing execution to a single action or storing results for reuse, runtime was reduced to around 1.22 seconds.

Additional experiments highlighted query optimization techniques. Simplifying a complex aggregation query reduced runtime from 7.48 seconds to 1.66 seconds. Avoiding *SELECT ** and selecting only required columns further reduced execution time from 2.85 seconds to 1.44 seconds.

Throughout the analysis, ChatGPT supported interpretation of runtime results and identification of practical cost-saving strategies within Databricks.

These observations illustrate how query design directly influences compute cost in distributed data processing systems.

Activity Log

DAY 11 – Time Travel & Data Recovery

Subhasis Das — Wed, 11 Mar 2026 16:07:36 +0000

Day 11 focused on Delta Lake’s time travel functionality and how historical data versions can be accessed in production data systems.

Two test records were appended to the ecom_orders Delta table to simulate a new ingestion event. Using DESCRIBE HISTORY, the table version history was examined to identify the newly created version. The dataset was then queried using VERSION AS OF to retrieve the table state before the append operation.

Row counts were compared between Version 6 and Version 7 to validate the append operation. The dataset size increased from 312,456,680 rows to 312,456,682 rows, confirming that two new records were successfully added.

Additional filtering queries isolated the newly inserted rows using high user IDs. Timestamp-based time travel was also demonstrated to retrieve the table snapshot immediately before the append occurred.

An earlier attempt to query the initial table version failed due to Delta retention policies and a prior VACUUM operation, highlighting an important production consideration when relying on historical table versions.

During the implementation process, ChatGPT helped diagnose schema mismatches during append operations and guided the correct use of Delta time travel queries within Databricks.

Activity Log

DAY 10 - Query Optimization & Explain Plans

Subhasis Das — Wed, 11 Mar 2026 10:52:28 +0000

Day 10 of Phase 2 focused on Query Optimization & Execution Analysis in Spark.

The Objective was to run a heavy Analytical Query on the Event Dataset, inspect its Execution Plan, and Analyze how Query Design affects Performance. A Purchase Aggregation Query was executed to identify the Most Active Buyers in the Dataset.

Using Spark’s EXPLAIN Functionality, the Parsed, Analyzed, Optimized, & Physical Execution Plans were examined. The Physical Plan revealed Stages such as Photon Scans, Hash Aggregation, Shuffle Exchanges, and Sorting Operations.

Execution Timing demonstrated the effect of Query Complexity. The Aggregation Query executed in approximately 2.20 seconds. A Simplified Projection Query that removed Aggregation and Sorting reduced Execution Time to approximately 1.41 seconds.

Caching was attempted as part of the Optimization Workflow, but Serverless Compute Restrictions prevented Persistence Operations. As a result, Optimization was demonstrated through Query Simplification and Explain-Plan Interpretation instead.

During the process, ChatGPT assisted with Explain-Plan interpretation and Query Optimization Reasoning within Databricks.

Activity Log

DAY 9 - Recommendation System

Subhasis Das — Mon, 09 Mar 2026 11:18:05 +0000

Day 9 of Phase 2: AI System Building focused on implementing a collaborative filtering Recommendation System using ALS.

User interactions were mapped into rating values (purchase = 3, cart = 2, view = 1) to simulate implicit feedback strength. An ALS model was trained on a controlled subset of users to prevent memory overflow in a shared/serverless environment.

Initial attempts using StringIndexer caused model size overflow due to high cardinality. Numeric casting of user and product IDs resolved this issue. Training on the full dataset resulted in heap memory errors, so user sampling and product pool limitation were applied to stabilize computation.

Because Unity Catalog restricts nested array rendering, manual candidate scoring and window-based ranking were implemented to generate Top-5 recommendations per user. Historical interactions were removed to ensure novelty in recommendations, which reduced counts for some users due to limited candidate coverage.

Throughout implementation, ChatGPT supported architectural decisions, memory optimization, and troubleshooting within Databricks.

Activity Log

DAY 8 - Batch Inference Pipeline

Subhasis Das — Sun, 08 Mar 2026 12:03:10 +0000

Day 8 of Phase 2: AI System Building focused on implementing a batch inference pipeline.

Using the engineered Silver feature table, feature vectors were assembled and applied to the trained Random Forest model to score over 5.3 million users. The model generated prediction probabilities and class outputs, which were then persisted into a managed Gold Delta table to simulate a production-style scoring layer.

During implementation, Spark ML probability outputs were stored as VectorUDT types, requiring explicit conversion before extracting class probabilities. Additionally, notebook schema rendering messages initially appeared as errors but were confirmed to be display-related rather than pipeline failures. These debugging steps reinforced the importance of understanding Spark’s internal data types during inference workflows.

The highest-ranked users displayed probabilities close to 1.0, consistent with earlier model evaluation outcomes.

Throughout the process, ChatGPT assisted in resolving vector extraction issues and validating inference pipeline logic within Databricks.

This exercise completed the transition from experimentation to operational batch scoring in the AI system workflow.

Activity Log

DAY 7 - MLflow Tracking

Subhasis Das — Sat, 07 Mar 2026 12:44:06 +0000

Day 7 of Phase 2: AI System Building focused on experiment tracking using MLflow.

The objective was to log trained model runs, record parameters and evaluation metrics, and store model artifacts for reproducibility and comparison. Both Logistic Regression and Random Forest models were logged along with ROC-AUC scores, which were observed to be close to 1.0.

During implementation, environment constraints in the shared/serverless workspace required specifying a Unity Catalog Volume path for temporary storage when logging Spark ML models. This highlighted how ML lifecycle management depends on infrastructure configuration, not just modeling logic.

The exercise reinforced the importance of experiment traceability, artifact storage, and reproducibility in scalable AI workflows. It also clarified the difference between logging a model and registering it within a model registry.

During troubleshooting and configuration, ChatGPT supported validation of MLflow setup and interpretation of lifecycle concepts within Databricks.

Activity

DAY 6 - Model Training & Tuning

Subhasis Das — Fri, 06 Mar 2026 14:15:05 +0000

As part of Day 6 of Phase 2: AI System Building in the Databricks 14 Days AI Challenge – 2 (Advanced), I focused on model training, tuning, and evaluation using the supervised dataset prepared earlier.

Feature vectors were assembled from engineered user-level metrics, and both Logistic Regression and Random Forest classifiers were trained using an 80/20 train-test split. Model performance was evaluated using ROC-AUC to ensure threshold-independent comparison.

Due to workspace limitations in the shared/serverless environment, CrossValidator-based tuning was not supported because of temporary storage configuration restrictions. As a result, hyperparameter tuning for Random Forest was performed manually by iterating over different tree counts and depths.

The observed AUC values were extremely high (≈0.999999 for Logistic Regression and 1.0 for Random Forest). This highlighted an important modeling insight regarding feature-label relationships and the need to carefully assess potential information leakage in supervised learning workflows.

During implementation, ChatGPT supported validation of model configuration, evaluation logic, environment troubleshooting, and interpretation of performance metrics within scalable AI system design practices inside Databricks.

Activity Log

DAY 5 - Production-Grade Feature Engineering

Subhasis Das — Thu, 05 Mar 2026 19:04:30 +0000

As part of Day 5 of Phase 2: AI System Building in the Databricks 14 Days AI Challenge – 2 (Advanced), I focused on preparing a production-ready supervised learning dataset.

The process began by creating a binary purchase label at the user level using event-level data. A user was labeled as 1 if at least one purchase event existed, otherwise 0. This label dataset was then joined with the previously engineered Silver feature table to create a consolidated training dataset.

An 80/20 train-test split was applied using a fixed seed to ensure reproducibility. Distribution validation was performed across the full dataset, as well as the train and test splits, to confirm that class proportions remained consistent. The observed class ratio remained stable across partitions, reinforcing correct dataset preparation practices.

During implementation, ChatGPT was used as a technical reference to validate aggregation logic, review join consistency, and confirm class distribution calculations aligned with scalable data engineering workflows.

Activity Log

DAY 4 – Structured Streaming (Basic Simulation)

Subhasis Das — Wed, 04 Mar 2026 17:12:28 +0000

As part of Day 4 of Phase 1: Better Data Engineering in the Databricks 14 Days AI Challenge – 2 (Advanced), I explored the basics of Structured Streaming through a folder-based simulation approach.

The objective was to simulate incremental data ingestion by monitoring a folder for incoming files and writing processed results into Delta format. Streaming input and checkpoint directories were prepared within Volume storage, and a predefined schema was used to configure streaming reads from curated data.

During implementation, several practical challenges were encountered. Volume path validation, folder preparation, and workspace limitations prevented the use of continuous streaming triggers. The workflow therefore required adapting to an alternative trigger suitable for controlled execution. Checkpoint behavior also highlighted how previously detected files are ignored during subsequent runs, demonstrating how incremental ingestion is maintained.

Although the streaming output could not be consistently demonstrated within the environment constraints, the exercise provided valuable insight into how storage configuration, checkpoints, and execution environments affect streaming pipelines.

Activity Log

Day 3 - Job Orchestration Basics

Subhasis Das — Tue, 03 Mar 2026 10:15:17 +0000

As part of Day 3 of Phase 1: Better Data Engineering in the Databricks 14 Days AI Challenge – 2 (Advanced), the focus moved toward understanding job orchestration and preparing notebooks for automated execution.

The notebook was first enhanced by introducing widget parameters to support runtime configuration. This allowed the workflow to remain flexible and reusable instead of relying on hardcoded execution logic.

The feature engineering logic developed earlier was then modularized into a function. Organizing transformations this way improved readability and made the notebook better suited for pipeline-based execution.

Following this, a Job was created using the workflow interface in Databricks. The notebook was added as a task, parameters were passed through configuration, and a daily schedule was defined to automate execution.

During implementation, ChatGPT supported the process as a technical reference for validating orchestration concepts and notebook structuring decisions.

This exercise helped demonstrate how data workflows evolve from manual notebook runs into repeatable and scheduled data engineering pipelines.

Activity Log