DAY 14 - Final Production-Ready System

#ai #python #database #machinelearning

Day 14 marked the final stage of the Databricks 14 Days AI Challenge – 2 (Advanced), bringing together the various components developed throughout the challenge into a complete production-ready system.

The primary objective was to integrate the data engineering pipeline with the machine learning workflow into a single operational process. Throughout the earlier phases, individual components such as data ingestion, feature engineering, model training, experiment tracking, and inference pipelines were developed separately. Day 14 focused on combining these pieces into an end-to-end architecture capable of generating predictions from raw data.

The pipeline begins with loading data from the Delta table that stores the processed e-commerce event dataset. Feature engineering is applied to transform event-level interactions into user-level behavioral features. These features include metrics such as total user activity, number of purchases, total spending, and average transaction value.

Next, a purchase label is generated to identify whether each user has made a purchase. The feature dataset and label dataset are then joined to produce the final training dataset.

Using this dataset, a Logistic Regression model is trained to predict the probability that a user will make a purchase. The dataset is split into training and testing subsets, and model performance is evaluated using the Area Under the ROC Curve (AUC). The evaluation confirmed that the model could effectively distinguish between purchasing and non-purchasing users based on their interaction patterns.

After training, the final model is saved to a Unity Catalog volume and logged using MLflow. Because the system was executed on a serverless cluster, MLflow required a Unity Catalog temporary directory for model serialization. Adjusting the MLflow configuration allowed the model to be successfully logged and stored for future use.

Once the model was persisted, batch inference was performed on the dataset to generate purchase probability predictions for each user. The predictions include the user identifier, predicted probability of purchase, and binary prediction label. These results were written to a Gold Delta table, making them accessible for downstream analytics and decision-making processes.

The output dataset revealed a strong separation between users predicted to purchase and those predicted not to purchase, indicating that the engineered features provided meaningful signals for the model.

During the implementation process, ChatGPT assisted with debugging MLflow configuration issues, refining the pipeline logic, and validating the final prediction extraction steps within the Databricks environment.

Completing Day 14 effectively demonstrates how individual data engineering and machine learning tasks can be assembled into a unified system capable of supporting production-style predictive analytics.

Activity Log

DEV Community

DAY 14 - Final Production-Ready System

Top comments (0)