DAY 6 - Model Training & Tuning

#ai #data #dataengineering #pyspark

As part of Day 6 of Phase 2: AI System Building in the Databricks 14 Days AI Challenge – 2 (Advanced), I focused on model training, tuning, and evaluation using the supervised dataset prepared earlier.

Feature vectors were assembled from engineered user-level metrics, and both Logistic Regression and Random Forest classifiers were trained using an 80/20 train-test split. Model performance was evaluated using ROC-AUC to ensure threshold-independent comparison.

Due to workspace limitations in the shared/serverless environment, CrossValidator-based tuning was not supported because of temporary storage configuration restrictions. As a result, hyperparameter tuning for Random Forest was performed manually by iterating over different tree counts and depths.

The observed AUC values were extremely high (≈0.999999 for Logistic Regression and 1.0 for Random Forest). This highlighted an important modeling insight regarding feature-label relationships and the need to carefully assess potential information leakage in supervised learning workflows.

During implementation, ChatGPT supported validation of model configuration, evaluation logic, environment troubleshooting, and interpretation of performance metrics within scalable AI system design practices inside Databricks.

Activity Log

DEV Community

DAY 6 - Model Training & Tuning

Top comments (0)