DEV Community

Cover image for DAY 5 - Production-Grade Feature Engineering
Subhasis Das
Subhasis Das

Posted on

DAY 5 - Production-Grade Feature Engineering

As part of Day 5 of Phase 2: AI System Building in the Databricks 14 Days AI Challenge – 2 (Advanced), I focused on preparing a production-ready supervised learning dataset.

Visual Concept

The process began by creating a binary purchase label at the user level using event-level data. A user was labeled as 1 if at least one purchase event existed, otherwise 0. This label dataset was then joined with the previously engineered Silver feature table to create a consolidated training dataset.

Notebook

An 80/20 train-test split was applied using a fixed seed to ensure reproducibility. Distribution validation was performed across the full dataset, as well as the train and test splits, to confirm that class proportions remained consistent. The observed class ratio remained stable across partitions, reinforcing correct dataset preparation practices.

Notebook

During implementation, ChatGPT was used as a technical reference to validate aggregation logic, review join consistency, and confirm class distribution calculations aligned with scalable data engineering workflows.

Codes

Activity Log

Top comments (0)