As part of Day 2 of Phase 1: Better Data Engineering in the Databricks 14 Days AI Challenge – 2 (Advanced), I focused on building a user-level Feature Table using a Silver Layer approach.
The workflow started by loading the previously created merged Delta table instead of working again with raw datasets. The objective was to transform event-level records into structured user-level features that could be reused for analytics or downstream machine learning tasks.
Using PySpark aggregations, I generated features such as total events, number of purchases, total spending, and average price across interactions. Non-purchase events were intentionally included to capture overall engagement patterns rather than restricting analysis only to completed transactions.
To ensure reliability, duplicate user records were handled explicitly and feature quality validation was performed. Null validation confirmed no missing user identifiers, while descriptive statistics helped review behavior across more than 5.3 million users.
During implementation, ChatGPT supported reasoning around aggregation logic and validation checks aligned with Silver layer practices.
The final dataset was stored as a Delta table using Delta Lake in Databricks, reinforcing structured and reusable data engineering practices.




Top comments (0)