Step 1: Data Preparation
The first step was cleaning the dataset and handling missing values.
-
Categorical features were replaced with
'NA'
-
Numerical features were replaced with
0.0
This ensured the dataset was consistent and ready for analysis.
Step 2: Exploring the Data
I examined the dataset’s key patterns and relationships.
- The most frequent industry among leads turned out to be the mode of the
industry
column. - I generated a correlation matrix to identify the strongest relationships among numerical features — a crucial step before modeling.
This helped highlight features that might have overlapping or dependent influences on the target variable.
Step 3: Feature Engineering & Splitting
To evaluate model performance fairly, I split the dataset into train (60%), validation (20%), and test (20%) sets, ensuring reproducibility with a fixed random seed.
Step 4: Understanding Feature Relationships
Using mutual information, I explored which categorical features had the strongest relationship with the target (converted
). This revealed how factors like industry, employment status, and lead source contribute to conversion likelihood.
Step 5: Logistic Regression Model
After encoding all categorical features with one-hot encoding, I trained a logistic regression model with these parameters:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
The model achieved a validation accuracy of 0.68, which rounds to 0.64 in the grading scale.
While it might not seem like a perfect score, it provided valuable insights into which features were most predictive and where improvements could be made.
Step 6: Feature Importance & Regularization
I then ran feature elimination experiments — dropping one feature at a time (like industry
, lead_score
, and employment_status
) to see which had the smallest impact on accuracy.
Finally, I tuned the model’s regularization strength (C) to find the best-performing setup.
Key Takeaways
This project deepened my understanding of:
- Data preprocessing and imputation
- Feature correlation and mutual information
- Model validation and tuning
- The balance between model complexity and generalization
Final Thoughts
This assignment reinforced a key lesson — predictive modeling isn’t just about achieving high accuracy; it’s about building interpretable, actionable insights. Every model is a story told in data, and each iteration gets you closer to understanding your audience, customers, or users better.
Top comments (0)