From Raw Leads to Predictive Insights: My Logistic Regression Assignment Journey

#datascience #zoomcampmachinelearning #datatalksclub #python

Step 1: Data Preparation

The first step was cleaning the dataset and handling missing values.

Categorical features were replaced with 'NA'
Numerical features were replaced with 0.0

This ensured the dataset was consistent and ready for analysis.

Step 2: Exploring the Data

I examined the dataset’s key patterns and relationships.

The most frequent industry among leads turned out to be the mode of the industry column.
I generated a correlation matrix to identify the strongest relationships among numerical features — a crucial step before modeling.

This helped highlight features that might have overlapping or dependent influences on the target variable.

Step 3: Feature Engineering & Splitting

To evaluate model performance fairly, I split the dataset into train (60%), validation (20%), and test (20%) sets, ensuring reproducibility with a fixed random seed.

Step 4: Understanding Feature Relationships

Using mutual information, I explored which categorical features had the strongest relationship with the target (converted). This revealed how factors like industry, employment status, and lead source contribute to conversion likelihood.

Step 5: Logistic Regression Model

After encoding all categorical features with one-hot encoding, I trained a logistic regression model with these parameters:

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

The model achieved a validation accuracy of 0.68, which rounds to 0.64 in the grading scale.

While it might not seem like a perfect score, it provided valuable insights into which features were most predictive and where improvements could be made.

Step 6: Feature Importance & Regularization

I then ran feature elimination experiments — dropping one feature at a time (like industry, lead_score, and employment_status) to see which had the smallest impact on accuracy.
Finally, I tuned the model’s regularization strength (C) to find the best-performing setup.

Key Takeaways

This project deepened my understanding of:

Data preprocessing and imputation
Feature correlation and mutual information
Model validation and tuning
The balance between model complexity and generalization

Final Thoughts

This assignment reinforced a key lesson — predictive modeling isn’t just about achieving high accuracy; it’s about building interpretable, actionable insights. Every model is a story told in data, and each iteration gets you closer to understanding your audience, customers, or users better.