In my last article, I broke down the Data Science Workflow for beginners. It’s a great starting point for understanding the key steps in any data science project.
In this follow-up post, I am putting that workflow into action by sharing my first machine learning project:
predicting whether someone is likely to seek treatment for mental health issues based on demographic and workplace data.
This hands-on project covers the full process, from understanding the problem and exploring the dataset to training a model and evaluating its performance.
📌 Key Takeaways
- I built this machine learning model using a public mental health survey dataset from Kaggle.
- The goal: Predict whether someone is likely to seek mental health treatment.
- Best model achieved ~82% accuracy.
- Key predictors: workplace support, family history, and how much mental health interferes with work.
- Full code: 👉 GitHub Notebook
📌 Why This Project Matters
Mental health is deeply personal, but the decision to seek treatment is often influenced by external conditions like work culture, stigma, or lack of access. By modeling treatment-seeking behavior:
- ✅ We identify at-risk individuals early.
- ✅ We encourage empathetic policy-making in the workplace.
- ✅ We normalize seeking help through data storytelling.
📌 Framing the Challenge
-
Problem::
Given demographic, personal, and workplace mental health history, can we predict whether someone is likely to seek treatment?
-
Machine Learning Problem Type:
Supervised Learning – Binary Classification
-
Success Criteria (Initial Evaluation):
A model with ≥ 85% accuracy and ≥ 80% recall for the positive class (that is, people who seek treatment) will be considered successful.
Data Source:
Structured, static dataset from Kaggle
📌 Dataset Overview
- 1 numerical feature (age)
- 26 categorical features (gender, self_employment, benefits, etc.)
- Target column: treatment (Yes/No) After cleaning, we had 1,300+ usable responses from tech professionals.
📌 Cleaning the Data: A Quick Summary
I made these key cleaning decisions:
- Dropped irrelevant or sparse columns like timestamp and comments
- Normalized gender values from wild responses like "guy (-ish)" or "femail" into "Male", "Female", "Other"
- Handled missing values with strategic imputation (e.g., replacing "self_employed" nulls with "No")
- Filtered out outliers in the age column (we kept ages 18–74)
Want to see exactly how? Check out the notebook here
📌 Exploratory Data Analysis (EDA)
- Treatment Distribution
Over half the respondents reported seeking treatment. This gives us a relatively balanced target, which is great for modeling!
- Gender vs Treatment Visualizing treatment-seeking behavior by gender revealed that:
- Women were slightly more likely to seek help.
The "Other" gender group had smaller numbers but still sought support at similar rates.
Age Distribution
Most respondents were aged 25–44, typical for tech jobs 😆. We also created age groups like "18–24", "25–34", etc., to identify behavioral patterns.
📌 Feature Engineering Highlights
To make the data model-ready, I:
- Grouped continuous ages into categories
- Ordinal-encoded ordered features (e.g.,
company size
,perceived difficulty of taking leave
) - Binary-encoded
yes/no
columns - One-hot encoded select categorical columns (like
benefits
,anonymity
,wellness_program
)
These steps helped reduce noise and preserve meaning in the data.
📌 Model Building with Random Forest
I tried 4 modeling approaches using RandomForestClassifier:
- Default model
- Manual hyperparameter tuning
- RandomizedSearchCV tuning
- GridSearchCV tuning
Model | Accuracy |
---|---|
Default RF | 82.1% |
Manually Tuned | 82.4% 🔥 |
RandomizedSearchCV | 81.2% |
GridSearchCV | 81.6% |
All models performed well, but manual tuning surprisingly gave the best result.
📌 Evaluation Metrics
Besides accuracy, I measured:
- Precision: How many predicted "Yes" are truly "Yes"
- Recall: How many actual "Yes" were correctly identified
- F1 Score: Balance between precision and recall
- Confusion Matrix: Breakdown of prediction results
- ROC AUC: Model’s overall ability to distinguish between classes
📌 Key Insights
- People with family history or poor workplace support were more likely to seek treatment.
- The
work_interfere
feature (i.e. how much work affects mental health) was highly predictive. - The Random Forest model was interpretable and gave consistently strong performance.
📌 Tools Used
-
pandas
andnumpy
for data manipulation -
numpy
for numerical computation -
matplotlib
for visualization -
scikit-learn
for modeling -
Jupyter Notebook
in a Miniconda environment
📌 Final Thoughts
This project was more than just a machine learning experiment, it was a reminder of how data can support empathy, and how technical skills can be used to explore meaningful questions.
- ✅ I practiced EDA, preprocessing, encoding, and model tuning.
- ✅ I built a working Machine Learning model that could be useful for HR or wellness platforms.
- ✅ Most importantly, I felt connected to a topic that truly matters.
Mental health is not just personal, it’s societal. Let’s keep talking about it, and maybe… let’s keep coding about it too 😜.
Happy coding!!!
Top comments (0)