Odinaka Joy

Posted on Jun 8 • Edited on Aug 31

Implementing the Data Science Workflow: Predicting Mental Health Treatment

#mentalhealth #ai #datascience

In my last article, I broke down the Data Science Workflow for beginners. It’s a great starting point for understanding the key steps in any data science project.

In this follow-up post, I am putting that workflow into action by sharing my first machine learning project:

predicting whether someone is likely to seek treatment for mental health issues based on demographic and workplace data.

This hands-on project covers the full process, from understanding the problem and exploring the dataset to training a model and evaluating its performance.

📌 Key Takeaways

I built this machine learning model using a public mental health survey dataset from Kaggle.
The goal: Predict whether someone is likely to seek mental health treatment.
Best model achieved ~82% accuracy.
Key predictors: workplace support, family history, and how much mental health interferes with work.
Full code: 👉 GitHub Notebook

📌 Why This Project Matters

Mental health is deeply personal, but the decision to seek treatment is often influenced by external conditions like work culture, stigma, or lack of access. By modeling treatment-seeking behavior:

✅ We identify at-risk individuals early.
✅ We encourage empathetic policy-making in the workplace.
✅ We normalize seeking help through data storytelling.

📌 Framing the Challenge

Problem::

Given demographic, personal, and workplace mental health history, can we predict whether someone is likely to seek treatment?
Machine Learning Problem Type:

Supervised Learning – Binary Classification
Success Criteria (Initial Evaluation):

A model with ≥ 85% accuracy and ≥ 80% recall for the positive class (that is, people who seek treatment) will be considered successful.
Data Source:
Structured, static dataset from Kaggle

📌 Dataset Overview

1 numerical feature (age)
26 categorical features (gender, self_employment, benefits, etc.)
Target column: treatment (Yes/No) After cleaning, we had 1,300+ usable responses from tech professionals.

📌 Cleaning the Data: A Quick Summary

I made these key cleaning decisions:

Dropped irrelevant or sparse columns like timestamp and comments
Normalized gender values from wild responses like "guy (-ish)" or "femail" into "Male", "Female", "Other"
Handled missing values with strategic imputation (e.g., replacing "self_employed" nulls with "No")
Filtered out outliers in the age column (we kept ages 18–74)

Want to see exactly how? Check out the notebook here

📌 Exploratory Data Analysis (EDA)

Treatment Distribution

Over half the respondents reported seeking treatment. This gives us a relatively balanced target, which is great for modeling!

Gender vs Treatment Visualizing treatment-seeking behavior by gender revealed that:
Women were slightly more likely to seek help.
The "Other" gender group had smaller numbers but still sought support at similar rates.
Age Distribution
Most respondents were aged 25–44, typical for tech jobs 😆. We also created age groups like "18–24", "25–34", etc., to identify behavioral patterns.

📌 Feature Engineering Highlights

To make the data model-ready, I:

Grouped continuous ages into categories
Ordinal-encoded ordered features (e.g., company size, perceived difficulty of taking leave)
Binary-encoded yes/no columns
One-hot encoded select categorical columns (like benefits, anonymity, wellness_program)

These steps helped reduce noise and preserve meaning in the data.

📌 Model Building with Random Forest

I tried 4 modeling approaches using RandomForestClassifier:

Default model
Manual hyperparameter tuning
RandomizedSearchCV tuning
GridSearchCV tuning

Model	Accuracy
Default RF	82.1%
Manually Tuned	82.4% 🔥
RandomizedSearchCV	81.2%
GridSearchCV	81.6%

All models performed well, but manual tuning surprisingly gave the best result.

📌 Evaluation Metrics

Besides accuracy, I measured:

Precision: How many predicted "Yes" are truly "Yes"
Recall: How many actual "Yes" were correctly identified
F1 Score: Balance between precision and recall
Confusion Matrix: Breakdown of prediction results
ROC AUC: Model’s overall ability to distinguish between classes

📌 Key Insights

People with family history or poor workplace support were more likely to seek treatment.
The work_interfere feature (i.e. how much work affects mental health) was highly predictive.
The Random Forest model was interpretable and gave consistently strong performance.

📌 Tools Used

pandas and numpy for data manipulation
numpy for numerical computation
matplotlib for visualization
scikit-learn for modeling
Jupyter Notebook in a Miniconda environment

📌 Final Thoughts

This project was more than just a machine learning experiment, it was a reminder of how data can support empathy, and how technical skills can be used to explore meaningful questions.

✅ I practiced EDA, preprocessing, encoding, and model tuning.
✅ I built a working Machine Learning model that could be useful for HR or wellness platforms.
✅ Most importantly, I felt connected to a topic that truly matters.

Mental health is not just personal, it’s societal. Let’s keep talking about it, and maybe… let’s keep coding about it too 😜.

Happy coding!!!

DEV Community