DEV Community

Yoen
Yoen

Posted on

Challenges with SMOTE Oversampling for Bankruptcy Prediction

Hello members,

I'm working on a supervised machine learning project that aims to predict company bankruptcy based on a dataset I have. I'd like to share the context and the challenges I am facing, hoping for your guidance.

**Dataset & Initial Approach:

1/The dataset has a significant imbalance: 96% of the companies are not bankrupt, whereas only 4% are.

2/ In my initial approach, we tackled this imbalance through undersampling, which meant we randomly removed non-bankrupt companies to achieve a 50-50 proportion.

3/Using this balanced dataset, we trained several models - logistic regression, random forest, and decision trees. The results were encouraging as we achieved a good predictive capacity.

**New Requirement & Issue:

Now, for a class project, I need to utilize oversampling instead of undersampling. The technique we've been guided to use is SMOTE (Synthetic Minority Oversampling TEchnique), which generates synthetic samples based on the k-nearest neighbors from instances in the minority class.

However, after applying SMOTE and then running my models (logistic regression, random forest, decision trees), the predictive performance has drastically diminished. The models, which previously were performing well with undersampling, now seem to have little to no predictive capacity post SMOTE.

I've reviewed my Python code multiple times and am confident about the implementation. This leads me to question if SMOTE or oversampling, in general, might not be the right approach for this specific dataset. Considering it was a recommended method from our professor, I'm a bit puzzled.

**Seeking Guidance:

Has anyone else faced challenges with SMOTE specifically in similar contexts?

Is there a fundamental reason why SMOTE might not be suitable for certain datasets or problems?

Are there any tips or adjustments you would recommend when using SMOTE?

Last question: Is SGD (Stochastic Gradient Descend) a type of logistic regrreession?

Any insights or suggestions would be greatly appreciated. Thanks in advance for your time and expertise.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free →

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay