The Overlooked Problem in Classification
Everyone talks about class imbalance. But there's a more insidious problem lurking in your data: class overlap.
Santos et al. argue that class overlap is a more significant impediment to classifier performance than imbalance alone. Yet most practitioners don't have tools to diagnose or address it.
During my research on overlap-handling techniques, I investigated how different methods affect global structural complexity. The findings led me to build FairSample—a package specifically designed for the class overlap problem.
What Is Class Overlap?
Class overlap occurs when instances from different classes share similar feature values. Your classifier sees:
Instance A: [feature1=5.2, feature2=3.1, feature3=1.4] → Class 0
Instance B: [feature1=5.1, feature2=3.2, feature3=1.3] → Class 1
These look almost identical, but belong to different classes. This confuses classifiers and degrades performance—even when your classes are perfectly balanced.
Why Overlap Is Harder Than Imbalance
Imbalance: You have 100 instances of Class A, 10 of Class B
- Solution: Sample to balance the ratio
- Outcome: Classifier sees both classes equally
Overlap: Classes share the same feature space
- Solution: Not straightforward—you're changing data structure
- Outcome: Depends on how you handle it
This is why overlap requires more sophisticated analysis.
FairSample's Approach
1. Quantify Overlap First
Before fixing anything, measure the problem:
from fairsample.complexity import ComplexityMeasures
cm = ComplexityMeasures(X, y)
# Get comprehensive overlap analysis
all_measures = cm.get_all_complexity_measures(measures='all')
# Focus on instance overlap
instance_overlap = cm.get_all_complexity_measures(
measures=['N3', 'N4', 'kDN', 'CM']
)
# Structural complexity
structural = cm.get_all_complexity_measures(
measures=['T1', 'LSC', 'DBC']
)
Why this matters: Different overlap patterns require different solutions.
2. 14+ Overlap-Handling Techniques
FairSample implements algorithms specifically designed for overlap:
- EHSO - Evolutionary Hybrid Sampling in Overlap
- RFCL - Repetitive Forward Class Learning
- NBUS - Neighbourhood-Based Undersampling
- URNS - Undersampling by Removing Noisy Samples
- SVDDWSMOTE - Support Vector Data Description-based oversampling
- OSM - Overlap-based Sampling Method
- And more...
All from peer-reviewed research (2014-2024).
3. Multi-Dimensional Evaluation
Evaluate how techniques affect your overlap:
from fairsample.utils import compare_techniques
# Compare multiple overlap-handling techniques
results = compare_techniques(
X, y,
techniques=['RFCL', 'EHSO', 'NBUS'],
complexity_measures='basic'
)
# See impact on overlap metrics
print(results[['technique', 'N3', 'T1', 'training_time']])
4. Before/After Validation
Verify that overlap actually reduced:
from fairsample import EHSO
from fairsample.complexity import compare_pre_post_overlap
# Apply overlap-handling technique
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)
# Measure structural changes
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("Overlap Reduction:")
print(comparison['improvements'])
Research Insights Applied
My research investigated how overlap-handling techniques affect data structure. Key insights:
Insight 1: Reducing overlap doesn't always improve classification
- Some techniques reduce overlap but fragment class structure
- Always validate with classification metrics
Insight 2: Different techniques, different structural effects
- Some improve instance overlap but worsen structural complexity
- Others balance both
- Trade-offs vary by dataset
Insight 3: Context matters
- No universal solution exists
- Measure your specific overlap profile first
These insights shaped FairSample's design—diagnostic tools before treatment.
Practical Example: Medical Diagnosis
from fairsample import RFCL
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Dataset with overlapping symptoms
# Different diseases, similar presentations
# Step 1: Quantify overlap
from fairsample.complexity import ComplexityMeasures
cm = ComplexityMeasures(X_train, y_train)
print(f"Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")
# Step 2: Handle overlap
sampler = RFCL(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)
# Step 3: Train classifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)
# Step 4: Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
40+ Complexity Measures
FairSample provides comprehensive overlap quantification:
Feature Overlap:
- F1, F1v, F2, F3, F4 - How much features discriminate between classes
Instance Overlap:
- N3, N4, kDN, CM, R-value - How much instances overlap in feature space
Structural Complexity:
- T1, LSC, DBC - How complex the decision boundary needs to be
Multiresolution:
- Purity, MRCA, C1, C2 - Multi-scale overlap analysis
Each reveals different aspects of your overlap problem.
Installation & Quick Start
pip install fairsample
Complete workflow:
from fairsample import EHSO
from fairsample.complexity import ComplexityMeasures
import pandas as pd
# Load data with class overlap
df = pd.read_csv('overlapping_data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Diagnose overlap
cm = ComplexityMeasures(X, y)
print("Overlap Analysis:")
print(f" Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")
# Handle overlap
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)
# Validate reduction
from fairsample.complexity import compare_pre_post_overlap
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("\nOverlap Reduction:")
print(comparison['improvements'])
# Use in classification
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)
When Overlap Matters Most
Class overlap is particularly problematic in:
| Domain | Why Overlap Occurs |
|---|---|
| Medical Diagnosis | Overlapping symptoms between diseases |
| Fraud Detection | Fraudsters mimic legitimate behavior |
| Software Defect Prediction | Similar code metrics for faulty/non-faulty modules |
| Network Intrusion | Attacks disguised as normal traffic |
| Image Classification | Visually similar objects in different categories |
In these domains, addressing overlap is crucial for performance.
Overlap vs. Imbalance: A Comparison
# Scenario 1: Only Imbalance (No Overlap)
# Class 0: 1000 instances, features [0-5]
# Class 1: 100 instances, features [10-15]
# Solution: Simple resampling works well ✓
# Scenario 2: Only Overlap (Balanced)
# Class 0: 500 instances, features [0-10]
# Class 1: 500 instances, features [5-15]
# Solution: Need overlap-handling techniques ⚠️
# Scenario 3: Both Imbalance + Overlap
# Class 0: 1000 instances, features [0-10]
# Class 1: 100 instances, features [5-15]
# Solution: FairSample's specialized techniques ✓✓
The Research Foundation
FairSample implements techniques from:
- Vuttipittayamongkol & Elyan (2020) - EHSO, NBUS - Information Sciences
- Das et al. (2014) - RFCL - IEEE TKDE
- Santos et al. (2023) - Overlap analysis framework - Artificial Intelligence Review
- Lorena et al. (2019) - Complexity measures - ACM Computing Surveys
Full citations: CITATIONS.md
Resources
- 📖 Documentation: https://mohduwaish59.github.io/fairsample/
- 💻 GitHub: https://github.com/mohdUwaish59/fairsample
- 📝 Issues: Report bugs or request features
- 🔧 PRs: Contribute techniques or improvements
The Bottom Line
Class overlap is often more harmful than class imbalance. Yet most tools focus solely on balancing class ratios.
FairSample provides:
- Diagnostic tools - Quantify overlap with 40+ measures
- Treatment options - 14+ research-backed techniques
- Validation methods - Verify overlap reduction
All specifically designed for the overlap problem.
Have overlap problems in your data? Try FairSample and share your results!
What domains have you encountered severe class overlap? Drop a comment below 👇
Top comments (0)