Mohd Uwaish

Posted on Feb 10

FairSample: Because Class Overlap Is Harder Than Class Imbalance

#python #machinelearning #datascience #opensource

The Overlooked Problem in Classification

Everyone talks about class imbalance. But there's a more insidious problem lurking in your data: class overlap.

Santos et al. argue that class overlap is a more significant impediment to classifier performance than imbalance alone. Yet most practitioners don't have tools to diagnose or address it.

During my research on overlap-handling techniques, I investigated how different methods affect global structural complexity. The findings led me to build FairSample—a package specifically designed for the class overlap problem.

What Is Class Overlap?

Class overlap occurs when instances from different classes share similar feature values. Your classifier sees:

Instance A: [feature1=5.2, feature2=3.1, feature3=1.4] → Class 0
Instance B: [feature1=5.1, feature2=3.2, feature3=1.3] → Class 1

These look almost identical, but belong to different classes. This confuses classifiers and degrades performance—even when your classes are perfectly balanced.

Why Overlap Is Harder Than Imbalance

Imbalance: You have 100 instances of Class A, 10 of Class B

Solution: Sample to balance the ratio
Outcome: Classifier sees both classes equally

Overlap: Classes share the same feature space

Solution: Not straightforward—you're changing data structure
Outcome: Depends on how you handle it

This is why overlap requires more sophisticated analysis.

FairSample's Approach

1. Quantify Overlap First

Before fixing anything, measure the problem:

from fairsample.complexity import ComplexityMeasures

cm = ComplexityMeasures(X, y)

# Get comprehensive overlap analysis
all_measures = cm.get_all_complexity_measures(measures='all')

# Focus on instance overlap
instance_overlap = cm.get_all_complexity_measures(
    measures=['N3', 'N4', 'kDN', 'CM']
)

# Structural complexity
structural = cm.get_all_complexity_measures(
    measures=['T1', 'LSC', 'DBC']
)

Why this matters: Different overlap patterns require different solutions.

2. 14+ Overlap-Handling Techniques

FairSample implements algorithms specifically designed for overlap:

EHSO - Evolutionary Hybrid Sampling in Overlap
RFCL - Repetitive Forward Class Learning
NBUS - Neighbourhood-Based Undersampling
URNS - Undersampling by Removing Noisy Samples
SVDDWSMOTE - Support Vector Data Description-based oversampling
OSM - Overlap-based Sampling Method
And more...

All from peer-reviewed research (2014-2024).

3. Multi-Dimensional Evaluation

Evaluate how techniques affect your overlap:

from fairsample.utils import compare_techniques

# Compare multiple overlap-handling techniques
results = compare_techniques(
    X, y,
    techniques=['RFCL', 'EHSO', 'NBUS'],
    complexity_measures='basic'
)

# See impact on overlap metrics
print(results[['technique', 'N3', 'T1', 'training_time']])

4. Before/After Validation

Verify that overlap actually reduced:

from fairsample import EHSO
from fairsample.complexity import compare_pre_post_overlap

# Apply overlap-handling technique
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)

# Measure structural changes
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("Overlap Reduction:")
print(comparison['improvements'])

Research Insights Applied

My research investigated how overlap-handling techniques affect data structure. Key insights:

Insight 1: Reducing overlap doesn't always improve classification

Some techniques reduce overlap but fragment class structure
Always validate with classification metrics

Insight 2: Different techniques, different structural effects

Some improve instance overlap but worsen structural complexity
Others balance both
Trade-offs vary by dataset

Insight 3: Context matters

No universal solution exists
Measure your specific overlap profile first

These insights shaped FairSample's design—diagnostic tools before treatment.

Practical Example: Medical Diagnosis

from fairsample import RFCL
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Dataset with overlapping symptoms
# Different diseases, similar presentations

# Step 1: Quantify overlap
from fairsample.complexity import ComplexityMeasures
cm = ComplexityMeasures(X_train, y_train)
print(f"Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")

# Step 2: Handle overlap
sampler = RFCL(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

# Step 3: Train classifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)

# Step 4: Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

40+ Complexity Measures

FairSample provides comprehensive overlap quantification:

Feature Overlap:

F1, F1v, F2, F3, F4 - How much features discriminate between classes

Instance Overlap:

N3, N4, kDN, CM, R-value - How much instances overlap in feature space

Structural Complexity:

T1, LSC, DBC - How complex the decision boundary needs to be

Multiresolution:

Purity, MRCA, C1, C2 - Multi-scale overlap analysis

Each reveals different aspects of your overlap problem.

Installation & Quick Start

pip install fairsample

Complete workflow:

from fairsample import EHSO
from fairsample.complexity import ComplexityMeasures
import pandas as pd

# Load data with class overlap
df = pd.read_csv('overlapping_data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Diagnose overlap
cm = ComplexityMeasures(X, y)
print("Overlap Analysis:")
print(f"  Instance Overlap (N3): {cm.analyze_overlap()['N3']:.4f}")

# Handle overlap
sampler = EHSO(random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X, y)

# Validate reduction
from fairsample.complexity import compare_pre_post_overlap
comparison = compare_pre_post_overlap(X, y, X_resampled, y_resampled)
print("\nOverlap Reduction:")
print(comparison['improvements'])

# Use in classification
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)

When Overlap Matters Most

Class overlap is particularly problematic in:

Domain	Why Overlap Occurs
Medical Diagnosis	Overlapping symptoms between diseases
Fraud Detection	Fraudsters mimic legitimate behavior
Software Defect Prediction	Similar code metrics for faulty/non-faulty modules
Network Intrusion	Attacks disguised as normal traffic
Image Classification	Visually similar objects in different categories

In these domains, addressing overlap is crucial for performance.

Overlap vs. Imbalance: A Comparison

# Scenario 1: Only Imbalance (No Overlap)
# Class 0: 1000 instances, features [0-5]
# Class 1: 100 instances, features [10-15]
# Solution: Simple resampling works well ✓

# Scenario 2: Only Overlap (Balanced)
# Class 0: 500 instances, features [0-10]
# Class 1: 500 instances, features [5-15]
# Solution: Need overlap-handling techniques ⚠️

# Scenario 3: Both Imbalance + Overlap
# Class 0: 1000 instances, features [0-10]
# Class 1: 100 instances, features [5-15]
# Solution: FairSample's specialized techniques ✓✓

The Research Foundation

FairSample implements techniques from:

Vuttipittayamongkol & Elyan (2020) - EHSO, NBUS - Information Sciences
Das et al. (2014) - RFCL - IEEE TKDE
Santos et al. (2023) - Overlap analysis framework - Artificial Intelligence Review
Lorena et al. (2019) - Complexity measures - ACM Computing Surveys

Full citations: CITATIONS.md

Resources

📖 Documentation: https://mohduwaish59.github.io/fairsample/
💻 GitHub: https://github.com/mohdUwaish59/fairsample
📝 Issues: Report bugs or request features
🔧 PRs: Contribute techniques or improvements

The Bottom Line

Class overlap is often more harmful than class imbalance. Yet most tools focus solely on balancing class ratios.

FairSample provides:

Diagnostic tools - Quantify overlap with 40+ measures
Treatment options - 14+ research-backed techniques
Validation methods - Verify overlap reduction

All specifically designed for the overlap problem.

Have overlap problems in your data? Try FairSample and share your results!

What domains have you encountered severe class overlap? Drop a comment below 👇

python #machinelearning #datascience #opensource #classoverlap

DEV Community