Mastering Imbalanced Data Handling in Python with Scikit-learn

Introduction

Imbalanced datasets are a common challenge in machine learning, where one class significantly outnumbers the other(s). This can lead to biased models that perform poorly on minority classes. In this blog post, we'll explore various techniques to handle imbalanced data using Python and Scikit-learn.

Understanding the Problem

Let's start with a simple example to illustrate the issue:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
print(classification_report(y_test, model.predict(X_test)))

You'll notice that the model performs poorly on the minority class. Let's explore some techniques to address this issue.

Resampling Techniques

Oversampling with SMOTE

Synthetic Minority Over-sampling Technique (SMOTE) creates synthetic examples of the minority class:

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

model = LogisticRegression()
model.fit(X_train_resampled, y_train_resampled)

print(classification_report(y_test, model.predict(X_test)))

Undersampling

Undersampling reduces the majority class to balance the dataset:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

model = LogisticRegression()
model.fit(X_train_resampled, y_train_resampled)

print(classification_report(y_test, model.predict(X_test)))

Algorithmic Approaches

Class Weights

Many Scikit-learn classifiers support class weighting:

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

print(classification_report(y_test, model.predict(X_test)))

Adjusting Decision Threshold

You can adjust the decision threshold to favor the minority class:

from sklearn.metrics import roc_curve

model = LogisticRegression()
model.fit(X_train, y_train)

# Find the optimal threshold
y_scores = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

# Make predictions using the optimal threshold
y_pred = (y_scores >= optimal_threshold).astype(int)

print(classification_report(y_test, y_pred))

Ensemble Methods

BalancedRandomForestClassifier

This ensemble method combines random undersampling with random forests:

from imblearn.ensemble import BalancedRandomForestClassifier

brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)

print(classification_report(y_test, brf.predict(X_test)))

EasyEnsembleClassifier

This method uses AdaBoost learners trained on balanced bootstrap samples:

from imblearn.ensemble import EasyEnsembleClassifier

eec = EasyEnsembleClassifier(n_estimators=10, random_state=42)
eec.fit(X_train, y_train)

print(classification_report(y_test, eec.predict(X_test)))

Evaluating Imbalanced Classifications

When dealing with imbalanced datasets, accuracy can be misleading. Consider using these metrics:

Precision, Recall, and F1-score
Area Under the ROC Curve (AUC-ROC)
Area Under the Precision-Recall Curve (AUC-PR)

from sklearn.metrics import roc_auc_score, average_precision_score

# AUC-ROC
print("AUC-ROC:", roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

# AUC-PR
print("AUC-PR:", average_precision_score(y_test, model.predict_proba(X_test)[:, 1]))

By applying these techniques and carefully evaluating your models, you can significantly improve your performance on imbalanced datasets. Remember to experiment with different approaches and combinations to find the best solution for your specific problem.