Imbalanced datasets are a common challenge in machine learning, where one class significantly outnumbers the other(s). This can lead to biased models that perform poorly on minority classes. In this blog post, we'll explore various techniques to handle imbalanced data using Python and Scikit-learn.
Let's start with a simple example to illustrate the issue:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.linear_model import LogisticRegression # Create an imbalanced dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42) # Split the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train a logistic regression model model = LogisticRegression() model.fit(X_train, y_train) # Evaluate the model print(classification_report(y_test, model.predict(X_test)))
You'll notice that the model performs poorly on the minority class. Let's explore some techniques to address this issue.
Synthetic Minority Over-sampling Technique (SMOTE) creates synthetic examples of the minority class:
from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=42) X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train) model = LogisticRegression() model.fit(X_train_resampled, y_train_resampled) print(classification_report(y_test, model.predict(X_test)))
Undersampling reduces the majority class to balance the dataset:
from imblearn.under_sampling import RandomUnderSampler rus = RandomUnderSampler(random_state=42) X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train) model = LogisticRegression() model.fit(X_train_resampled, y_train_resampled) print(classification_report(y_test, model.predict(X_test)))
Many Scikit-learn classifiers support class weighting:
model = LogisticRegression(class_weight='balanced') model.fit(X_train, y_train) print(classification_report(y_test, model.predict(X_test)))
You can adjust the decision threshold to favor the minority class:
from sklearn.metrics import roc_curve model = LogisticRegression() model.fit(X_train, y_train) # Find the optimal threshold y_scores = model.predict_proba(X_test)[:, 1] fpr, tpr, thresholds = roc_curve(y_test, y_scores) optimal_idx = np.argmax(tpr - fpr) optimal_threshold = thresholds[optimal_idx] # Make predictions using the optimal threshold y_pred = (y_scores >= optimal_threshold).astype(int) print(classification_report(y_test, y_pred))
This ensemble method combines random undersampling with random forests:
from imblearn.ensemble import BalancedRandomForestClassifier brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42) brf.fit(X_train, y_train) print(classification_report(y_test, brf.predict(X_test)))
This method uses AdaBoost learners trained on balanced bootstrap samples:
from imblearn.ensemble import EasyEnsembleClassifier eec = EasyEnsembleClassifier(n_estimators=10, random_state=42) eec.fit(X_train, y_train) print(classification_report(y_test, eec.predict(X_test)))
When dealing with imbalanced datasets, accuracy can be misleading. Consider using these metrics:
from sklearn.metrics import roc_auc_score, average_precision_score # AUC-ROC print("AUC-ROC:", roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])) # AUC-PR print("AUC-PR:", average_precision_score(y_test, model.predict_proba(X_test)[:, 1]))
By applying these techniques and carefully evaluating your models, you can significantly improve your performance on imbalanced datasets. Remember to experiment with different approaches and combinations to find the best solution for your specific problem.
06/10/2024 | Python
26/10/2024 | Python
15/11/2024 | Python
08/11/2024 | Python
21/09/2024 | Python
06/10/2024 | Python
14/11/2024 | Python
17/11/2024 | Python
06/10/2024 | Python
25/09/2024 | Python
25/09/2024 | Python
15/11/2024 | Python