How to handle imbalanced datasets when using XGBoost in Python? performance optimization and strategies?
I'm trying to implement I'm working on a project and hit a roadblock... I'm working on a binary classification question using XGBoost (version 1.5.0) in Python, and I've encountered important scenarios due to an imbalanced dataset. My dataset contains around 10,000 instances, but only about 800 belong to the positive class. After training my model, I noticed that while the accuracy was around 95%, the precision and recall for the positive class were very low, which isn't acceptable for my use case. I've tried implementing several strategies to mitigate the imbalance, including: - Resampling the dataset using the `imblearn` library to oversample the minority class, but this caused overfitting. - Using the `scale_pos_weight` parameter in XGBoost, setting it to the ratio of the negative to positive class. - Trying the `SMOTE` technique, which gave better results but still not satisfactory. Hereβs a simplified version of my current code: ```python import xgboost as xgb from sklearn.metrics import accuracy_score, precision_score, recall_score from imblearn.over_sampling import SMOTE X, y = load_data() # Assume this loads the features and target smote = SMOTE(sampling_strategy='minority') X_resampled, y_resampled = smote.fit_resample(X, y) model = xgb.XGBClassifier(scale_pos_weight=12.5) model.fit(X_resampled, y_resampled) y_pred = model.predict(X_test) print(f'Accuracy: {accuracy_score(y_test, y_pred)}') print(f'Precision: {precision_score(y_test, y_pred)}') print(f'Recall: {recall_score(y_test, y_pred)}') ``` Despite these attempts, I still see that the precision is hovering around 0.3, and recall is low too. I also received an behavior saying that too many trees were created when using the `early_stopping_rounds` parameter, which I assumed was due to overfitting after oversampling. Could anyone suggest effective strategies or best practices for handling imbalanced datasets with XGBoost? Specifically, are there techniques to improve recall without sacrificing too much precision, or perhaps ways to tune the hyperparameters better? Any insights would be greatly appreciated! I'm on Ubuntu 20.04 using the latest version of Python. Any pointers in the right direction?