Evaluation of the performance of the Smote, Smote Enn, and Borderline Smote resampling methods based on the number of outlier data with Z Score

arisgunadi gunadi; Dewi Oktofa Rahmawati; Nurfa Risha

doi:10.24843/LKJITI.2025.v16.i02.p05

Authors

arisgunadi gunadi undiksha
Dewi Oktofa Rahmawati Undiksha
Nurfa Risha Undiksha

DOI:

https://doi.org/10.24843/LKJITI.2025.v16.i02.p05

Abstract

Handling class imbalances in datasets is a significant challenge in the classification process. Disruption occurs if the minority class has a crucial role in decision-making. Oversampling is one of the solutions that is widely used to overcome this problem. This study compares the performance of three popular oversampling methods, namely SMOTE (Synthetic Minority Oversampling Technique), SMOTE-ENN (SMOTE with Edited Nearest Neighbor), and Borderline-SMOTE, based on the number of outlier data produced. Outlier data is measured using a Z-score-based statistical approach. The research was conducted by applying the three oversampling methods on several datasets. Evaluation is carried out by counting the number of outlier data after the resample process, as well as by evaluating their impact on the performance of the classification model using metrics such as accuracy, precision, recall, and F1-score.

The research results show that there is no significant difference in the number of outlier data in SMOTE, ENN SMOTE, or borderline SMOTE. In the diabetes.csv dataset, it was found that the percentage of outlier data in the initial condition and the condition after resampling with SMOTE, resampling with SMOTE ENN, and borderline SMOTE were 7.4%, 6.8%, 6.7%, and 63%, respectively. For the predict_ honor.csv dataset, the data are 7.1%, 7.3%, 7.6%, and 7%. For the winequality.csv dataset, the data are 8%, 7.8%, 6.8%, and 5.8%. Meanwhile, smoking.csv data found 7.1%, 7.3%, 7.6%, and 7.0%. However, if we look at each feature in each dataset, more varied conditions are found regarding the performance of the three algorithms, which is related to the number of outlier data produced. In terms of differences, no significant differences were found in the number of outlier data produced. The second finding is related to the performance of the decision tree classification model. It can be stated that the influence of feature correlation is more important than perfect data balance in the dataset.

👁 Abstract Views: 311📥 pdf Downloads: 328