ISSN (Online): 2321-3418
server-injected
Mathematics and Statistics
Open Access

Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

DOI: 10.18535/ijsrm/v12i04.m03· Pages: 494-501· Vol. 12, No. 04, (2024)· Published: April 29, 2024
PDF
Views: 1,429 PDF downloads: 472

Abstract

One common challenge in classification modeling is the existence of imbalanced classes within the data. If the analysis continues with imbalanced classes, it is probable that the result will demonstrate inadequate performance when forecasting new data. Various approaches exist to rectify this class imbalance issue, such as random oversampling, random undersampling, and the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC). Each of these methods encompasses distinct techniques aimed at achieving balanced class distribution within the dataset. Comparison of classification performance on imbalanced classes handled by these three methods has never been carried out in previous research. Therefore, this study undertakes an evaluation of classification models (specifically Gradient Boosting, Random Forest, and Extremely Randomized Trees) in the context of imbalanced class data. The results of this research show that the random undersampling method used to balance the class distribution has the best performance on two classification models (Random Forest and Gradient Boosted Tree).

Keywords

ClassificationImbalanced ClassRandom OversamplingRandom UndersamplingSMOTENC

References

  1. G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Application in R. New York: Springer, 2013. doi: 10.2174/0929867003374372.DOI ↗Google Scholar ↗
  2. A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Front. Neurorobot., vol. 7, 2013, doi: 10.3389/fnbot.2013.00021.DOI ↗Google Scholar ↗
  3. S. M. Lundberg et al., “From local explanations to global understanding with explainable AI for trees,” Nat. Mach. Intell., vol. 2, no. January, pp. 56–67, 2020, http://dx.doi.org/10.1038/s42256-019-0138-9.DOI ↗Google Scholar ↗
  4. P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006, doi: 10.1007/s10994-006-6226-1.DOI ↗Google Scholar ↗
  5. R. Siringoringo, “Klasifikasi Data Tidak Seimbang Menggunakan Algoritma SMOTE dan K-Nearest Neighbor,” J. ISD, vol. 3, no. 1, pp. 44–49, 2018.Google Scholar ↗
  6. W. Chaipanha and P. Kaewwichian, “Smote Vs. Random Undersampling for Imbalanced Data-Car Ownership Demand Model,” Communications, vol. 24, no. 3, pp. D105–D115, 2022, doi: 10.26552/com.C.2022.3.D105-D115.DOI ↗Google Scholar ↗
  7. S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets : A review,” Science , vol. 30, no. 1, pp. 25–36, 2006Google Scholar ↗
  8. Q. H. Doan, S. H. Mai, Q. T. Do, and D. K. Thai, “A cluster-based data splitting method for small sample and class imbalance problems in impact damage classification,” Appl. Soft Comput., vol. 120, p. 108628, 2022, doi: 10.1016/j.asoc.2022.108628.DOI ↗Google Scholar ↗
  9. D. T. Utari, “Integration of Svm and Smote-Nc for Classification of Heart Failure Patients,” BAREKENG J. Ilmu Mat. dan Terap., vol. 17, no. 4, pp. 2263–2272, 2023.Google Scholar ↗
  10. M. A. Ganai, M. Hu, A. K. Malik, M. Tanvir, and P. N. Suganthan, “Ensemble deep learning: A review,” Eng. Appl. Artif. Intell., vol. 115, 2022, doi: https://doi.org/10.1016/j.engappai.2022.105151.DOI ↗Google Scholar ↗
  11. L. Breiman, “Random Forests,” Mach. Learn., vol. 45, pp. 5–32, 2001.Google Scholar ↗
  12. M. Aria, C. Cuccurullo, and A. Gnasso, “A comparison among interpretative proposals for Random Forests,” Machine Learning with Applications, vol. 6. p. 100094, 2021. doi: 10.1016/j.mlwa.2021.100094.DOI ↗Google Scholar ↗
  13. J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and decision trees,” IJCSI Int. J. Comput. Sci. Issues, vol. 9, no. 5, pp. 272–278, 2012.Google Scholar ↗
  14. S. Han, H. Kim, and Y. S. Lee, “Double random forest,” Mach. Learn., vol. 109, no. 8, pp. 1569–1586, 2020.Google Scholar ↗
  15. S. E. Suryana, B. Warsito, and S. Suparti, “Penerapan Gradient Boosting Dengan Hyperopt Untuk Memprediksi Keberhasilan Telemarketing Bank,” J. Gaussian, vol. 10, no. 4, pp. 617–623, 2021, doi: 10.14710/j.gauss.v10i4.31335.DOI ↗Google Scholar ↗
  16. J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Stat., vol. 29, no. 5, pp. 1189–1232, 2001, doi: 10.1214/aos/1013203451.DOI ↗Google Scholar ↗
  17. R. Kohavi and F. Provost, “Glossary of Terms Glossary of Terms,” Mach. Learn., vol. 30, pp. 271–274, 1998.Google Scholar ↗
  18. J. C. Obi, “A comparative study of several classification metrics and their performances on data,” World Journal of Advanced Engineering Technology and Sciences, vol. 8, no. 1, pp. 308–314, 2023, doi: https://doi.org/10.30574/wjaets.2023.8.1.0054.DOI ↗Google Scholar ↗
Author details
Andika Putri Ratnasari
Universitas Negeri Yogyakarta, Faculty of Mathematics and Natural Sciences, Colombo Road, Yogyakarta,
✉ Corresponding Author
👤 View Profile →