International Journal of

ADVANCED AND APPLIED SCIENCES

EISSN: 2313-3724, Print ISSN: 2313-626X

Frequency: 12

line decor
  
line decor

 Volume 12, Issue 1 (January 2025), Pages: 112-124

----------------------------------------------

 Original Research Paper

The critical role of evaluation metrics in handling missing data in machine learning

 Author(s): 

 Ibrahim Atoum *

 Affiliation(s):

 Department of Artificial Intelligence, Faculty of Science and Information Technology, Al-Zaytoonah University of Jordan, Amman, Jordan

 Full text

  Full Text - PDF

 * Corresponding Author. 

  Corresponding author's ORCID profile: https://orcid.org/0000-0002-9259-7937

 Digital Object Identifier (DOI)

 https://doi.org/10.21833/ijaas.2025.01.011

 Abstract

The presence of missing data in machine learning (ML) datasets remains a major challenge in building reliable models. This study explores various strategies to handle missing data and provides a framework to evaluate their effectiveness. The research focuses on commonly used techniques such as zero-filling, deletion, and imputation methods, including mean, median, mode, regression, k-nearest neighbors (KNN), and flagging. To assess these methods, a detailed evaluation framework is proposed, considering factors such as data completeness, model performance, stability, bias, variance, robustness to new data, computational efficiency, and domain-specific needs. This comprehensive approach allows for a thorough comparison of methods, helping to identify the most suitable technique for specific datasets and tasks. The findings highlight the importance of considering the unique features of the dataset and the goals of the analysis when choosing a method. While basic techniques like deletion and zero-filling may be effective in some cases, advanced imputation methods often preserve data quality and improve model accuracy. By applying the proposed evaluation criteria, researchers and practitioners can make better decisions on handling missing data, leading to more accurate, reliable, and adaptable ML models.

 © 2025 The Authors. Published by IASE.

 This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

 Keywords

 Missing data handling, Machine learning models, Imputation techniques, Data completeness, Model performance evaluation

 Article history

 Received 3 September 2024, Received in revised form 25 December 2024, Accepted 5 January 2025

 Acknowledgment

No Acknowledgment.

 Compliance with ethical standards

 Conflict of interest: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

 Citation:

 Atoum I (2025). The critical role of evaluation metrics in handling missing data in machine learning. International Journal of Advanced and Applied Sciences, 12(1): 112-124

 Permanent Link to this page

 Figures

 No Figure

 Tables

 Table 1 Table 2 Table 3

----------------------------------------------   

 References (39)

  1. Abidin NZ, Ismail AR, and Emran NA (2018). Performance analysis of machine learning algorithms for missing value imputation. International Journal of Advanced Computer Science and Applications, 9(6): 442-447. https://doi.org/10.14569/IJACSA.2018.090660   [Google Scholar]
  2. Agarwal S (2023). An intelligent machine learning approach for fraud detection in medical claim insurance: A comprehensive study. Scholars Journal of Engineering and Technology, 11(9): 191-200. https://doi.org/10.36347/sjet.2023.v11i09.003   [Google Scholar]
  3. Albahri AS, Zaidan AA, AlSattar HA, Hamid RA, Albahri OS, Qahtan S, and Alamoodi AH (2023). Towards physician's experience: Development of machine learning model for the diagnosis of autism spectrum disorders based on complex T‐spherical fuzzy‐weighted zero‐inconsistency method. Computational Intelligence, 39(2): 225-257. https://doi.org/10.1111/coin.12562   [Google Scholar]
  4. Alzubaidi L, Bai J, Al-Sabaawi A, Santamaría J, Albahri AS, Al-dabbagh BSN, Fadhel MA, Manoufali M, Zhang J, Al-Timemy AH, and Duan Y et al. (2023). A survey on deep learning tools dealing with data scarcity: Definitions, challenges, solutions, tips, and applications. Journal of Big Data, 10: 46. https://doi.org/10.1186/s40537-023-00727-2   [Google Scholar]
  5. Alzyadat W, Shaheen A, Ala’a Al-Shaikh AA, and Al-Khasawneh Z (2024). A proposed model for enhancing e-bank transactions: An experimental comparative study. Indonesian Journal of Electrical Engineering and Computer Science, 34(2): 1268-1279. https://doi.org/10.11591/ijeecs.v34.i2.pp1268-1279   [Google Scholar]
  6. Aubaidan BH, Kadir RA, Ljab MT, and Taha BA (2024). Intelligent imputation of missing data using bidirectional neighbor graph modeling for diabetic risk prediction. Journal of Theoretical and Applied Information Technology, 102(8): 3508-3522.   [Google Scholar]
  7. Başakın EE, Ekmekcioğlu Ö, and Özger M (2023). Providing a comprehensive understanding of missing data imputation processes in evapotranspiration-related research: A systematic literature review. Hydrological Sciences Journal, 68(14): 2089-2104. https://doi.org/10.1080/02626667.2023.2249456   [Google Scholar]
  8. Blázquez-García A, Wickstrøm K, Yu S, Mikalsen KØ, Boubekki A, Conde A, Mori U, Jenssen R, and Lozano JA (2023). Selective imputation for multivariate time series datasets with missing values. IEEE Transactions on Knowledge and Data Engineering, 35(9): 9490-9501. https://doi.org/10.1109/TKDE.2023.3240858   [Google Scholar]
  9. Buttia C, Llanaj E, Raeisi-Dehkordi H, Kastrati L, Amiri M, Meçani R, Taneri PE, Ochoa SAG, Raguindin PF, Wehrli F, and Khatami F et al. (2023). Prognostic models in COVID-19 infection that predict severity: A systematic review. European Journal of Epidemiology, 38(4): 355-372. https://doi.org/10.1007/s10654-023-00973-x   [Google Scholar] PMid:36840867 PMCid:PMC9958330
  10. Chen AY and McCoy J (2024). Missing values handling for machine learning portfolios. Journal of Financial Economics, 155: 103815. https://doi.org/10.1016/j.jfineco.2024.103815   [Google Scholar]
  11. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, and Tabona O (2021). A survey on missing data in machine learning. Journal of Big Data, 8: 140. https://doi.org/10.1186/s40537-021-00516-9   [Google Scholar] PMid:34722113 PMCid:PMC8549433
  12. Gorriz JM, Segovia F, Ramirez J, Ortiz A, and Suckling J (2024). Is K-fold cross validation the best model selection method for Machine Learning? Arxiv Preprint Arxiv:2401.16407. https://doi.org/10.48550/arXiv.2401.16407   [Google Scholar]
  13. Huang B, Zhu Y, Usman M, and Chen H (2024). Semi-supervised learning with missing values imputation. Knowledge-Based Systems, 284: 111171. https://doi.org/10.1016/j.knosys.2023.111171   [Google Scholar]
  14. Jaradat Y, Masoud M, Manasrah A, Alia M, and Jannoud I (2024). Review of data imputation techniques in time series data: Comparative analysis. The Eurasia Proceedings of Science, Technology, Engineering and Mathematics, 27: 122-129. https://doi.org/10.55549/epstem.1518433   [Google Scholar]
  15. Kadhim MA and Radhi AM (2023). Heart disease classification using optimized Machine learning algorithms. Iraqi Journal for Computer Science and Mathematics, 4(2): 31-42. https://doi.org/10.52866/ijcsm.2023.02.02.004   [Google Scholar]
  16. Kazemi A, Rasouli-Saravani A, Gharib M, Albuquerque T, Eslami S, and Schüffler PJ (2024). A systematic review of machine learning-based tumor-infiltrating lymphocytes analysis in colorectal cancer: Overview of techniques, performance metrics, and clinical outcomes. Computers in Biology and Medicine, 173: 108306. https://doi.org/10.1016/j.compbiomed.2024.108306   [Google Scholar] PMid:38554659
  17. Koukaras P, Mustapha A, Mystakidis A, and Tjortjis C (2024). Optimizing building short-term load forecasting: A comparative analysis of machine learning models. Energies, 17(6): 1450. https://doi.org/10.3390/en17061450   [Google Scholar]
  18. Li C, Ren X, and Zhao G (2023). Machine-learning-based imputation method for filling missing values in ground meteorological observation data. Algorithms, 16(9): 422. https://doi.org/10.3390/a16090422   [Google Scholar]
  19. Li J, Guo S, Ma R, He J, Zhang X, Rui D, Ding Y, Li Y, Jian L, Cheng J, and Guo H (2024). Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets. BMC Medical Research Methodology, 24: 41. https://doi.org/10.1186/s12874-024-02173-x   [Google Scholar] PMid:38365610 PMCid:PMC10870437
  20. Liguori A, Markovic R, Ferrando M, Frisch J, Causone F, and van Treeck C (2023). Augmenting energy time-series for data-efficient imputation of missing values. Applied Energy, 334: 120701. https://doi.org/10.1016/j.apenergy.2023.120701   [Google Scholar]
  21. Liu X, Hasan MR, Ahmed KA, and Hossain MZ (2023). Machine learning to analyse omic-data for COVID-19 diagnosis and prognosis. BMC Bioinformatics, 24: 7. https://doi.org/10.1186/s12859-022-05127-6   [Google Scholar] PMid:36609221 PMCid:PMC9817417
  22. Liu Y, Li B, Yang S, and Li Z (2024). Handling missing values and imbalanced classes in machine learning to predict consumer preference: Demonstrations and comparisons to prominent methods. Expert Systems with Applications, 237: 121694. https://doi.org/10.1016/j.eswa.2023.121694   [Google Scholar]
  23. Mitra R, McGough SF, Chakraborti T, Holmes C, Copping R, Hagenbuch N, Biedermann S, Noonan J, Lehmann B, Shenvi A, and Doan XV et al. (2023). Learning from data with structured missingness. Nature Machine Intelligence, 5(1): 13-23. https://doi.org/10.1038/s42256-022-00596-z   [Google Scholar]
  24. Mundargi Z, Khedkar S, Kumbhar S, Mohod K, and Meshram Y (2024). Revolutionizing cerebral stroke prediction: Mastery unveiled through stratified k-fold and k-fold cross validation techniques for imbalanced datasets. Grenze International Journal of Engineering and Technology, 10: 2407-2413.   [Google Scholar]
  25. Munshi RM (2024). Novel ensemble learning approach with SVM-imputed ADASYN features for enhanced cervical cancer prediction. PLOS ONE, 19(1): e0296107. https://doi.org/10.1371/journal.pone.0296107   [Google Scholar] PMid:38198475 PMCid:PMC10781159
  26. Nezami N, Haghighat P, Gándara D, and Anahideh H (2024). Assessing disparities in predictive modeling outcomes for college student success: The impact of imputation techniques on model performance and fairness. Education Sciences, 14(2): 136. https://doi.org/10.3390/educsci14020136   [Google Scholar]
  27. Pagano TP, Loureiro RB, Lisboa FV, Peixoto RM, Guimarães GA, Cruz GO, Araujo MM, Santos LL, Cruz MA, Oliveira EL, and Winkler I et al. (2023). Bias and unfairness in machine learning models: A systematic review on datasets, tools, fairness metrics, and identification and mitigation methods. Big Data and Cognitive Computing, 7(1): 15. https://doi.org/10.3390/bdcc7010015   [Google Scholar]
  28. Palanivinayagam A and Damaševičius R (2023). Effective handling of missing values in datasets for classification using machine learning methods. Information, 14(2): 92. https://doi.org/10.3390/info14020092   [Google Scholar]
  29. Parhi SK and Patro SK (2023). Prediction of compressive strength of geopolymer concrete using a hybrid ensemble of grey wolf optimized machine learning estimators. Journal of Building Engineering, 71: 106521. https://doi.org/10.1016/j.jobe.2023.106521   [Google Scholar]
  30. Park K, Ergan S, and Feng C (2024). Quality assessment of residential layout designs generated by relational generative adversarial networks (GANs). Automation in Construction, 158: 105243. https://doi.org/10.1016/j.autcon.2023.105243   [Google Scholar]
  31. Santos KC, Miani RS, and de Oliveira Silva F (2024). Evaluating the impact of data preprocessing techniques on the performance of intrusion detection systems. Journal of Network and Systems Management, 32: 36. https://doi.org/10.1007/s10922-024-09813-z   [Google Scholar]
  32. Sedaghat A, Arbabkhah H, Jafari Kang M, and Hamidi M (2024). Deep learning applications in vessel dead reckoning to deal with missing automatic identification system data. Journal of Marine Science and Engineering, 12(1): 152. https://doi.org/10.3390/jmse12010152   [Google Scholar]
  33. Shamji MH, Ollert M, Adcock IM, Bennett O, Favaro A, Sarama R, Riggioni C, Annesi-Maesano I, Custovic A, Fontanella S, and Traidl-Hoffmann C et al. (2023). EAACI guidelines on environmental science in allergic diseases and asthma–leveraging artificial intelligence and machine learning to develop a causality model in exposomics. Allergy, 78(7): 1742-1757. https://doi.org/10.1111/all.15667   [Google Scholar] PMid:36740916
  34. Sharma B, Sharma L, Lal C, and Roy S (2023). Anomaly based network intrusion detection for IoT attacks using deep learning technique. Computers and Electrical Engineering, 107: 108626. https://doi.org/10.1016/j.compeleceng.2023.108626   [Google Scholar]
  35. Sierra-Porta D (2024). Assessing the impact of missing data on water quality index estimation: A machine learning approach. Discover Water, 4: 11. https://doi.org/10.1007/s43832-024-00068-y   [Google Scholar]
  36. Sun Y, Li J, Xu Y, Zhang T, and Wang X (2023). Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Systems with Applications, 227: 120201. https://doi.org/10.1016/j.eswa.2023.120201   [Google Scholar]
  37. Tahyudin I, Solikhatin SA, Tikaningsih A, Lestari P, Nambo H, Winarto E, and Hassa N (2024). Forecasting hospital length of stay for stroke patients: A machine learning approach. International Journal of Advances in Soft Computing and Its Applications, 16(1): 99-117.   [Google Scholar]
  38. Veetil IK, Chowdary DE, Chowdary PN, Sowmya V, and Gopalakrishnan EA (2024). An analysis of data leakage and generalizability in MRI based classification of Parkinson's disease using explainable 2D Convolutional Neural Networks. Digital Signal Processing, 147: 104407. https://doi.org/10.1016/j.dsp.2024.104407   [Google Scholar]
  39. Zhou Y, Shi J, Stein R, Liu X, Baldassano RN, Forrest CB, and Huang J (2023). Missing data matter: An empirical evaluation of the impacts of missing EHR data in comparative effectiveness research. Journal of the American Medical Informatics Association, 30(7): 1246-1256. https://doi.org/10.1093/jamia/ocad066   [Google Scholar] PMid:37337922 PMCid:PMC10280351