Two-Stage Tuning of Machine Learning Models for Heart Disease Classification on Synthetic Data

Marini; Tri Sugihartono; Chandra Kirana; Benny Wijaya; Hamidah

doi:10.63158/journalisi.v8i3.1599

Authors

Marini Institut Sains dan Bisnis Atma Luhur, Indonesia
Tri Sugihartono Institut Sains dan Bisnis Atma Luhur, Indonesia
Chandra Kirana Institut Sains dan Bisnis Atma Luhur, Indonesia
Benny Wijaya Institut Sains dan Bisnis Atma Luhur, Indonesia
Hamidah Institut Sains dan Bisnis Atma Luhur, Indonesia

DOI:

https://doi.org/10.63158/journalisi.v8i3.1599

Keywords:

Heart Disease Risk Classification; Two-Stage Hyperparameter Tuning; Machine Learning; Comparative Analysis; Feature Importance

Abstract

Heart disease remains a leading global cause of mortality, highlighting the need for accurate early risk classification. This study benchmarks Random Forest, XGBoost, and Logistic Regression for heart disease risk classification using a synthetic, perfectly balanced dataset, while addressing performance limitations caused by inadequate hyperparameter configuration. The dataset comprised 70,000 samples with a 50/50 class distribution and 18 clinical and demographic features. Although useful for controlled benchmarking, synthetic balanced data may yield optimistic estimates and may not fully represent real-world clinical variability. Each model was implemented in a scikit-learn Pipeline with median imputation and, where applicable, standard scaling. A two-stage tuning strategy was applied by combining RandomizedSearchCV with GridSearchCV refinement to optimize model configurations systematically. Under these benchmarking conditions, XGBoost achieved the best test performance, with an F1-score of 99.34%, AUC-ROC of 99.97%, and accuracy of 99.34%. Random Forest obtained an F1-score of 99.20% and AUC-ROC of 99.95%, while Logistic Regression achieved an F1-score of 99.12% and AUC-ROC of 99.95%. Age, pain in the arms/jaw/back, and cold sweats/nausea were the most influential predictors. The proposed framework is reproducible, computationally efficient, and suitable for validation on heterogeneous clinical datasets.

Downloads

Download data is not yet available.

References

[1] W. H. Organization, “Cardiovascular diseases (CVDs),” 2023, World Health Organization, Geneva. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

[2] V. Shorewala, “Early detection of coronary heart disease using ensemble techniques,” Informatics Med. Unlocked, vol. 26, p. 100655, 2021, doi: 10.1016/j.imu.2021.100655.

[3] I. Javid, A. K. Z. Alsaedi, and R. Ghazali, “Enhanced accuracy of heart disease prediction using machine learning and recurrent neural networks ensemble majority voting method,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 3, pp. 1–10, 2020, doi: 10.14569/IJACSA.2020.0110369.

[4] N. L. Fitriyani, M. Syafrudin, G. Alfian, and J. Rhee, “HDPM: An effective heart disease prediction model for a clinical decision support system,” IEEE Access, vol. 8, pp. 133034–133050, 2020, doi: 10.1109/ACCESS.2020.3010511.

[5] V. V Ramalingam, A. Dandapath, and M. K. Raja, “Heart disease prediction using machine learning techniques: A survey,” Int. J. Eng. Technol., vol. 7, no. 2.8, pp. 684–687, 2018, doi: 10.14419/ijet.v7i2.8.10557.

[6] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” 2017.

[7] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease prediction using hybrid machine learning techniques,” IEEE Access, vol. 7, pp. 81542–81554, 2019, doi: 10.1109/ACCESS.2019.2923707.

[8] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.

[9] A. Gonzales, Y. Guruswamy, and S. R. Smith, “Synthetic data in health care: A narrative review,” PLOS Digit. Heal., vol. 2, no. 1, p. e0000082, 2023, doi: 10.1371/journal.pdig.0000082.

[10] Z. Obermeyer and E. J. Emanuel, “Predicting the future — big data, machine learning, and clinical medicine,” N. Engl. J. Med., vol. 375, no. 13, pp. 1216–1219, 2016, doi: 10.1056/NEJMp1606181.

[11] F. Pedregosa, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.

[12] J. A. Sterne, “Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls,” BMJ, vol. 338, p. b2393, 2009, doi: 10.1136/bmj.b2393.

[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY, USA: Springer, 2009. doi: 10.1007/978-0-387-84858-7.

[14] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

[15] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” 1995, pp. 1137–1145.

[16] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014.

[17] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” 2016. doi: 10.1145/2939672.2939785.

[18] D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression. Wiley, 2013. doi: 10.1002/9781118548387.

[19] S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artificial neural network classification models: A methodology review,” J. Biomed. Inform., vol. 35, no. 5–6, pp. 352–359, 2002, doi: 10.1016/S1532-0464(03)00034-0.

[20] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.

[21] D. M. W. Powers, “Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation,” J. Mach. Learn. Technol., vol. 2, no. 1, pp. 37–63, 2011.

[22] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009, doi: 10.1109/TKDE.2008.239.

[23] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in a random forest?,” 2012. doi: 10.1007/978-3-642-31537-4_13.

[24] D. Mozaffarian, “Heart disease and stroke statistics — 2016 update: A report from the American Heart Association,” Circulation, vol. 133, no. 4, pp. e38–e360, 2016.

[25] G. A. Roth, “Global burden of cardiovascular diseases and risk factors, 1990–2019: Update from the GBD 2019 study,” J. Am. Coll. Cardiol., vol. 76, no. 25, pp. 2982–3021, 2020, doi: 10.1016/j.jacc.2020.11.010.

Two-Stage Tuning of Machine Learning Models for Heart Disease Classification on Synthetic Data

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

Most read articles by the same author(s)

publisher

sidebar

certificate

template

gs-citation

index

stat