Two-Stage Tuning of Machine Learning Models for Heart Disease Classification on Synthetic Data
DOI:
https://doi.org/10.63158/journalisi.v8i3.1599Keywords:
Heart Disease Risk Classification; Two-Stage Hyperparameter Tuning; Machine Learning; Comparative Analysis; Feature ImportanceAbstract
Heart disease remains a leading global cause of mortality, highlighting the need for accurate early risk classification. This study benchmarks Random Forest, XGBoost, and Logistic Regression for heart disease risk classification using a synthetic, perfectly balanced dataset, while addressing performance limitations caused by inadequate hyperparameter configuration. The dataset comprised 70,000 samples with a 50/50 class distribution and 18 clinical and demographic features. Although useful for controlled benchmarking, synthetic balanced data may yield optimistic estimates and may not fully represent real-world clinical variability. Each model was implemented in a scikit-learn Pipeline with median imputation and, where applicable, standard scaling. A two-stage tuning strategy was applied by combining RandomizedSearchCV with GridSearchCV refinement to optimize model configurations systematically. Under these benchmarking conditions, XGBoost achieved the best test performance, with an F1-score of 99.34%, AUC-ROC of 99.97%, and accuracy of 99.34%. Random Forest obtained an F1-score of 99.20% and AUC-ROC of 99.95%, while Logistic Regression achieved an F1-score of 99.12% and AUC-ROC of 99.95%. Age, pain in the arms/jaw/back, and cold sweats/nausea were the most influential predictors. The proposed framework is reproducible, computationally efficient, and suitable for validation on heterogeneous clinical datasets.
Downloads
References
[1] W. H. Organization, “Cardiovascular diseases (CVDs),” 2023, World Health Organization, Geneva. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
[2] V. Shorewala, “Early detection of coronary heart disease using ensemble techniques,” Informatics Med. Unlocked, vol. 26, p. 100655, 2021, doi: 10.1016/j.imu.2021.100655.
[3] I. Javid, A. K. Z. Alsaedi, and R. Ghazali, “Enhanced accuracy of heart disease prediction using machine learning and recurrent neural networks ensemble majority voting method,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 3, pp. 1–10, 2020, doi: 10.14569/IJACSA.2020.0110369.
[4] N. L. Fitriyani, M. Syafrudin, G. Alfian, and J. Rhee, “HDPM: An effective heart disease prediction model for a clinical decision support system,” IEEE Access, vol. 8, pp. 133034–133050, 2020, doi: 10.1109/ACCESS.2020.3010511.
[5] V. V Ramalingam, A. Dandapath, and M. K. Raja, “Heart disease prediction using machine learning techniques: A survey,” Int. J. Eng. Technol., vol. 7, no. 2.8, pp. 684–687, 2018, doi: 10.14419/ijet.v7i2.8.10557.
[6] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” 2017.
[7] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease prediction using hybrid machine learning techniques,” IEEE Access, vol. 7, pp. 81542–81554, 2019, doi: 10.1109/ACCESS.2019.2923707.
[8] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.
[9] A. Gonzales, Y. Guruswamy, and S. R. Smith, “Synthetic data in health care: A narrative review,” PLOS Digit. Heal., vol. 2, no. 1, p. e0000082, 2023, doi: 10.1371/journal.pdig.0000082.
[10] Z. Obermeyer and E. J. Emanuel, “Predicting the future — big data, machine learning, and clinical medicine,” N. Engl. J. Med., vol. 375, no. 13, pp. 1216–1219, 2016, doi: 10.1056/NEJMp1606181.
[11] F. Pedregosa, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
[12] J. A. Sterne, “Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls,” BMJ, vol. 338, p. b2393, 2009, doi: 10.1136/bmj.b2393.
[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY, USA: Springer, 2009. doi: 10.1007/978-0-387-84858-7.
[14] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
[15] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” 1995, pp. 1137–1145.
[16] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014.
[17] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” 2016. doi: 10.1145/2939672.2939785.
[18] D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression. Wiley, 2013. doi: 10.1002/9781118548387.
[19] S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artificial neural network classification models: A methodology review,” J. Biomed. Inform., vol. 35, no. 5–6, pp. 352–359, 2002, doi: 10.1016/S1532-0464(03)00034-0.
[20] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.
[21] D. M. W. Powers, “Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation,” J. Mach. Learn. Technol., vol. 2, no. 1, pp. 37–63, 2011.
[22] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009, doi: 10.1109/TKDE.2008.239.
[23] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in a random forest?,” 2012. doi: 10.1007/978-3-642-31537-4_13.
[24] D. Mozaffarian, “Heart disease and stroke statistics — 2016 update: A report from the American Heart Association,” Circulation, vol. 133, no. 4, pp. e38–e360, 2016.
[25] G. A. Roth, “Global burden of cardiovascular diseases and risk factors, 1990–2019: Update from the GBD 2019 study,” J. Am. Coll. Cardiol., vol. 76, no. 25, pp. 2982–3021, 2020, doi: 10.1016/j.jacc.2020.11.010.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Information Systems and Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors Declaration
- The Authors certify that they have read, understood, and agreed to the Journal of Information Systems and Informatics (JournalISI) submission guidelines, policies, and submission declaration. The submission has been prepared using the provided template.
- The Authors certify that all authors have approved the publication of this manuscript and that there is no conflict of interest.
- The Authors confirm that the manuscript is their original work, has not received prior publication, is not under consideration for publication elsewhere, and has not been previously published.
- The Authors confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- The Authors confirm that the manuscript is not copied from or plagiarized from any other published work.
- The Authors declare that the manuscript will not be submitted for publication in any other journal or magazine until a decision is made by the journal editors.
- If the manuscript is finally accepted for publication, the Authors confirm that they will either proceed with publication immediately or withdraw the manuscript in accordance with the journal’s withdrawal policies.
- The Authors agree that, upon publication of the manuscript in this journal, they transfer copyright or assign exclusive rights to the publisher, including commercial rights














