PCOS Classification Using Random Forest, Recursive Feature Elimination, and Explainable AI

Authors

  • Syifa Ayu Salsabila Putri Telkom University, Indonesia
    Competing Interests

    there is no competing interest

  • Rona Nisa Sofia Amriza Telkom University, Indonesia
    Competing Interests

    There is no competing interest

Pages Icon

DOI:

https://doi.org/10.63158/journalisi.v8i3.1603

Keywords:

Clinical Classification, Feature Selection, PCOS Classification, Recursive Feature Elimination, Explainable AI

Abstract

Ovary Syndrome (PCOS) is an endocrine-related condition predominantly affecting women during their childbearing years who experience delayed diagnosis due to the limitations of conventional methods that require laboratory tests and imaging procedures that are relatively costly and time-consuming. This study develops a PCOS classification model based on a clinical dataset of 541 patients with 42 clinical attributes using the random forest algorithm with Recursive Feature Elimination (RFE) feature selection and an Explainable AI (XAI) approach. The research pipeline comprised several sequential stages: problem identification, data collection, preprocessing, data splitting, feature selection, model training and testing, evaluation, and SHAP-based explainability analysis. Performance was evaluated using Accuracy, Precision, Recall, and F1-score, and compared between two models, namely RF+CF and RF+RFE, where RF+RFE was identified as the best-performing model. The XAI approach using SHAP (SHapley Additive exPlanations) was applied to identify and explain the contribution of clinical variables to the classification results. The best model, RF+RFE, achieved an accuracy of 92.66%, precision of 93.75%, recall of 83.33%, and F1-score of 88.24%, demonstrating superior performance compared to RF+CF. As this study relies on a single dataset, broader validation across multiple centers is recommended before clinical deployment. This model is intended as a screening-support approach and has not been validated as a clinical diagnostic tool. The findings are anticipated to serve as a foundation for building data-driven early screening tools and clinical decision-making support systems.

Downloads

Download data is not yet available.

References

[1] H. Elmannai et al., “Polycystic Ovary Syndrome Detection Machine Learning Model Based on Optimized Feature Selection and Explainable Artificial Intelligence,” Diagnostics, vol. 13, no. 8, pp. 1–21, 2023, doi: 10.3390/diagnostics13081506.

[2] S. Arora, Vedpal, and N. Chauhan, "Polycystic Ovary Syndrome (PCOS) diagnostic methods in machine learning: a systematic literature review", vol. 84, no. 16. Springer US, 2025. doi: 10.1007/s11042-024-19707-6.

[3] S. Ahmed et al., “A Review on the Detection Techniques of Polycystic Ovary Syndrome Using Machine Learning,” IEEE Access, vol. 11, pp. 86522–86543, 2023, doi: 10.1109/ACCESS.2023.3304536.

[4] M. Alagarsamy, N. Shanmugam, D. P. Mani, M. Thayumanavan, K. K. Sundari, and K. Suriyan, “Detection of Polycystic Syndrome in Ovary Using Machine Learning Algorithm,” Int. J. Intell. Syst. Appl. Eng., vol. 11, no. 1, pp. 246–253, 2023.

[5] S. Tiwari et al., “SPOSDS: A smart Polycystic Ovary Syndrome diagnostic system using machine learning,” Expert Syst. Appl., vol. 203, no. May, 2022, doi: 10.1016/j.eswa.2022.117592.

[6] J. Lim et al., “Machine learning classification of polycystic ovary syndrome based on radial pulse wave analysis,” BMC Complement. Med. Ther., vol. 23, no. 1, pp. 1–15, 2023, doi: 10.1186/s12906-023-04249-5.

[7] C. Aulia et al., “Analisis Pola Gejala Pcos Menggunakan Algoritma K-Means Clustering,” JOISIE (Journal Inf. Syst. Informatics Eng., vol. 9, no. 1, pp. 91–99, 2025, [Online]. Available: https://www.ejournal.pelitaindonesia.ac.id/ojs32/index.php/JOISIE/article/view/4939

[8] H. J. Teede et al., “Recommendations From the 2023 International Evidence-based Guideline for the Assessment and Management of Polycystic Ovary Syndrome,” J. Clin. Endocrinol. Metab., vol. 108, no. 10, pp. 2447–2469, 2023, doi: 10.1210/clinem/dgad463.

[9] H. Yang et al., “Risk Prediction of Diabetes: Big data mining with fusion of multifarious physical examination indicators,” Inf. Fusion, vol. 75, no. February, pp. 140–149, 2021, doi: 10.1016/j.inffus.2021.02.015.

[10] S. Nasim, M. S. Almutairi, K. Munir, A. Raza, and F. Younas, “A Novel Approach for Polycystic Ovary Syndrome Prediction Using Machine Learning in Bioinformatics,” IEEE Access, vol. 10, no. September, pp. 97610–97624, 2022, doi: 10.1109/ACCESS.2022.3205587.

[11] S. Sreejith, H. Khanna Nehemiah, and A. Kannan, “A clinical decision support system for polycystic ovarian syndrome using red deer algorithm and random forest classifier,” Healthc. Anal., vol. 2, no. March, p. 100102, 2022, doi: 10.1016/j.health.2022.100102.

[12] M. I. Prasetiyowati, N. U. Maulidevi, and K. Surendro, “The accuracy of Random Forest performance can be improved by conducting a feature selection with a balancing strategy,” PeerJ Comput. Sci., vol. 8, pp. 1–15, 2022, doi: 10.7717/PEERJ-CS.1041.

[13] M. I. Prasetiyowati, N. U. Maulidevi, and K. Surendro, “Feature selection to increase the random forest method performance on high dimensional data,” Int. J. Adv. Intell. Informatics, vol. 6, no. 3, pp. 303–312, 2020, doi: 10.26555/ijain.v6i3.471.

[14] S. Alam Suha and M. N. Islam, “Exploring the dominant features and data-driven detection of polycystic ovary syndrome through modified stacking ensemble machine learning technique,” Heliyon, vol. 9, no. 3, p. e14518, 2023, doi: 10.1016/j.heliyon.2023.e14518.

[15] R. Iranzad and X. Liu, “A review of random forest-based feature selection methods for data science education and applications,” Int. J. Data Sci. Anal., vol. 20, no. 2, pp. 197–211, 2025, doi: 10.1007/s41060-024-00509-w.

[16] S. Ratnasingam and J. Muñoz-Lopez, “Distance Correlation-Based Feature Selection in Random Forest,” Entropy, vol. 25, no. 9, 2023, doi: 10.3390/e25091250.

[17] M. Mohamad, A. Selamat, O. Krejcar, R. G. Crespo, E. Herrera-Viedma, and H. Fujita, “Enhancing big data feature selection using a hybrid correlation-based feature selection,” Electron., vol. 10, no. 23, pp. 1–24, 2021, doi: 10.3390/electronics10232984.

[18] N. G. Rezk, S. Alshathri, A. Sayed, E. El-Din Hemdan, and H. El-Behery, “XAI-Augmented Voting Ensemble Models for Heart Disease Prediction: A SHAP and LIME-Based Approach,” Bioengineering, vol. 11, no. 10, 2024, doi: 10.3390/bioengineering11101016.

[19] P. K. Mohanty, S. A. J. Francis, R. K. Barik, D. S. Roy, and M. J. Saikia, “Leveraging Shapley Additive Explanations for Feature Selection in Ensemble Models for Diabetes Prediction,” Bioengineering, vol. 11, no. 12, pp. 1–19, 2024, doi: 10.3390/bioengineering11121215.

[20] O. O. Bifarin, “Interpretable machine learning with treebased shapley additive explanations: Application to metabolomics datasets for binary classification,” PLoS One, vol. 18, no. 5 May, 2023, doi: 10.1371/journal.pone.0284315.

[21] T. Hulsen, “Explainable Artificial Intelligence (XAI): Concepts and Challenges in Healthcare,” AI, vol. 4, no. 3, pp. 652–666, 2023, doi: 10.3390/ai4030034.

[22] T. Patil and S. Arora, “Survey of Explainable AI Techniques: A Case Study of Healthcare,” Lect. Notes Networks Syst., vol. 765 LNNS, pp. 335–346, 2023, doi: 10.1007/978-981-99-5652-4_30.

[23] D. Saraswat et al., “Explainable AI for Healthcare 5.0: Opportunities and Challenges,” IEEE Access, vol. 10, no. July, pp. 84486–84517, 2022, doi: 10.1109/ACCESS.2022.3197671.

[24] S. Xia and Y. Yang, “A Model-Free Feature Selection Technique of Feature Screening and Random Forest-Based Recursive Feature Elimination,” Int. J. Intell. Syst., vol. 2023, 2023, doi: 10.1155/2023/2400194.

[25] U. M. G and U. M. P, “SmartScanPCOS: A feature-driven approach to cutting-edge prediction of Polycystic Ovary Syndrome using Machine Learning and Explainable Artificial Intelligence,” Heliyon, vol. 10, no. 20, 2024, doi: 10.1016/j.heliyon.2024.e39205.

Downloads

Published

2026-06-22

Issue

Section

Articles