Explainable AI for Water Quality Classification Using Ensemble Stacking
DOI:
https://doi.org/10.63158/journalisi.v8i3.1601Keywords:
anti-leakage pipeline, SMOTE, stacking ensemble, minority-class classification, local explainabilityAbstract
This study proposes a robust and interpretable machine learning framework for water quality classification using a publicly available water quality dataset containing 7,996 samples and 20 physicochemical features with an imbalanced class distribution (88.59% majority and 11.41% minority). The study addresses the critical issue of biased classification toward the majority class, which can lead to risk-prone misclassification of unsafe water. An ensemble stacking model combining XGBoost, LightGBM, and CatBoost with a Random Forest meta-learner (passthrough) was developed using an anti-leakage pipeline integrating RobustScaler and SMOTE within stratified 80:20 train–test cross-validation, while hyperparameter tuning was optimized using F1-score to improve minority-class performance; SHAP was further applied for global and local explainability. The proposed model achieved an F1-score of 0.8563 for the minority class and a ROC-AUC of 0.9846, indicating strong discriminative performance, while SHAP analysis identified ammonia as the most influential feature and revealed that False Negative errors were mainly caused by complex feature interactions. The study contributes an integrated framework combining stacking ensemble learning, anti-leakage evaluation, and SHAP-based global–local interpretation to support more reliable and transparent water quality classification; however, the findings are currently limited to a single dataset and and require multi-dataset validation.
Downloads
References
[1] Ms. M. Nandhini, “Water Quality Prediction Using Machine Learning Technique,” IJIREEICE, vol. 13, no. 12, Dec. 2025, doi: 10.17148/IJIREEICE.2025.131206.
[2] W. Chen, D. Xu, B. Pan, Y. Zhao, and Y. Song, “Machine Learning-Based Water Quality Classification Assessment,” Water (Basel)., vol. 16, no. 20, p. 2951, Oct. 2024, doi: 10.3390/w16202951.
[3] S. Yadav and G. P. Bhole, “Handling Imbalanced Dataset Classification in Machine Learning,” in 2020 IEEE Pune Section International Conference (PuneCon), IEEE, Dec. 2020, pp. 38–43. doi: 10.1109/PuneCon50868.2020.9362471.
[4] L. Zhang and D. Jánošík, “Enhanced short-term load forecasting with hybrid machine learning models: CatBoost and XGBoost approaches,” Expert Syst. Appl., vol. 241, p. 122686, May 2024, doi: 10.1016/j.eswa.2023.122686.
[5] R. Rivaldo, R. Taufik, I. S. Ilman, and O. D. E. Wulansari, “A Comparative Study of XGBoost, LightGBM, and CatBoost Models for Customer Churn Prediction in the Banking Industry,” Jurnal Pepadun, vol. 6, no. 2, pp. 178–187, Aug. 2025, doi: 10.23960/pepadun.v6i2.277.
[6] R. Liu et al., “Stacking Ensemble Method for Gestational Diabetes Mellitus Prediction in Chinese Pregnant Women: A Prospective Cohort Study,” J. Healthc. Eng., vol. 2022, pp. 1–14, Sep. 2022, doi: 10.1155/2022/8948082.
[7] M. Munsarif, M. Sam’an, and S. Safuan, “Peer to peer lending risk analysis based on embedded technique and stacking ensemble learning,” Bulletin of Electrical Engineering and Informatics, vol. 11, no. 6, pp. 3483–3489, Dec. 2022, doi: 10.11591/eei.v11i6.3927.
[8] P. Netayawijit, W. Chansanam, and K. Sorn-In, “Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support,” Healthcare, vol. 13, no. 20, p. 2588, Oct. 2025, doi: 10.3390/healthcare13202588.
[9] P. Lakkarasu, “Designing and deploying scalable MLOps pipelines for continuous artificial intelligence model training and delivery,” in Designing Scalable and Intelligent Cloud Architectures: An End-to-End Guide to AI Driven Platforms, MLOps Pipelines, and Data Engineering for Digital Transformation, Deep Science Publishing, 2025, pp. 28–42. doi: 10.70593/978-93-49910-08-9_3.
[10] S. Yang, Z. Huang, W. Xiao, and X. Shen, “Interpretable Credit Default Prediction with Ensemble Learning and SHAP,” in 2025 International Conference on Artificial Intelligence, Human-Computer Interaction and Natural Language Processing (ICAHN), IEEE, May 2025, pp. 102–106. doi: 10.1109/ICAHN67688.2025.00027.
[11] O. Mermer, E. Zhang, and I. Demir, “A Comparative Study of Ensemble Machine Learning and Explainable AI for Predicting Harmful Algal Blooms,” Big Data and Cognitive Computing, vol. 9, no. 5, p. 138, May 2025, doi: 10.3390/bdcc9050138.
[12] N. G. Rezk, S. Alshathri, A. Sayed, and E. El-Din Hemdan, “EWAIS: An Ensemble Learning and Explainable AI Approach for Water Quality Classification Toward IoT-Enabled Systems,” Processes, vol. 12, no. 12, p. 2771, Dec. 2024, doi: 10.3390/pr12122771.
[13] Z. B. Tadese et al., “Interpretable prediction of acute respiratory infection disease among under-five children in Ethiopia using ensemble machine learning and Shapley additive explanations (SHAP),” Digit. Health, vol. 10, Jan. 2024, doi: 10.1177/20552076241272739.
[14] N. Nasir et al., “Water quality classification using machine learning algorithms,” Journal of Water Process Engineering, vol. 48, p. 102920, Aug. 2022, doi: 10.1016/j.jwpe.2022.102920.
[15] “Exploring The Effectiveness Of Different Data Cleaning Techniques For Improving Data Quality in Machine Learning,” Humanitarian and Natural Sciences Journal, vol. 4, no. 7, Jul. 2023, doi: 10.53796/hnsj4711.
[16] Prof. Arati K Kale and Dr. Dev Ras Pandey, “Data Pre-Processing Technique for Enhancing Healthcare Data Quality Using Artificial Intelligence,” Int. J. Sci. Res. Sci. Technol., pp. 299–309, Jan. 2024, doi: 10.32628/IJSRST52411130.
[17] A. A. Soomro et al., “Data augmentation using SMOTE technique: Application for prediction of burst pressure of hydrocarbons pipeline using supervised machine learning models,” Results in Engineering, vol. 24, p. 103233, Dec. 2024, doi: 10.1016/j.rineng.2024.103233.
[18] X. Ye, W. Xu, X. Ye, D. Long, Q. Yin, and B. Huang, “Stroke Prediction Using the Trust Evaluation with Data Leakage Avoiding,” J. Phys. Conf. Ser., vol. 2560, no. 1, p. 012051, Aug. 2023, doi: 10.1088/1742-6596/2560/1/012051.
[19] N. Rathnayake, T. Linh Dang, and Y. Hoshino, “Designing and Implementation of Novel Ensemble model based on ANFIS and Gradient Boosting methods for Hand Gestures Classification,” in The 11th International Symposium on Information and Communication Technology, New York, NY, USA: ACM, Dec. 2022, pp. 283–289. doi: 10.1145/3568562.3568598.
[20] Z. Chen, “The Principle of Tree Explainer and Its Associated Validation,” in Proceedings of the 5th International Conference on Computer Information and Big Data Applications, New York, NY, USA: ACM, Apr. 2024, pp. 1155–1162. doi: 10.1145/3671151.3671352.
[21] G. Zhao et al., “Enhancing interpretability of tree-based models for downstream salinity prediction: Decomposing feature importance using the Shapley additive explanation approach,” Results in Engineering, vol. 23, p. 102373, Sep. 2024, doi: 10.1016/j.rineng.2024.102373.
[22] A. V. Ponce‐Bobadilla, V. Schmitt, C. S. Maier, S. Mensing, and S. Stodtmann, “Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development,” Clin. Transl. Sci., vol. 17, no. 11, Nov. 2024, doi: 10.1111/cts.70056.
[23] A. M. AbdulAbbas, R. Alkanany, Y. A. K. Al-Nuaimi, and Z. M. A. Al-Hamdawee, “A Sequential Data Preprocessing Pipeline for Diabetes Prediction: A Data Leakage Prevention and Dual-Validation Approach,” Engineering, Technology & Applied Science Research, vol. 15, no. 6, pp. 30059–30066, Dec. 2025, doi: 10.48084/etasr.14155.
[24] R. K. Makumbura et al., “Advancing water quality assessment and prediction using machine learning models, coupled with explainable artificial intelligence (XAI) techniques like shapley additive explanations (SHAP) for interpreting the black-box nature,” Results in Engineering, vol. 23, p. 102831, Sep. 2024, doi: 10.1016/j.rineng.2024.102831.
[25] A. Aldrees, M. Khan, A. T. B. Taha, and M. Ali, “Evaluation of water quality indexes with novel machine learning and SHapley Additive ExPlanation (SHAP) approaches,” Journal of Water Process Engineering, vol. 58, p. 104789, Feb. 2024, doi: 10.1016/j.jwpe.2024.104789.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Information Systems and Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors Declaration
- The Authors certify that they have read, understood, and agreed to the Journal of Information Systems and Informatics (JournalISI) submission guidelines, policies, and submission declaration. The submission has been prepared using the provided template.
- The Authors certify that all authors have approved the publication of this manuscript and that there is no conflict of interest.
- The Authors confirm that the manuscript is their original work, has not received prior publication, is not under consideration for publication elsewhere, and has not been previously published.
- The Authors confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- The Authors confirm that the manuscript is not copied from or plagiarized from any other published work.
- The Authors declare that the manuscript will not be submitted for publication in any other journal or magazine until a decision is made by the journal editors.
- If the manuscript is finally accepted for publication, the Authors confirm that they will either proceed with publication immediately or withdraw the manuscript in accordance with the journal’s withdrawal policies.
- The Authors agree that, upon publication of the manuscript in this journal, they transfer copyright or assign exclusive rights to the publisher, including commercial rights














