Machine Learning Approach for Credit Score Predictions

This paper addresses the problem of managing the significant rise in requests for credit products that banking and financial institutions face. The aim is to propose an adaptive, dynamic heterogeneous ensemble credit model that integrates the XGBoost and Support Vector Machine models to improve the accuracy and reliability of risk assessment credit scoring models. The method employs machine learning techniques to recognise patterns and trends from past data to anticipate future occurrences. The proposed approach is compared with existing credit score models to validate its efficacy using five popular evaluation metrics, Accuracy, ROC AUC, Precision, Recall and F1_Score. The paper highlights credit sco ring models’ challenges , such as class imbalance, verification latency and concept drift. The results show that the proposed approach outperforms the existing models regarding the evaluation metrics, achieving a balance between predictive accuracy and computational cost. The conclusion emphasises the significance of the proposed approach for the banking and financial sector in developing robust and reliable credit scoring models to evaluate the creditworthiness of their clients.


INTRODUCTION
Credit scoring models have emerged as effective and efficient tools for banks and other financial institutions to distinguish, recognise, and discriminate against potential default borrowers and mitigate credit risk. Given such a scenario, a credit scoring model's prediction, recognition, and discriminatory performance are important for financial institutions and banks to generate profits. Financial institutions use a credit score to determine a client's creditworthiness for a loan. Credit scores are generated by considering personal details such as historical track records on debt responsibilities, profiling, primary place of residence, earnings, job, demographic information, assets like vehicles and real estate, and census data. There has been a swift surge in the number of credit requests that financial institutions receive, and they have to assess the possible hazards associated with granting credit to their clients. The sooner financial institutions can ascertain whether or not to provide credit to their clients, the more advantageous it is. Credit scores are utilised by lenders, retailers, car dealerships, and real estate agents to appraise whether a client is eligible for a loan, credit card, automobile, or a new residence. Additionally, they determine the interest rate and credit limit that are applicable.
Credit scoring is useful for managing credit risk and minimising information asymmetry [1] [2]. Its purpose is to produce a score that can differentiate loan applicants into two categories: those who are creditworthy and likely to repay their loans and those who are risky and unlikely. This score is linked to the anticipated likelihood of default and is transformed into a classification task [3]. The creation of a robust, efficient, and adaptable credit scoring model has a significant impact on the profitability of financial institutions [4]. Every credit risk scoring model must comply with stringent regulations, and any violation may result in significant regulatory costs. Therefore, creating credit scoring models that are adaptable, efficient, and robust in accurately predicting loan defaults is crucial. Before the advent of machine learning, statistical models were used for credit scoring. Nevertheless, statistical methods usually rely on strong assumptions such as linear separability and normal distribution of data [5]. These assumptions can restrict statistical methods' effectiveness when applied to large datasets or when they are violated.
Credit scoring is generally computed using different mathematical tools that estimate the probability of default (PD) of the party receiving the loan [6]. While this approach can provide valuable insights into a customer's creditworthiness, the traditional data analysis methods and manual credit scoring methods can be slow and resource intensive. As a result, banks are increasingly turning to machine learning and other automated techniques to speed up the credit evaluation process and make more accurate predictions [7]. These technologies can analyse large volumes of data and identify patterns and trends that may be difficult for humans to detect, enabling banks to make faster, more informed lending decisions.
The study aims to develop a model that delivers precise results even when the data is unbalanced, and customer variables are subject to change over time. To address imbalanced data, oversampling is used, which adjusts unequal data classes to generate balanced datasets. The model's effectiveness was evaluated using various real-life credit score datasets. The study also explored imbalance classification, which involves building prediction models based on classification datasets with a notable class imbalance. Dealing with unbalanced datasets can be challenging as many machine learning techniques tend to neglect the minority class. This can lead to poor performance, even though accurately identifying the minority class is often the most important aspect [8]. In this study, various datasets containing different classes were used to tackle the issue of imbalanced data, with some datasets having more positive samples and others having more negative samples. To address the Tsholofelo Mokheleli, Tinofirei Museba| 499 class imbalance problem, the Synthetic Minority Oversampling Technique (SMOTE) was applied to oversample the minority class. This technique involves randomly increasing the number of minority class samples by replicating them to balance the class distribution. After SMOTE, Principal Component Analysis (PCA) was conducted to identify the key features that significantly impact the results. PCA is an unsupervised learning method that reduces dimensionality in machine learning. It uses an orthogonal transformation statistical technique to transform the observations of correlated variables into a set of linearly uncorrelated data [9]. This study's primary achievement is creating an adaptive, dynamic, and novel heterogeneous ensemble credit scoring model. Our proposed model differs from existing models in that it considers changes that occur with customer variables over time and applies the dynamic ensemble selection to incorporate both accuracy and diversity.
The data utilised in the experiments were obtained from the UCI Machine Learning Repository [10]. These datasets are widely used in related research, making comparing predictions comprehensively with existing studies feasible. Some of the datasets were obtained from the Kaggle and UCI repositories. They are accessible to the public and can be downloaded free of charge.

Related Work
Banks and most financial institutions use a quantitative model for credit scoring to distinguish between creditworthy and risky customers. Given the increasing complexity of credit scoring, several approaches to designing efficient and robust scoring models have been proposed. Recently, machine learning ensembles have gained prominence over statistical models due to their recognition and adaptation abilities in developing robust and assertive credit scoring models. Credit scoring models utilise information from loan applications and customer details to accurately predict the likelihood of loan default. Credit scoring is a crucial aspect of the credit risk management system for the majority of financial institutions, aiding in the prediction of a surge in loan applications, and a number of contemporary approaches based on ensembles of machine learning approaches to establish quantitative credit scoring models to distinguish between two categories of applications namely: creditworthy applicants and non-creditworthy applicants. Due to its strong interpretability of results, Zhang et al. [11] applied the logistic regression model to design a novel ensemble called the Balancing and Weighting Effect (BWE). The major drawback of the logistic Balancing and Weighting Effects model is that the balancing operation of training samples enhances the recognition ability of default samples at the expense of the recognition ability of non-default samples. BWE requires o be integrated with other credit scoring models and learning algorithms to increase diversity and further improve the recognition ability of the BWE.
To accurately handle the class imbalance problem that is inherent in credit scoring, as the misclassification of the minority is often costly, Johah Mushava and Michael Murray [12] suggested utilising XGBoost, a dependable and effective classification technique and included the quantile function of the Generalized Extreme Value (GEV) distribution as a link function to improve the identification of infrequent cases. While XGBoost-based methods are intricate and offer superior outcomes compared to simple imputation methods, these techniques lessen the comprehensibility of the scoring outcomes. In another research paper, the same authors, Johah Mushava and Michael Murray [13] investigated the predictive power of the most popular classification technique currently used for credit scoring, with special attention to predicting a client to pay given different intervals of days in arrears. The approach only works well for a fixed window of 3 to 12 months, but most clients can go beyond even five years with no payment made. The approach does not consider the occurrence of variable drifts. Developing a reliable and confident credit scoring model takes considerable time, usually between 3 to 18 months. Therefore, it is not uncommon for financial institutions and credit scoring models to remain unchanged for several years.
The credit scoring task should be considered an ephemeral scenario since variables can drift over time. Yiqiong Wu [14] proposed a credit scoring framework that focuses on uncertainty and incorporates multi-objective feature selection to handle credit classification under uncertain conditions. The multi-objective optimisation problem is addressed using a modified evolutionary algorithm and a binary multiobjective particle swarm optimisation. One of the drawbacks of this approach is that it uses a simple dummy method to encode categorical variables. The experimental results show that the credit scoring model with better AUC and AUCC values may only sometimes yield satisfactory FPR or FNR values. The method for determining the cut-off point could be more effective. Hongliang He [15] introduced a new ensemble model to tackle the problem of class imbalance in credit datasets. This model can adjust the imbalance ratios to enhance recognition performance. The proposed approach extends the supervised under-sampling approach called BalanceCascade to create adjustable datasets to estimate data imbalance ratios. The proposed method comprises three stages and employs the PSO algorithm to optimise parameters. It adopts a stacking approach to combine RF and XGBoost as base classifiers to form an ensemble. Despite the recommendations for improving the handling of imbalanced data, the approach still has limitations. For instance, it does not consider the impact of redundant samples from positive classes and the performance of an ensemble model with more than three base classifiers. Nevertheless, the models' diversity is key to any ensemble classifier's success.
Wanan Liu [13] proposed two tree-based augmented GBDTS for credit scoring [16]to harness the power of tree-based algorithms for credit scoring models. Diversity is introduced via a stepwise feature augmentation mechanism. The Tsholofelo Mokheleli, Tinofirei Museba| 501 proposed approach was evaluated on four large-scale credit scoring datasets along with several benchmark models, and the performance comparison demonstrated that the proposed approach is effective. The proposed approach needs to integrate tree-based stepwise feature augmentation with XGBoost making the performance poorly balanced, more complex, and difficult to interpret the results. A credit scoring model that incorporates the bagging algorithm with the stacking method was proposed by Yufei Xia et al. [17]. The Bstacking model involves four base learners trained in bagging samples. However, complex models like Bstacking may raise privacy concerns and regulatory actions. In addition, interpretability should be highlighted to balance a real-world credit scoring model's accuracy, complexity, and interpretability. Credit scoring involves working with large amounts of data, which makes it difficult to perform resampling during model training. As a result, methods such as bagging and boosting, which involve resampling the training data, are typically not used in credit scoring. In another effort to create a credit score model capable of accurately distinguishing loan applicants, Yufei Xia et al. [17] proposed a credit scoring model called the overfitting-cautious heterogeneous ensemble model (OCHE) is a tree-based heterogeneous ensemble model designed to avoid overfitting. This model uses a dynamic ensemble selection strategy and advanced tree-based classifiers as base models. The approach also considers overfitting in the ensemble selection stage. The proposed model was compared with benchmark models on five publicly available real-world datasets. It outperformed most individual and homogeneous ensemble models regarding predictor accuracy, as measured by four metrics.
Using a powerful base such as XGBoost and CatBoost, which are complex but generate better results, may also lead to the deterioration of the interpretability of the scoring results. This paper proposes the Adaptive Dynamic Heterogeneous Ensemble (ADHE) that explores the dynamic ensemble selection to formulate an ensemble of accurate and diverse models derived from two base learners. To detect and adapt to changes in the behaviour of applicants, models are updated regularly. To tackle the class imbalance issue, the Synthetic Minority Oversampling Technique (SMOTE) is utilised. The XGBoost algorithm is used for feature processing.

METHODS
Ensemble learning combines the prediction outputs of different classifiers to generate better generalisation than applying a single algorithm. This section proposes the Adaptive and Dynamic Heterogeneous Ensemble (ADHE) for credit scoring. Adaptive and Dynamic Heterogeneous Ensemble is a machine learning technique that combines multiple models to improve prediction accuracy and robustness. It involves creating an ensemble of diverse models that complement each other's strengths and weaknesses. The adaptive and dynamic aspect refers to the ability to adjust the ensemble in response to data and environment changes. In the credit scoring context, the Adaptive and Dynamic Heterogeneous Ensemble approach integrates XGBoost and Support Vector Machine models to create a more accurate and reliable credit scoring model. XGBoost is a gradient-boosting algorithm that can handle large and complex datasets while Support Vector Machines effectively handle high-dimensional data. By combining these models, the ensemble is better equipped to handle the challenges of credit scoring, such as class imbalance, verification latency, and concept drift. The experimental setup is specified from aspects such as Dynamic Ensemble Selection and pool generation, base learners, data pre-processing, hyperparameter tuning, evaluation metrics and credit datasets.

Dynamic Classifier Selection and Pool Generation
Given a training dataset, = { , }, where is an X dimensional feature matrix and y ∈ {0,1}N indicates the label. A value of 1 in represents a default application, whereas 0 is an indication of creditworthiness. The dataset generates an initial pool of classifiers from the two base learning algorithms: XGBoost and Support Vector Machines. Our proposed approach employs a heterogeneous ensemble architecture. After the initial pool is generated, classifier ensembles are subsequently selected. The approach for selection is dynamic, and the classifiers considered competent are chosen using a fitness function specific to different groups of test samples. This makes the ensemble classifier be created dynamically [18]. For an ensemble to accurately distinguish applicants, base models in ensemble learning must be diverse and accurate. This study selects classifiers based on their accuracy on the validation set and their diversity to handle the incremental learning that accounts for the changing customer behaviour over time. The credit scoring task is transitory because various variables might alter over time. Therefore, the study uses data stream mining techniques designed for incremental learning and to detect and adjust to changes in the data distribution. To select classifiers from the pool that are accurate and diverse, we employ the algorithm called Selection by Accuracy and Diversity (SAD) [19], which is as follows: 1) Train a set of different classifiers 2) Measure the accuracy of each classifier on a validation set.
3) Choose the top-performing classifiers based on accuracy. 4) Measure the diversity between the chosen classifiers and the remaining ones. 5) Select additional classifiers with high diversity and add them to the ensemble until the desired size is reached. 6) Combine the classifiers in the ensemble. 7) Evaluate the performance of ensemble learning.
The Q Statistic [20] diversity measure is used as a diversity measure in this study due to its simplicity and ease of interpretation.

Base Learners
In simple terms, the success of a classifier ensemble relies on the diversity of performance of its base classifiers [21]. Our approach uses two base learning algorithms, XGBoost and Support Vector Machines (SVM), to introduce diversity. In addition to the introduction of diversity, the two base learners have the potential to strike a good balance between accuracy and efficiency. Support Vector Machines have demonstrated tremendous capability for regression and classification problems in static and dynamic domains. They have been extensively used to address the curse of dimensionality for most classification problems. For classification tasks, SVM can identify a hyperplane that effectively separates linear data into two classes while maximising the distance between the training instances.
If the data is non-linear, the SVM kernel function maps it to a higher dimensional space. In such cases, SVM looks for an optimal hyperplane that can separate the two data classes in the high-dimensional feature space. Support Vector Machine has two hyperparameters, the cost parameter and the RBF kernel parameter .. The cost parameter manages both the misclassification and the complexity level.
The RBF kernel parameter regulates the impact of an individual training sample on the hyperplane.
Chen and Guestrain [22] developed eXtreme Gradient Boosting (XGBoost) to address classification problems encountered in real-world scenarios. XGBoost can reduce model variances by incorporating regularisation into the loss function and using a weighted quantile sketch for tree learning to handle sparse data. XGBoost surpasses many other machine learning algorithms in speed and accuracy because of these techniques and weights. It uses Taylor's expansion to approximate the loss function quickly.
The XGBoost learning algorithm has quite a number of hyperparameters. The number of estimators' hyperparameters controls the number of iterations in XGBoost. The maximum depth hyperparameter determines the maximum depth of a single base learner, while the subsampling rate hyperparameter specifies the fraction of samples used to train one base learner. The learning rate hyperparameter reduces the contribution of each base learner. The column sampling rate hyperparameter determines the fraction of features used for training a single base learner. Finally, the gamma hyperparameter determines the minimum loss reduction necessary to create a new partition.

Parameter Optimisation
The base learners employed in the study, Support Vector Machines and XGBoost, are associated with several parameters that can substantially impact the prediction performance of the credit card fraud detection system. The parameters must be optimised for the fraud detection system to perform optimally, and several optimisation algorithms exist. Most existing optimisation techniques suffer from the curse of dimensionality. The computational cost involved tends to increase dramatically with the number of hyperparameters or as the search space is extended. The tuning of hyperparameters for most applications is subjective and relies on empirical judgement and trial and error approaches. To overcome the drawbacks of existing optimisation algorithms, this study employs an adaptive heterogeneous Particle Swarm Optimizer to appropriately optimise and generate an optimal subset of accurate parameters and improve the efficacy of XGBoost and Support Vector Machines for the classification problem. Kennedy and Eberhart [23] created the Particle Swarm Optimization (PSO) algorithm, a popular heuristic algorithm, and an evolutionary computational technique. The PSO algorithm is a population-based, iterative, global, and stochastic optimisation technique. It takes inspiration from the social behaviour of birds flocking or fish schooling to conduct an intelligent search for the best possible solution [24]. PSO is a heuristic optimisation algorithm that does not need gradients as it is not based on differentiability. This makes it useful for solving problems that have nonconvex or discontinuous functions. In the current research, the swarm's particles were instantiated individually to introduce diversity within the swarm. This allows for different search behaviours among the particles as they can randomly choose velocity and position update rules from a pool of possible behaviours. Combining exploratory and exploitative particles allows the algorithm to balance exploration and exploitation, preventing premature convergence and allowing for a better solution space search.

Data Pre-processing
The datasets utilised in this research are processed through standardisation, scaling to a range of 0 to 1, and the approximation of missing values. The class imbalance issue in the credit card fraud datasets is also addressed. The mean is removed and scaled to unit variance to standardise numeric features. The data is scaled using the 0-1 normalisation method. If is a given feature, then the normalised feature can be calculated as follows: where ′ expresses the standardised value.
Normalising features significantly enhances the precision of classifiers, particularly those that rely on distance or edge computations, making the model more confident and precise. Credit card fraud data is associated with class imbalance. Detecting credit card fraud is challenging due to the highly skewed distribution of credit card transaction data, where the proportion of legitimate transactions Additionally, a metric called degOver is utilised to address class imbalance with overlap, which considers both the imbalance ratio and the dataset structure [26]. Dynamic Ensemble Selection (DES) is applied to handle different drifting concepts. To handle verification latency, we employ integrated Fraud Detection (FD) [27]. Smooth Clustering based Boosting (SCBoost) is a fraud detection method with noise-resistant boosting. It is combined with k-Shortest Distance Ratio (k-SDR), which helps to effectively use the labelled dataset and address issues caused by class imbalance. K-SDR's primary function is to classify an instance based on the ratio of its average distance to the k nearest instances in the positive class, preventing any interference caused by a class imbalance in the labelled dataset.

Feature Selection
Feature selection is carried out using XGBoost. It computes feature importance scores by measuring the average reduction in objective function value achieved using a particular variable for splitting. This evaluation is carried out immediately when variables are selected for splitting. During the tree-building process, variables with higher scores are considered more important. This study employs XGBoost as a joint base learner with Support Vector Machine, and the suggestions proposed by Xia [28] are followed to implement scores derived from feature importance as a guideline in a sequential forward search (SFS) feature selection algorithm. SFS places the relevant features into the subset and iteratively adds the features that remain and have the highest scores, thus generating a series of candidate feature subsets. Only the feature subset that maximises the crossvalidated accuracy is selected as the optimal feature set suitable for training the model in the subsequent steps.

Performance Metrics
The study presented in this paper is modelled as a machine learning binary classification task. We selected five popular evaluation metrics to comprehensively perform our proposed approach, ADHE and benchmarks. The performance The accuracy obtained from the test data is used as the main performance metric. Furthermore, we compute each model's Precision, Recall, F1_Score and Area Under the Curve (AUC). The Area Under the Curve provides a proper assessment of the classification quality of each model. The AUC metric provides a measure of the effectiveness of a classifier for a given task. The value of AUC is within the interval 0 to 1, and an efficient classifier is identified with an AUC value almost close to 1. The accuracy metric is determined by dividing the total number of accurate predictions by the overall number of forecasts made.
On the other hand, precision refers to the ratio of the total number of accurate predictions made to the total number of correct predictions made. A recall metric measures the proportion of correct predictions of positive class values in the test dataset. Finally, the F1 score is a measure that represents the equilibrium between accuracy and recall. The performance metrics can be expressed mathematically as follows:

Data Description
Credit scoring in the era of big data has its challenges. Credit data is big and often nonstationary. Data is constantly evolving as customers' behaviour changes. Realtime processing systems have significant business value because they can react instantly. Machine learning models, in many cases, are built on outdated data that no longer accurately represents the distribution of new data. The main difficulty is promptly detecting and adapting to changes in concept drift and successfully Tsholofelo Mokheleli, Tinofirei Museba| 507 managing model transitions during these changes. Credit data is typically nonlinear and has many points, creating a dense cloud that makes it challenging to observe relationships and determine linearity. Along with concept drift and nonlinearity, credit scoring data also has a class imbalance issue where some classes have many samples and others only have a few. The overall performance of a machine learning algorithm can be adversely affected when large datasets contain data from classes with different probabilities of occurrence.
Five real-world credit datasets are employed to validate our proposed model's efficacy. Among the five datasets, the three most popular ones, Australian, Japanese, and German, are sourced from the UCI Machine Learning Repository [10]. The selected datasets are frequently used in related literature, enabling us to perform a feasible comparison with other state-of-the-art studies. The three datasets need to be larger to perform a detailed analysis of the behaviour of our proposed model. We added two other large credit datasets for a detailed and accurate comparison with existing studies. The Peer to Peer (P2P) consists of quite a number of instances regarding consumer lending, and a dataset from PPDai is employed. The PPDai [15] consists of instances of transaction records sourced from an advanced P2P lending platform in China. From the Kaggle community,   [31], bagging algorithm with stacking method (BStack) [17], a group method of data handling (GMDH) based sensitive semi-supervised selection ensemble (GCSSE) model [32], the Generalised Shapley Choquet Integral (GSCI) [33] and the Adaptive Particle Swarm Optimization [34].

Experimental Results
A comprehensive comparison of the prediction performance of our proposed approach with the selected benchmark models. The first experiments compare our proposed prediction performance against individual base learners and homogeneous ensemble models. The proposed and benchmark models are validated on five credit score datasets across five evaluation metrics. The empirical experiments are conducted in Python 2.7 on a PC with 3.6 GHz, Intel i7 CPU, 8GB RAM and Microsoft Windows 10 Operating System. Table 2 provides the average prediction performances of individual classifiers and homogeneous ensembles against our ADHE approach on seven evaluation measures. The section compares results from our proposed ADHE approach with the individual classifiers and homogeneous ensembles on five credit-scoring datasets across seven performance metrics. Table 4 presents the Accuracy scores of the ADHE model and other homogeneous ensemble models used in previous studies. From the results, it can be inferred that the ADHE model generally outperforms the homogeneous ensembles. Additionally, the heterogeneous ensemble is constructed through a feature selection process, which further improves the proposed model's performance. The prediction performance of the proposed ADHE is better in general than the homogeneous ensembles, reflecting the ADHE approach's effectiveness. The experimental results for the ADHE heterogeneous ensemble model for all seven-evaluation metrics are the best for the datasets from Australian, Japanese, German, PPDai and the GMSC. The prediction performance of single classifiers and homogeneous ensembles could be more consistent and stable on different datasets.

Comparison of the ADHE and Ensemble Benchmarks
The performance of ADHE is compared to the other five state-of-the-art ensemble models. The results are shown in Table 4. The tables reveal the findings of all the experiments conducted on the five datasets. ADHE performs the best overall on all evaluation metrics. The prediction performance of ADHE is enhanced further by the selection of competent classifiers that are diverse, making it able to adapt to changes in the underlying distribution of the data.
Since the prediction performance of ensemble models is demonstrated in the literature, we employ the Overfitting-cautious heterogeneous ensembles model (OCHE) [31], bagging algorithm with stacking method (BStack) [17], a group method of data handling (GMDH) based sensitive semi-supervised selection ensemble (GCSSE) model (Xu Zhou, 2020), the Generalised Shapley Choquet Integral (GSCI) [33] and the Adaptive Particle Swarm Optimization (APSO-XGBoost, 2021) [34]. The prediction performance of the created ensemble is evaluated by comparing it with the individual classifiers. The prediction results for all five datasets are presented, and Table 2 displays the prediction performance of both the individual models and the ensemble, using various indicators. The prediction performance of ADHE and other benchmark models is relatively good. This is hugely attributed to the simultaneous consideration of accuracy and diversity of both learners in the combination stage. Tables 3 reveal important findings of the behaviour of ADHE in handling changes and class imbalance. Firstly, ADHE outperforms the rest of the benchmark ensemble models and achieves first place among the evaluation metrics for most datasets. Secondly, other ensemble-based approaches achieve good performance, demonstrating the superiority of heterogeneous ensemble methods in credit scoring. OCHE and Bstacking perform well as they show acceptable results across the five datasets, which partially explains why they are often selected as benchmarks for most new credit scoring approaches. For several datasets, benchmark models show acceptable results. As shown on other performance metrics, the prediction performance of benchmarks exhibits different behaviours, which inversely necessitates evaluating benchmark models from various aspects such as label, probability, and discriminatory capability. The results in the Table 3 demonstrate the advantages of heterogeneous ensemble approaches in credit scoring. Ensemble methods built using accuracy and diversity provide promising prediction performance, especially in credit scoring.

Comparison of Computational Cost
An effective and robust credit scoring model has to be computationally efficient.
In addition, to be computationally efficient, a credit scoring model must provide quick and accurate responses to prospective loan applicants. The credit scoring model must consider that the variables differ for each applicant since the variables drift with time, and the changes to the training model must be consistent with the frequent updates of the credit scoring model. The application of the XGBoost as one of the base learners supports Graphics Processing Unit (GPU) to perform highly parallel independent calculations, significantly reducing computational time. This section compares the computational cost of benchmark models and our proposed model called ADHE. To accurately measure the computational cost, we implement a single training time.
Furthermore, it is calculated as the whole training time of a single cross-validation.  Table 4 also shows the single training time for the benchmark models. Comparing our proposed approach and the benchmark models regarding computational cost revealed a trade-off between computational cost and model prediction performance.

Statistical Significance Tests
For classification problems, each performance metric has its own merits and demerits. To evaluate our proposed model against benchmark models, we used a non-parametric significance test instead of a parametric test because when comparing credit-scoring models, the assumptions of parametric tests are frequently unmet. This test can establish the statistical significance between the models and assess the performance differences. In their study, Lessman et al. [35] utilise non-parametric tests to compare classification models, as parametric tests [36] are often not suitable for such comparisons due to the assumptions they make. They employ the Friedman, non-parametric test that ranks the models to assess their differences. It calculates a statistic based on the following formula: where D and K represent the number of datasets and classifiers, respectively, and denotes the averaged rank of classifier j on dataset . To calculate the average rank for each classifier, we use the corresponding rank among the evaluation metrics over datasets without losing any generality. Suppose the Friedman test null hypothesis is false, indicating a significant difference in the average ranks of the models for a specific evaluation measure. In that case, a post hoc test is conducted to compare it with a control method. This is done because the null hypothesis assumes no differences among the models. A paired comparison is carried out through a post hoc test to compare the differences among individual models. Our empirical experiment uses the Nemenyi test to demonstrate the difference when the average ranks differ by at least a Critical Difference (CD). The CD is calculated as follows: CD = q α,∞,k √ k(k+1) 12D (8) where q α,∞,k The Nemenyi test diagram shows the average ranks of the ADHE model at various levels of significance. The lines connecting different models indicate the average ranks of the ADHE model, and the number of datasets represented by D is used to calculate the Critical Difference (CD). The -test statistic is used in this calculation.  The diagram's horizontal axis represents the average rankings of the benchmark models for each dataset. A black box connects the benchmarks with a difference in average ranks lower than the CD value. The proposed ADHE is significantly better than OCHE, Bstacking, APSO-XGBoost, GCSSE and GSCI.

CONCLUSION
Machine learning techniques have shown great potential in accurately assessing creditworthiness, making them increasingly common in credit scoring models. In this study, new predictive models were developed to enhance machine learning's accuracy and ability to differentiate between creditworthy and non-creditworthy customers, facilitating faster credit decisions by financial institutions. Consequently, the financial industry has embraced machine learning algorithms to improve the accuracy of customer categorisation. The proposed adaptive dynamic heterogeneous ensemble model provides a faster and more efficient method for predicting customer credit scores and mitigating financial losses. In addition, the ensemble model includes supplementary metrics to support unbiased decisionmaking, addressing previous studies that emphasised the importance of multiple metrics in evaluating model performance.
However, the study faced certain limitations, such as the time-consuming nature of processing large datasets with grid search cross-validation, which necessitated the use of randomised cross-validation. The experiments were also conducted on Google Colab with less than six months of machine learning programming experience. This could have affected the results, which could be further improved by employing more efficient feature selection techniques besides PCA. The study could be replicated in the future with more imbalanced datasets and more powerful computers to determine whether better results can be obtained.