Securing Against Zero-Day Attacks: A Machine Learning Approach for Classification and Organizations’ Perception of its Impact

.


INTRODUCTION
A zero-day (0-day) vulnerability is a loophole in a software that is latent.A zeroday exploit is the way a vulnerable system is attacked [2].These are threats with a higher probability of succeeding since organizations are likely not to have complete and appropriate measures in place to nip them in the bud.A zero-day attack is termed as such since it occurs before the target knows about the existence of such vulnerability as the malware is released before or with little opportunity for developers to patch any existent vulnerability [1].The most dangerous aspect of a zero-day attack is that the application having vulnerability may be dealing with 1124 | Securing Against Zero-Day Attacks: A Machine Learning Approach for ..... sensitive files [6].If such an application is on a server, an attack can pass through it and hijack the whole system.A lot of losses have happened over the years because of this attack [8].Organizations have run into several problems which include but not limited to revenue loss, legal repercussions, reputational damage, loss/theft of data, decrease in production and unauthorized access.Service quality has a major impact on clients trust on organizations [5].Zero attacks can lead to loss of such trust.Machine Learning, being the ability of a system to learn from experience is indispensable in discovering patterns in data and these can be rightly applied to classify attacks as zero day or not.
In recent times, researchers have hunted for practical solutions to at least remedy the situation immediately the attack happens [13].This attack cannot be discovered until it happens, that is the most dreaded part of it.One best way to prevent this kind of cyber-attack is to purchase software products that have passed through series of tests and confirmations from trusted software and programmers should ensure every loophole is sealed in order to avoid attackers from exploiting any vulnerable loophole in an application [9].People who have cloud storage and use applications or API to read and process data should ensure there is maximum testing for those cloud applications.Zero-day attack is always quick and zero-day attackers don't lose their fight easily [6].Machine learning algorithms ensure that classification is correctly and timeously done [3].The zero-day threats (ZDTs) to organizations' networks is costly and require new approaches to identify malicious behavior [2].
Many studies have been conducted about zero-day attacks.One of the studies was Zero Day Threat Detection Using Metric Learning Autoencoders which demonstrated and improved upon a previous approach which used a dualautoencoder to identify such threats in the network flow [2].Further studies were carried out using principal component analysis (PCA), truncated singular value decomposition technique (TruncatedSVD) and pareto-based Monte carlo technique (PB-MCFR) for dimensionality reduction on a dataset for zero-day vulnerability analysis.The results showed that PB-MCFR outperformed PCA and TruncatedSVD and concluded that while significant efforts have been put in developing a robust tool to combat zero-day attacks, they fall short in the key performance metrics [12].Also, machine learning algorithms were to see how well ML algorithms can detect zero-day malware, with random forest showing the best result in terms of accuracy with zero rates for false positive and false negative [6].Also, in evaluating AI-Based Techniques for Zero-Day Attacks Detection using high-level model abstractions and non-linear transformations, it was observed that due to the sensitivity of the zero-day attacks, accuracy or precision was not enough to measure the performance of the models, as it is an ever-evolving issue, the use of current datasets should be employed [10].An Enhanced Classification Model for Likelihood of Zero-Day Attack Detection and Estimation based on deep-reinforcement learning, a reward learning and training feature with sparse feature generation and adaptive multi-layered recurrent approach that performed better than rule-based ranking in predicting zero-day threats was used [11].In another development, a transferred generative adversarial network(tDCGAN) based on deep auto encoders for detection of malware was used and an attempt was made to solve the problems by malware by using the discriminator's ability to extract meaningful features for malware detection and an accuracy of 95.74% was achieved [7].
All the reviewed work did not address the negative impact of this attack on organization neither did they get the perception of such impact on organizations in addition to insufficiency of the metrics for the classification.It is to this end that there is a need to build a model that helps organizations classify attacks in order to identify and seal such vulnerabilities and a contemporary approach, Random Forest, a machine learning algorithm, with its excellence ability to handle intricate datasets and still classify with high accuracy offers one of the best approaches in this direction hence its adoption in this research.

METHODS
This research seeks to classify malware Zero-day Attack using Machine Learning approach.The data includes both malicious as well as legitimate files that were sourced from Meraz'18, the annual techno-cultural festival held at IIT Bhilai.The malware section of the data constitutes software that is intended to interfere, harm, or acquire unauthorized entry to a computer's infrastructure while the legitimate files section of the data were programs that are safe for users to utilize and devoid of malicious intent [14].
Furthermore, Algorithm Based Method (ABM) was adopted in this research.In various fields, notably computer science, engineering, and psychology, the application of algorithmic methods is a common strategy for addressing or solving problems [15].Using ABM, we itemize the process of solving the problem of classification of zero-day attack by developing the algorithm for the method in section 2.1 which outlines the process from the point of model/problem definition, data acquisition, carrying out EDA, feature selection, data training using Random Forest classifier and evaluating the result of the classifier.

Model / Problem Definition
A software fragility known as a zero-day vulnerability is one for which no recognized security patch or upgrade has been made available.There is no publicly accessible data about this danger, and a software provider may or may not be aware of the vulnerability.Nevertheless, this study aim to use machine learning algorithm in the classification of zero-day malware attacks using Random Forest Classifier and using Principal Component analysis for the feature selection process.First in this research, a dataset consisting of  + 1 dimensions is used.Furthermore, the mean and dimension of every section in the dataset are computed and the computation of the covariance matrix is done in accordance with Eqn 1.

𝐶𝑂(𝑥, 𝑦
Furthermore, the computation and sorting of eigenvectors will be carried out.
Finally, the data is transformed into the new subspace using this  * 1 eigenvector matrix.After the feature selection process, the machine learning algorithm, RF, is applied.The reason for using this algorithm is that in detecting and classifying potential risks or malicious activities, Random Forest is recognized for its exceptional predictive accuracy.Being an ensemble learning method, it combines multiple decision trees to generate forecasts.This ensemble approach contributes to improving the overall efficiency and reliability of the classification model, a critical factor in cybersecurity for accurate threat identification and mitigation.Also, in the realm of cybersecurity, Random Forest is indispensable because it The random forest algorithm utilizes bagging on decision trees but with a crucial enhancement.In training the model, the following steps apply.
1. Take a bootstrap (with replacement) subsample from the data.2. For the first split, sample p < P variables at random without replacement.
3. For each of the variables of the dataset X i 1 … X j(p) , apply the splitting algorithm: 4. For each split values s_j(k) of X j(k) : 5. Split the data in partition A, with X j(k) < s j(k) as one partition and the rest of the data where X j(k) ≥ s j(k) as some other partition.6. Measure the homogeneity of classes within the partitions of A. 7. Select the value of s j(k) that produces the split value s j(k) that gives the maximum within partition homogeneity of class.8. Select the variable X j(k) and split the values s j(k) that produces maximum within partition homogeneity of class.9. Proceed to the next split and repeat from step 2 10. 10: Continue with additional splits following the same procedure until the tree is grown.

Exploratory Data Analysis (EDA)
Exploratory data analysis is the crucial process of conducting early investigations on data to find patterns, identify anomalies, test hypotheses, and double-check assumptions using summaries of statistics and graphical representations [13].
Understanding the data first and attempting to extract as many ideas from it as possible is good practice.EDA is all about interpreting the data at hand before using it.Prior to conducting data analysis and passing it through an algorithm, it is essential to thoroughly comprehend it.Patterns in the data were recognized and the decision on which factors are crucial and the ones that has little bearing on the result was made.Every machine learning problem solving involves EDA.In this section, a cross section of variables in the Zero-day malware attack dataset was obtained from [14] and different visualization of the statistical measures of the different parameters are presented.Figure1 shows the workflow of the research which are series of macro steps employed.Also since there was more features in the dataset, PCA for feature selection was employed.After the feature selection, Random Forest for the classification of the reduced feature data sets was done and an evaluation of the classifier in order to determine the accuracy of the model was done.
Furthermore, figure 2 represents a major linker version count in the datasets, it shows that versions from 1 to 15 have a greater number of counts in the data.With respect to data, the major linker version signifies the specific iteration of the linker software employed to unite object files, amalgamating them into an executable or library file.This integration process, orchestrated by a computer application termed the linker, merges numerous object files into one, be it an executable or library file.The concept of a minor linker Version in figure 3 pertains to the software iteration of the linker application, which is employed for the amalgamation of object files to generate executable or library files.This linker version data is used to monitor adjustments and enhancements to the linker program and to ensure its compatibility with other software elements.From figure 3, a minor linker version count in the datasets shows that version between 0 to 25 has a greater number of counts in the data.

Figure3. MajorLinkerVersion vs Count
Furthermore, File alignment describes the procedure of matching the data or portions of a file to particular positions or addresses.It guarantees that the beginning memory address of each section is a multiple of the alignment value.Figure 4 depicts the file alignment count in the data sets.Again, Section alignment makes sure that each section begins at a memory address that is double the alignment value in order to improve the initialization and operation of the executable file.A total of 4000 had the maximum number of counts in the data set as compared to the 8000 alignments which had less than 100 counts while sections alignments of 5000, 6000 and 7000 did not have any count in the data sets.Figure 5 depicts section alignment vs count in the dataset.
In the context of cyber security, entropy typically refers to a measurement of data's randomization or unpredictably.While low entropy denotes patterns or frequency, which can be more easily exploited or attacked, high entropy indicates a higher degree of unpredictability, making data more challenging to anticipate or analyze.

Feature Selection
The process of choosing a subset of pertinent features (variables, predictors) for use in model building in machine learning and statistics is known as feature selection, variable selection, attribute selection, or variable subset selection.One of the key elements of a feature engineering method is the feature selection process.A predictive model is created after lowering the amount of input variables.
Measuring how important a chosen feature is, choosing the best is the key.In this research, PCA was adopted for feature selection because in datasets containing numerous attributes as used here, it can resolve issues related to interdependent characteristics, a common challenge encountered by other feature selection methods.The algorithm for the selection using PCA is as presented in the ensuing paragraph.
1. Load numerical parameters in the datasets.2. Perform PCA by scaling the data to have a mean of 0 and a unit variance.
3. Extract the loadings and variance explained by each principal component (PC).4. Ensure The rotation attribute of the PCA object contains the loadings, or the coefficients that define the PCs.The sdev attribute has the standard deviation of each principal component.5. Calculate a measure of importance for each feature based on the loadings and variance explained.Specifically, take the absolute value of the loadings for the first two PCs and multiply them by the corresponding variance to get a weighted measure of importance for each feature and then store this information in a data frame and sort it in descending order of importance.
The normalized PC features in the data sets alongside its importance are in  as an attack that could ground their organizations and could potentially lead to job loss, litigations, low productivity, breach or organization privacy and even outright business closure among other undesirable and unimagined consequences and were interested in implementing cyber security measures and prioritizing them henceforth to prevent such attacks provided it is within affordable limits.

RESULTS AND DISCUSSION
In assessing the effectiveness, efficiency, impact of this research in line with the aim, an evaluation process using R programming to preprocess all needed variables to obtain a summary of all parameters based on min, 1st Qu., Median, mean,3 rd Qu and max for each of the features in the data sets.Figure 12 indicates the statistical tendencies for these variables.From Figure 15, the errors associated with the generated tree must be identified.This is done by plotting errors against each node of the tree as shown in figure 16.It can rightly be observed that as the number of trees keeps on increasing, the error keeps decreasing which shows that using a larger number of trees in the Random Forest model will give a better accuracy.Furthermore, the confusion matrix is adopted for evaluation of the classification of the model.The confusion matrix as expressed in table 2 is specifically used for the evaluation of the accuracy of the model.The test dataset which was use in the testing of the random forest model shows that out of 104 illegitimate files, 99 were correctly classified as illegitimate files while 5 were miss-classified as legitimate.Also, for the legitimate class, out of 57 legitimates files, 54 were correctly classified as legitimate files while 3 were miss classified as illegitimate files.From the confusion matrix, using sum(diag(t)/sum(t)) =((99+54)/(99+5+3+54)) *100 gives an accuracy of 0.95 i.e. 95%.

CONCLUSION
Software vendors and organizations need to continuously check for new vulnerabilities in their software and prepare for zero-day attacks.While it is difficult to completely prevent it, several defensive measures can be taken to protect against this menace can be taken.Everyday strategies like antivirus, spywares etc are not enough to identify these and previous work on this, apart from using few metrices for classification, did not address the perceptions of organizations with respect to its negative consequences.This study delved into the realm of zero-day malware, a perilous form of cyber threat exploiting system vulnerabilities before detection and remedy to bridge this gap as such attacks are of severe risks to enterprise security, wreaking havoc on organizational efficiency as it proliferates undetected.This study carves a distinctive path by focusing on the classification of zero-day attacks and particularly important is its embracement of machine learning in classification, a potent asset in the cyber security world.The employment of ensemble machine learning approach for this classification is innovative.The choice of the Random Forest Algorithm is especially noteworthy for its record of delivering precise outcomes using intricate datasets.Classifiers were trained to detect known classes and adapt to new ones.This algorithm was employed to properly classify attacks as zero-day or not and this yielded commendable results of 95% accuracy and 3.8% error rate.This research not only advances the comprehension of zero-day attacks but also offers pragmatic insights into the perception by organizations, of its impact and their eagerness to embrace and prioritize any proffered solution(s).Looking ahead, it is crucial to acknowledge the ever-evolving nature of zero-day threats.Recognizing the necessity for organization-specific datasets as pivotal since security configurations and practices can widely differ among organizations and such diversities should be explored.Comparative analyses of these datasets could illuminate disparities in patterns, shedding light on how distinct security measures influence the occurrence and characteristics of these attacks.Overall, the research aim was met.

Figure 8 .
Figure 8. Resource Min Entropy vs Count

Figure 9 .
Figure 9. Resource Max Entropy vs Count Figure 10 and figure11represent various parameters in the datasets that is used to visualized how each of the indepentent variables contributes to the class label of the factor variable (i.e.illegitimate 0 or legitimate 1 ).From Figure10, being a graph of Resource Mean Entropy vs Legitimate, it can be observed that resoure mean entropy is more illegitimate i.e. 0 when compared with legitimate i.e. 1.Also, Figure11which shows Section Mean Entropy vs Legitimate represents a box plot that visualized how section mean entropy fits between the two class label, it could be observed that both class labels tend to have even distributions in the class of 0 and 1 which represents illegitimates and legitimates respectively.

Figure 12 .
Figure 12. statistical tendencies for variablesThe data underwent preprocessing to remove noisy and redundant data.The data was then split into 70% training set and 30% test set for the Machine Learning model.Figure13is a snapshot of the training sets.

Figure 13 .
Figure 13.Cross section of training Sets The training dataset is fed into the Random Forest algorithm to create a training model that can be used to classify new or different datasets in R. The outcome is as shown in Figure 14.

Figure 14 .
Figure 14.Outcome of the Training Set Furthermore, the diagrammatic representation of the random forest tree of the model built is as shown in figure which shows the number of nodes in the random forest models being represented in a hierarchical tree indicating the classes (legitimate and non-legitimate zero-day attack i.e., 1 and 0).

Figure 15 .
Figure 15.Random Forest for the model

Figure 15 .
Figure 15.Error on the rf tress.
Split the normalized data in 11 in the Ratio 70:30; 14: Apply ML model to 70% of the data in 13; 15: Test the model with the remaining 30% of data in 13 16: Evaluate Result of 15 17: End table 1 and the weighted measure of importance is illustrated in figure 12.It can be table 1 that PC43, PC31, PC18, PC30, PC44 and PC 23 respectively were least significant components in the features of the datasets as compared to other components.