Forecasting Brown Sugar Production Using k-NN Minkowski Distance and Z-Score Normalization

The demand for brown sugar products often falls below the level of production, resulting in unsold goods when market demand surpasses the production capacity. This paper addresses the challenge faced by many brown sugar businesses in estimating production yields. Another issue, apart from production uncertainty, is the presence of a dataset with a significant nominal range. The study focuses on a specific brown sugar producing company in Indonesia. To address the production estimation problem, this research proposes the use of k-NN supervised learning as a forecasting method. However, instead of relying solely on k-NN, the study suggests employing z-score normalization to handle the dataset's large nominal range. The production data used for analysis spans from March 2019 to February 2022, comprising 144 weekly records. The dataset is divided into training and testing data, employing an 8:2 split validation ratio. The proposed method consists of several steps, including data normalization using z-score, processing k-NN based on the Minkowski distance, and concluding with the de-normalization process. The results demonstrate the successful implementation of the proposed method in predicting production levels. The evaluation indicates an average error margin of 3.34%, which is below the 5% threshold. The evaluation of predictive data for k-NN with z-score normalization proves effective in forecasting brown sugar production uncertainty and addressing the challenge of a large nominal range.


INTRODUCTION
The brown sugar industry represents a traditional household trade that has been passed down through generations. The production process involves simple methods and equipment [1]. Sugar is a crucial staple in the Indonesian diet, consumed for its energy, flavor-enhancing properties, and as a raw material in the food and beverage industry [2]. These factors contribute to the extensive market advantage enjoyed by brown sugar. However, the consumption patterns of brown sugar exhibit significant fluctuations. Demand for brown sugar products often fails Several studies have been conducted on the prediction of sugar production, including the work presented in [6]. Its focuses on predicting crop production yields using K-Means and a modified version of the k-NN algorithm. The research primarily revolves around determining categories through clustering and classifying input data into five specific categories using the modified k-NN approach. Another study by [7] employs data mining techniques to develop a model for estimating sugarcane yield, considering variables such as meteorological factors and crop management. Data mining techniques, including random forest, boosting, and support vector machines, are utilized in this study. Notably, papers [6] and [7] primarily concentrate on the prediction methods, with [6] focusing on predicting major crop yields and [7] discussing the development of a model to predict sugarcane yield based on meteorological and crop management variables. However, neither of these studies delves into detail regarding the preprocessing procedures. Thus, we identify a gap in utilizing preprocessing techniques, particularly data normalization, in the context of brown sugar production datasets.
In this research, we employ proposed method the K-Nearest Neighbor (k-NN) machine learning approach due to its proven accuracy, efficiency, and low error ratio [8]. K-Nearest Neighbor is known for its simplicity in implementation [9], fast training on data, and robustness even without feature selection [10]. Additionally, it performs well on small datasets [11], [12], making it widely used in forecasting techniques for various problems. While the algorithm is considered straightforward, researchers have continuously sought to improve its predictive performance. K-Nearest Neighbor is applicable to both regression and classification tasks and has been successfully implemented in diverse fields [13]- [15]. Furthermore, paper [16] demonstrates the suitability of k-NN for time series forecasting. Considering the reliability of k-NN in forecasting and the small dataset available for our research, we have chosen k-NN as the preferred algorithm for our forecasting purposes.
The k-NN algorithm operates by determining the k-NN results based on the shortest distance. Various methods exist for approximating distance, including the Minkowski distance approach. While the Euclidean distance is commonly employed by most k-NN algorithms, this research introduces an update by utilizing the Minkowski distance approach. Mailagaha Kumbure and Luukka (2021) have suggested that the Euclidean distance is often suboptimal for practical problems, and better outcomes can be achieved by generalizing it. By employing the Minkowski distance approach, the proposed method can identify more suitable nearest neighbors for the target [17].
Another challenge encountered, apart from the uncertainty in sugar production, is the presence of attributes with a relatively large nominal range. Attributes with large value ranges can unjustly dominate the results of the classification process solely due to their numerical magnitude. Hence, normalization is necessary to balance the range of values [18]. In order to address the challenges associated with estimating production levels under uncertainty and dealing with a dataset characterized by a significant nominal range, this paper proposes a technique that combines the use of the Minkowski distance k-NN and z-score normalization to forecast weekly production levels.

Collection Data
Collection of data sets of brown sugar production starts from March 2019 to February 2022. The obtained production data is weekly data with a total of 144. The data set that has been collected will be split into data train and data test. Splitting the data using common ratio of 8:2, where the percentage of 80% training data and 20% testing data.

Proposed Method
The K-Nearest Neighbor method is widely recognized as a popular classification algorithm used for predicting classes based on neighboring records or samples [19]. In the proposed method, the original image processing involves a series of steps, as illustrated in Figure 1.
Step 1: Normalization process to produce a range balanced value using z-score normalization.
Step 1: The next step is determining k value (the number closest neighbor) Step 2: Calculate the Minkowski distance for each object against the given training data. Minkowski Distance or Minkowski metric is a metric in a normed vector space that can be thought of as a generalization of the Euclidean distance and the Manhattan distance [20]. Minkowski distance Where d is distance between x and y , i = each data, n = amount of data xi = data at the center of the i cluster, yi = data on each data to i, p = powers.
Step 3: Sorting objects into groups that have the smallest distance, and check objects from nearest neighbors (range of K values).
Step 4: Calculating the average of object values in k range using the closest Nearest Neighbor category (K range), then the calculated query instance value can be predicted. Equation 3 below shows the average calculation of object values in the K range. (2)

Evaluation
Evaluation of prediction of this paper uses margin error according paper [12].
Margin Error (ME) can be used for knowing the difference between prediction and real data.

Denorm=(Predicted result * σ)+ μ
The input data will undergo calculation using the k-NN method to forecast production results. The steps involved in the production forecasting calculation are as follows:

Normalization
Since the data in many attributes exhibit varying ranges, it is necessary to normalize them. To achieve this, the datasets are transformed using the z-score method, which involves calculating the minimum and maximum values for each attribute.
The modified values resulting from this normalization process can be observed in Table 2. Notably, these processed values demonstrate a balanced range. For reference, Table 1 displays the original values prior to any preprocessing steps.

K-NN
The fundamental principle underlying k-NN is to identify the shortest distance between the data under evaluation and its nearest neighbors. Following the data split and normalization steps, the algorithm proceeds to calculate the distance between the attribute values of the testing data and each training data attribute using the Minkowski distance Eq. 1. a) Sorting process is carried out ascending based on result of minkowski distance. The value of the distance between data test and data train is sorted from the lowest value. This paper used k=3. To calculate the mean object, Eq. 2 is employed. Here, we provide the sample mean object value within the range of k = 3, with neighbor values derived from the testing data. Based on the results presented in Table 3, we can conclude that the data utilized corresponds to numbers 1 (neighbor = 0.210973), 2 (neighbor = Once the predicted value has been obtained, the subsequent step involves denormalizing the prediction value to its original, real value.

Evaluation
During the evaluation stage, a set of test data is classified using the established classification model. The classification process employs the Minkowski distance measurement. It is important to note that the test data utilized in this phase is distinct from the data used for training. A comparison between the evaluation results of the proposed method and the standard k-NN can be observed in Table  4. Tabel  Referring to the data presented in Table 4, the prediction results indicate that the margin of error for the aforementioned data falls below 5%. The utilization of zscore normalization as a preprocessing technique yields an average margin of error of 3.34%. Consequently, it can be inferred that the prediction model, combined with the z-score normalization, demonstrates a satisfactory level of accuracy.
For the standard k-NN predictions, the margin of error ranges from 1.9% to 13%. Meanwhile, the k-NN with z-score normalization shows a lower margin of error, varying from 0.23% to 8.56%. Comparing the two methods, it is evident that the k-NN with z-score normalization consistently yields lower margin errors across the predictions. This indicates that the application of z-score normalization effectively reduces the variability and improves the accuracy of the production forecasts.
The results demonstrate that the proposed k-NN with z-score normalization approach provides more reliable predictions compared to the standard k-NN. The average margin of error achieved using the z-score normalization technique is 3.34%, indicating a high level of precision in forecasting the production levels of brown sugar.
Overall, these findings support the effectiveness of employing the k-NN algorithm with z-score normalization for estimating production levels in the brown sugar industry. The reduced margin of error implies improved decision-making capabilities and better planning for brown sugar businesses, ultimately leading to more efficient operations and resource allocation.

CONCLUSION
The K-Nearest Neighbor algorithm can predict the amount of brown sugar production, so that it can help overcome and minimize excess product availability and can make decisions to increase or reduce brown sugar production. In some case the result of KNN has weaknesses, it caused the minimize the dataset, so we have to collect more dataset, and then analysis k, maybe we can use more k value The K-Nearest Neighbor algorithm with data preprocessing using z-score normalization is better than K-Nearest Neighbor without data preprocessing, with an average margin of error of 3.34%.