Rice Yield Forecasting: A Comparative Analysis of Multiple Machine Learning Algorithms

Agriculture plays a crucial role in Nigeria's economy, serving as a vital source of sustenance and livelihood for numerous Nigerians. With the escalating impact of climate change on crop yields, it becomes imperative to develop models that can effectively study and predict rice output under varying climatic conditions. This study collected rice yield data from Katsina state, spanning the years 1970 to 2017, sourced from the Nigeria Bureau of Statistics. Additionally, climatic data for the same period were obtained from the World Bank Climate Knowledge portal. Logistic Regression (LR), Artificial Neural Network (ANN), Random Forest (RF), Random Trees (RT), and Naïve Bayes (NB) were employed to develop rice yield prediction models utilizing this dataset. The findings reveal that random forest and random trees exhibited superior classification performance for yield prediction. The developed models offer a promising tool for predicting future rice yields, facilitating proactive measures to ensure food security for the people of the state.


INTRODUCTION
Agriculture plays a crucial role in Nigeria's economy, serving as a primary source of sustenance for the population and a significant means of livelihood for numerous Nigerians [1], [2]. Among the diverse forms of agriculture, crop cultivation stands out as the predominant agricultural activity in Nigeria [3]. Consequently, any failures in crop production have profound implications for families and the overall economy. Notably, rice cultivation emerges as one of the prominent agricultural practices in Nigeria, spanning across almost every state in the country.
Rice holds a prominent position as a staple food in Nigeria, being widely consumed by both the impoverished and affluent segments of society [4]. Its significant consumption in terms of tons per year highlights its vital role in the country [4]. Consequently, any failures in rice production would have a substantial played more significant roles compared to other predictors such as solar radiation, air humidity, soil moisture, and wind speed. [31] examined the predictive accuracy of machine learning and regression approaches for crop production prediction using ten agricultural datasets. The M5-Prime and k-closest neighbor models demonstrated high levels of accuracy among all the methods considered. Specifically, the M5-Prime model achieved RMSEs of 5.14, 79.46%, and 18.12%, as well as RRSEs of 79.46% and MAEs of 18.12%.
Prediction models for multiple crop yields have also been explored. [32] conducted a study using Gradient Boosting, Support Vector Regression (SVR), and k-Nearest Neighbors, along with crop models, to predict yields for various crops in the Netherlands, Germany, and France. The models incorporated weather, remote sensing, and soil data as input features. Similarly, [33] utilized random forest techniques to forecast cotton yield in Maharashtra, India, highlighting the wide usage of soil, climate, and solar parameters in predicting crop yield, among other factors. [34] focused on predicting potato tuber yield, employing four machine learning algorithms: linear regression, elastic net, k-nearest neighbor, and support vector regression. Their results indicated that Support Vector Regression performed the best, with RMSE values of 5.97, 4.62, 6.60, and 6.17 t/ha for different years.
In Rwanda, [35] used the Aqua Crop model to predict maize yield under rainfed agriculture in the Eastern province. The study analyzed various climatic parameters, including temperature, rainfall, evapotranspiration, and maize yield. Notably, the research revealed no significant impact of rainfall trends on crop yield during the considered study period. Additionally, [36] developed a Deep Neural Network-based solution to predict and assess the yield of corn hybrids using environmental and genotype data as part of the 2018 Syngenta Crop Challenge. Their model demonstrated high accuracy, achieving an RMSE of 12% of the average yield and 50% of the standard deviation for the validation dataset, utilizing predicted weather data. Although numerous machine learning predictions have been conducted on rice yield, further research is still needed in this field.

METHODS
The methods adopted in this work follows the regular data mining procedure which include data collection, data preprocessing, choosing learning algorithm and training them to produce rice yield prediction model. Briefly, the diagram in Figure  1 captures the method conceptually.

Data source
The study location is katsina state in northern Nigeria, it has border with the republic of Niger and is a semi-desert area. Data on rice yield from katsina state in Nigeria was collected from Nigeria Bureau of statistics (NBS) while some climatic data was downloaded from World Bank climate knowledge portal. The data from NBS contains annual records of rice yield in katsina state from 1970 to 2017 with the following attributes: elevation, max temperature, min temperature, wind, relative humidity, and Yield/metric tons. While the World Bank dataset contains precipitation, and average temperature.

Feature selection
One crucial aspect of the project is the feature selection. In this work we use the entire features or attributes earlier introduced. The features focus on climatic and environmental variables which affect yield in the area. These features were also used to study the relationship between climate change and yield in southern part of Nigeria [4]. At a continental level, [25] and [1], used these variables. Table 1 give a statistical detail of the data. In the table, the minimum and maximum value for each of the attributed are given with the mean and standard deviation.

Data preparation
A further transformation of data was carried out with each attribute values reduced to maximum of 1, this was done by dividing by the highest values of each attribute.
For the training of the model, the first set contains only the data with the six attributes which were initially introduced, and a class attribute based on yield data. The class attribute has 3 categorizations: low, moderate, and high. Any yield less 600 metric tons is categorized as Low yield, any yield less 700 metric tons is categorized moderate yield and yield greater or equal 800 metric ton is categorized high yield. The data was divided into 75% for training and 25% for testing.

Model development environment
Weka (Waikato Environment for Knowledge Analysis) was used to develop the model. This tool is a data mining suite with several machine learning algorithms for carrying out classification, clustering, association task. The software also has some data preprocessing tools and visualization tools and other functionality for data transformation into proper form for mining.

Prediction Algorithms
A total of 5 predictor were used in this work: Logistic Regression (LR), Artificial Neural Network (ANN), Random Forest (RF), Random Trees (RT) and Naïve Bayes (NB). These classifiers have their strength and weakness. LR is an algorithm that can be used for classifying an object into various label groups and can used as a regression technique [1]. It predicts the probability of an event taking place using a logistic function that whose value range between 0 and 1. If the probability is higher than certain threshold, then the object belongs to the class, else it is not. ANN is a computing algorithm with large number of interconnecting artificial neurons. Neural network works analogously like human brain [37]- [39]. It consists of computational nodes which receive input and a processing layer that sums up the input to produce the output. There are several architectures of neural network, however, all have basic 3 layers. it maps input to output to find patterns in the training data, with this, it generalizes training set of the input value already classified in predefined class. RF is a form of learning algorithm that generates a tress based on the attributes from the dataset, where each tree is itself a classification tree [40]. Several random samples are generated from which randomized trees are developed. It initially randomly samples the complete data set, following which many decision trees are generated. Each tree is trained using a random sample from which it was built. All of the decision trees' predictions are then combined into a single tree for a single output. If multiple trees are trained and a greater number of them predict that an object belongs to class Y, and one says no, the final random forest prediction will be class Y. RT Is a tree is formulated from a random sample of other trees, with each trees possessing n number of random features at each node [41]. The general theory is that each tree has probability of being sampled. NB is an algorithm that uses a function ( ) to performs mapping of certain input into output class, where the class have label 1……n. As a classifier, NB forecast the predict the probability that an item belongs to a certain class [42]. It uses the bayes theorem based on the following equation. the probability of Y given X can be expressed. . (1)

RESULT AND DISCUSSION
In this section, we present the outcomes obtained from several rice yield prediction models that we have developed. The results are depicted in Figure 2, illustrating the correlations among the different variables employed in the models.

Prediction Model
Predictive modeling is a powerful mathematical approach employed to forecast future events or outcomes through the analysis of patterns and trends within a given dataset. It involves the application of various algorithms and techniques to identify relationships and dependencies among the input variables, ultimately enabling accurate predictions to be made based on these patterns. By leveraging historical data and statistical methods, predictive modeling provides valuable insights and predictions that can aid decision-making processes. The following is an overview of the steps involved in constructing the Rice Yield Forecasting model.

Figure 2.
Relationship among the attributes of rice yield model Figure 2 illustrates the relationships between rice yield and various climatic variables. The strongest correlation observed is between rice yield and maximum temperature, with a coefficient of 0.17. However, it is important to note that this correlation is still considered weak. Interestingly, our findings align with other studies, such as [4], which also reported no relationship between rice yield and rainfall in southeast Nigeria. Similarly, our results indicate a negative correlation between rice yield and precipitation. Although this may seem counterintuitive, similar findings have been documented by [19] in China. Among all the variables considered, temperature and humidity exhibit a positive relationship with rice yield, while wind and precipitation display a negative correlation. Moving on, Figure 3 and Figure 4 depict graphical representations of two of the predictive rice yield models developed in this study. Figure 3 provides a visual representation of the rice yield prediction models based on Artificial Neural Networks (ANN). The model architecture includes input neurons representing wind, relative humidity, precipitation, maximum temperature, minimum temperature, and rice yield. The ANN consists of one hidden layer for processing the input data, while the output layer categorizes the yield into classes such as low, moderate, or high. On the other hand, Figure 4 depicts a graphical representation of the Random Tree rice yield prediction models. The tree structure consists of branches representing wind, relative humidity, precipitation, maximum temperature, and minimum temperature as the input variables. The leaves of the tree represent the output class, which corresponds to different yield categories. These graphical representations provide a clear overview of the model structures and the relationships between the input variables and the predicted rice yield.  Eli Adama Jiya, Umar Illiyasu, at all | 793

Model Accuracy Measure
This paper presents the development of five models. A comprehensive overview of these models, along with their respective accuracy measures, can be found in Tables 2 and 3. Table 2 provides class-based accuracy measures for each model, while Table 3 presents the error measures for each model.   Table 2, RF, RT, and NB all attained a ROC Area value of 1, while ANN scored 0.89 and LR achieved 0.95. This indicates that RF, RT, and NB exhibit a higher capability to correctly predict the class of rice yield compared to ANN and LR.
Interpreting the results, ANN and LR were only able to predict the actual rice yield with 75% accuracy, while NB achieved a prediction accuracy of 91.7%. Both tree algorithms, RT and RF, demonstrated perfect accuracy of 100% in predicting rice yield. Based on these metrics, the tree algorithms (RF and RT) are considered the best models for rice prediction using climatic variables. The error-based metrics can be found in Table 3, providing further insights into the models' performance.

Discussion
This study examines the performance of five distinct models for predicting rice yield based on climatic variables. The models were evaluated using both classbased and error-based metrics to assess their accuracy in predicting class instances and numerical values, respectively. The class-based metrics revealed that Random Forest (RF) and Random Trees (RT) surpassed the other models by achieving a TP rate of 1, indicating their precise predictions of rice yield classes. Naïve Bayes (NB) followed with a TP rate of 0. 19 Interpreting the error measures, both RT and RF achieved perfect accuracy of 100% in predicting rice yield, while ANN and LR reached 75% accuracy. NB demonstrated higher accuracy at 91.6%. Overall, the tree algorithms, specifically RF and RT, displayed exceptional performance in predicting rice yield using climatic variables, with RT being slightly superior due to its lower error.
These findings strongly suggest that the tree algorithms, particularly RT, are the most effective models for accurately predicting rice yield based on climatic variables. They outperformed the other models in both class-based and errorbased metrics. This study underscores the significance of utilizing diverse evaluation metrics to comprehensively assess the predictive performance of models.

CONCLUSION
This paper developed rice yield prediction model for Katsina state in Nigeria, using climatic data and rice yield data. The work used 5 machine learning algorithms: Logistic Regression (LR), Artificial Neural Network (ANN), Random Forest (RF), Random Trees (RT) and Naïve Bayes (NB). The result of the data analysis reveals that precipitations has no significant relationship with rice output but rather temperature has closer relationship ( figure 2). Also, of all the 5 models, Random Trees has the highest accuracy in predicting rice yield with MAE, RMSE, RAE and ERSE all equal to 0 and 100% accuracy while Neural Network performed more poorly than all other models. We therefore recommend that as future katsina state climatic variables are predicted by various agencies within and outside Nigeria, our model can be used alongside to predict rice yield in the state. This will help the government to plan and will also ensure food security for the people of the state.