Forecasting Diabetes Correlated Non-alcoholic Fatty Liver Disease by Exploiting Naïve Bayes Tree

INTRODUCTION: In recent years, non-alcoholic fatty liver disease (NAFLD) has been identified as the most vulnerable chronic disease. Fat is accumulated in the liver cells of persons with NAFLD. Diabetes is the most common ailment among people of all ages, so it is critical to recognize and prevent its adverse effects. OBJECTIVES: A relevant dataset with appropriate features was selected. Ensemble algorithms were applied for the prediction task, and finally, the method with the best performance was extracted. METHODS: In addition to Ensemble approaches namely bagging, Random forest and Ada-boost, individual classifiers Naive Bayes (NB) and C4.5 Decision tree were considered. These ML techniques were compared with the proposed NB tree algorithm, a combination of C4.5 and Naive Bayes. RESULTS: The following evaluation parameters were computed for each analyzed algorithm: accuracy, detection rate, negative predictive value (NPV), false negative rate (FNR), and false positive rate (FPR). The algorithms are then compared based on these metrics to determine the best algorithm. The NB tree was obtained to be the best method with 97.55% accuracy, 0.4853 detection rate, 0.9615 NPV, 0.0388 FNR, and 0.0099 FPR. CONCLUSION: The NB tree outperformed individual Naive bayes and C4.5 classifiers, and the other techniques studied. The developed algorithm could be applied in NAFLD-related research. accuracy, detection rate, NPV, FNR and FPR, diabetes mellitus (DM).


Introduction
Diabetes mellitus has recently emerged as one of the most vulnerable chronic diseases. People are becoming more susceptible to this disease due to modern lifestyle factors such as bad food habits and a lack of physical activity. Glucose levels in the blood are not appropriately maintained, leading to various complications. One such problem is a non-alcoholic fatty liver disease (NAFLD). This consequence occurs in diabetic people who consume little or no alcohol. Fat storage is detected in the liver cells of NAFLD patients, and there is a risk of liver injury and inflammation [1]. Usually, the liver contains a limited amount of fat; however, if the amount surpasses the limit, it is called fatty liver. The Figure 1 illustrates the difference in appearance between healthy liver and fatty liver. The liver performs vital activities in the human body. The liver performs various activities, including albumin and clotting factor manufacturing, blood detoxification, nutrient and drug processing, fat, vitamin, bile storage, and glucose S.S. Reddy et al. 2 production. As a result, detecting liver illnesses is a crucial responsibility to reduce their negative effects [2]. NAFLD is caused by both type 1 and type 2 diabetes. However, people with type 2 diabetes are more vulnerable to it than people with type 1 diabetes. The alanine amino transferase (ALT) levels of 20% of children with type 2 diabetes are abnormal [3]. According to a study [4], NAFLD affects 50-70% of type-2 diabetic individuals and 50% type-1 diabetic patients. They also discovered that diabetic patients have a higher risk of developing advanced NAFLD than non-diabetic ones.

Figure 1. Visualization of healthy liver and fatty liver
Cirrhosis will develop if NAFLD is not appropriately treated. In this situation, the liver will be harmed. Cirrhosis can result in ascites, hepatic encephalopathy, esophageal vein enlargement, liver malignancy, and end-stage liver failure. End-stage liver failure causes a liver function to decline or halt. Ascites are a condition in which fluid accumulates in the abdomen. Slurred speech, confusion, and tiredness are some of the symptoms of hepatic encephalopathy [1]. As a result, early identification of NAFLD will help avert disease progression. Some of the symptoms of NAFLD include enlarged blood vessels and spleen, red hands, jaundice, lethargy, and abdominal enlargement. If a person notices these symptoms, he should see a doctor avoid the disease's worsening effects. Obesity, high cholesterol, diabetes, high blood pressure, and any other metabolic syndrome might contribute to NAFLD. Among all diabetic patients, those with the highest risk of developing NAFLD [5] have the highest risk of developing the disease. For the diagnosis of NAFLD, scans and blood tests for liver function are commonly used. Alkaline phosphatase (ALP), aspartate transaminase (AST), alanine transaminase (ALT), gamma-glutamyl transferase (GGT), Bilirubin, and albumin are the parameters for liver function tests. Liver illness is diagnosed based on Amino transferase, which refers to AST and ALT. They will aid in the detection of hepatocellular damage. Bilirubin is a yellow-colored chemical found in human stool and blood. A high bilirubin level indicates jaundice, a marker of NAFLD or liver disease that includes hepatitis. The liver produces almost 10 grams of albumin per day, and abnormally high albumin levels suggest liver illness. A patient with NAFLD should keep a healthy weight, exercise regularly, and live a healthy lifestyle [6].
Because diabetes is such a common chronic condition, accurate predictions of its side effects, NAFLD, are required. This forecast is helpful for diabetes people who need to start therapy at the appropriate moment. A few ML algorithms are being examined to construct a suitable predictive model in this context. In this paper, an algorithm called NB tree, which is an ensemble of naive Bayes and C4.5 decision tree, is developed. The suggested approach is compared to techniques such as naive Bayes, C4.5 decision trees, bagging, random forest, and adaptive boosting. A comparison study is carried out based on accuracy, detection rate, NPV, FNR, and FPR. For the algorithm's implementation, R programming is used. After persuasion of the findings and comparison analysis, the algorithm with superior predicted performance was obtained. This superior algorithm was advised to produce more accurate and better NAFLD predictions.

Literature survey
Reddy et al. [7] predicted the hospital readmission of diabetic patients. A deep belief network, a deep learning technique, was used. In addition, gradient boosting, adaboost, logistic regression, decision tree, and random forest existing algorithms are also implemented. The proposed deep belief network performed better than the remaining techniques regarding specificity, accuracy, NPV, and precision, with 0.6644, 0.6917, 0.7032, and 0.6814, respectively. But logistic regression performed better only in terms of f1-score (0.7833). Sarwar et al. [8] implemented machine learning techniques: random forest, naive Bayes, SVM, decision tree, logistic regression, and KNN to predict diabetes. For this purpose, the Pima diabetes dataset was selected, and a percentage split of 70% was applied. The training data extracted from the percentage split is used for implementing techniques. Both KNN and SVM obtained 77% of the highest accuracy compared with other techniques. Reddy et al. [9] detected diabetes using voting strategy and considered Pima diabetes dataset. Implemented algorithms like decision tree, SMO, naive bayes, adaboost-M1 and SVM on the training data obtained after performing k-fold cross validation technique. Evaluation is done by using the test dataset. After implementing voting strategy on all the algorithms, 95% overall accuracy was observed. Vijayan and Anjali [10] have implemented naive Bayes, decision tree, SVM, decision stump, and adaboost to predict diabetes. The diabetes dataset from the UCI repository was chosen for developing models. Each algorithm other than adaboost is considered base learners to obtain an individual adaboost model. AdaBoost with decision stump has obtained good accuracy of 80.72% and concluded as the best performing algorithm. Reddy et al. [11] predicted single or combination of correlated ailments related to diabetes. Retinopathy, cardiovascular, and nephropathy are the ailments of diabetes selected in their work. An RDAD dataset taken from a medical centre was used to predict the disease. The proposed fuzzy logic along with the k-cross validation technique. Fuzzy logic has obtained 97% overall accuracy with 80 ms computation time and is the best performing technique over other schemes. Kulkarni et al. [12] considered few machine learning algorithms to predict NAFLD. Decision tree, SVM, logistic regression, random forest, ANN, and gradient boosting algorithms are used. A dataset that has been used is taken from a hospital in Pune. This dataset is related to liver disease patients obtained from electronic health records. Firstly, data cleaning followed by feature selection is implemented, then continues with the algorithm implementation. They have identified that diabetes is the most crucial factor after feature selection. The random forest has better performance with 85% accuracy and 1.0 AUROC. Reddy et al. [13] reviewed various data mining techniques used for diabetes prediction and correlated ailments. Different research works compared methods like C4.5, image Net, I-SVM, fuzzy, and neuro cognitive. By performing k-cross validation technique, image Net obtained better accuracy. So, this was identified as the best among all other data mining techniques. Deo and Panigrahi [14] highlighted their work on hepatic steatosis prediction. Hepatic steatosis means fatty liver, which can be caused due to either alcoholic or non-alcoholic consumption. NHANES-III dataset was used in their work with a 70% percentage split. SVM with medium and fine Gaussian, bagging, and boosting techniques like gentle and adaboost are implemented along with the 10-cross validation technique. Gentle boosting tree achieved average accuracy of 79.03%, sensitivity of 75.88%, specificity of 81.86%, and AUC of 0.79 and was recognized as the best. Chen and Zhao [15] proposed multi-layer random forest (MLRF) to predict fatty liver disease. A real time fatty liver disease dataset was considered in their work. Before implementing the proposed technique, data pre-processing, normalization and dimensionality reduction methods are implemented. This proposed technique is then compared with a few algorithms like SVM, naive bayes, logistic regression, and back propagation NN. MLRF was found as the best algorithm with 98.63% accuracy. Perveen et al. [16] predicted the risk of NAFLD and progression of disease also by using decision tree. The dataset on which they worked is an electronic medical record data. C4.5 is the decision tree algorithm employed for prediction. It was implemented on both balanced and unbalanced datasets and observed that it performed better for unbalanced dataset with 76.2% accuracy, 66.9% precision, 73.5% recall, 67.6% f-measure, 0.299 MCC, and 73.1% AUROC. Wu et al. [17] used logistic regression, ANN, naive bayes, and random forest algorithms with 3, 5 and 10-cross validation techniques to predict fatty liver. The real time dataset of fatty liver from a hospital was considered in the work. It contains 577 records in total, where 377 are representing as positive for FLD. The best values of accuracy and AUROC are obtained for random forest with 87.48% and 0.925 respectively when 10-cross validation was performed.
The main aim of Wu et al. [18] work is effectively sense motor imagery using EEG signals for mind and system interface. This work is very useful to patients with motor brain problems. This workillustratestechniqueNB algorithm for analysing brain signals. The results of proposed model arebetter than their counter parts. Islam et al. [19] applied few ML techniques to develop a predictive model for fatty liver disease (FLD). They performed SVM, ANN, random forest and logistic regression techniques with 10-cross validation technique on a liver patient dataset. It contains 994 records with 533 female and 461 male. Among all the techniques logistic regression has performed better with 76.30% accuracy, 74.10% sensitivity and 64.90% specificity. Details about objectives, dataset and system architecture for this work is given in section 3. This section is followed by section 4 where all the algorithms used in this work are explained. Among these NB tree proposed algorithm is elaborated and remaining are briefly described. The analysis of obtained results, including its discussion, was provided in section 5. In this section the best performing algorithm was found after proper and valid comparison. The conclusion of this work is provided in section 6, followed by references.

Methodology
The details of objectives, dataset, and system architecture of the proposed methodology were described in this section.

Objectives of the work
Non-alcoholic fatty liver is a disease that most diabetic patients will be affected. So, accurate prediction of this disease is needed, which helps a doctor or physician to make a better decision about patient's condition. Machine learning has been widely used to predict various diseases in recent days. It is also a cost effective method for prediction. Hence, few machine learning algorithms are chosen to predict the disease. The aim of this work is • To obtain a dataset that helps to develop a best predictive model for non-alcoholic fatty liver disease. • To use an efficient ensemble algorithm for disease prediction. • To find out the algorithm with best performance. The used dataset is described in the following sub section. This paper proposes an ensemble method called NB tree, compared to base algorithms such as naive bayes, C4.5 decision tree, bagging, random forest, and ada-boost techniques. R was used to programme all of these algorithms. Accuracy, detection rate, negative predictive value, false negative rate, and false positive rate are used to analyse and compare algorithms to determine the best performing one. Finally, the NB tree outperformed the others.

Dataset description
Considered dataset has 18 features and 1022 records. This is a binary classification dataset. The target variable has two classes, NAFLD positive and NAFLD negative. A detailed description of the dataset is given in table 1. The attribute hepatitis B is the liver infection. The attributes ALT, AST, GGT, ALP and albumin are the liver functioning test components. These are enzymes of the liver. The abnormal values of these components indicate fatty liver or any other liver disease. Triglyceride is used to detect cholesterol in a person. The higher levels of TG indicate high cholesterol, which is one of the risk factor for NAFLD. The attributes total Bilirubin, direct Bilirubin and indirect Bilirubin are the terms obtained from bilirubin blood test. The abnormal values in these terms indicate liver disease or liver damage. The histogram plot including the density plot is demonstrated in figure 2. The pink color highlighted bars are representing the histogram for each attribute. The density plot is represented as the dashed line for each of the attribute. It is used for visualizing the distribution of dataset considered in this work. Whether the patient is tested positive or negative for hepatitis B. 0negative, 1positive. ALT Alanine amino transferase, value is given in IU/L and normal range is between 0 and 45 IU/L AST Aspartate amino transferase, value is given in IU/L and normal range is between 0 and 35 IU/L GGT Gamma-glutamyl transferase, value is given in IU/L and normal range is between 0 and 30 IU/L ALP Alkaline phosphate, value is given in IU/L and normal range is between 30 and 120 IU/L TG Triglycerides, based on which cholesterol is detected. Its value is given in mmol/L and normal range is <1.7 mmol/L TBIL Total Bilirubin, value is given in μmol/L and normal range is between 1.71 and 20.5 μmol/L DBIL Direct Bilirubin, value is given in μmol/L and normal range is < 5.1 μmol/L IBIL Indirect Bilirubin, it is calculated as TBIL -DBIL. Its value is given in μmol/L Albumin Its value is given in g/L and normal range is between 40 and 60 g/L NAFLD Whether the patient has non-alcoholic fatty liver disease or not. 1positive, 0negative.

System architecture
The system architecture from figure 3 demonstrates the working of proposed approach to develop an effective model for NAFLD. Data pre-processing followed by 80% percentage split is performed initially to obtain training and test datasets. On the training dataset with 818 instances all the six algorithms are implemented in R programming. Then a trained model will be obtained for each algorithm further evaluated on the test dataset with 204 instances. This will give the results used to compare all the algorithms in terms of performance metrics accuracy, detection rate, NPV, FNR and FPR. From this analysis, an algorithm with best performance will be found out. In this work, NB tree is recognized as the best performing technique.

Algorithms used
NB tree was explained elaborately and remaining five algorithms are briefly described in this section. Remaining algorithms include naive bayes, C4.5 decision tree, bagging, random forest and adaboost.

Naive bayes classifier
It is a classification algorithm based on probabilistic approach. This algorithm calculates the posterior probability for each group or class of the target variable. This can be briefly explained using formula (1). The term P(G/A) is posterior probability of group G for set of predictor variables A = {a1, a2, ....,an}. Similarly the terms P(A/G) and P(G) are likelihood and prior probabilities of group G in target variable. The predicted value is the group with highest posterior probability [20] . (1)

C 4.5 decision tree
It is also known as J48 decision tree algorithm and basically it is used for classification purpose. The splitting criteria namely gain ratio is employed in this technique for splitting the decision tree until the leaf nodes. Gain ratio is said as the normalization of information gain, which is a splitting criteria employed in ID3 decision tree algorithm. Normalization process is done using split information value. The entire process of constructing a C4.5 decision tree is involved while constructing NB tree. So, this process was explained clearly in NB tree algorithm. The limitation of decision tree is over fitting [21] .

Bagging
It is an ensemble technique based on weak classifiers. Initially some bootstrap datasets will be constructed. A model trained on a bootstrap dataset is called as a weak learner, whose performance is weak. Such weak learners are combined to get a final model with best performance. Thus, the ensemble technique uses voting strategy to predict the target class. Voting strategy will consider the class which is predicted mostly by weak learners [22] .

Random forest
It is also an ensemble technique. In this algorithm bootstrap datasets are constructed same as in the bagging technique. Decision tree algorithm was trained on single instance of bootstrap dataset. It also uses voting strategy on the outputs predicted by all the decision trees and gives the predicted value. The value which is predicted mostly after considering all decision trees will be the target value predicted [23] .

AdaBoost
It's an ensemble technique which works similarly as bagging. In ada-boost the decision stumps are the weak classifiers. Decision stump is a single level decision tree. The difference between bagging and ada-boost is, in adaboost it assigns weights to the instances for training the decision stumps. Weight of instance will be increased if decision stump training on it is predicted wrongly. The changes in weights are considered for constructing next decision stump. In ada-boost instead of voting the weighted average strategy was used, which performs the average of the weak classifiers. Based on this the final prediction was done by the strong classifier [24] .

NB tree
The NB tree method combines naive bayes and the C4.5 decision tree. The proposed technique uses a C4.5 or J48 decision tree to build the decision tree. It's an ID3 decision tree that's been tweaked. In the C4.5 decision tree, the attribute selection method for dividing the decision tree is gain ratio. The typical naive bayes algorithm is used at the DT leaves once the DT has been constructed. Each class's probability information for a specific instance is stored in these leaf nodes. The algorithm's result will be the class with the highest probability. This procedure is elaborated in the algorithm given below [25].

Algorithm: NB tree
INPUT: Dataset OUTPUT: Predictions made for the input data ASSUMPTIONS: "g" holds the different categories of the target attribute, h holds only one category from g at a time, Ph is the probability of instance that belongs to class h,A is a particular predictor attribute, D is a set of instances from an attribute, k represents different categories of instances in attribute A, p holds a category from k at a time, |Dp| is the no. of instances with category p from attribute A.
Step 1: Start Step 2: For each attribute A in the input dataset. a. Calculate entropy for target and predictor attributes.

Gainratio(A) = Gain(A) SplitInfo(D, A)
Step 3: Repeat step-2 until all the predictor attributes are completed then end for loop and go to step-4.
Step 4: Predictor attribute which obtained highest value of gain ratio is selected for splitting criteria.
Step 5: After completing the construction of decision tree perform naive bayes on the leaves of the tree.
Step 6: The target class with highest probability from naive bayes is the predicted output for the given input.
Step 7: Stop Steps 2 and 3 together represents a FOR loop. Its main motive is to calculate gain ratio for each predictor attribute. The gain ratio is the normalization of information gain, which is a splitting criteria used in ID3.
Step 2a calculates entropy for target and predictor attributes using formula (2) and (3) respectively.
Step 2b calculates information gain for each predictor attribute using formula (4). Steps 2c and 2d calculates split information using formula (5) and gain ratio using formula (6) respectively.
Step 4 comprises of identifying attribute with highest gain ratio for splitting the tree. In step 5 naive bayes algorithm is implemented on the decision tree leaves. The formula used to calculate probability for each class is provided in formula (1). The target class with highest probability will be the predicted output from NB tree [26].

Results analysis & Discussion
This section contains the outcomes of implementing the discussed strategies in R programming. The computed results are discussed, and all algorithms are compared to determine the best algorithm. The confusion matrix for the NB tree is shown in Table 2. The true positive, true negative, false positive, and false negative values for the NB tree are 99, 100, 1, and 4 correspondingly, as shown in table  2. These values are used to compute the evaluation parameters, which are shown in the subsection that follows. The remaining algorithms are similarly evaluated using a similar way. After that, a comparison is done to determine which algorithm is better.

Performance metrics
Accuracy This metric will measure the correct classification rate. The ratio of correct predictions made to the instances in total is the accuracy, with value between 0 and 1. The value nearer to 1 indicates good performance of the model. The formula (7) is to calculate accuracy of the prediction model and its value is obtained as 0.9755 for NB tree. This is 97.55% which can be said as a good performance.

Detection rate
This metric will measure the ability of a model, to detect different groups in the target variable. The ratio of correct positive predictions made to the instances in total is called detection rate. Its value is between 0 and 1, where the value near to 1 represents a good performance.

NPV
The ratio of correct negative predictions made to the actual negative instances in total is called NPV, whose value lies between 0 and 1. The value close to 1 will represent a good performance.
NPV for NB tree = 100 / (100+4) = 0.9615 FNR or miss rate FNR is also called as miss rate. It is the ratio of incorrect predictions made as positive to the total positive predictions, whose value is between 0 and 1. The less value i.e. nearer to 0 indicates good performance.

FNR = FN FN + TP
FNR for NB tree = 4 / (4+99) = 0.0388 FPR or fall out FPR is also called as fall out. It is the ratio of incorrect predictions made as negative to the total negative predictions. Its value lies in between [0, 1]. The value nearer to 0 means good performance.

FPR = FP FP + TN
FPR for NB tree = 1 / (1+100) = 0.0099  Table 4 shows a performance comparison of the algorithms studied in this study with algorithms from relevant literature. Different datasets were utilised in diverse academic publications from table, and the dataset used in this work was different as well. In [12], C4.5 and random forest algorithms are utilised, with random forest proving to be the most effective. In [14], bagging and adaboost were utilised and compared to their rivals, but neither one of them was found to have the best performance. The proposed study took into account the NB classifier, which authors used in works [15] and [17]. Random forest was discovered to be the best algorithm in literary works [12] and [17], and it was also noticed to be applied in [19]. The algorithms employed in the proposed work are drawn from these studies and compared to the proposed NB Tree technique.    Wu et al. [17] Logistic regression, ANN, naive bayes and random forest.

Results obtained
Implemented four ML techniques to predict FLD. All these techniques are implemented using 3, 5 and 10 cross validation. The results of algorithms in case of three cross validation techniques are compared on the basis of accuracy and AUROC.

Random forest
Accuracy-87.48% and AUROC-0.925 Islam et al. [19] SVM, ANN, random forest and logistic regression. Developed a predictive model for fatty liver using 10 cross validation with algorithms. For evaluating and comparing the algorithms accuracy, sensitivity and specificity are used.

Logistic regression
Accuracy-76.30%, sensitivity-74.10% and specificity-64.90% As any of the work from the literature has not used NB tree, the comparison of existing algorithms with it will help to recognise the most significant algorithm for predicting NAFLD. From the result analysis, it had been found that the proposed NB tree outperformed its individual classifiers and other ensemble techniques from literature works. Hence, it was undoubtedly the best algorithm which has obtained about 97.55% accuracy.
This value of accuracy is better than accuracy of existing algorithms in the related literature works as well. Also the dataset features in this proposed work also plays a vital role for prediction as they are related to test results from liver function test and few risk factors of NAFLD. Henceforth, it would be better to consider NB tree over other three ensemble algorithms in further works related to NAFLD.

Conclusion
Diabetes type 1 and 2 people are more likely to suffer from non-alcoholic fatty liver disease than those who do not have the disease. Clinicians can use the desease prediction to minimize more difficulties to make quick and efficient treatment decisions. The NB tree, an ensemble of naive bayes and a C4.5 decision tree, is the best method for this task. This model was superior to random forest, bagging, and adaboost in terms of accuracy. Naive Bayes and C4.5 are also compared with the NB tree that is an ensemble of these underlying algorithms. After a rigorous comparitive study, NB tree is identified as better performing algorithm with accuracy, detection rate, NPV, FNR and FPR of 97.55 percent , 0.4853, 0.9615, 0.0388 and 0.0099 accordingly. Finally, the NB tree was found superior to other ensembles in predicting NAFLD following a valid and fair review of all outcomes. It is expected to see a lot more work in the medical field in the future employing the best mix of mining algorithms and ML algorithms.