A Comprehensive Analysis on Detecting Chronic Kidney Disease by Employing Machine Learning Algorithms

INTRODUCTION: Chronic Kidney Disease refers to the slow, progressive deterioration of kidney functions. However, the impairment is irreversible and imperceptible up until the disease reaches one of the later stages, demanding early detection and initiation of treatment in order to ensure a good prognosis and prolonged life. In this aspect, machine learning algorithms have proven to be promising, and points towards the future of disease diagnosis. OBJECTIVES: We aim to apply different machine learning algorithms for the purpose of assessing and comparing their accuracies and other performance parameters for the detection of chronic kidney disease. METHODS: The ‘chronic kidney disease dataset’ from the machine learning repository of University of California, Irvine, has been harnessed, and eight supervised machine learning models have been developed by utilizing the python programming language for the detection of the disease. RESULTS: A comparative analysis is portrayed among eight machine learning models by evaluating different performance parameters like accuracy, precision, sensitivity, F1 score and ROC-AUC. Among the models, Random Forest displayed the highest accuracy of 99.75%. CONCLUSION: We observed that machine learning algorithms can contribute significantly to the domain of predictive analysis of chronic kidney disease, and can assist in developing a robust computer-aided diagnosis system to aid the healthcare professionals in treating the patients properly and efficiently.


Introduction
Chronic kidney disease (CKD) refers to a progressive and irreversible decline of the structure and functionalities of the kidneys, especially the deterioration of the glomerular filtration rate that develops over the course of several months or years.It begins with abnormal biochemical changes which eventually lead to gradual loss of excretory, endocrine and metabolic functions of the kidneys.These abnormalities manifest as signs and symptoms of renal failure.Although the underlying etiology of the disease remains unknown in a large number of patients, the commonest causes of the disease were listed as hypertension, diabetes, interstitial diseases, glomerular diseases, systemic inflammatory disorders, renovascular abnormalities and congenital conditions [1].The Prognosis Mirza Muntasir Nishat et al.
2 of the disease is determined by monitoring the glomerular filtration rate (GFR) and quantity of albumin in urine.Decreased GFR and increased albumin in urine was found to be allied with a higher risk of all-cause mortality, mortality from cardiovascular diseases (CVD), progressive kidney diseases and acute kidney injury (AKI) [2].Atherosclerotic calcification within the vessels followed by cholesterol crystal formation was suspected to create a high risk of developing CKD in a patient [3].If untreated, CKD leads through a spectrum of pathological conditions eventually to end-stage renal disease (ESRD) or end-stage renal failure (ESRF) which is responsible for coma and death in patients [4].The gradual development of CKD is either asymptomatic, or it presents with a set of non-specific symptoms like loss of weight, fatigue, poor appetite, edema, headache, muscle cramps etc. which makes it quite difficult for the patient or the physician to suspect the involvement of the kidneys.Moreover, the symptoms do not show until much later in the 3rd or 4th stage of the disease, by which time the comorbidities already set in [5].CKD also manifests with immune dysfunction, haematological abnormalities, endocrine dysfunction, neurological symptoms and electrolyte imbalance [1].In turn, CKD acts as a risk factor for other conditions like CVD, as mentioned above, resulting in added mortalities and morbidities [6].As a result, CKD has become a global burden, contributing to a significant portion of deaths due to non-communicable diseases (NCD).It has risen from being the 27th leading cause of global death in 1990 to being the 18 th in 2010 [7].Approximately, 1 million people died from CKD or cause related to it in the year of 2013 [8].The number of new cases needing renal replacement therapy has increased at a rate of 8% per year for the last decade worldwide [9].
Studies have revealed CKD to be a greater burden in the low and middle-income countries when compared to the high-income ones [10], [11].The proportion of people diagnosed with CKD in the urban areas of South Asia ranges from 7.2% to 17.2% [12].The prevalence was reported to be 13% among the general population of Dhaka city aged 15 years or older [13].Another community-based research suggested that about one-third of the rural people in Bangladesh were at risk of having CKD [14].Hence, in a developing country like Bangladesh, CKD poses an impeccable threat not only as a disease but also as a financial burden due to its demand for long-term treatment.The situation calls for the innovation of a diagnostic or, at the very least, a screening technique for the early and reliable detection of CKD in a patient to ensure an effective treatment by the doctor.Machine learning (ML) is currently one of the most notable and successful technologies in the medical industry for diagnosing and forecasting various diseases and their stages.[15][16][17][18][19][20][21].As machine learning is all about the exploration of the huge dataset and their patterns, features, modes etc., the dataset of various diseases can be fed into these algorithms with a view to developing ML models [22][23][24][25][26][27][28].This introduction of algorithms in medical databases will greatly assist medical professionals in making informed decisions about illnesses, preventing mistakes, and providing a safe life to the general public [29].

Related Works
In this context, a lot of researchers and data scientists have executed different techniques to attain satisfactory performances in terms of the ML models.Engin et al. used a dataset extracted from UCI to apply K-star, SVM, and J48 algorithms and compared them in terms of accuracy, sensitivity, and other parameters, where J48 classifier achieved 99% accuracy [30].Gunarathne et al., on the other hand, tested different algorithms on the same dataset and found that the Multiclass Decision Forest (MDF) algorithm outperforms the other algorithms with 99.1% accuracy [31].However, a different approach of featuring datasets was conducted by Nusrat et al.where they pre-processed the data by root mean squared error, mean absolute error and receiver operating characteristic curve.After featuring the dataset, they implemented Naïve Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM) and K-Nearest Neighbour (KNN) algorithm.According to their investigation, DT offers the best accuracy which is about 98%-99% [32].However, another work was conducted by Huseyin et al.where they improvised on the feature selection of the dataset before applying the algorithm.Hence, they applied the filter, wrapper and embedded feature selection methods on the dataset and then passed them through the SVM algorithm.According to their work, the filter scheme subset evaluation achieved the best accuracy of 98.5% [33].Furthermore, Devika et al. focused on Naïve Bayes (NB), K-Nearest Neighbour (KNN) and Random Forest (RF) algorithms in their research to predict CKD.Among these classifiers, RF has outperformed other algorithms with an accuracy of 99% [34].Besides, Merve et.al. has accomplished better accuracy (99.5%) by deploying AdaBoost ensemble learning approach.In their work, they have evaluated the performances of the ML models by utilizing mean absolute error (MAE), root mean squared error (RMSE) and area under curve (AUC) [35].
In addition, Amanah et al. has implied PSO algorithms to optimize their result more precisely and has obtained an accuracy of 99.5%.After applying AdaBoost and PSO feature selection algorithm combined, they were able to increase their average accuracy by 36.20% [36].On the other hand, Chittora et.al carried out six different methods of feature selection and implemented seven machine learning algorithms where 99.6% accuracy was attained by deep learning network [37].Moreover, Sobrinho et.al conducted a research where they have analyzed how machine learning approaches can help in the early detection of CKD in underdeveloped countries.The study findings indicate that the J48 decision tree is a good machine learning technique for such screening in developing nations due to the ease of comprehension of its classification results, with 95.00% accuracy [38].In our previous study, boosting algorithms were deployed in the same dataset where we achieved 99.75% accuracy AdaBoost algorithm [39].Hence, it is evident that machine learning algorithms open windows of identifying chronic kidney diseases at an early stage so that better treatment can be ensured for the patient.In this research, we focused on an investigative approach in terms applying supervised ML algorithms in UCI dataset pertaining to CKD and comparing the performances of the ML models in a comprehensive manner so that a vivid idea is portrayed in terms of developing a computer aided diagnosis system to detect CKD at an early stage and ensure proper treat for the patients [40].

Data Processing
One of the most widely used and accurate datasets for implementing machine learning algorithms is the UCI dataset repository.The dataset contains 400 instances and 25 attributes.Hence, the description of the attributes with necessary information is presented in Table 1.In order to apply machine learning algorithms, data must be reliable and well-structured.There are two types of data in this dataset: (i) numerical values and (ii) categorical values.Categorical values were replaced with dummy values for the implementation of ML algorithms.Since there are many missing values in various attributes in this data set, four separate data frames have been created to apply the algorithms and extract the results.The missing values have been first filled with the mean values of the corresponding attributes.Moreover, the missing values were then imputed using median and mode values.Finally, missing values were omitted in which we were left with 158 instances from 400 instances.
Each of the data frames was split into two portions: (i) training set and (ii) testing set.The training set has been comprised with 60% of the data and the rest of the data has been used for testing purpose.The splitting of each data frames has been cross-validated and hyperparameter tuning was accomplished.Hence, the performance parameters are observed and tabulated for both 'with tuning' and 'without tuning' case.The correlation heatmaps for all the data frames like mean, mode, median and no null are illustrated in Figure 1.Finally, a detailed step-by-step workflow diagram is presented in Figure 2 which provides a clear idea of the overall approach.Jupyter Notebook from Anaconda navigator was utilized as simulation platform for this research.However, this analysis was executed on a computer with Intel Core i5 9th generation processor with 16 GB RAM.

Logistic Regression (LR)
Logistic Regression is a statistical classification model which estimates the probability of an event existing within a certain class [41].Despite the fact that its name includes the word "regression," logistic regression is a commonly used binary classifier.

EAI Endorsed Transactions on Pervasive Health and Technology
Online First

K-Nearest Neighbours (KNN)
K-Nearest Neighbors is one of the simplest and most used supervised machine learning algorithms [42].Technically it does not train any dataset; instead an observation is predicted to fall under those classes which have the largest proportion of k-nearest neighbors around it.Distance is considered to be a metric to determine similarity.
For instance, the closet data point around the point under observation can be considered most similar to the data point.There are a large variety of distance-metrics like Euclidean distance (d),

Support Vector Machine (SVM)
Support Vector Machine is one of the most robust algorithms based on the statistical learning framework which offers solution for both regression and classification problems [43].Using the kernel trick, SVM can classify both linear and non-linear datasets.The datasets are separated by a (n-1) hyper plane, where every data point is considered to be an n-dimensional vector.For a twodimensional space, hyper plane is a line separating a plane in two parts.A support vector classifier can be defined by the following terms: ' ( ) ( , ) Here,   = Bias S= Set of observations α= Model parameters that have to be learned

Decision Tree (DT)
Decision Tree is another supervised learning algorithm whose goal is to train a model to classify a target variable by learning simple chained decision rules from previous input variables [44].The variables are split recursively based on a set of impurity criteria until some stopping criteria are reached.
The decision tree model resembles an upside-down tree, with the first decision rule at the top and subsequent decision rules dispersed across the tree like branches.Among many impurity measurement systems, Gini impurity is selected for the used model.
Here, G(t)= Gini impurity at node t   = Proportion of observation at class c of node t

Random Forest (RF)
Random Forest is a learning algorithm which operates by creating multiple decision trees at training time and providing output class of individual trees [45].It is applicable for both regression and classification.This model does a small tweak that utilizes the de-correlated tree by building a multitude of decision trees on bootstrapped samples from training data, this process is known as bagging.During bootstrapping, it filters a few numbers of feature columns out of all feature columns.Bootstrap modelling decreases the variance and increases the bias.Predictions of unknown inputs after training can be written as: Where, B= Optimal number of trees Also, the uncertainty (σ) of the prediction can be written as the folowing:

EAI Endorsed Transactions on Pervasive Health and Technology
Online First

Naïve Bayes (NB)
Naïve Bayes is a supervised algorithm which imposes independence of features while classifying data [46].This model is an effective tool for datasets which have a high number of input features.It considers all the features available including some of the features that have weak effects on the final prediction.The probabilistic model of the Naïve Bayes algorithm can be written as the following equation where A and B are two independent events ( | )* ( ) ( | ) ( )

Multilayer Perceptron (MLP)
A multilayer perceptron (MLP) is a feed-forward artificial neural network made up of several layers of perceptron [47].It contains nodes of at least three named input node, hidden layer and the output node.This network uses a nonlinear activation function which maps weighted inputs to each neuron outputs.In this paper, we used sigmoid functions as the activation functions.

( ) tanh( )
The range of the first hyperbolic tangent is -1 to 1 and the second hyperbolic tangent is a logistic function.Learning in the perceptron is carried out by back propagation.Minimized error function (ε) at the output node j, after performing gradient descent, can be written as:

Quadratic Discriminant Analysis (QDA)
Quadratic Discriminant Analysis is a statistical classifier which disjoints two or more classes of data by using a quadratic decision surface [48].This classifier is used on those cases where there exists a difference between the covariance matrices.
In this classifier, the mean (µk) and the covariance matrix k ∑ are estimated separately for each classifier.For a particular input, the objective function is derived where the function is quadratic in 'x' and so the decision boundaries are 0s of quadratic functions. Where,

Results
After implementing different machine learning algorithms, necessary simulations have been performed extensively in Python.The confusion matrices are tabulated for each ML model which are depicted from Table 2 to Table 9 consecutively.However, the performances are enhanced by tuning the hyperparameters by random search cross validation.

EAI Endorsed Transactions on Pervasive Health and Technology
Online First After tweaking the hyperparameters, the machine learning models were trained to have as little bias as feasible in order to minimize overfitting.They were then evaluated via cross-validation to eliminate the possibility of data leakage while maintaining the variance as low as possible.Hence, the performance parameters like accuracy, precision, sensitivity, specificity, F1-score, and the area under the receiver operating characteristic (ROC-AUC) curve are calculated and evaluated accordingly.The graphical comparative presentation of all the performance metrics is displayed in Figure 3

Discussions
A comparative analysis of all the performance metrics for both 'tuning' and 'without tuning' of the hyperparameters are shown in this section.The results are portrayed for four different data frames.Furthermore, Random Search Cross Validation (RandomizedSearchCV) was utilized for hyperparameter tuning which uses random hyperparameter combinations to discover the optimum solution for the built model.The value of hyperparameters has a substantial impact on model performance.It is important to note that there is no way to predict the optimum values for hyperparameters in advance, thus all possible values must be attempted to get the optimal values.As doing this manually could take a significant amount of time and resources, RandomizedSearchCV is brought into action to automate hyperparameter tuning.Hence, accuracy is calculated for different ML algorithms for both before and after hyperparameter tuning and the results are then tabulated in Table 10.The other performance parameters are presented in Table 11.
After tuning the hyperparameters best accuracy was shown by the Random Forest algorithm which is 99.75%.However, the other algorithms like LR (99.36%),DT (99%), SVM (99.36%),NB (99%) also showed promising results in terms of accuracy.In case of precision, all the models performed more than 95% except QDA.However, it is observed that all the models showed sensitivity more than 95% except MLP and KNN.On the other hand, the all models depicted F1 score and ROC-AUC more than 90%.But Random Forest outperforms all other models with highest value in of every performance parameter.

Conclusion
Kidneys are not only required to filter the toxic substances from the body but also vital for maintaining acid-base balance, electrolyte balance and blood pressure of the body.Malfunction of the kidneys is responsible for mild to fatal diseases as well as dysfunction of other organs of the body.That is why researchers around the world have devoted themselves to seeking ways of precise diagnosis and effective treatment of kidney diseases.As machine learning techniques have become more prevalent in the medical field for diagnosing, chronic kidney disease (CKD) is now on the list of diseases that can be predicted leveraging machine learning algorithms.All the researches to identify CKD using ML algorithms has improved the process and result accuracy from day to day.
In our work, we have proposed the random forest algorithm (accuracy 99.75%) as the most efficient algorithm among all other algorithms.In this investigation, the data are processed efficiently as the missing values are handled with four different criteria like mean, mode, median and null dropping method.Moreover, the study also focuses on measuring the performances of the ML models for both tuning and without tuning of the hyperparameters.Significant improvements in performances of the ML models are witnessed which are presented graphically.Overall, the study explores the applicability of the supervised machine learning algorithms in bioinformatics and presents their compatibilities in diagnosing various fatal diseases like CKD at an early stage.This research will guide future researches on predictive analysis of other health conditions with machine learning algorithms where applicable, and help to polish and correct the techniques further.We plan to collect datasets from local health care facilities to estimate the regional parameters in order to develop a diagnostic model in the context of Bangladesh.Furthermore, we will explore deep learning and neural networking methods, and apply them to hone the process to near perfection.

Figure 1 .
Figure 1.Correlation heatmap of different data frames (a) mean (b) median (c) mode and (d) no null

Figure 3 .
Figure 3.Comparison of (a) accuracy (b) precision (c) F1 score (d) sensitivity and (e) ROC of different ML algorithms for both without and with tuning of hyperparameters

Table 1 .
Description of Attributes with Information

Table 2 .
Confusion Matrix for K-Nearest Neighbour

Table 3 .
Confusion Matrix for Logistic Regression

Table 4 .
Confusion Matrix for Decision Tree

Table 10 .
Comparison of Accuracy of different ML algorithms

Table 11 .
Performance Parameters of different ML algorithmsFinally, the best performing ML models in terms of different performance metrics are tabulated in Table12.However, our proposed ML model (Random Forest, accuracy 99.75%) is compared with other related research works in Table13.

Table 13 .
Comparison with other research works