Support Vector Machine based eHealth Cloud System for Diabetes Classification

INTRODUCTION: Diabetes is a major health issue because it leaves people with physical disabilities. Therefore, methodologies with a reduced error rate must be used to diagnose this dangerous disease. Data Mining techniques such as Artificial Neural Network are common tools adopted for the classification of diabetes and one of the core components of the eHealth system. Data Mining techniques aim to provide reliable and timely diagnostic outcomes during the diagnosis of the disease. OBJECTIVE: The objective of the research work is to propose a Support Vector Machine based eHealth Cloud System for Diabetes Classification. This work aims to improve the diagnostic accuracy of computer-assisted diagnostic systems. METHOD: The proposed methodology implemented in two-phase, In the first phase system is trained using different Support Vector Machine (SVM) kernel functions and in the second phase effectiveness of the system is tested in terms of classification accuracy and error. Different SVMs have the ability to diagnose this disease. PIMA Indian Diabetes Dataset (PIDD) has been used in our experiments for training and testing. Kernel functions are usually used to refer to the kernel trick, a method of using a linear classifier to solve a non-linear problem. RESULT: In this classification accuracy and classification error are used for performance evaluation. It is worth mentioning that the system giving remarkable accuracy of 77.50% in Coarse Gaussian SVM in 10-fold validation whereas fine Gaussian SVM gives 98.8% accuracy in No validation set. CONCLUSION: This paper introduces the SVM eHealth Cloud System for Diabetes Classification. The system is trained using the PIDD. Such a system can be used as “Application-as-a-Service” in cloud computing. It is therefore believed that the system will enhance the process of clinical decision-making and also assist physicians concerning Diabetes Diagnosis. It is worth to mention that the SVM kernel-based system performed well in comparison to the different systems.


Introduction
2 type of disease that is mainly based on people's daily activities and are the product of an unhealthy connection between people and their surroundings. The development of these lifestyle diseases is continuing, it takes ages to grow, and they do not readily lend them to remedy once they are encountered. Bad food practices, lack of exercise, incorrect posture, and an impaired biological clock like are factors that endorse to lifestyle diseases [1]. The resulting chronic diseases -cardiovascular disease, heart attack, heart disease, obesity, and respiratory problems which are longterm illnesses and slow progression will severely affect the earnings of individuals [2]. As per a study published in [3], it tells that lifestyle plays a key role in predisposing to lifestyle diseases, such as diabetes.
Lifestyle diseases, not limited to adults alone, have also begun to hit children. The shift in buying power and the advent of technology has changed the way our lives are now functioning. Less physical activity, more resource access and no time to spare, we were prey to some extremely rare diseases that our grandparents never heard of back in the 60s and 70s. While transmissible diseases such as malaria, cholera, polio can be managed with proper medical care, lifestyle diseases can be avoided if a healthy balanced way of life is followed.

Diabetes
Diabetes is a lifestyle disease fuelled by high blood sugar levels over an extended period. Symptoms of high blood sugar include excessive urination, enhanced thirst, and excessive hunger. Diabetes can cause many complications due to lack of treatment, complications may be cardiovascular disease, stroke, chronic kidney disease, foot ulcers, and eye damage, etc. [4]. Diabetes is caused either by the pancreas that does not produce enough insulin or by the cells of the body that do not respond properly to the insulin that is produced.

Types of Diabetes:
Three forms of diabetes are [5]: (i) Type 1 diabetes: failure of the pancreas to produce enough insulin due to the loss of beta cells. (ii) Type 2 diabetes: starts with insulin resistance, in which cells fail to respond properly to insulin. A lack of insulin may also occur as the disease progresses. (iii) Gestational diabetes: develop high levels of blood sugar during pregnancy in women. This condition is known as gestational diabetes.

Symptoms of Diabetes:
i. Increased thirst ii. excessive urination iii.
Extreme appetite iv.
weight loss v. Fatigue vi. Tiredness vii.
Frequent infections such as skin etc.

Medical Mining
In 2019, electronic health records are frequent in healthcare facilities. With increasing access to a large quantity of patient information, healthcare providers are now tended to focus on maximizing their organization's decision-making process with the help of data mining. Medical databases have collected large amounts of health information and medical conditions. Relationships and trends can provide new medical information within this data. Unfortunately, to discover this hidden knowledge, few methodologies have been developed and applied [6]. Data mining have the capability to find interesting fact from the large corpus data. The pattern extracted from data mining may play an important role in decision-making. In healthcare, data mining is effective in areas such as predictive medicine, relationship management, fraud detection, and the evaluation of the efficacy of some treatments [25,26,27]. Medical mining is a broad area of research in which mining methods are used to solve diagnostic and treatment problems as well as to understand the progression of the disease. Medical mining includes learning from hospital records (for diagnostic and treatment decision support), learning from data related to health care, and learning from epidemiological data. In medical mining, classification strategies are widely used to classify data into different classes according to the history of different patients in a domain. Data mining has many applications in various domains like network security, medical mining, cloud computing, etc. [19][20][21]].

Related Work
In [7] authors have presented an adaptive evolutionary RBF network algorithm to improve RBF network accuracy. The quality is authenticated using accuracy and is evaluated on three datasets from the UCI database. The results show which approach is an effective means of solving medical disease diagnosis multi-objective RBF network.
In [8] authors have presented a method for classifying diabetes using genetic programming. In this research, numerous approaches used to assess the efficacy of the characteristics of diabetes, to enable the selection of features. The superiority of the method is demonstrated by comparing the outcome with other approaches.
In [9] authors have presented the use of SVM to diagnose diabetes. In this an added interpretation module that turns the SVM's "black box" model into a comprehensible SVM diagnostic decision representation.
In [10] LDA and ANFIS were used to diagnose diabetes. LDA is used to isolate the distinguishing variables healthy and diabetes data, whereas ANFIS is used to classify the EAI Endorsed Transactions on Pervasive Health and Technology 01 2020 -05 2020 | Volume 6 | Issue 22 | e3 Support Vector Machine based eHealth Cloud System for Diabetes Classification 3 results of LDA. The approaches used to deliver the preceding findings with good accuracy.
[11] presented a GA and fuzzy logic-based system for diabetes identification. The proposed system uses a fuzzybased classification scheme to demonstrate improved classification accuracy.
[12] Presented FCS-ANTMINER, achieved by combining ACO with Fuzzy logic. The outcomes are based on 10-fold cross-validation. The system attained high accuracy against several approaches, which strongly recommends that combining ACO and Fuzzy Logic assists us to accurately identify diabetes.
The goal of [13] is to use ACO to obtain a set of diabetes disease diagnostic guidelines. The program is tested using the PIDD. The outcome demonstrates that the system can detect diabetes with remarkable accuracy and competitiveness.
In this study [14], a new method for classifying medical database information is presented. To determine the applicability of the projected method, real-time problems were examined. The system's performance is good.

Support Vector Machine (SVM)
SVM is a type of supervised algorithm for Data Mining that provides data analysis for classification and regression analysis. Although regression can be used, SVM is mostly used for classification purposes. The data is plotted in the n-dimensional space, every feature's value is also the unique coordinate's value. Through the learning process, we find the perfect hyperplane that separates the data instances.

SVM's ideology
SVM is based on the idea of having a hyperplane that better divides features into different domains.

How does SVM work?
SVM Determine the correct hyperplane from three hyperplanes A, B, and C identify the correct hyperplane that has the best margin and separates classes. B's Hyperplane has done this very well in this scenario and is shown in Figure 1.  Figure   2.

Figure 2. Linearly Separable hyperplanes
In such a situation hyperplane that Maximizes the distances amongst the adjacent data point and this will assist us to find the right hyperplane, the distance is referred to as Margin and is shown in Figure 3. Here hyperplane C's margin is better in comparison with hyperplane A and B. Therefore, right hyper-plane is C. Robustness is another important reason to choose the hyper-plane with a higher margin. If we choose a lowmargin hyper-plane then there is a high likelihood of misclassification. In Figure 2 and Figure 3, x and y represent the field or attribute of the 2-dimensional dataset.

Advantages of SVM
• Guaranteed Optimality: Due to Convex Optimization 's existence, the solution will always be a global minimum, not a local minimum.
• Easily access it from either Python or MATLAB for implementation. • SVM can be used for both linear & non-linear separability. linearly separable data poses a hard margin, even though the non-linearly separable poses the soft margin. • SVMs comply with semi-supervised models of learning. It can be used as well as unlabeled in areas where the data is labeled. • SVM can do the function mapping using simple dot product with the help of Kernel Trick.

Disadvantages of SVM
• SVM can't handle categorical data. This leads to sequential information loss and thus leads to worse results. • Kernel selection may be the greatest limitation of the vector support machine.

Kernel Function
It takes low-dimensional input space and transforms into a higher-dimension space, in other words transforming the non-separable situation into separable, these functions are called kernels. SVM Kernel projection from 2D to 3D is shown in Figure 4.

Linear Kernel
The simplest kernel is the linear kernel. It is dot product < a, b > plus a constant c.

Polynomial kernel
The polynomial kernel for degree-d is: Where a and b are vectors in the state space and d is the degree, c ≥ 0 is a constraint that trade-off the consequence of higher-dimensional vs. lower-dimensional terms in the polynomial. While c = 0, the kernel labeled as homogeneous. Polynomial kernels: linear having degree 1, quadratic having degree 2, and cubic having degree 3.

Gaussian Kernel
Used when the data is not previously known. The formula is the following: EAI Endorsed Transactions on Pervasive Health and Technology 01 2020 -05 2020 | Volume 6 | Issue 22 | e3 5 Furthermore, it could also be used as The variable plays a major role in the kernel's efficiency. The exp will act linearly if overestimated, and higher-dimensional estimation will begin to fail its nonlinearity. For Fine Gaussian SVM, the scale is ( )/ . For Medium Gaussian SVM, the scale is ( ). The coarse Gaussian SVM scale is 4 ( ).

Kernel scale mode
When setting the kernel scale mode to Auto, the system will select the scale value using a heuristic procedure. Subsampling is used in the heuristic procedure. Set a random number of seed using rng before training the classifier to reproduce results.
Using a heuristic method, the program should pick the scale value while setting the kernel scale mode to auto. In the heuristic method, sub-sampling is used. Set a random number of seeds using rng to replicate outcomes before supervising the classifier.

Dataset
PIDD data originates from Diabetes Diseases. The aim is to determine a person has diabetes or not based on medical diagnosis. The collection of these instances from a wider list has been done with several constraints. In particular, all patients here are women who are at least 21 years old [21].
The dataset includes data from 768 women with 8 characteristics and 1 class attribute, in particular: i. Pregnancies ii. Glucose iii. blood pressure iv. SkinThickness v. Insulin vi. BMI vii. DiabetesPedigreeFunction viii. Age ix. Outcome Figure 5 shows the Distribution of PIDD Attributes and Figure 10 shows the visualization of data set with each attribute.   input to the system is diabetes dataset and the final output is whether the person is diabetic (+) or not.

Proposed System and Results
• Diabetes Dataset: Input given to the model. Here PIDD is to the system. • Training Set: The part of the PIDD is used to train the prediction model. • Testing Set: The part of the PIDD is used to test the effectiveness of the prediction model. • Feature Selection: Attributes are considered for training the model. In this study, we considered all the features of the PIDD. • SVM based Trained Model: This is the final SVM based eHealth for Diabetes Classification. It may be used in the cloud environment Application-as-a-Services.

SVM eHealth cloud System
As healthcare services cost increases & healthcare specialists are becoming rare and tough to discover, so nowadays it is necessary for healthcare institutions to adopt an IT & data mining-based system. IT & data mining enables health institutions to modernize many of their procedures and more efficiently and cost-effectively deliver services. The new technological developments like cloud computing deliver a cost-effective network & offer an enabler for IT services. This can also be done on the "e-Health Cloud" pay-as-you-use platform to support the healthcare trade deal as per demands while reducing their expenses. Cloud computing is the technology of using the remote server system to store, handle the data, rather than in the local system. The trained SVM eHealth system can be used in the form "Application-as-a-Services" [23,24] as shown in Figure 7.

Figure 7. SVM eHealth Cloud System
Steps: Step 0: Start Step 1: Load the PIDD Step 2: SVM eHealth cloud System Model Development using the Training set Step 3: Testing Model using the validation set Step 4: Performance analysis Step 5: Selection of best Model Step 6: Stop   The process is as follows:

Hold Out Validation
• Randomly shuffle the set of data.
• Divide the data set into K folds.
• repeat K-times: o Take 1-fold as a testing set.

Discussion
Our SVM Kernel-based expert system for Diabetes classification is simulated in MATLAB, Windows 10, 8 GB RAM, Intel Core i7-8700 CPU @3.20 GHz processor. The proposed system is trained using different SVM Kernel functions using PIDD. The system is trained with hold out validation, no validation, and different k-fold validation methods and performance and recorded in table 3-9. Table 1 shows the comparison of different kernel  functions, table 2 shows the statistics of the dataset. Table  4,6, and 8 shows the configuration and complexity like prediction speed, Training Time, Kernel scale, Kernel function. Table 3 shows the performance of various kernel functions with hold out validation (80-20 and 70-30). Table  5 shows the performance of various kernel functions with No validation. Table 7 shows the performance of various kernel functions with K-fold cross-validation (K=10, 8 5). In no validation set the best performance is observed on fine gaussian SVM and the values of various performance measures such as accuracy and error are 98.8 % and 01.20 respectively.
In Holdout validation set the best performance is observed on Liner SVM and the values of various performance measures such as accuracy and error are 77.40% and 22.60% respectively.
In K-Fold validation best performance is observed on Coarse Gaussian SVM in 10 fold validation and the values of various performance measures such as accuracy and error are 77.50% and 22.50% respectively.    . shows the ROC of Fine Gaussian SVM, it is observed that the AUC is 1.00 in no validation, in this case, accuracy is very high but the system may have a problem of overfitting.

Conclusion
ICT advances lead to the use of data mining techniques in different fields, including medical sciences. By using data mining techniques, we can design and implement complex medical processes. In this, we proposed the SVM eHealth Cloud system for diabetes prediction. The system is trained using the PIDD. Such a system can be used as "Application-as-a-Service" in cloud computing, also, it is useful for various medical science fields such as diagnosis and assisting surgeons, doctors, etc. It is worth to mention that the system giving remarkable accuracy 77.50% by coarse gaussian SVM in 10-fold validation and fine gaussian SVM gives 98.8% accuracy in No validation set. In the future, we will use some optimization techniques like GA, PSO, ACO, etc. along with the machine learning algorithm. It is worth to mention that such a machine learning-based eHealth system may be used for the early prediction of diabetes. Such a system enables health institutions to modernize many of their procedures and more efficiently and cost-effectively deliver services.