Risk Assessment of Myocardial Infarction for Diabetics through Multi-Aspects Computing

INTRODUCTION: Myocardial infarction (MI) is a type of cardiovascular disease. Cardiovascular disease is the major side effect of diabetes. It causes damage to heart muscle due to interruption in the blood flow. The chance of getting this disease is high in diabetes patients. OBJECTIVES: To choose a dataset with features related to diabetes, parameters of ECG and risk factors of MI for effective prediction. Predict myocardial infarction in both type-1 and type-2 diabetic patients using regression techniques. Recognise the best algorithm. METHODS: Multiple linear regression, ridge regression and lasso regression are existing techniques in addition to which proposed technique lasso regression is used to develop a model for prediction. The trained models are compared to know better performing algorithm. Estimation statistics namely confidence and prediction intervals are used to show the amount of uncertainty in predicted values. The statistical measures in regression analysis namely root mean squared error and r_squared value are used to evaluate and compare algorithms. RESULTS: The proposed algorithm ‘lasso regression’ has achieved better values of RMSE and r_squared as 0.418 and 0.2278 respectively compared to remaining techniques. CONCLUSION: Best performance of proposed algorithm was noticed and hence using lasso regression for prediction of myocardial infarction in diabetes patients gives better results.


Introduction
Cardiovascular disease is one of the major complications of diabetes. Myocardial infarction is one type of cardiovascular disease. It is also known as heart attack which is caused because of interruption in the blood flow that damages the heart muscle. There is more chance for diabetic patients to have a heart attack compared to non-diabetic patients. In a study related to diabetes at oxford centre stated that patients who are affected with fatal myocardial infarction are having higher HbA1C compared to normal myocardial infarction patients. [1] HbA1C is the observed average blood sugar levels from past two to three months. A heart attack or myocardial infarction can be considered as a form of acute coronary syndrome. Acute coronary syndrome is a condition that occurs when the arteries got blocked. These arteries play a major role to carry oxygen, blood and nutrients to heart. [2] There are three types of acute coronary syndrome. They are STEMI, NSTEMI and CAS or unstable angina.

EAI Endorsed Transactions on Pervasive Health and Technology
Research Article EAI Endorsed Transactions on Pervasive Health and Technology 09 2020 -12 2020 | Volume 6 | Issue 24 | e3 S.S. Reddy, S. Nilambar and R. Rajender 2 • ST segment elevation myocardial infarction (STEMI) is a condition which occurs when a coronary artery got blocked completely and doesn't allow any blood flow through the heart and damages the muscle. • Non-ST segment elevation myocardial infarction (NSTEMI) is a condition which occurs when a coronary artery is blocked partially. • Coronary artery spasm (CAS) or unstable angina is a condition which occurs when the arteries of heart are tightened and reduces or stops the blood flow. [3] There are several signs one can observe before affecting to heart attack. The chest pain is an early sign when treated can reduce complication. The common symptoms for myocardial infarction include shortness of breath, tiredness, indigestion, heart burn and nausea.
[4] More than 68% of diabetic people are affected to heart diseases with age greater than or equal to 65. [5] Several risk factors which increase the chances of heart attack are high blood pressure, high blood cholesterol, obesity, diabetes, age, family history of heart attack, lack of exercise and use of tobacco.
[4] In order to avoid myocardial infarction one should follow a better lifestyle and take medication if having any one of the risk factors. In a survey conducted across the world it was noticed that 32.2% of people affected to cardiovascular disease are having type-2 diabetes. The type-2 diabetes patients have 53% chance of affecting to myocardial infarction. [6] A Doctor will diagnose heart attack or myocardial infarction by performing tests like electrocardiogram (ECG), blood test, echocardiogram and angiogram. ECG is performed to monitor the electrical activity of heart which shows the measures in the form of graph. A blood test is done to identify the leak of proteins present in heart into the blood. An angiogram test will identify the areas where arteries are blocked. An echocardiogram provides the images of heart which is used to monitor the functioning of valves and for identifying blood clots. [7] In section 2 the literature work is provided. In section 3 the methodology of this work is provided that explains about the objectives of the work, dataset used and system architecture in detail. In section 4 the proposed work is provided in which MLR and RR algorithms are explained briefly and Lasso regression algorithm is explained in detail. In section 5 the analysis of obtained results is provided. The detailed explanation of estimation statistics and statistical measures used in the work is also given. In section 6 the conclusion of this work is provided.

Literature Survey
Xingjin Zhang et al. [8]  Procheta Nag et al. [13] developed a system for predicting myocardial infarction using data mining techniques. They collected the dataset from different hospitals containing 25 attributes related to acute myocardial infarction. The data mining techniques used are C4.5 decision tree (DT) and random forest. The percentage split of 70%, 60% and 55% were performed on dataset using seed values 1 to 4. They considered evaluation metrics like accuracy, precision, recall and ROC curve. C4.5 has obtained best values of metrics for all three percentage splits using seed 3. Random forest algorithm is implemented using seed 3 and compared with C4.5. The values of accuracy, precision, recall and ROC curve obtained for random forest in case of 70% percentage split were 96, 0.94, 1 and 0.99, in case of 60% percentage split are 96, 0.97, 0.97 and 0.99, and in case of 55% percentage split the values were 95, 0.95, 0.97 and 0.99 respectively. From the results obtained random forest algorithm has performed well so they used that algorithm to develop an app for predicting myocardial infarction using data provided. Polaraju and Durga Prasad [14] used MLRtechnique to predict heart disease. They used the data collected from patients. They implemented this algorithm in C# language using .Net framework. They divided training and test datasets using 70% and 30% respectively. The training dataset consist of 13 attributes and 3000 instances. They concluded that the result of MLRmodel is more appropriate for prediction of heart disease. Neel Adwani [15] presented his work on predicting the probability of affecting to heart attack. He used three attributes from heart disease dataset in kaggle namely age, cholesterol level and target class for prediction purpose. The machine learning regression algorithm called MLRis used by him to predict heart attack. The implementation is done using GNU octave which is an open source software. When age and cholesterol level is given as input the chance of heart attack is predicted by the model. He concluded that adding more predictor attributes can increase the accuracy of the model. Madhubala et al. [16] provided their work for prediction of diabetes using multiple linear regression. The dataset used by them is a diabetes dataset taken from kaggle. The dataset attributes considered are glucose, BP, insulin, age, BMI and outcome (target variable). They calculated correlation between each predictor variable and target variable. Among these correlation values best two predictor attributes glucose and BMI are used to train MLRmodel. The visualization of the output is provided. From analysis of results they noticed that glucose level > 100 and BMI value > 20 indicates the presence of diabetes. Muthukrishnan and Rohini [17] presented their research work on using regression techniques for developing predictive models in machine learning. They implemented OLS regression, RR and LASSO regression on real time diabetes dataset. They used these algorithms for feature selection which provided coefficients of predictor attributes. They concluded that the LASSO regression technique has performed better with minimum attributes than the other two techniques.
Jeena and SukeshKumar [18] used RR for predicting risk of stroke. They used clinical dataset collected from a hospital in Trivandrum. The dataset contains 14 attributes and 531 instances. They used bootstrap validation to validate the model developed using RR. They calculated risk score based on which they predicted the chance of effecting to stroke. Huan Lio and Yahui Liu [19] focused on using RR for developing prediction model. They used forest fire prediction dataset for implementing RR. First they performed RR for feature selection to get efficient features then those attributes are used to develop the support vector machine model using radial basis function as kernel type. They concluded that the prediction is accurate by combining regression technique with SVM. Avinash Golande and Pavan Kumar [20] highlighted their work on prediction of heart disease. They used some effective ML techniques like DT, k-mean clustering, Adaboost and k-nearest neighbour on the heart disease dataset. They have compared the output of these algorithms with classifiers used in already existing research papers. Alexander Schlemmer et al. [21] predicted cardiac diseases using several ML algorithms. They used the data collected from a cardiological study which contains information of 261 patients. The algorithms they have used are support vector machine with linear and RBF kernels, k-nearest neighbours with k values as 1, 3, 5, 8 and random forest. The leave one out (LOO) test and Matthews correlation coefficient was performed on each classifier. They concluded that linear SVM has performed better in terms of Matthews correlation coefficient with a value of 0.28. Santhana Krishnan and Geetha [22] presented their work on prediction of heart disease. The dataset used by them is in terms of medical data taken from UCI ML repository. They used algorithms like decision tree and naive bayes for prediction of heart disease. By comparing results obtained for both algorithms they concluded that decision tree has obtained better accuracy of 91%. Arash Farbahari et al. [23] focused on using linear, ridge and lasso regression for determining influential variables that affect fasting sugar levels in type-2 diabetic patients. After determining the influential variables they implemented logistic regression to predict type-2 diabetes. The dataset used by them is collected from 380 healthy persons and 270 type-2 diabetic patients. The three regression algorithms are compared using mean squared error (MSE). They concluded that among all the attributes HbA1C attribute is more influential predictor attribute for fasting sugar level.

Methodology
This section comprises of objectives of this work, detailed explanation of dataset used and system architecture.

Objectives of the work
In recent days ML algorithms are most prominently used in various areas including medical research (disease predictions). The chance of affecting to myocardial EAI Endorsed Transactions on Pervasive Health and Technology 09 2020 -12 2020 | Volume 6 | Issue 24 | e3 4 infarction is high in diabetic patients of both type-1 and type-2. Early and effective diagnosis of MI in diabetic patients will help to avoid further risks. The objectives of this work to accomplish the proposal are • To consider dataset attributes related to diabetes, risk factors of MI and parameters from ECG tests. • Choose ML algorithms for developing a myocardial infarction predictive model. • Evaluate each algorithm using performance metrics.
• Recognize the best performing one among all the algorithms.
The dataset considered in this work comprises of variables or attributes related to diabetes, risk factors of MI and the parameters from ECG test. The dataset attributes that effect type-1 and type-2 diabetic patients are considered. These attributes are necessary to effectively build a predictive model. Three regression algorithms were chosen for developing a predictive model. They are MLR, RR and lasso regression. These algorithms are implemented in R programming. The statistical metrics should be considered to evaluate the performance of regression algorithms. The estimation statistics namely confidence interval and prediction interval are used to show uncertainty in predicted values. The statistical metrics namely RMSE and R squared were chosen for evaluating and comparing algorithms to recognize the best performing algorithm.

Dataset
The dataset considered to implement the algorithms consists of 22 attributes and 133 instances. Among these attributes 21 are predictor attributes and one is target attribute. As the main aim of this work is to predict the myocardial infarction in diabetes patients, most of the predictor attributes are related to diabetes and electrocardiogram parameters. The predictor attributes related to diabetes are body mass index (BMI), type of diabetes, duration of diabetes, fasting blood sugar level, HbA1C and type of treatment for diabetes. Sys_bp Systolic blood pressure is the pressure in the flow of blood during contraction of heart muscle. The normal range of SBP is less than or equal to 120 mmHg.

15.
Dia_bp Diastolic blood pressure is the pressure in the flow of blood between the heart beats. The normal range of DBP is less than or equal to 80 mmHg. 16.
Smoking Whether the person is having habit of smoking or not. The attributes related to electrocardiogram are results of resting electrocardiograph, maximum heart rate achieved, slope of the ST segment, exercise induced ST depression and thallium heart scan. The attributes results of resting electrocardiograph, slope of the ST segment and thallium heart scan are not in the original graphical form of ECG but they are represented in the form of different classes which are described in below table 1. Some attributes are related to risk factors of myocardial infarction like cholesterol, age, use of tobacco (smoking) and high blood pressure. The attributes related to cholesterol are parameters of lipid profile test namely LDL, HDL and triglyceride, statin and dosage of statin. The attributes related to blood pressure are systolic blood pressure and diastolic blood pressure. In addition to these the gender attribute was also included. Figure 1 describes the density plot for all the 22 attributes in the dataset. The name of the attribute is mentioned on the top of each plot. The N value represents number of observations in the dataset which is 133. Kernel smoothing is used in the density plot. In kernel smoothing each data point is represented as Gaussian shaped kernel and all these Gaussian kernels are combined to obtain density plot. The bandwidth under each attribute density plot represents the standard deviation of the smoothing kernel for that respective attribute.

System architecture
A good architecture that describes the entire process is necessary to better understand the work. In few works, only the process is mentioned without presenting any flowchart. Rather than stating only the process, a flowchart is included to describe the process clearly. The system architecture of this work is presented in figure 2. It describes the step by step methodology to obtain best performing algorithm. First the myocardial infarction dataset described in the above section is loaded. The size of the dataset is 133 instances and 22 attributes. Then data pre-processing is performed. The percentage split of 80% is performed on the pre-processed data. The training and test datasets contains 107 instances and 26 instances respectively. Using the training dataset the three algorithms multiple linear regression, ridge regression and lasso regression are implemented which gives the trained model for each algorithm. The each trained model is evaluated using test data and provides the result. These results are compared to obtain the best performing model finally. This entire process is implemented using R programming. The confidence intervals, prediction intervals, r_squared value and RMSE are used for evaluating each algorithm. The confidence and prediction intervals are used to show the uncertainty in the predicted values. r_squared and RMSE are EAI Endorsed Transactions on Pervasive Health and Technology 09 2020 -12 2020 | Volume 6 | Issue 24 | e3 S.S. Reddy, S. Nilambar and R. Rajender 6 the statistical measures based on which the best performing algorithm was recognized. The small value of RMSE and large value of r_squared is the criteria for the best algorithm.
After comparison lasso regression was recognized as the best one and it was explained in a detail way in result analysis section

Proposed work
In this section the algorithms multiple linear regression and ridge regression are explained briefly. The proposed algorithm lasso regression is explained in detail. The value of the target variable is calculated using the below formula [24] in all the three algorithms. �ₜ = ₀ + ₁ ₜ₁ + ₂ ₜ₂ + ⋯ + ₚ ₜₚ Where p=1,2,3,...,P, P is the number of predictor variables, xtp is the value of predictor variable p in observation t, β0 is the y intercept value and β1,β2,....,βP are regression coefficients for each predictor variable . Here yt is actual value of target variable for observation t, � ₜ is predicted value of target variable for observation t. In the following formulae n is number of observations.

Multiple linear regression
MLR is an extension to the linear regression technique. This algorithm is used to model the linear relationship between several predictor (independent) variables and one target (dependent) variable. In linear regression the value of target variable is estimated by using only one predictor variable. In multiple linear regression two or more predictor variables are used to estimate the value of target variable. The value of target variable is calculated using formula mentioned above. After predicting all the values the error is calculated using cost function which is defined below.

Ridge regression
Ridge regression is a regularization of linear regression. This method reduces the complexity of the model by shrinking the regression coefficients using a tuning parameter called lambda. It also reduces the multi collinearity which is said as correlation between predictor variables. The value of the target variable is predicted same as in MLR but the cost function to calculate the error is modified into following formula. The extra term is called the penalty term which is the square of regression coefficients. Here βₖ represents regression coefficients for values of k=0,1,2,...P. is a tuning parameter. [25] =

Algorithm: Lasso regression
Input: Each observation in the training dataset. Output: Predicted value of the target variable for given observation.
Assumptions: t is a specific observation, n is number of observations, xp is a specific predictor variable value where p=1,2,3,...,P, P is number of predictor variables, y is target variable, yt is value of target variable in observation t, xtp is value of predictor variable p in observation t, �ₜ is predicted value of target variable for observation t, ₖ represents regression coefficients for k=0,1,2,...P, is a tuning parameter.
Step 1: Start Step 2: Calculate mean of each predictor variable xp and target variable y.
Step 3: Calculate regression coefficient of each predictor variable.
Here ₀ is the intercept value which is calculated placing all predictor variables equal to 0. �ₜ = ₀ + ₁ ₜ₁ + ₂ ₜ₂ + ⋯ + ₚ ₜₚ Step 4.b: Return value of target variable obtained in 4.a Step 5: End for loop in step 4 Step 6: Calculate cost function Step 7: Stop

Lasso stands for Least Absolute Shrinkage and Selection
Operator. It is also a regularization of linear regression. It reduces the model complexity and problem of over fitting by using magnitude of regression coefficients while calculating cost function. The cost function will give the error after predicting values. By using this technique the errors obtained can be reduced when compared to RR. The process of predicting the value is same in all the three algorithms but differs when it comes to cost function. [26] In step 2 the mean is calculated for each predictor variable and target variable. These values are used for calculating regression coefficient of each predictor variable in step 3. The regression coefficients of predictor variables are used to obtain the predicted value of the target variable in step 4 and return the value in 4.b. The overall error after predicting all the values is calculated using cost function in step 6. Thus, the trained model is obtained which is further evaluated on test dataset. The figure 3 represents the flowchart of regression algorithms. All the three algorithms works in a similar way, there is a change in only the formulae. The formulae used in each algorithm has already mentioned in this section. So, in this figure only the steps involved in working of algorithm is described through a flowchart.

Result analysis
In this section the results of the three regression algorithms are provided. The algorithms MLR, RR and lasso regression are compared with each other. The uncertainty in predicted values of each algorithm is shown using estimation statistics confidence interval and prediction interval. Comparison of three algorithms is done using statistical measures RMSE and r_squared.

R_Squaredvalue
R_Squared is used to know whether the obtained regression line is better than normal horizontal regression line generated through mean of data points. Its value should be between 0 and 1. Its value increases if the errors are less, vice versa. While comparing models the higher value of r_squared value indicates best model. The value of r_squared is explained for lasso regression model below.
The numerator ∑ ( ₜ − �ₜ) 2 1 is called as sum of squared errors (SSE). In this equation n is number of instances in test dataset (n=26), yt is the actual value of test instance i, �ₜ is predicted value for test instance i. The SSE value for lasso regression model is obtained as 4.5436. The denominator ∑ ( ₜ − �ₜ) 2 1 is called as sum of squared total (SST). In this equation n is number of instances in test dataset, yt is the actual value of test instance i, �ₜ is mean of test dataset. The SST value for lasso regression model is obtained as 5.8846. R_Squared for lasso regression = 1 -(4.5436 / 5.8846) = 0.2278

Root Mean Squared Error (RMSE)
RMSE is defined as root mean square of the errors. In regression RMSE is very important measure to check the performance of the developed model. The lower value of RMSE indicates less errors and better performance. The numerator is called as sum of squared errors (SSE) which is given in r_squared value above. The value of RMSE for lasso regression model is explained below.

Confidence interval
Confidence interval is an estimation statistic used to give bounds on estimation of population parameter like mean and standard deviation. In regression when 95% confidence interval is used the upper and lower limits of mean are obtained. The estimated value of each observation should lie between these limits.
) ₜ � is the predicted value of observation t, Tn-2 is the value in 95% column from t-distribution table and n is the number of observations in test dataset. yt is actual value of observation t. � is mean of the predictor attribute. P is specific observation. The denominator ∑ ( ₚ − �) 2 =1 is the sum of squared total (SST). MSE is the mean squared error calculated by formula SSE / (n-q) where q is the number of coefficients in the model. The value of MSE lasso regression was obtained as 0.1747 by using a built-in function.

Prediction interval
Prediction interval is an estimation statistic used to give bounds on estimation of single observation in dataset. As explained in confidence interval 95% prediction interval also gives upper and lower limits. The estimated value of each observation should be in this limit. The description of formula is same as in confidence interval formula.
) PI of first instance in test dataset for lasso regression model

Results obtained
The estimation statistics for predicted values obtained after implementing three algorithms are provided below. The figure 4, figure 5 and figure 6 correspond to MLR, RR and lasso regression respectively. In these figures the left side data is the 95% confidence intervals and right side data is 95% prediction intervals obtained as output for each test instance.
In figure 4 the test instances numbered from 7 to 131 represent the instances selected from original dataset to form the test dataset. In figures 5 and 6 these instances are represented from 1 to 26 because of calculating confidence and prediction intervals by implementing the formulae and storing the results in a data frame. So, the instances were mentioned in a sequence 1 to 26, but the instances are same for all three algorithms only numbering differs. In case of multiple linear regression (figure 4) the estimation statistics are obtained using built-in method.       The literature works related to myocardial infarction and heart diseases are provided in the table 4. In some works ( [8], [9] and [10]) only ECG data was considered for implementation. To increase the quality of research, attributes related to diabetes, ECG data and risk factors of myocardial infarction are considered in this work. Considering the attributes on the above criteria will increase the scope of prediction. In [11] clustering technique is used along with feature selection technique but the evaluation of the model was not provided. In this work statistical evaluation metrics were considered. In some works ( [12], [13], [21] and [22]) few ML classification algorithms are used. These algorithms are the most commonly used algorithms. In this work regression algorithms are considered instead of regularly used algorithms. Decision tree and naive bayes Prediction of heart disease is done using two ML algorithms. Accuracy is the metric considered for evaluating and comparing algorithms.
In [14] MLR algorithm is used for predicting heart disease but its evaluation was not performed. In [15] and [16] only two attributes were chosen for their work. In [15] they concluded that adding few more predictor attributes will give better results. The works [18] and [19] are comprises of RR. In [18] only 14 attributes were used in the dataset and evaluation of the model was not done. In [19] RR was used for feature selection. In the works [17] and [23] lasso regression was used. The works [24], [25] and [26] are referred to study about theoretical concept of the three algorithms. In [17] lasso regression algorithm has outperformed OLS and RR. Ordinary Least Squares (OLS) regression will minimize the sum of squares obtained by calculating the difference between actual and predicted values. They have stated that OLS regression has a disadvantage of over fitting. In [23] HbA1C attribute was the significant one to predict type-2 diabetes.
In this present work a 22 attributes dataset was considered and evaluation of the model was also done. This makes the work different from other related literature work and has performed well. MLR is type of linear regression that deals with two or more predictor attributes and a target attribute. Linear regression has a drawback of over fitting. So, regularization techniques were used to overcome it by penalizing the coefficients. The Ridge and lasso regression are the regularization techniques of linear regression. The lasso regression will perform better in the case where there are few significant attributes but whereas RR will perform better when there are more significant attributes. As there are few significant attributes lasso regression has obtained better results than remaining two. There is a limitation in this work. The dataset used does not contain more instances, it has only 133 instances. Choosing correct predictor attributes is more important to develop an efficient model. Though instances were less the 22 attributes considered has covered all the required data for accurate prediction of myocardial infarction. Finally it is recommend that a large dataset with more instances could be taken to obtain even more efficient model in further works. Also to consider lasso regression as it has performed better than remaining two algorithms.

Conclusion
Myocardial infarction is also known as heart attack which is a type of cardiovascular disease. It is one of the complications of diabetes and its prediction is of utmost importance. The regression algorithms namely multiple linear regression, ridge regression and lasso regression are used in this work to predict myocardial infarction for type-1 and type-2 diabetic patients. The algorithms are implemented in R programming. After rigorous analysis of the considered algorithms it is found that the proposed 'lasso regression algorithm' has performed better in predicting Myocardial infarction in terms of statistical measures RMSE and r_squared. The lasso regression obtained values of RMSE and r_squared as 0.418 and 0.2278 respectively. The attribute "Thallium heart scan" is identified as the most important attribute. By considering the statistical measures lasso regression was suggested to predict myocardial infarction in diabetic patients.