Heart Disease Diagnosis Based on Deep Radial Basis Function Kernel Machine

Over the years Radial Basis Function (RBF) Kernel Machines have been used in Machine Learning tasks, but there are certain flaws that prevent their usage in some up-to-date applications (e.g., some Kernel Machines suffer from fast growth number of learning parameters whilst predicting data with large number of variations). Besides, Kernel Machines with single hidden layer have no mechanisms for features selection in multidimensional data space, and machine-learning task becomes intractable with enlargement of the data available for analysis. To address these issues, this paper investigates the usage of a framework for “deep learning” architecture composed of multilayered adaptive non-linear components – Multilayer RBF Kernel Machine. To be precise, three different approaches of features selection and dimensionality reduction to train RBF based on Multilayer Kernel Learning are explored, and comparisons between them in terms of accuracy, performance and computational complexity are made. As opposed to the “shallow learning” algorithm with usually single layer architecture, results show that the multilayered system produces better results with large and highly varied data. In particular, features selection and dimensionality reduction, as a class of the multilayer method, shows results that are more accurate. This paper proposes a novel scheme based on deep Multilayer RBF Kernel Machine learning for sleep apnea detection and quantification using statistical features of ECG signals. The results obtained show that the newly proposed approach provides significant accuracy improvements compared to state-of-the-art methods. Because of its noninvasive and low-cost nature, this algorithm has the potential for numerous applications in sleep medicine.


Introduction
In the recent decades the sufficient amount of techniques for different machine learning tasks including classification, regression, function approximation clustering and feature transformation were developed with help of the class of non-linear functions -radial basis functions (rbf) [1,2].One of the interesting idea is radial basis functions networks and their generalization kernel networks.In this work, the special emphasis is given to the application of these networks to the problem of data classification.Radial basis functions are the special kind of functions, which have a characteristic feature to monotonically decrease or increase with increase of the distance from the central point.The center, the distance scale and particular shape could vary for different models [1].The most while, of course, many possible variations.
Another way to think about rbf networks is as kernel machines with specific type of kernel.Kernel machines are special machine learning methods which allow the usage of regular machine learning techniques developed to learn linear functions in the problems with non-linear dependencies.This goal is achieved via transformation (mapping) of input feature space into the Hilbert space.
The first kernel machines were a natural extension of the Support Vector Machine proposed by Vapnik for classification of the linearly separable data points.The goal of the algorithm was to find the hyperplane, which will divide two datasets and will have maximum distance (margin) between itself and closest points from two classes.This hyperplane can be presented as the linear combination of the training samples lying on that margin (support vectors):   = ( 2  2  2 ,  ) +  6

2
. The algorithm finds the optimal values for the parameters .
The extension for the non-linear separable case exploits so called feature mapping function g with the hyperplane of the form:   = ( 2  2 ( 2 ), () ) +  6
The function   2 ,  which satisfies the conditions of the Mercer's theorem can be presented in the form   2 ,  = ( 2 ), () in the Hilbert space is called kernel.If the kernel function is selected appropriately the data points can become separable in the new feature space can become separable.This method is usually referred to in the literature as the "kernel trick".In this case the method for linear SVM training could be applied.The Gaussian radial basis function is one of this kernels, so the support vector learning could be applied as learning method for radial basis function network with support vectors being the centers of radial basis functions [7,8].
Paper [9] provides the idea of the extension of the support vector machine algorithm to the multiclass classification problem with usage of different weights for different outputs and selection of the class which produces the maximum value.RBF networks and kernel machines in general have proven their effectiveness in different machine learning tasks, and there have been an extensive development from the theoretical and algorithmic point of view in this field since they were first introduced.However, it turned out that this method has some flaws which prevent its usage in some upto-date applications.Like other methods (for example, kNN) relying on the data smoothness and locality (meaning that similar points should lie close in the feature space) the kernel machines suffer from the fast growth of the number of learning parameters while predicting data with large number of variations [24].Another problem is that the kernel machines with single hidden layer have no mechanisms for features selection in the multidimensional data space and completely rely on the user in this part.The optimal selection of features for particular method becomes more and more complicated with enlargement of the data available for analysis.To solve this problem common for many machine learning algorithms the paradigm of deep learning has recently emerged.The idea of this approach is based on the assumptions that learning model should not only provide the prediction results but also learn an optimal data representation required for this task.
The notion of the good data representation usually includes several points [23]: smoothness and natural clusteringsimilar data points should lie close to each other in the learned feature space; expressiveness of explanatory factors -the learned feature space should be of reasonable space but still be able to explain multiple variations of data; a hierarchical organization of explanatory factors -it will be useful to have a hierarchical structure of features/ concepts where more abstract features will be defined in the terms of less abstract features located lower in the hierarchy; shared factors across tasks -it is common that the same concepts can be used to explain different events, so it will be useful to be able to use the same features to predict different parameters; sparsity -only small number of factors should be relevant for each of the particular observations; simplicity -it is desirable for many algorithm to have a simple (in the best case linear) dependencies between factors.The term of "deep" learning was coined in the contrast to the "shallow" learning algorithms, which have fixed usually single layer architecture.The "deep" learning architectures are compositions of many layers of adaptive non-linear components [27].It is expected that by analogy with the mammal brain capable to store information on the several layer of different abstractions multi-layer architectures will bring the improvement to the learning algorithms.However simple training of the neural networks with multiple hidden layers has shown an improvement only up to the certain number of layers (2 or 3).And further increase in this number didn't provide any significant improvement and in some case results even have become worse [28].The existing algorithms have faced the problem of local minimum and it is being reported that the generalization of such gradient-based methods become worse with larger number of layers.Several papers have also shown that supervised training of each separate layer also does not give significant improvement in results compared to regular multilayer learning.Later development has gone in the direction of the intermediate feature representation for each new layer.
Deep learning networks and training algorithms using this approach have achieved significant results in the multiple real life applications [23] including computer vision, audio signal processing, and natural language processing and so on.In some fields of study they are still considered among the best available approaches.
The successful examples of the deep neural networks for supervised learning mainly exploit two different approaches and their possible combinations: special structure of the network in terms of neuron connections and hierarchically organized feature transformations applied to their results (i.e.convolutional neural networks) and multilayer networks with feature representations for each layer learned with unsupervised learning technique which is followed by parameters tuning of the network with regular supervised learning technique.

Methodology
This work shows how kernel methods can be extended to hierarchical structures without required complicated machinery.Three algorithms using RBF kernel are explored, and the main differences between them in terms of how to define the transformation through the combination of linear mapping and nonlinear activation function are studied after training and testing.In this section, a brief on the methodological steps is provided.
Error! Reference source not found.presents overview of the proposed methodology.The steps in the above procedures are standardized approach, but with novel combination.The promising outcome of the procedure is a justification to establish its implementation.We present below detailed discussion of the steps.Naturally, the method helps to focus on the kernel PCA for deep learning.We prune features in just one-step, and then apply kernel PCA algorithm to produce a result that can be used as input for the next layer.Even from the obtained result, an optional option to prune selected features is employed to further minimize features redundancy.From the first step, we determine N number of layers to be used, and in step 6 we set loop of steps as long as the N number is not exceeded the second step of the algorithm will continue to iterate.We compute the result of the algorithm and feed the feature representations to the classifier to make final decision.

Multilayer RBF kernel machine based on supervised kernel regression
This algorithm is extended from the first one by applying supervised kernel regression and removing the optional step of feature selection because it is done along with projection.Yger's (2011) assume that it will give better computation time.For the latent arable regression, the feature extraction will also be incorporated in the regression step but it would be based not on the output but on the input.The author claims to overcome the drawback which is the step of feature selection by learning each hidden layer using Kernel Partial Least Squares regression (KPLS).
Below is a summary on how we implemented the Multilayer RBF machine based on supervised kernel regression and shown in Error!Reference source not found.. 1.Let N be the number of layers we would like to use.

Select appropriate kernels and kernel parameters
(cross-validation or otherwise) -not described in the work 3. Apply supervised regression to extract next feature value and corresponding Eigen value.4. If Eigen value is greater than selected threshold go to step 3 otherwise, use all the extracted features as input to the next layer.5.If number of iterations exceeds N go to step 6, otherwise go to step 2. 6. Feed the feature representations to the classifier to make the final decision.In this algorithm, feature selection methods and unsupervised dimensionality reduction is incorporated into supervised regression algorithm.In MKMs, deep learning approach is achieved through repetitive iteration of supervised regression algorithm sequentially list above.The use of this mean is however not new in this context, what is new here is how we apply the supervised regression with Eigen value to extract appropriate feature representative.Using the first algorithm, we have made a major contribution by replacing supervised regression with kernel PCA.We retained selection of appropriate kernel and kernel parameters, as they are already place from the existing algorithm.Since supervised training only occurs in the last layer in MKMs, this makes features selection method very important.From the first step, we determine N number of layer for this process.Like in the first algorithm for kernel PCA where ranking method is used to prune feature for appropriate selection.We then apply supervised regression to the pruned features.In supervised regression, the use of LMKMs is used to inspire deep learning training architectures.This is more specifically important achievement of the supervised regression algorithm used to extract next feature value and corresponding Eigen value.
The Eigen value is computed and determined through the algorithm steps any value greater than the threshold discarded.While appropriate value are all extracted as input for the next layer.The N number of layer we intend to use continue to iterate until number is greater than N. We feed the feature representations to implement the Multilayer RBF machine based on supervised kernel regression.

Multilayer RBF kernel machine based on unsupervised kernel regression
This proposed algorithm based on unsupervised latent space and the motivation behind this claim is that unsupervised methods work well with the regular neural networks and unsupervised learning focusing on important patterns from data regardless of their labels as it reduces the input dimensionality of data without losing crucial information.In this algorithm we use the idea of unsupervised methods that is describe in Memisevic work (2003), which is Kernel parameters selection and dimensionality selection as shown in figure 3. Let us say that we have data in q-dimensional space:  ∈  < (observable space), and we want to find the  , the solution depends on the selected kernel bandwidth which can be explained in the following steps : 1.For train set Y, find the solution X which optimizes error in the latent space.2. For particular X and Y solve the optimization problem to find optimal X scaling S (X:=X*S) which optimizes error in observable space.3. Identify the optimal range for the h (kernel bandwidth) based on the graph connectivity algorithm.4. Perform the algorithm of traversing through different values of h to identify the optimal one.5. Select appropriate number of parameters d.The idea of unsupervised kernel dimension reduction has been applied in [52] focusing on both linear and nonlinear unsupervised kernel dimension reduction.However, this current work has considered non-linear unsupervised kernel dimension reduction inspired by [52].
The kernel choice from [29] which is a Multilayer kernel machine (MKM), a kernel based model is adopted for the three algorithms experimented in this work.Particularly, MKM is introduced in the first algorithm to integrate unsupervised dimensionality reduction with supervised feature selection methods into kernel PCA algorithm.Selection of the kernel parameters and optimal dimensionality of the latent space 3. Use extracted latent variables as input to the next step.4. If number of iterations exceeds N go to step 6, otherwise go to step 2. 5. Feed the feature representations to the classifier to make the final decision.This algorithm combined both KPCA and supervised regression algorithm.This is done to achieve a more reliable input and consequently, results.An unsupervised regression algorithm is embedded with supervised feature selection and unsupervised dimensionality reduction.The idea of multilayer kernel machines (MKMS) implemented in this work is to help filter only feature relevant input, to be fed into developed unsupervised regression algorithm, and construct an infinite dimensional representation.Additionally, to help obtain result, unsupervised dimensionality reduction is implemented with feature space.The attempt to adopt this approach is considered to be high level implementation concept of MKMs through the use of different machine learning techniques, which is this case a combination of two is used to develop another.The implementation of the multilayer RBF machine based on kernel PCA is not new.However, in this algorithm, we 7 have replaced PCA with unsupervised kernel regression.The idea of unsupervised regression application is suggested by Memisevic (2003).The steps involved in the above algorithm are discussed.Based on Memisevic (2003) idea, this work combined feature selection with unsupervised regression.Unlike supervised regression procedure, latent variables are extracted instead using unsupervised regression method.In this work, we extract these variables my applying the input to obtain even better input parameters.This is equally based on three key steps from Memisevic (2003).As usual, we determine N number of layer we want to use.As we are dealing with number of layers, previous output of the extracted latent variables are used as input to next layer.This process continues until the N number is reached.We then feed the feature representations to implement the multilayer RBF machine based on this algorithm for unsupervised kernel regression.The procedure of unsupervised regression method provided below in Figure 1.

conclusion
In conclusion, the multilayered systems generally show relatively better results with large and highly varied data, as compared with "shallow learning" algorithms with usually single layer architecture.Moreover, amongst the multilayer algorithms, Supervised Regression tends to produce more stable and accurate results.As such, the usage of this particular algorithm should be commonly used example is the Gaussian function   =

Figure .
Figure.Multilayer kernels machine (MKMs) for the three different transformations

2 .−
Optimize an error in the latent space. JKL  =  2 -( 2 ,  D ) D There is an efficient way to solve this problem via eigen value decomposition.− We have explicit representation of x in terms of y:

6 .
Write the code to incorporate method into classification.7. Run the tests.8. Add the special constraints concerning distances between different classes in the optimization problem to better fit the data for further classification.Adapt the optimization solution.9.If results are unsatisfactory try the optimization in the observable space: a. Determine X for train Y b.Find the model for finding X for new points Y.
Multilayer RBF machine based on unsupervised kernel regression and shown in Error!Reference source not found.. 1.Let N be the number of layers we would like to use. 2. Apply unsupervised regression to extract latent variables which better represent the input parameters.(Kernel parameters selection and dimensionality selection is embedded in this step) based on the ideas described in the Memisevic work in Memisevic (2003) -Learning of optimal latent space representation with input data -Learning of transformation from observable to latent space.-

Figure 1 : 1 .
Figure 1: The procedure of unsupervised regression (before improvement) Unfortunately, after model training and evaluation of the three algorithms; the unsupervised method did not give a good accuracy as expected.In order to improve performance, the unsupervised latent regression with projection method is suggested.The classifier based on this method is built by the following steps:1.The whole training dataset is subdivided into several groups based on the data class labels.2. For each group individually we train the following model:a. =   =

Figure 2
below presents the accuracy values for the four algorithms according to variation in number of layers.

Figure 2 :
Figure 2: Accuracy vs. number of layers when applying the four algorithms for Apnea dataset Figure 3 below presents the MSE values for the four algorithms according to variation in number of layers.

Figure 3 :
Figure 3: MSE vs. number of layers when applying the four algorithms for Apnea dataset Figure 4 below presents the sensitivity values for the four algorithms according to variation in number of layers.

Figure 4 :
Figure 4: Sensitivity vs. number of layers when applying the four algorithms for Apnea dataset Figure 5 below presents the specificity values for the four algorithms according to variation in number of layers.

Figure 6 :
Figure 6: Cohen's Kappa vs. number of layers when applying the four algorithms for Apnea dataset Figure 7 below presents the Training time values for the four algorithms according to variation in number of layers.

Figure 7 :
Figure 7: Training time (sec) vs. number of layers when applying the four algorithms for Apnea dataset Figure 8 below presents the Validation time values for the four algorithms according to variation in number of layers.

Figure 8 :
Figure 8: Validation time (sec) vs. number of layers when applying the four algorithms for Apnea dataset

PCA features are best selected using ranking method in which redundant features are discarded. The ranking method is used to prune away inappropriate or unwanted features at each layer in the MKM.
Kernel PCA has been in existence since over ten years ago and now more newly inspired in an unsupervised approach through deep belief nets pretraining.The kernel PCA features in MKMs is used as input for next layer features.Meanwhile, in this case we select appropriate kernels and its parameters and apply from layer to the other.However, while nonlinear transform by arc-cosine kernel can be utilized in kernel PCA in MKMs, an RBF kernel that mimics the projections of a randomly initialized neural network is regarded an alternative approach, which is used in this work.The idea of Cho (2012) is to implement unsupervised dimensionality reduction and supervised feature selection techniques into the multilayer arc-cosine kernels.However, in this work, the implementation of unsupervised dimensionality to reduce feature selection to exclude unwanted features or input into the next layer.Kernel In kernel PCA, iterative application is employed to realize deep 5 learning in MKMs.
Heart Disease Diagnosis Based on Deep Radial Basis Function Kernel Machine representation of this data in the d-dimensional space q>d:  ∈  > (latent space).Let us say we have N data points.