A Comparative Analysis of Feature Extraction Methods for Classifying Colon Cancer Microarray Data

Feature extract ion is a proficient method for reducing dimensions in the analysis and prediction of cancer classification. Microarray procedure has shown great importance in fetching informat ive genes th at needs enhancement in diagnosis. Microarray data is a challenging task due to high dimensional-low sample dataset with a lot of noisy or irrelevant genes and missing data. In this paper, a comparative study to demonstrate the effectiveness of feature ext raction as a dimensionality reduction process is proposed, and concludes by investigating the most efficient approach that can be used to enhance classification of microarray. Principal Component Analysis (PCA) as an unsupervised technique and Partial Least Square (PLS) as a supervised technique are considered, Support Vector Machine (SVM) classifier were applied on the dataset. The overall result shows that PLS algorithm provides an improved performance of about 95.2% accu racy compared to PCA algorithms.


Introduction
Dimensionality reduction is a very helpful, important and necessary tool in the expression of microarray datasets.It endeavours to trim down, recognize and illustrates the collection of unified datasets by transforming a highdimensional dataset into a lower dimensional dataset which signifies the most significant variables that triggers the distinctive data.This significant and essential tool attracts numerous researchers working in the aspect of bioinformat ics and deals with gene expression datasets to work on the dimensionality reduction [1], [2].Several methods for dimension reduction exist, but none is confirmed the best method for all circu mstances due to the reason that at the time of the processing some informat ion is lost [3].To improve the performance of dataset, feature ext raction as a universal method of dimension reduction is considered.In Feature extraction method, the original h igh-dimensional feature space is estimated on to low-dimensional feature space, for typical microarray data analysis the training sample size is always limited.Due to classification algorith ms may be short of efficiency or even fail in high dimensional microarray data analysis, dimension reduction is a good choice to variable selection in order to overco me the dimensionality problem; it uses a little quantity of features to substitute a feature subset containing well-built correlations in the orig inal data [4].Qu ite a lot of feature extraction algorith ms and techniques have been proposed in literature, one of the most popular and widely used techniques is PCA [1].PCA is an unsupervised method and an effective tool but it is not efficient for high dimensional and comp lex dataset, due to the fact that it cannot retrieve precisely the true latent variables of complex and supervised datasets [5], data in a very high dimensional space often exists in a lo wer dimension.With this kind of data, the intrinsic supervised structure could not be found through an unsupervised feature extraction technique.Another drawback of PCA is that the size of the covariance matrix is proportional to the dimensionality of the data-points.In order to overcome the drawback of unsupervised feature extraction in a very high dimensional dataset, several supervised feature extraction methods have been developed.An improvement of supervised algorithm is Local Linear Embedding (LLE) [1], it is efficient and powerful fo r dimensionality reduction among the other algorith ms [5], [6], [7].Local Tangent Space Analysis (LTSA) is another nonlinear dimensionality reduction technique that describes local properties of the high-dimensional data using the local tangent space of each data point [8].These techniques have been successfully applied on microarray data.In this paper, a supervised feature extraction algorithm fo r dimensionality reduction is proposed to handle the curse of dimensionality of microarray data.PLS is a proposed algorith m for supervised feature extraction.The experiments show PLS outperforms PCA in reducing the dimension of supervised structures and visualization performance.This paper is organized as follows.Section 2 deals with the related work of dimensionality reduction for classificat ion of microarray data.Section 3 describes the dataset used, methodology and algorithms .Sect ion 4 deals with the discussion and results.Section 5 concludes the work.

Related Works
This Author Jian, Linh, and David [9] presented a dimension reduction models using PLS, SIR and PCA, the comparative performance of their classification procedures were similar to PCA and PLS, the co mplexity of microarray data analysis was reduced.PLS and SIR were both expensive in dimension reduction but extra effective than PCA, and the results are reliable with the scrutiny of the method.In 2012 [10] carried out normalization to regulate all the features in the dataset and dimensionality reduction to carry out clustering.The diabetic dataset which contains 768 instances and 8 attributes has been taken and PCA algorith m is used to reduce the dimensions.Out of 8 features, 4 features are selected without the loss of informat ion.WEKA3.7 tool is used to investigate the diabetes data.After performing dimensionality reduction density based clustering algorith m is used to find the maximal set of density.Dimensionality reduction is used to increase the accuracy of the clustered data.
In 2008 [11] has compared two d ifferent feature extraction algorith ms.The features of the products based on the review of the customer, is considered as the dataset.In the first algorithm the candidate features are identified and they are pruned.In the second approach association rule min ing is used to find the frequent pattern.Here, the dataset is based on the customer review which are co llected fro m the social website such as amazon,cnet and it is based on five different products(two digital cameras, a DVD p layer, an M P3 player and a cell phone).Likelihood Ratio Test is the method used to extract the features of the product.In 2016 [12], has presented the methods for visual data mining in order to mine the data and to make cognitive.Here, the author has performed the attribute selection method i.e. wrapper method and filter method.Here, seven types of dataset (lung cancer, promoter, sonar, Arrhythmia, Colon Tu mour, and Central Nervous System) have been used and the accuracy has been calculated before reduction and after reduction.In this reduction framework, the numbers of attributes have been reduced.The data visualization is represented in order to determine the relationship between the data.The algorithms like (LDA, QDA, and KNN) have been used in this work and found that LDA have performed efficiently and reduces the attributes effectively.In 2014 [16] reviewed numerous development application to help users implement feature extraction of gene expression data, the paper presented review of software for feature ext raction methods such as PCA, ICA, PLA and LLE.The software applications have limitations in terms of co mputational performance and there is need for development of classification methods to improve performances of these feature extraction methods.2015 [16] co mpared dimension reduction based on logic regression models for the case-control genome-wide association by employing PCA and PLS, there were limitat ions in the interaction of the genes of dataset used affecting the goodness of fit and accuracy of the parameter estimation of PLS and needed further investigations.

Datasets Used
Colon cancer dataset was used for this experiment, it contains an expression of 2000 genes with highest minimal intensity across 62 tissues, derived fro m 40 tumour and 22 normal colon tissue samples [13].The gene expression was analyzed with an Affy metrix oligonucleotide array complementary to more than 6,500 human genes.The gene intensity has been derived fro m about 20 feature pairs that correspond to the gene on the DNA microarray chip by using a filtering process.Details for data collection methods and procedures are described in [13], and the data set is available fro m the website http://microarray.princeton.edu/oncology/.

Methods and Algorithms
Dimensionality reduction is an important factor used in reducing original data features without the loss of informat ion.The main objective of this analysis work is to compare two different feature ext raction algorithms namely PCA and PLS.The architecture is as follows: 3

Principal Component Analysis (PCA)
In dimension reduction, PCA is one of the well-known techniques; its conception is to lessen the high dimensionality of a given dataset, while keeping enough of the variation existing in the in itial pred ictor variab les.This is attained by transforming the p initial variables X=[x 1 ,x 2 ,….,x p ], [16] to a latest set of q predictor variables.PCA is a widely used unsupervised feature extraction technique; it works by replacing the original variables in a data with nu merical variab les called p rincipal co mponent by capturing the most descriptive features with respect to the most relevant ones [15].PCA mathemat ically transforms data by referring them to a different coordinate system in order to obtain the greatest variance.A number of correlated variables into a smaller nu mber of uncorrelated variables called principal co mponents [10].PCA identifies patterns of similarit ies and differences in a data, these patterns are determined and can be co mpressed by reducing the numbers of dimensions without much loss of information.In order to conduct the PCA analysis for the input data, the following steps are performed by adopting [16]: -Create N x d data matrix with one row vector per data input.-Subtract mean fro m each ro w vector in X -Calculate the covariance matrix of X -Find Eigen Vectors and Eigen values of ∑ -Fetch the Eigen vector with the largest Eigen values.The PCA are uncorrelated and the components exp lain the largest percentage in the dimensional dataset with res ults in extracting 10 components which are considered relevant in the colon cancer dataset used.

Partial Least Square (PLS)
Partial Least Square (PLS) is a supervised feature extraction technique, which is widely used as a procedure in modeling associations linking blocks of experimental variables by means of latent variable, it tries finding uncorrelated linear transformations (latent components) of the original predictor variables which have high covariance with the response variables [16].The goal of PLS is to find the linear relationship between the response and exp lanatory variables y and X: Where T represents the scores (latent variables) P and C are loadings, and Ex and Ey are the residual matrices obtained the original X and y variab les.Feature extraction using PCA ignores the response variable and its equivalence.PLS integrates the response variable during the dimensionality reduction procedure.PLS outperforms PCA in the case of microarray gene expression, PLS only consists of indicating the amount of gene components whereas PCA necessitates choosing the essential gene components [17].

Support Vector Machine (SVM)
In this step, the results for classification are co mputed using SVM for classification.SVM is a recently developed technique used for classification suggested by Vapnik, which was consecutively applied to several domains.SVM is applied to microarray cancer data which comprises of several gene expressions.SVM is applied after many steps after analysis to finally classify cancer tissues as part of an integrated algorithm.SVM is a constructive learning procedure based on statistical knowledge theory [18], it is used for classification tasks, and it uses linear models in implementing non-linear class boundaries by transforming input space using a non-linear mapping into a new space.SVM produces an accurate classifier with less over fitting and it is robust to noise.

MATLAB
PLS, PCA, clustering, dimension reduction, factor analysis, visualization, and others.In the statistical toolbox of MATLAB, several PLS and PCA functions are provided for multivariate analysis.Most of these functions are used for dimensional reduction.All of these functions are implemented in MATLA B.

Results and Discussions
The colon cancer dataset extracted were classified, the classification results obtained show the features capability for classifying the colon's status.The average classification accuracy, which is using features with PCA and PLS are recorded in tabular form below.The proposed methodology was applied to the publicly available co lon cancer database.In this experiment, PCA as a feature extraction method is used to reduce the high-dimension and SVM is used as the classifier.PCA is used to de-correlate the data and 10 components was achieved, in Fig. 2, the overall accuracy on all the datasets obtained using PCA as feature extraction to transform and extract the dataset is reported in a confusion matrix.Table II illustrates a comparative chart between the three methods used in terms of several performance measures such as accuracy, sensitivity, specificity, precision, error, time and area under curve.This comparison shows the integrity of the proposed approach with respect to the state of the art.The colon cancer dataset used to generate our result achieved its best on PLS for feature extract ion; it makes this method suitable for practit ioners.

Conclusion
In this paper, a widely used colon cancer datasets was used for the evaluation of the algorith ms used.The dimension reduction algorithms used to eliminate high dimensional data were PCA and PLS, it uses SVM as its classifier, and it was successfully implemented on MATLAB.Fo r the purpose of finding the smallest gene subsets for accurate cancer classification, PLS method is highly effective co mpared to PCA.
PLS Based method showed a better performance than PCA-based method with 95.16% to 82.30% accuracy.Hence it can be stated that the PLS based dimensionality reduction scheme is suitable for microarray gene classification as it ext racts relevant and a reduced amount of information fro m the feature selection based technique.
In future studies PLS can be co mpared with another feature extraction method with the aforementioned criteria.Another dataset will be a good avenue for further research of dimensionality reduction.

Figure 1 .
Figure 1.Technique Workflow Assuming {(x 1 ,y 1 ),…,(x n ,y n )} be a training set with x 1i ϵ R d and y i is the corresponding target class.SVM can be reformu lated as: Maximize: This is the weighted average of the training features.Here, αi is a Lagrange mult iplier o f the optimization task and αi is a rank label.Values of α′si are non zero for all the points lying inside the margin and on the correct side on Scalable Information Systems 07 2017 -09 2017 | Volume 4 | Issue 14 | e2 A Comparative Analysis of Feature Extraction Methods for Classifying Colon Cancer Microarray Data of the classifier.The kernel function is used to solve the problem.The Kernel function analyses the relationship among the data and it creates a complex d ivisions in the space [19].

Figure 4 .
Figure 4. Confusion Matrix of Proposed Classification, using PCA-B ased for Classification

Table II :
Performance E valuation of Proposed P CA and PLS Met hods.