A Gene Expression Data Classification and Selection Method using Hybrid Meta-heuristic technique

The gene expression data selection is an ill-posed problem. The features selection techniques are found to be an efficient way to evaluate the dimensions of huge gene expression data. This feature selection techniques guide the relevant gene selection. In this paper, a hybrid method (MPG) is proposed to get selection of gene expression by using Mutual information way with Particle Swarm Optimization (PSO) and Genetic Algorithm (GA). A simulation environment is developed, which reveals the decrease in gene expression data dimensions and also removes the duplication among the classified gene data sets significantly. The proposed approach suitable for gene data set analysis using different classifier techniques and show the higher efficiency and accuracy of proposed data sets as compared to traditional selection mechanisms.


Introduction
In all the tissue samples the gene expression of all the genes is measured by the DNA microarray and these microarray techniques have many mechanisms, adopted for classification of gene expressions.It also recognizes normal tissues and cancer tissues [1].This approach has a big datasets for analysis of large samples of gene expressions.This has computing values having genes in thousands cell mixture having range values from 1000 to 30000.Only small samples from 100 to 400 be selected for gene classification [2] [3] and analysis to get the required results.Different parameters were taken using Support Vector Machine (SVM).This will handle statistical oriented calculations in easy way with the help of data mining techniques [4].This follows the classification rule that was applied to avoid data over fitting from the datasets.This will only applied if sample size is smaller than the number of features and variations of genes.The purpose of this classification method is to set a decision function that deals with the collected data for better classification.Note that this technique fails if either test data was in complete or not available.Also note that execution time increases if data having large number of features and selection criteria.So it is very difficult to execute.To avoid such hindrances in microarray data analysis, a subset of genes having relevant classification is selected to decrease the data dimensionality.Feature selection is best approach for selection of relevant subsets using data mining technique to reduce high-dimensionality of datasets with irrelevant features.Doing so will degrade the clustering performance [5].Data mining is the artificial intelligent technique with data base management system in the medical field for the evaluation of gene data sets [6].Feature selection is an optimized technique using number of benchmark datasets.This will increase the performance of data classification or gene classification by reducing selected features [7] [8].

Rachhpal Singh
Identification of tumours is a complex approach for data clustering having a huge set of gene expressions [9].Data having different genes expressions is famous for handling big data.It will also decrease the duplication of data and reduction of high-dimensionality data from the assigned data sets [10].If search space range is high, then a small set of genes be selected for any type of tumour identification.Feature selection is best classification method from all the classification and selection criteria for removing the duplication of gene data expressions and to control the dimensionality of the data sets.Accuracy and stability of data sets classification in learning method increases the speed of selection and classification process by using feature selection mechanism in gene expressions and gene classifications.This classification approach depends upon the maximization of Mutual Information (MI) with hybrid method using Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).This proposed is named as MGP algorithm or Mutual-information with Genetic and Particle swarm optimization.
When samples of two data sets from gene sets was correlated randomly using maximum Mutual Information (this is dependency level among given data sets and sample size), then it gives better results as compared to traditional or conventional approach.Correlation among all the genes of same data sets or the sample data sets was computed and well interpreted showing the association or relationship between them.Maximization of Mutual Information of genes from the given data sets or sample data sets describe a high informative statement.This is the best way for illustration of all the genes sets as compared to the genes sets having no mutual information mechanism [11].In past, number of techniques were successfully executed using GA, PSO and SVM for the analysis of microarray data [12].Also query optimization is an effective technique using GA with PSO to analyse the performance of all evolutionary techniques for gene selection and classification [13].
In 1995, K-E (Kennedy and Eberhart) developed a famous population-based stochastic optimization mechanism called PSO.It is a simulation process deals with social behaviour of organisms (like flocking of birds and schooling of fishes).Here, a big search space was done where every single candidate's output considered as a particle.This particle uses its own memory and knowledge obtained by the swarm is to search the best solution.Fitness value evaluated by the fitness function by all the particles and be optimized.So each particle sets its position or by changing its velocity, either according to own experience or neighbouring particle's experiences and that is the best use of best location or position.Particles change their position and take a movement using problem space by finding and follow the current optimum particle positions.Such mechanism works for a fixed no. of times.Its stops its execution till minimum error occurs.This procedure works iteratively.Particle adopts the data characteristics of gene expression having high dimensions and so their classification results are superior as compared to other mechanisms in evolutionary system.Every particle sets its venue or position according to pbest (personal best) and gbest (global best) as fitness values in a search space for gene selection and classification.Also, during searching always avoid from trapping in a local optimum.This can be done by adjusting the fine-tuning and inertia weight.If gbest trapped itself in local optimum, then a search space for every particle be created in same space to avoid trapping and at end gets superior results of classification.This will reduce the number of selected genes.As shown in Figure 1(a), all the particles have converging property near gbest after a defined period and in case gbest not change its value after four iterations, then it is considered that it stuck into local optimum.In these situations, gebest is current fitness value.By resetting its value to zero, this becomes best for gene feature selection and accurate classification as shown in Figure 1(b).Here, local optimum be skipped and search continue for superior gene classification as output.The individual particle be converged towards gbest and reset the value as shown in Figure 1(c).Further to find a new gbest with lesser number of genes is as shown in Figure 1(d).This operation achieves superior gene classification by reducing no. of genes according to the need of selection of genes.Change of particle velocity interpreted as a probability change in searching particle state or location or movement.Furter to improve the gene celection and classification GA a popular heuristic technique is used.It is efficient and useful in case of problem is complex in large size with number of hindrances.GAs are used to optimize queries quickly [14].It support natural selection process to select best gene selection and abandon worst genes to get best solution [15].GA initializes genes as population generation for solving search space problem with the help of genetic operators.Further generated gene population follows iterative procedure as evolution of new gene sets for classification.Selection, Crossover and Mutation are basic operations used to get more genes generations during iteration process.An objective function as fitness function computed during every iteration.If currently generated gene population has better fitness than existing gene population, then old fitness is replaced by new and continues till either n generation occurs or met the stopping criteria [16].GA operators significantly obtained the optimal solution.To get optimized selection and classification solutions, GA has following components: Population: Set of gene chromosome considered as population and generated with random variables by using GA's selection operator.
Fitness function: Also called objective function for gene selection and classification to find exact optimal solution.
Selection: It is primarily applied on gene population having ranking metric and is similar to select a hockey team from different countries sports person's as gene population.
Crossover: Swapping of gene population after selection process either selecting one point crossover or two point crossover is the crossover operation of genes for best selection and classification of genes.
Mutation: Rarely used operation to get the best solution.
It preserve genetic diversity from one generation of a population of chromosomes to next one i.e. invert the randomly selected bits based on the mutation probability [17].
As GA has two basic operations crossover and mutation having two probabilities.Pc is crossover probability and Pm is mutation probability used for gene selection.If these probabilities are not accurate assumption, then it creates a premature convergence or non-convergent approach during the gene expression search and classification.The proposed mechanism MPG improves the conventional GA with PSO by adjusting the Pc and Pm values to search best global optimized solution.
Further Maximum Mutual Information (MI) process combines with two hybrid PSO and GA algorithms to launch a new strategy for solving such complex problem.
Here a novel hybrid algorithm proposed having gene feature selection mechanism combining MI with PSO and GA to remove the duplication of gene data samples and decrease the gene expression data dimensionality using SVM.The proposed technique shows the better results by classification accuracy rates comparisons with some of existing gene feature selection approaches.Different classifier methods are applied to selected gene datasets to test robustness of proposed algorithm.All classifiers show classification accuracy rates higher than 80% during simulation.
In section 2, literature review done.In section 3, general scheme or methodology of hybrid approach is discussed.In section 4, simulation results as experiment are discussed and at last in section 5, it was concluded.

Hybrid Mutual Information Metaheuristic Feature Selection Algorithm (MPG) A. Maximization of Mutual Information
Mutual information is maximized with the help of random variables [41].It means one random sample (p) has the dependent information on the second random sample (q).So overall maximization of mutual information having a defined gene expression dataset is defined in the equation (1) below: where a(p) denotes probability density of variable p, a(q) represent probability density of variable q and a(p, q) shows joint probability density.MI denotes the maximum mutual information of x in y and is illustrated in equation ( 2 2), irrelevance of gene expression outline x to class y shows a result zero as MI(x, y)=0.Further the final formula of maximum mutual information is shown in the equation (3).
Here m is the class quantity number in a dataset and the main objective of this operation is to get genes with strong dependency as compared to other groups of genes in the same class.In particular times computing the MI will describe genetic filtering.

B. Particle Swarm Optimization
PSO proposed by Kennedy and Eberhart in 1995 [13] at first for optimization.In PSO, swarm are executed for counted number of particles and it is similar to individual population as Evolutionary Algorithms (EA).At every iteration or loop, all the particles take a movement in problem space to search global optima.Every particle has a velocity vector and a current position vector for giving a direction to movement of particles.(5) Equations ( 4) represent velocity computation and equation (5) shows updated position of given particle i at a some defined iteration k.Equation (4) compute a new velocity vel i for every particle which is also known as potential output based on previous velocity value and location of the particle to find best fitness as pbest i .Further searching continues in global space to find the global best fitness.By computing the global population or from a set of local neighbourhood so that local neighbourhood gives a direction for global best which is also called gbest i .Social weight or individual are denoted by ϕ 1 and ϕ 2 respectively.Finally, rand 1 and rand 2 are two random number variables having a range from 0 to 1 and w denotes inertia weight factor.Equation ( 5) is used for updating of every particle's location x i in the defined solution space.

C. Genetic Algorithm (GA)
GA has two critical operations as crossover and mutation.mutation operation locally.These operations are two methodology of GA for global and local search [40].Two probabilities Pc for crossover, Pm for mutation are applied to find the convergence of GA for searching the optimal output.Pc and Pm are two pre-defined variables in the standard GA that have fixed value GA search space and helpful in gene selections and classifications.If Pc has the too large value, then the global search is also become too complex or coarse and so optimal output can be missed.Further if Pc has too small value, then it stuck or put into local minima.Similarly if Pm value is too large, then genetic procedure works same as a random search.Also if Pm has smallest value, then will supress the exploratory capability of search.During the searching operation, particle search with a defined velocity and position factor is a more effective to permit GA to adjust the values of Pc and Pm that resembles with mutual information.This whole procedure generates a MPG algorithm for gene selection and classification.In MPG, the Pc and Pm values are computed as described in equations ( 6) and ( 7) below: In equation ( 6),   shows maximum value of all individual's fitness during MPG search operation, FN avg denotes average fitness value, FN 1 tells us the high fitness value [38] and  1 ,  2 ,  3 and  4 denotes four variables having range from 0 to 1.

D. Maximization of Mutual Information with PSO and GA (MPG)
Combining maximum mutual information with particle swarm optimization and genetic algorithm, a gene selection mechanism was proposed known as MPG selection and classification algorithm.SVM work as a classifier in selected operation for best selection.So SVM in MPG has the best fitness value that is more efficient and useful for classification accuracy.In equations ( 6) and ( 7), set the values for k1 = 0.8, k2 = 0.9, k3 = 0.2 and k4 = 0.002 and also have maximum number of iteration to 800 that is used for selection of best fitness.Let take gene expression dataset X has x1 and x2 gene samples.MPG procedure for selection and classification has the following steps: (1) Compute mutual information for all the genes in dataset X by implementing MI with a given number of times.It produces a subset Y of X from the whole selected MI genes.Number of genes in Y subset is 200.
(2) Compute the particle velocity and position during the MPG iterations by using equation ( 4) and ( 5) with the updated value of velocity and position in a defined number of iterations or upto a maximum value.
(3) Initialize population for MPG and compute fitness value for every individual.Note that population size is considered according to the space of the problem.It means if size is larger, then it is very easy to find the fitness value from MPG for best optimal results.It will elapse for longer time and for the large and complex iterations.Here 300 is taken the population size and is denoted as M. Every individual has number of genes from the subset Y and take sample size as a1 for every gene.
(4) Apply coding process to encode individuals in a population and after coding, every individual value corresponds to a chromosome having some defined length.
(5) Compute all the fitness values for FN max , FN avg and F 1 .( 6) Individuals having high fitness value can be selected by setting a threshold value.(7) Compute the Pc and Pm for best fitness value after finding particle velocity and position with updating operation.It will generate a new population.(8) Finally a test was done to find current optimal fitness function or value that satisfy the target or meet the termination criteria.If it meets then, move to the step ( 9) else move to step (4).( 9) Finally get the optimal subset values of genes for selection and classification according to the decoding rules.

Experimental Results and Comparisons
Four type gene expression datasets of Appendix cancer, Cervical Cancer, Gallblader cancer and Kidney cancer are taken for experimental purpose.Sample size, quantity of genes taken and distribution of every class of data sets taken for simulation is as shown in Table 1.particle swarm optimization and genetic algorithm using SVM is implemented in Matlab software having version 6.1 for MS-Windows 10 on i5 7 th generation processor with 8GB RAM.Gavin Cawley's toolbox of SVM software was taken as test data for classification of data gene sets [42].Some of the parameters are used for simulation purposes.The parameters considered for the proposed technique for the selection and classification of genes is as shown in Table 2. Various parameters taken for proposed technique using GA for the classification of genes data sets in the simulation process is as shown in Table 3.After implementation of these parameters in simulation, it was observed that accuracy of genes classification not incremented monotonically with the number of gene incrimination.So the datasets having small numbers samples relatively have the better results.Data sets of gene expression for the selection of genes have minimum number of genes was simulated using a mapping procedure.This mapping procedure of genes for the labelling and classification of genes gave better solution in optimized form.Also note that by increasing number of genes, rate of classification may be decreased or increased  20) 99( 18) 99(12)-Results were compared using conventional criteria to find the accuracy of the genes selections and classifications.
Comparisons show the correct rate with the size of the gene sets (number of genes used) in above table for optimized solution of datasets.In the table 4, data having the symbol "-" means some of the genes were not participating or non-available.Table 5 describe the detailed output of the experiment done with the help of simulator and it was observed that Appendix dataset has 100% classification rate by using maximum 50 genes data sets and has better performance than the previous mechanism used.Here Ex is the executions like Ex1, Ex2, Ex3 and Ex4.Interesting output obtained with the help of simulator for Appendix, Cervical, Gallbladder and Kidney cancer datasets.
Proposed method has highest or averaged accurate classification rate analysis.Experimental results show the significant change in biological genes and much better than previous approaches.Table 5 represent detailed output of 4 independent execution by simulator MPG algorithm by using the SVM and found that performance and output is quite stable.

Conclusion
Proposed hybrid meta-heuristic based feature selection algorithm combining the maximizing mutual information algorithm with PSO and GA known as MPG algorithm.This selection algorithm shows effectively the dimension of genes expression data sets in their original format and reduces redundancy problem in data set.MPG selection mechanism decreases the number of genes with high classification accuracies.Also simulation was done for all classification accuracy datasets with existing feature selection mechanisms.It demonstrates the effectiveness of proposed algorithm.Different classifiers are applied to decrease dataset.Further this algorithm is used for efficiency improvement.Simulator can be developed for cloud environment in future using this MPG technique to improve the time complexity.

Figure 1 (
Figure 1(a) Trapping of gbest in local, (b) Resetting of gbest to zero, (c) Resting gbest and particle movement, (d) Convergence of particles towards updated gbest and improvement in individual position.
) as below.Here consider M genes in a data set quantity and D shows gene expression with outline x data range having a dataset of genes in y class.E represents gene expression with outline x data range having a dataset of genes not in y class.Similarly, F shows gene expression without outline x data range having a dataset of genes in y class: New individuals are generated by the crossover operation globally.Similarly new individuals are generated by Rachhpal Singh EAI Endorsed Transactions on Scalable Information Systems 01 2020 -03 2020 | Volume 7 | Issue 25 | e4 subset with improved classification accuracy [29].Sahu et.al. proposed an optimized and filtering technique having different feature selection mechanism to classify microarray data for high dimensional cancer sets by combining PSO with signalto-noise ratio and probabilistic neural network for clustering of data sets with the help of k-means clustering approach to rank every gene of each cluster [30].Ghamisi et.al. proposed a new feature selection method based on combined GA with PSO on validation samples for best fitness value using SVM classifier [31].Shen et.al. presented a paper for tumour classification by combining PSO with SVM to select genes and evaluate or classify by taking microarray data and minimize the high dimension data [32].Shen et.al. developed a PSO and tabu search hybrid algorithm known as HPSOTS for selection of genes and classification of tumour on three microarray different data sets by minimizing the high dimension data sets [33].Babaoglu et.al. studied the binary PSO and GA to search the efficiency of feature selection on determination of coronary artery disease existence based upon exercise stress testing data [34].Chen et.al. developed a novel approach having PSO utilization combining with a decision tree technique work as a classifier and compared the performance of well-known benchmark classification methods with the proposed technique based on statistical analysis for all the test datasets are compatible with SVM [35].Arora et.
[38] et.al.comparedPSO and GA with SVM to classify high dimensional microarray data and found low samples of informative genes from huge amount of data sets using a binary representation in Hamming space [28].Li et.al.used a hybrid approach combining PSO with GA and SVM for selection and classification of genes from a set of large amount of data sets by testing by considering three benchmark gene expression datasets that reduced datasets dimensionality and confirmed about 4 informative gene al. proposed real-world complex problems solving technique for feature selection [36].Sharma et.al.proposed two nature-inspired computing mechanism query optimization in the medical field [37].Kaur et.al.presentedrole of big data in healthcare applications technique to handle medical industry related problems[38].

Table 2 .
Parameters for proposed technique for selection of genes data sets

Table 3 .
Parameters for proposed technique for classification of genes data sets

Table 4 .
due to complex relationship among genes and genes data sets.So accuracy of the classification depends upon the relationship agreement and identifies the co-relationship between classifier and selection algorithm.Nutshell all the classifiers and selection mechanism has high classification accuracy rates than all the datasets taken.It can also be used for comparison purpose for different data sets considered in the simulation.