Prediction of Cross Project Defects using Ensemble based Multinomial Classifier

BACKGROUND: The availability of defect related data of different projects leads to cross project defect prediction an open issue. Many studies have focused on analyzing and improving the performance of Cross project defect prediction. OBJECTIVE: The multinomial classification has not been much explored. This paper instanced on multiclass/multinomial classification of defect prediction of cross projects. METHOD: The ensemble based statistical models – Gradient Boosting and Random Forest are used for classification. An empirical study is carried out to determine the performance of multinomial classification for cross project defect prediction. Depending on the number of defects, class level information is classified into one of three defined multiclass class 0, class 1, and class 2. RESULTS & CONCLUSION: Major outcome of the paper concludes that multinomial/multiclass classification is applicable on cross project data and has comparable results to within project defect data.


Introduction
Identification of the defect prone classes before actual testing reduces the testing cost and efforts. It also leads to a more focused testing thereby enhancing the probability of fault free software [35]. Much research work has been carried out in within projects data. The availability of many different projects data leads to the motivation of cross project defect prediction. In the past decade much research has focused on binary classification of CPDP. The multinomial classification in CPDP is still an open area to be focused.
The advantage of treating this problem as multinomial classification than a regression is that regression does not respect the bounds of zero. Here the target variable that needs to be predicted is a positive quantity. However, regression based approaches in prediction of the defects can give noninteger or negative predictions. This can lead to invalid predictions. Therefore in this study substantiation of multinomial/multiclass classification has been done for cross projects.
The multinomial classification provides information on severity of defect prone classes. Higher the number of defects Lipika Goel et al. in a class higher is that class defect prone. In this work we have classified the class level defect data into different groups depending on the total number of defects in each class. The three classes defined are as below. The reason for taking 5 as the threshold value is included in Section 4 of the paper.
• Class 0: Class with no bugs.
• Class 1: Class with bugs greater than 0 and less than 5.
• Class 2: Class with bugs greater than and equal to 5.
We have taken 15 open source object oriented projects. The classification is performed for cross projects and within project defect prediction. In our previous work [33] (G. Lipika et al., 2018) of binary classification for CPDP and WPDP, we inferred that Random forest and Gradient Boosting ensemble algorithms outperformed all the other algorithms ( LR, NVB, K-NN). Henceforth, in this paper we are using Gradient Boosting and Random Forest ensemble techniques for modeling. Ensemble learners combine multiple learners for a predictive model. They improve the predictive performance of the model. The main research question directed is : RQ: Whether multinomial classification of CPDP is feasible and comparable to WPDP? To answer the above stated question we conducted the experiment for multinomial classification for the CPDP. Multinomial classification was also done for within project defect data. To determine the feasibility of multinomial classification for CPDP we have compared our results of multinomial classification of cross projects to within projects on the basis of AUC-ROC value, f measure, precision and recall. Cross validation is also performed to determine the training accuracy of the model. This paper is divided into the following sections: Section 2 presents a brief of the literature survey. Section 3 summarizes the datasets, metrics, the performance measures and the models used. Section 4 states the proposed methodology. Section 5 tabulates the results. The threats to validity is stated in Section 6.Section 7 concludes the paper.

State Of Art
In this section we present a brief summary of the state of art in the field of cross project defect prediction. Since multinomial classification is not much explored therefore the below state of art gives a big picture of the work done in CPDP.
K-NN was used by Turhan et al in 2009 [3] to reduce the data distribution difference, then the Naïve Bayes classifierwas used for defect prediction. The homogeneous metrics were selected and the Naïve Bayes classifier was applied on 10 different cross projects. The results of the cross projects were compared with within project. The author demonstrated the minimum number of data samples required to build an effective predictor.
Logistic Regression model was used by T. Zimmermann et al in 2009 [4] for CPDP thereby giving a new dimension to it. He concluded that selection of dataset characteristic has a significant impact on the performance of the model in defect prediction. He performed a big study on feasibility of CPDP and inferred that decision trees can improve the performance measures.
S.J. Pan et al. in 2010 [5] focused on the categorizing and reviewing the progress of transfer learning in regression, classification and clustering problems. It inferred that transfer learning will be helpful in detection of defects with different training and testing dataset distribution. The authors gave a new dimension to defect prediction for cross projects using transfer learning approach.
Menzies et al. in 2011 [6] used a WHERE algorithm for creating a local model through clustering of the training data. Then the WHICH rule learning algorithm was used for classification. The novel approach to classification showed better results then some existing classifiers. Ma et al. in 2012 [10] presented a new innovative approach named TNB-Transfer Naïve Bayes for CPDP. The results concluded that TNB showed better results than state of art in terms of area under curve. In inferred that even if few local training data is available , the knowledge from the different data distribution can also be used for defect prediction.
Zhimin He et al. in 2012 [9] investigated the defect prediction in cross project context focusing on selection of training dataset. The author concluded that in some cases training data from cross projects provide better results than from the same project. The authors also proposed a method for automatic selection of training data for projects with no historical data.
TCA-Transfer Component Analysis was implemented by Jaechang Nam et al. in 2013 [11] for CPDP. TCA made the feature distribution in source and target datasets similar. In addition TCA + was also proposed to make better defect predictions.
Herbold in 2013 [13] proposed NN Filtering and EM Clusterin as distance-based strategies for selecting the training data. Different classifiers like Logistic Regression, Naïve Bayes, Random Forest, ANN, Decision trees were used for prediction of the cross projects. Proper selection of the training data can lead to better defect prediction. Peters et al. in 2013 [17] proposed a new filter called the Peters filter. The results of peter filter were compared with the Burak filter. Peter filter outperformed for CPDP. It analyses the structure of the other available projects and selects the final training dataset. Dejaeger et al. in 2013 [15] contributed by studying fifteen different bayes networks and compared them to the other machine learning algorithms. The authors also investigated the Markov theory of feature selection.
G. Canfora et al. in 2013 [12] proposed a novel multiobjective model for cross project defect prediction. The multiobjective logistic model used genetic algorithm. This model allowed engineers to choose predictors by achieving a Prediction of Cross Project Defects using Ensemble based Multinomial Classifier 3 compromise between effectiveness and lines of code to be tested. Panichella et al. in 2014 [18] investigated the equivalence of the classifiers. He studied whether different classifiers identified the same defect prone classes. The authors proposed a combined defect predictor which outperformed with higher value of ROC. They concluded that combination of classifiers yields better results than a single classifier.
Peng He et al. in 2014 [19] addressed the problem of imbalanced feature set between the source and the target dataset. Considering this the authors proposed a framework based on distribution characteristic instance mapping. They validated the results on the publicly available datasets an concluded that the proposed model improved the performance of CPDP.
Peters et al. in 2015 [21] proposed an extension of MORPH and CLIFF. LACE2, a novel algorithm did not added all the data from the products to the defined shared cache. It decides which instance should be added. Kawata et al. in 2015 [22] proposed the use of DBSCAN clustering algorithm. In 2015 Y.Zhang et al. [24] used ensemble classifiers Maximum Voting and Average Voting for classification. Amasaki et al. in 2015 [25] proposed a combination of attribute selection and relevance filtering. These were performed on the log transformed data. Nam and Kim in 2015 [20] proposed CLAMI, a novel fully automated unsupervised approach. CLAMI included clustering and labeling of the training data for prediction.
The solution for heterogeneous cross project defect prediction by was provided X. Y. Jing and Nam in 2015 [27] . Ryu, D et al. in 2016 [28] investigated the applicability of class imbalance learning in CPDP scenario. They proposed a novel method as VCB-SVM.The proposed model improved the predictive performance by sampling and class imbalance learning. Duksan

Metrics used and description of Datasets
We have taken the well-defined datasets from http://openscience.us. The original data is collected by Marian Jureczko, Institute of Computer Engineering, Wroclaw University of Technology. From each dataset the information at the class level is considered. These datasets generally have multiple versions. Table 1 describes the datasets used. In this work we have focused on homogeneous CPDP for multinomial classification. The software metrics are important in defect prediction process [36]. The CK metrics (Chidamber and Kemerer, 1994) are selected and are used for software defect prediction. CBO [34,37], RFC and LCOM metrics of the CK suite were associated with the fault proneness of classes. Higher the value of these metrics higher is fault proneness of the class. WMC and RFC metrics of the CK suite to be highly related with number of defects in a class. The higher value of DIT [38] and lower value of NOC metrics of the CK suite to be also related to the fault proneness of the class. These studies infer the high association of the CK metric suite with the number of defects in a class. The

Ensemble Learning Models
An ensemble contains a combination of base/ weak learners to improve the performance of the model. The prediction accuracy of an ensemble is much higher than the base learners. Base learning algorithms can be neural network, decision tree or any other machine learning algorithm. The majority voting for classification and weighted averaging for the regression are the common strategies of combining base learners. The popular and effective ensemble methods are Bagging, Boosting and Stacking. The figure1. gives a diagrammatic representation of ensemble approach. The figure states that N classifiers are used on a set of input features for prediction. The results from the N classifiers are combined for the final outcome.

Figure 1. Ensemble Learning Approach
The ensemble learning models used in this experiment are Random Forest and Gradient Boosting. Random forest model is a bagging technique whereas Gradient Boosting is a boosting approach. 1) Random Forest: Random Forest is a solution to most of the data science problems. It is a flexible machine learning model which is competent to perform both classification and regression. It handles missing values, dimensionality reduction methods, outliers and thereby having better results. It is an ensemble of weak models (decision trees). To classify an object each tree votes, the forest chooses the classification which has the highest number of votes. For regression it computes the average of the output produced by each tree. The major advantage of Random forest is avoiding the problem of overfitting.
The standard deviation of the predictions from different regression trees gives the estimate of uncertainty of the prediction. The subsequent predictors improve its learning by faults of the existing predictors. It involves three elements: • A loss function to be optimized.
• A weak learner to make predictions.
• Model to add weak learners. Gradient boosting uses decision trees as weak learners. A tree is parameterized and its parameters are modified and then it is added to the model to minimize the loss.
Where (x i , y i ) is the training set L(y,F(x)) is a differential loss function, M is the no. of iterations, h m (x) is the base learner, γ m is the multiplier.

Evaluation Measures
The Table 3 summarizes the performance measures which are mostly used for evaluation of classification models.

Data Acquisition and Understanding
On collection of the datasets the identification of target variables and the input features is done. The total number of defects in a class is taken as the target variable. We have taken into assumption that higher the number of defects in a class higher is the class defect prone and higher is the severity. When two or different projects have the same set of metrics then such metrics are called the homogeneous metrics. In our analysis we have taken the different projects with the homogeneous metrics thereby performing homogeneous CPDP.

Data Preprocessing and Preparation
The data preprocessing steps include data encoding, data normalization and feature selection. In the data encoding, label encoder is used to convert the categorical value to the numeric value. Z-score normalization is performed to normalize the data on a common scale.
We have conducted the experiment to classify our data of cross and within project into multi-classes. First of all. we made this problem as a 10 class multinomial classification Problem. Class with 0 defects has lowest severity, class 1, class 2 and so on and the class with 10 defects has highest severity (the maximum defects in a class were not more than 10). On evaluation of the predicted values and actual values, we inferred that most of the values are belonging to class 0 or class 5 among the 10 classes defined above. Hence, in order to evaluate the severity into 3 buckets as low, medium and high we converted the 10 class problem to 3 class problem (multinomial) with 5 being the threshold value. As stated before we defined three classes as follows: Class 0: Class with no bugs. Class 1: Class with bugs greater than 0 and less than 5. Class 2: Class with bugs greater than and equal to 5. Cross validation method is also performed to evaluate the training accuracy of the model.

Modeling
The modeling includes a description on model fitting and its evaluation.

Model Fitting:
Gradient Boosting and Random forest are used for multinomial classification. Hyperparameter tuning of the classifiers are done. Hyperparameter tuning is the settings done to the classifiers to optimize its performance. These are not directly learned from the data and needs to be predefined. In the case of a random forest, Hyperparameter includes the number of decision trees used in the forest, the number of features considered when splitting a node by each tree, the maximum depth of a tree, the minimum number of samples required at leaf node. We used Grid Search to set the range for Prediction of Cross Project Defects using Ensemble based Multinomial Classifier 5 EAI Endorsed Transactions on Scalable Information Systems 01 2020 -03 2020 | Volume 7 | Issue 25 | e5 each parameter. The range of values for each parameter in Random Forest in the experiment is: 'n_estimators': [10,15,20] 'max_depth': [5,10,12] min_samples_leaf': [2,3] min_samples_split': [2,3] In the case of Gradient Boosting, hyperparameters include nestimators and the learning rate. The range of values for each parameter in Gradient Boosting in the experiment is: 'n_estimators': [30,40,50] 'learning_rate': [0.2, 0.1]

Model Evaluation:
For evaluation of the models we have used three performance measures i.e. F-Measure, AUC-ROC and Mean Average Precision (MAP). F-Measure is the harmonic mean of precision and recall and is widely used in evaluation of classification models. Confusion Matrix is used to evaluate recall, precision and f-measures is calculated. The AUC-ROC values facilitate the comparison of the models. An AUC value of 1 represents a perfect classifier whereas for random classifiers a value of 0.5 is expected. The MAP is used to evaluate the performance of the classifiers. MAP value can be calculated by computing the average precision of each class and then the average over the class. Higher the value of MAP better it is. Algorithm 1 presents the pseudo code of the experiment conducted. Figure2 gives defect prediction process of multinomial classification. In this, data is collected and preprocessed. As discussed in section 1 of the paper, the dataset is then divided into three classes of 0, 1 and 2. After the selection of CK metrics the training dataset is provided to the model for classification. The ensemble classifiers are used for prediction of defects under the category of three defined classes.

Results & Discussion
In this section we present results of the experiment conducted. Table 4,5,6 ,7 and 8 tabulates the results. Table 4 gives the value of precision,recall & f-measure for CPDP using random forest & gradient boosting. Table 5 tabulates the values of precision,recall & f-measure for wpdp using random forest & gradient boosting. Table 6 gives the average value of f-score, precision & recall using random forest & gradient boosting for CPDP & WPDP whereas the Table 7 presents the Auc-Roc values using random forest and gradient boodting for CPDP and WPDP. The accuracy of the predictive performance of the model is specified in Table 8.
The performance evaluation measures used are f1-score, AUC-ROC value , precision and recall. We have compared the results of multinomial classifiation for CPDP with WPDP. The observations are as below:

RQ: Whether multinomial classification of CPDP is feasible and comparable to WPDP?
From Boosting and Random Forest are 0.73 and 0.72 respectively. These values are comparable to the accuracy values for WPDP as 0.79 and 0.81 using Random Forest and Gradient Boosting respectively. Since accuracy is also one of the performance measure of the classifier therefore from the above observations we can infer that multinomial classification for CPDP is also feasible and comparable to WPDP.
From table 6 we observe that when using Random Forest as classification model the average values of F-Score in CPDP for class 0, class 1 and class 2 are 0.9, 0.268 and 0.302 respectively, whereas when we look at the values for WPDP they are 0.838, 0.274 and 0.291 for class 0, class 1 and class 2 respectively. In the case of CPDP, this value was higher by 6.88% for class 0, 2.18% lower class 1 and 3.64% higher for class 2 when compared with WPDP. On using Gradient Boosting as the ensemble model for classification the average values of F-Score in CPDP for class 0, class 1 and class 2 are 0.92, 0.247 and 0.334 respectively, whereas when we look at the values for WPDP they are 0.846, 0.258 and 0.322 for class 0, class 1 and class 2 respectively. In the case of CPDP, this value was higher by 8.043% for class 0, 4.26% lower class 1 and 3.59% higher for class 2 when compared with WPDP. From the above observations we can infer that multinomial classification for CPDP is also feasible and comparable to WPDP with respect to F-measure.

Threats To Validity
Threats to internal validity states the errors in the experiments. We have checked our datasets and experiments but still there can be errors that we did not notice. An issue that can affect the internal validity is the use of classifier for modeling. There are many classification algorithms available. Any research work includes only a small subset of classification algorithm. In our work, we have focused on only on two algorithms of Ensemble approach (Random Forest and Gradient Boosting).Besides this, all the datasets used in our experiments are of Jureczko, there can be some quality issues among the datasets.
Threats to external validity are related to the generalization of the results. All the datasets used in our experiment are from single source and are open source projects. This threat can be reduced by considering more datasets from different sources for the experiment.
Threats to reliability validity states the possibility of replicating this work. All the datasets used in our experiment are publicly available and the pseudocode is available in this paper. This work can easily be replicated.
The threat to conclusion validity is that we have compared our outcomes of the experiment on the basis of F-score and AUC of within project defect prediction. Since not much of work on multinomial classification for CPDP is present in state of art therefore results obtained are compared with WPDP to prove the feasibility of multinomial classification for CPDP.

Conclusion & Future Scope
The software reliability is directly proportional to probability of error free software. In this paper we analyzed the performance in multinomial classification of cross and within project defect prediction. The datasets were collected Jureczko. Cross validation was done to estimate the predictive performance of the models. CK based object oriented metrics was effective for CPDP. We first labeled our class information in three different levels depending on the number of defects in each class. The experiment was conducted by training the model on different cross project defect prediction. Ensemble learning approaches were used for classification. The values of F-score, MAP and AUC were used for empirical analysis of the model.
The results indicate that multinomial classification on CPDP is feasible and comparable to WPDP. There are many verticals on which the work can be extended. Multinomial classification of Heterogeneous CPDP can be focused. Proper training dataset selection and class imbalance to optimize the performance of defect prediction are still open issues in CPDP.