Experimental Comparison of Classiﬁcation Methods under Class Imbalance

The class imbalance problem is prevalent in many domains including medical, natural language processing, image recognition, economic and geographic areas etc. We perform a systematic experimental comparison of di ﬀ erent imbalance classiﬁcation algorithms — ranging from sampling, distance metric learning, cost-sensitive learning to ensemble learning approaches — on several datasets from UCI, KEEL and OpenML. The algorithms included DDAE, MWMOTE, SMOTE, RUSBoost, AdaBoost, cost-sensitive decision tree (csDCT), self-paced Ensemble Classiﬁer, MetaCost, CAdaMEC and Iterative Metric Learning (IML). As the substantial bias potentially caused by imbalance classiﬁcation can be harmful for underrepresented classes which are of critical social and economic values and even lives, the main objective of our study is thus to understand the impact of imbalance ratio and the size of the utilized datasets on the performance of the above-mentioned algorithms. Our experiments show that 1) Sampling methods perform the worst and cannot be used directly for imbalanced classiﬁcation, since they lack of consideration of neighborhoods based on distance. However, some classiﬁers can be improved after the balance of class distribution. 2) Cost-sensitive learning models should be utilized when the dataset is less imbalanced, because it is di ﬃ cult to set an appropriate cost matrix for a speciﬁc dataset, which can cause performance ﬂuctuations. 3) IML consistently shows good performance (in terms of F1 and AUCPRC), is resilient to di ﬀ erent imbalance ratios but sensitive to the data distribution of the dataset. 4) Ensemble learning techniques generally perform better over other approaches due to their combined intelligence of multiple basic classiﬁers. 5) In terms of system performance, self-paced Ensemble Classiﬁer performs fairly well with regards to learning time, while IML and DDAE yield the longest learning time; AdaBoost and self-paced Ensemble Classiﬁer are two algorithms require lowest memory usage. We also provide our empirical recommendation for algorithm selection under di ﬀ erent requirements and usage scenarios based on our analysis.


Introduction
Classification is one of the most popular topics of machine learning [1][2][3] and much attention has been paid to binary classification [4][5][6]. Classical classification methods include Naïve Bayes, k-nearest neighbor (kNN) [7], support vector machine (SVM) [8], decision tree [9] and random forest [10]. These classification algorithms typically assume their datasets are balanced in their class distribution. However, in many real-world application domains such as medical diagnosis [3,[11][12][13][14], streaming and social behavior data analysis [2,15], software development process [16,17], financial frauds [18][19][20], unsolicited phone calls [21], disaster risks [22,23], recommendation systems [24,25], and text classification [26][27][28], inherently imbalanced datasets are commonly seen. Also, the limitation of the data collection process and the imbalanced cost for fixing different errors can lead to the imbalance. Affected by these conditions, normal classifiers are often confused by the majority class and ignore the minority class, which may result in catastrophe for instance massive waste of resources, time or money, and even can endanger the lives [29][30][31]. Additionally, for the percentage of examples available for each class, most real-world datasets processed using non-linear classification strategies are imbalanced, which may cause the algorithm to learn overly complicated models that overfit the data and have almost no correlation [32]. This issue is especially critical since it leads to a significant barrier in the performance achieved by classical learning methods which typically assume that the class distribution is balanced [31,32]. To combat the losses in critical social and economic values and even lives, its highly important to understand the impact of class imbalance in different imbalance learning algorithms and recommend an appropriate algorithm for a given use scenario.
As demonstrated in previous literature, a poorperforming algorithm for imbalance classification can lead to substantial losses in critical social and economic values and even lives [12,14,19,20,23,27,28,31]. Hence, it is important to study the impact of different metrics on the different algorithms under different situations, so that an appropriate choice could be made when deciding the algorithms used for imbalance classification. Although there are evaluation studies on one of the specific directions for imbalance classification or focusing on one specific metric [41], to our best knowledge no work has been conducted using a comprehensive set of evaluation metrics to quantify the performance of representative algorithms from these different classification families.
This paper focuses on comparing the performance of these algorithms on multiple datasets from several different domains, including healthcare, card playing, software development projects and hand-written digit recognition. We use these datasets to evaluate ten imbalanced classification algorithms, namely 1) sampling: SMOTE [35] and MWMOTE [42]; 2) costsensitive learning: MetaCost [43], CAdaMEC [44] and cost-sensitive decision tree [9]; 3) distance metric learning: Iterative Metric Learning (IML) [36]; 4) ensemble learning and hybrid methods: AdaBoost [45], RUSBoost [39], self-paced Ensemble Classifier [11] and DDAE [40]. Our experiments not only analyze the performance of different models based on a general set of evaluation metrics on the same dataset, but also quantify the impact of key factors related to imbalanced learning, such as the size of the dataset and the imbalance ratio, as well as system performance in terms of learning time and memory usage.
The following sections will first review the related work, and then present our data sources. In Section 4 are the detailed evaluation results, with additional discussions following in Section 5. Section 6 concludes this paper.

Related Work
Due to the vital importance of data analysis for human health, lives and the socioeconomic world, many researchers examined the issue of imbalanced classification. For example, [46] combined feature selection and ensemble classification to solve the problems of imbalanced healthcare data on the diagnosis of a brain tumor. [47] introduced a novel voting class weight algorithm based on random forest algorithm to identify the minority class sufficiently in medical applications and can be 2 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3 applied to the detection of diseases. [48] presented an innovative comprehensive ensemble learning paradigm that involves multiple SVM diversity structures for classification for the early detection of diseases with imbalanced data.

Sampling Methods
Resampling means creating a new transformed version of the training set of an imbalanced dataset, offering a set of practical and straightforward approaches to provide a more balanced data distribution [34]. The three main resampling methods are oversampling, undersampling, and hybrid techniques which are a combination of both from sampling algorithms [9].
Among these three groups, informed undersampling such as EasyEnsemble and BalanceCasad [49], and synthetic sampling such as SMOTE [35] and the Borderline-SMOTE [50] are shown to outperform random oversampling and random undersampling. However, undersampling methods can lead to information loss which in turn results in loss in classification performance and underfitting, while oversampling suffers from issues of overfitting, high computational overhead and long training time [4].

Synthetic Minority Oversampling Technique (SMOTE).
SMOTE [35] addresses the class imbalance problem by generating synthetic samples in feature space (see Fig. 3 for more details). One minority class example s 1 will be selected randomly and then its k nearest neighbors in the minority class will be screened out; a line segment is formed between one of these k neighbors s 2 , which is selected at random, and s 1 in the feature space [35]. SMOTE creates the synthetic samples through a convex combination of s 1 and s 2 [51]. As described in [52], random undersampling is suggested to be used to curtail the size of the majority in the first instance. Next, SMOTE is utilized on the training set to align the class distribution. This approach is proven to outperform the plain undersampling [35].

Cost-sensitive learning methods
Cost-sensitive learning methods are utilized to deal with different misclassification errors that incur different penalties to find the optimal decision based on the cost matrix [29,54] as shown in Table 1. If m stands for the predicted label and n stands for the actual label, the C(m, n) is the cost of predicting a class n sample as class m. Given the cost matrix, the purpose of this type of learning method is explained to create a model with minimal overall misclassification costs [55,56].
As most traditional classifiers assume that the misclassification has the same cost for false negative (FN) and false positive (FP) [29], the real-world scenarios are not so ideal. Conceptually, in certain situations, the cost of incorrect labeling for a sample should always be higher than a correct one [29,30,55]. The supplied cost matrix can strongly impact the effectiveness and a common phenomenon is that, the cost of classification errors cannot be described clearly and domain expert knowledge is lacking [9,54].
Cost-sensitive learning techniques can be categorized into two families: meta-learning (including two aspects: thresholding and sampling) and Direct methods [56]. The principle of the former category is to create a "wrapper" to turn existing cost-insensitive classifiers into cost-sensitive classifiers, and in the latter case, classifiers that are cost-sensitive in themselves are constructed [56]. Although efficient due to their ability to take account of the importance of different classes, a disadvantage of cost-sensitive algorithms is that it is difficult to define an appropriate cost matrix for each dataset and generalize the learning algorithm [57].
Cost-sensitive DeCision Tree (csDCT) [9]. Its main idea is to minimize two separate costs: the test cost of the feature and the cost of misclassification of the sample. It takes account of misclassification during pruning and is popularly used as a direct method for detecting card frauds when the cost to misclassify could vary [56,58]. Weighting [59] is one implementation of the sampling methods, in which examples of the minority are assigned high weights according to their proportion. [44] is proposed upon AdaMEC [60] (a cost-sensitive algorithm) through an appropriate calibration with Platt scaling.

MetaCost.
MetaCost [43] is another cost-sensitive learning model which includes a cost-minimizing method independent of the number of classes or arbitrary cost matrices. It relabels the instances in the training set with the class labels that have estimated minimal costs, then the new replacement training set will be applied to the error-based learners. MetaCost 3 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3 can be viewed as a representative thresholding tool in the case of thresholding (see Section 2.5).

Distance Metric Learning Methods
Iterative Metric Learning (IML) [36] focuses on the exploration of a stable neighborhood space for each of the data samples in the testing set. To achieve this, the proposed procedure utilizes an iterative metric learning technique [36], which locates a stable neighborhood for the specific testing data, using k-nearest neighbors rule (kNN) [7] as the base classifier. IML comprises the following steps: 1) Large Margin Nearest Neighbor (LMNN) learning [61] (a learning method which offers a much higher accuracy than kNN; see Fig. 4) to improve the data space (through relabeling and regrouping the neighbouring data blocks) which separates the data samples with a different class label by a large margin and makes the samples with the same class labels close to each other; 2) calculate the distance between each of the samples from the training set to the testing samples; 3) run this and previous steps through multiple iterations, controlled by a predefined matching ratio. [62] shows that IML yields to a classification performance bound but requires the learned matrix conforms to the Positive Semi-Definite (PSD) constraint; its performance is still unknown for non-linear metrics.

Ensemble Learning and Hybrid Methods
In 1979, Dasarathy and Sheela [63] presented one of the earliest studies on ensemble learning, which partitions the feature space with two or more classifiers. In 1990, Hansen and Salmon [12] utilized ensemble artificial neural networks (ANNs) with similar configurations to improve a single classifier's generalization performance. In 2005, Surowiecki [64] illustrates the basic idea of various ensemble learning methods and shows that under certain controlled circumstances, the ensemble decisions or predictions of humans often outperform those made by an individual.
The ensemble methodology is used to enhance the individual classifiers. The key idea is to train multiple classifiers and then combine them to achieve an overall classification. The ensemble learning approach has been successfully applied to many areas like medical diagnosis [65,66], cheminformatics [67,68], and bioinformatics [69,70]. Ensemble methods reduce the dispersion of model performance and can make reliable prediction performance, but since they are typically based on either sampling and/or cost-sensitive methods as basic classifiers, they may enjoy the benefits of these basic classifiers but also inherit their disadvantages.
Boosting is one of the most practical techniques of ensemble learning through instance partitioning [71]. Simply put, Boosting generates an ensemble classifier by applying resampling methods on data and is later combined with major voting [64,72]. In 1997, Freund and Schapire introduced AdaBoost (Adaptive Boosting) [45], one of the most representative works of boosting, which applies an iterative process to simple boosting to improve performance. This approach focuses on the instances which are much more complex to classify. [73] is a boosting ensemble learning approach utilized to deal with the class imbalance problem; the key principle behind it is to enhance the weak learner gradually into a strong learner. More specifically, in AdaBoost, a new dataset, in which higher weights are assigned to instances that are misclassified by the previous classifier and a lower weight is assigned to the one with a correct prediction, is used for training each subsequent classifier [74]. This is implemented by varying the sample weight, which indicates its importance in the classifier training process, stage by stage. Fig. 5 shows the process of combining the ultimate classifier. weighted minority class through a clustering approach. The weight of each important minority sample is chosen based on its Euclidean distance to the nearest majority class sample.

Adaptive Boosting (AdaBoost). Adaboost
RUSBoost. RUSBoost [39] is a combination of data sampling and boosting algorithm based on SMOTE-Boost [75] that balances class distribution through SMOTE [35] and works on improving classifier performance with the balanced data under the help of AdaBoost. Instead of SMOTE, this hybrid approach utilizes random sampling (RUS) for performance improvements. [40] is a novel model to address the class imbalance problem consisting of resampling, data metrics learning, cost-sensitive learning, and ensemble learning. Besides using kNN as the base classifier, DDAE has four components: 1) Data Block Construction (DBC), divides the training (both minority and majority) samples into different number of data blocks based on the given balanced ratio; 2) Data Space Improvement (DSI) applies LMNN (similar to IML) to improve the data space (through relabeling and regrouping the neighbouring data blocks) for training samples in each data block generated in the DBC component; 3) Adaptive Weight Adjustment (AWA) finds an appropriate overall class weight generated using the data coming from each data block [89]; 4) Ensemble Learning (EL) leverages ensemble learning with the weight determined via AWA; multiple base classifiers with major voting technique work on the final decision for each input sample.

DDAE. DDAE
Self-paced Ensemble Classifier. Self-paced Ensemble Classifier [11] is an effective ensemble classifier generated by self-paced harmonizing data hardness through undersampling. [11] shows this method can achieve robust performance even when the classes are highly overlapped and the data distribution highly skewed.

Evaluation metrics
Evaluation of classification performance plays a significant role in guiding and learning performance [76]. Hence, we make extensive evaluation of these existing classification algorithms in different scenarios to understand their pros and cons in this paper.
One popular way of describing evaluation metrics is through the confusion matrix [77], which provides multiple performance results to offer a deeper understanding of predictive model performance, so that the types of error can be more clearly observed. Table 2 shows the structure of a confusion matrix for binary classification. Table 3 shows the measures derived using the confusion matrix from Table 2.
Sensitivity * Specificity A more systematic way of depicting evaluation metrics was proposed by Ferri et al. [78], which classifies the evaluation metrics into three groups: probability metrics, ranking metrics and threshold metrics. In this paper, we apply two groups of metrics, threshold and ranking metrics, to evaluate the model performance.
Thresholding metrics, which quantify the classification prediction error, focuses on the generalization ability of the trained classifier through the quality of the trained classifier when used to predict unknown examples [79]. The most common threshold metric is the accuracy of classification applied in most conventional applications; nevertheless, accuracy is inappropriate for evaluating the imbalanced dataset since it is simple for a classifier that only predicts the majority to yield a low error [80].
Sensitivity-specificity metrics are practical thresholding metrics for imbalanced classification that are applied by several researchers [81,82]. As defined in Table  3, specificity means the true negative rate. Sensitivity, the complement to specificity, describes the true positive rate. The geometric mean (G-mean) is calculated through the combination of sensitivity and specificity [83] and can balance both concerns. Sensitivity, Specificity and G-mean are taken into account when both positive and negative classes are meaningful at the same time [84]. In addition, Precision-Recall metrics arising from the fields of information retrieval are utilized when the output of the minority(class positive) is more crucial [85]. In this paper, F1 (the value of β in F β -Measure is 1) is utilized as one of the most important evaluation metrics.
Nonetheless, threshold metrics are not suitable when the distribution of categories observed in the training dataset does not match the distribution of the test set 5 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3 Figure 6. An example of ROC Curve and the actual data, which can make the performance misleading [51].
The ranking metrics focus on how effectively the base classifiers rank the examples [78]. A numeric score of an example that refers to the probability of being classified as positive is provided by the base classifier, which shows the level of granularity instead of a simple prediction. Different thresholds whose choice affects the trade-offs of both classes' errors can be utilized to test classifiers' performance [9]. Receiver Operating Characteristics (ROC) Curve [86] is the most commonly applied ranking metric that is not based on a specific threshold. ROC Curve takes the true positive rate (TPR) and false positive rate (FPR) into account and each point of ROC Curve corresponds to the single classifier performance with a given distribution [54]. The area under the ROC curve (AUCROC) is generally applied to measure different classifiers' performance, which is summarized into a single metric [84]. An example of ROC Curve can be observed from Fig 6. Point A (0, 1) represents the best performance of the classifier. Therefore, the closer the ROC curve is to A and the more it deviates from the 45-degree diagonal(representing a random classifier), the more successful it is; this also indicates the greater the AUCROC is, the better [84,86]. However, in [54], it is argued that even a classifier with a high AUCROC can perform poorly in a particular region in ROC space compared with a low AUCROC classifier.
If a dataset is highly skewed, the performance of the algorithm might as observed overly optimistic through a ROC curve [86]. The Precision-Recall (PR) Curve, which assesses a more informative representation of performance, is utilized to solve such a limitation [54,87]. PR-Curve is a plot of Recall on the x-axis and Precision on the y-axis [87], and it can capture the performance of the classifier correctly and effectively if the number of false positives drastically change as the Precision metrics takes the ratio of TP to TP+FP into account [54]. Due to its high level of performance with highly skewed data, it has been applied to the evaluation of performance by many researchers, such as [88][89][90]. Unlike ROC-Curve, whose objective is to be closer to the point (0, 1), the highest performing classifier is represented by a PR-Curve residing in the top right of the PR space(point C(1, 1)) [9]. Similar to AUCROC, the area under the PR Curve (AUCPRC) is also a summary of PR-Curve with a single scale value [9]. An example of PR-Curve can be observed in

Data Sources
Eight datasets were collected from the medical or healthcare sector. These are Yeast1vs7, Euthyroid Sick, Thyroid Sick, and Mammographic (MGC), Wisconsin Diagnosis Breast Cancer (WDBC) and Pima Indian Diabetes (PID) from UCI [91], and two sub datasets of Protein Homology (PH1 and PH2) from KDD Cup 2004 [92], which are utilized to test the performance of these models. The detail of these datasets is depicted in Table  4. In addition, eight further datasets are employed, including Cm1, Mw1, Pc1, Pc3, Pc4 which are from NASA Metrics Data Program (NASA) dataset [93] on the software development process, two datasets Poker89vs6 and Poker8vs6 for card playing, which are from KEEL [16], and Optical Recognition of Handwritten Digits (optdigits) from UCI. All these datasets are imbalanced distributed but with various imbalance ratio (IR), instances and features. The detail of these datasets is depicted in Table 5.
We apply a few data cleaning techniques introduced by previous works (e.g., [94][95][96][97]). Data entries with duplicated, inconsistent, or missing values are either deleted or corrected by replacing the missing value with 6 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3  Fig. 9, so this column can be converted into IS_SEX_MALE, IS_SEX_FEMALE, IS_SEX_NA). Outliers with outstanding values are identified and cleaned up, with the aid of data visualization e.g., by box plots with quantiles. For example, Fig. 10 and Fig. 11 depict the box plots for the attributes 'Blood Pressure' and 'Glucose' from the PID dataset, respectively. The outliers can be observed clearly in this illustration: there are some instances with an extremely low value for 'Blood Pressure' which is rare in the real world; an instance of a 0 'Glucose' can also be viewed as an outlier. Such incorrect data, or data that violates common sense, may lead to an ineffective model.
Further used data preprocessing techniques include feature encoding [99], which allows data transformation to make data more acceptable as input for models.    Table 6 shows the results of using feature encoding for this attribute before and after transformation. This attribute will be replaced by four new attributes: 'RSrc1', 'RSrc2', 'RSrc3' and 'RSrc4'. The package Scikit-Learn provides a OneHotEncoder class to transform these kinds of category values into one-hot vectors [99].

Overall Comparison
Tables 7-21 present the results of our performance evaluation of all examined imbalance classification algorithms in terms of G-mean, F1 and AUCPRC for each separate dataset.
As shown from these results, the G-mean of DDAE is close to or better than the average on most of the datasets. For example, the G-mean value obtained on the MGC is 0.878, while the average value at this time is only 0.750; the G-mean value yielded on Pc1 is 0.740, while the figure for RUSBoost is only 0.44. However, DDAE's F1 and AUCPRC are not satisfactory; they are, lower than the corresponding average values on most of the datasets. In contrast, AdaBoost and self-paced Ensemble Classifier yield high performance in terms of F1, AUCPRC and G-mean under most circumstances. These two ensemble algorithms seem can accurately            CAdaMEC, as well as csDCt, performed well on the optdigits dataset and the Thyroid Sick dataset, but on Yeast1vs7 dataset, its performance is the worst. SMOTE and MWMOTE are two sampling models, with a similar performance in most cases except on the Poker8vs6 dataset.
Moreover, the performance of IML fluctuates above and below the average level.Compared with other algorithms, the advantage is not obvious.
To investigate the ability of theses models on positive sample detection, the recall for all these models on the medical datasets is presented in Table 22.
It shows the recall for DDAE performs well on all medical datasets. The recall for self-paced Ensemble Classifier and CAdaMEC can also maintain a relatively stable performance on nearly all the medical datasets.

Impact of Imbalance Ratio
First, to analyze the influence of the IR on the model, eight identically-sized sub-datasets with different imbalance ratios (IRs) were used in the experiment. They were IR=10, 20, 30, 40, 50, 60, 70, 80 and 90. The x-axis for Figures 12-14 is IR. Fig. 12 illustrates the trend of recall of all algorithms as the IR increases. This metric stays stable on DDAE most of the time, and DDAE keeps the recall between 0.9 and 0.95. MWMOTE, SMOTE, cost-sensitive Decision Tree, CAdaMEC, self-paced Ensemble Classifier and IML show a downward trend. Among these, the recall of MWMOTE, SMOTE and self-paced Ensemble Classifier decreased slightly as the class distribution became more imbalanced, with the highest and lowest values for MWMOTE, SMOTE, AdaBoost and self-paced Ensemble Classifier being 0.978(IR=30) and 0.852 (IR=90), 0.967 (IR=30) and 0.793 (IR=80), 0.865 (IR=10) and 0.517 (IR=80), 0.898 (IR=10) and 0.795 (IR=80), respectively. The figures for recall of the other four "decreasing" algorithms drop significantly, from 0.841 (IR=20) to 0.593 (IR=90) for csDCT, from 0.836 (IR=10) to 0.519 (IR=90) for CAdaMEC, and from 0.767 (IR=10) to 0.517 (IR=90) for IML. In addition, no obvious correlation between recall and the changes in IR can be seen through the curves of RUSBoost and MetaCost. All of them show a slight fluctuation in this process.
Notably, the G-mean of almost all the models remains nearly stable except MetaCost. All of the models still retain a high value for G-mean even when the IR is 70 or 80. Fig. 13 shows that some of models performs better (or similarly) when the IR is 70 compared to the IR is 10, including DDAE, MWMOTE, SMOTE, RUSBoost, AdaBoost, MetaCost, self-paced Ensemble Classifier, IML and csDCT.
F1 is an evaluation metric which takes both recall and precision into account. In other words, if the 10 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3   with only a slight decrease/increase, the changes in 11 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3 F1 will be similar to those in precision. Fig. 14 shows F1's trends of SMOTE, MWMOTE, DDAE and csDCT, which reflects the changes in precision. The value of F1 for MetaCost and AdaBoost decreases slightly during the process. In contrast, as the performance of other models shows degradation to some degree, the figures for IML, RUSBoost and selfpaced Ensemble Classifier stay almost stable, but the former increases slightly after the IR is greater than 70 and the latter drops to a small degree. AUCPRC is another evaluation metric which considers recall and precision at the same time. As shown in Fig.  15, the AUCPRCs for DDAE, MWMOTE, SMOTE, CAdaMEC and IML decrease as the IR increases. The figure for MetaCost fluctuates significantly but is still showing a decrease. The AUCPRCs for AdaBoost, RUSBoost and self-paced Ensemble Classifier are all on a slightly downward trend. Moreover, csDCT preforms differently on AUCPRC from other models; the figure for these two models climbing when the IR varies from 50 to 90.

Impact of the Size of Datasets
Next, in order to analyze whether the size of the dataset will affect the classification accuracy of the model, five sub-datasets with the same IR but different numbers of instances (which are #Instances=4500, 9000, 18000, 36000 and 72000) are taken from the Protein Homology dataset. The results can be observed in Figures 16-25.
It can be seen that with the sample dataset sizes, the values of the five evaluation metrics for the algorithms DDAE, MWMOTE, SMOTE, IML and cost-sensitive Decision Tree (csDCT) have an upward trend, especially the precision and F1, which increase more than 0.2 and 0.1 respectively. This phenomenon shows that the classification results are more accurate on these models when the sample size is extensive compared to when the sample size is small. The RUSBoost, AdaBoost, self-paced Ensemble Classifier and CAdaMEC algorithms are not significantly affected by the size of the dataset. The performance of the algorithms on all five experiments is quite excellent, and the values of the evaluation metrics are mostly between 0.9-1.0. Although the CAdaMEC is slightly inadequate when the total sample size is 4500, the recall value is also greater than 0.8, and as the total number of instances exceeds 10,000, this model can predict the labels of all majority and minority classes correctly among all the algorithms.
It can be noted that, only MetaCost does not show an apparent trend.

Evaluation on System Performance
We also evaluate the system performance of these algorithms. All our experiments are conducted on      As shown in Table 24, the memory usage of each algorithm is similar, even when processing different datasets, and there is no obvious relation between the memory usage and the size of the dataset. Among all these algorithms, AdaBoost and self-paced Ensemble Classifier are two algorithms with the lowest memory usage.

Discussion
As shown in our experiments, DDAE yields the highest value for recall on almost all imbalanced datasets. Since the recall only focuses on minority, a higher recall means that most samples with a class label of 1 can be correctly predicted. since the recall and precision generally cannot be satisfactory at the same time, it is critical for a class imbalance classification model to achieve an appropriate F1. Comparing with other algorithms like csDCT, RUSBoost and self-paced Ensemble Classifier, the F1 of DDAE is extremely low on some datasets. This illustrates that DDAE is more sensitive than others, so it is easier to predict a sample with a class label of 0 as 1, which can be considered first because the number of training samples is too small. On several datasets, such as Poker8vs6, DDAE's F1 performs extremely poorly compared to other algorithms but has better recall performance. This dataset contains a total of 1,477 samples, but its IR is as high as 85.882, that is, only 17 positive samples are included in this dataset. Comparing to Poker8vs6, the performance of F1 on PH1 (with 11,274 data samples) and optdigits (with 5,620 data samples) is much better. This can also be confirmed from Fig. 15 and Fig. 16, which shows that when the size of dataset is constant, as the IR increases, the F1 and AUCPRC of DDAE on individual dataset shows a significant decrease.
Iterative Metric Learning (IML) is a model that also utilizes Large Margin Nearest Neighbor (LMNN) to improve the data space and uses the kNN classifier as the basic classifier. Unlike DDAE, IML consistently performs well in F1 and AUCPRC, even when the value of recall is low. compared with other models, the evaluation metrics of IML on different datasets vary, meaning that IML is more sensitive to the data distribution of the dataset. The results in Section 4.2 shows the performance of IML keeps stable under different IRs.
The performance of SMOTE and MWMOTE seems quite similar, but the G-mean, F1 and AUCPRC for SMOTE on some slightly imbalanced datasets, such as WDBC (IR=1.866) and PID (IR=1.684), are better than that of MWMOTE. This can also be observed from Fig. 14, which shows that the IR is relatively small, SMOTE performs better than MWMOTE. SMOTE will randomly select minority samples to synthesize new samples, regardless of the surrounding samples. This may generate useless samples if the selected minority sample is far away from the decision boundary; the newly generated samples may be also overlapped with the majority samples in the surrounding if the selected minority resides inside the majority region [52]. MWMOTE also applies the synthetic sample generation technique. However, unlike SMOTE, MWMOTE utilizes a clustering procedure to ensure that all the produced samples must be located within the minority region to avoid any false or noisy synthetic samples [42]. Different from SMOTE, MWMOTE employs a more practical approach to select samples that are difficult to learn, and then assigns them appropriate weights [42]. This can explain why MWMOTE performs better than SMOTE when the IR is high.
AdaBoost, RUSBoost and self-paced Ensemble Classifier are three alternative ensemble learning techniques to DDAE. AdaBoost is a basic implementation of the ensemble learning technique. RUSBoost adds a random undersampling technique on the foundation of boosting. Like DDAE, the self-paced Ensemble Classifier also cuts the majority of the dataset into several bins and uses the resampled balance training set (including majority subset and minority subset) to train each base classifier. Unlike DDAE, the other three ensemble methods do not take the weights of both classes into account, which can also contribute to a lower recall value but higher F1 and AUCPRC values. AdaBoost and RUSBoost perform similarly. Compared with RUSBoost, the self-paced Ensemble Classifier and AdaBoost show a more stable performance. Even though AdaBoost and RUSBoost and AdaBoost achieve 0.196 and 0.214 respectively in terms of recall on the Pc3 dataset, the recall of the self-paced Ensemble Classifier is 0.786 for 14 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3  If the set cost matrix is not suitable for the predicted dataset and the data is highly skewed, the performance of the classifier can be extremely unsatisfactory, such as on the Poker8vs6 dataset. Considering this, it is better to utilize this kind of classifier when the imbalance ratio is not very high. Our experiments confirmed that as the IR increases, the curves of all evaluation metrics of MetaCost and CAdaMEC fluctuate, but still show a downward trend.
The impact of the size of the dataset on the performance of the model is noticeable. Except for Metacost (which shows an irregular fluctuation), the remaining algorithms either consistently perform very well or all the evaluation metrics of the algorithm rise gradually as the number of samples in the dataset increases. For most algorithms, the more samples in the dataset, the more training samples are used to train the model, which can provide a stronger training basis for the model and enable more accurate prediction. Enlarging the scale of the dataset can enhance the completeness of the data, and can also alleviate the over-fitting phenomenon caused by resampling methods, such as in MWMOTE and SMOTE.
Although our current experiments focus on binary classification, some of the investigated algorithms are also suitable for multi-class classification, such as AdaBoost, MetaCost and cost-sensitive decision tree. 15 EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e3 SMOTE, MWMOTE and IML focus on the feature space. Thus, if these techniques are used for multiclass classification, the classifier combined with these techniques should be also suitable for multi-class classification. Table 25 summarizes the pros and cons of all the techniques investigated in this paper. As a consequence, since algorithms have various characteristics, it is important to choose the appropriate algorithm depending on the actual situation. For instance, for the algorithms used for stocks prediction, people should pay more attention to their precision. However, in medical diagnosis or earthquake prediction fields, the recall for the algorithms is more significant.
Based on our experimental results in the studied datasets, we recommend the following flowchart for algorithm selection, as shown in Fig. 26.

Conclusion
We conducted a systematic performance comparison of several class imbalance classification models using various datasets from medical and other sectors.
Our experiments showed sampling methods perform the worst, and when the dataset is not very imbalanced cost-sensitive learning models achieves good performance. Ensemble learning techniques generally perform better over other approaches due to their combined intelligence of multiple basic classifiers. In terms of system performance, distance metric learning requires the longest learning time, while self-paced Ensemble Classifier requires shortest time for learning; AdaBoost and self-paced Ensemble Classifier requires the lowest memory usage.
Our analysis implies that the performance of algorithms cannot be judged by the individual evaluation metrics alone; the application requirements and usage scenarios should also be taken into account. In particular, we make our empirical recommendation for algorithm selection under different requirements and usage scenarios. In addition, the number of base estimators set by ensemble learning classifiers can also make an impact on their performance.