Over-sampling imbalanced datasets using the Covariance Matrix

INTRODUCTION: Nowadays, many machine learning tasks involve learning from imbalanced datasets, leading to the miss-classiﬁcation of the minority class. One of the state-of-the-art approaches to ”solve” this problem at the data level is Synthetic Minority Over-sampling Technique (SMOTE) which in turn uses K-Nearest Neighbors (KNN) algorithm to select and generate new instances. OBJECTIVES: This paper presents SMOTE-Cov, a modiﬁed SMOTE that use Covariance Matrix instead of KNN to balance datasets, with continuous attributes and binary class. METHODS: We implemented two variants SMOTE-CovI, which generates new values within the interval of each attribute and SMOTE-CovO, which allows some values to be outside the interval of the attributes. RESULTS: The results show that our approach has a similar performance as the state-of-the-art approaches. CONCLUSION: In this paper, a new algorithm is proposed to generate synthetic instances of the minority class, using the Covariance Matrix.


Introduction
Covariance Matrix could be used in different continuous optimization problems [25], such as energy resource management problem(ERM) [26].Moreover, real world data often presents characteristics that affect classification: noise, missing values, inexact or incorrect values, inadequate data size, poor representation in data sampling, etc.The imbalanced dataset problem represents a field of interest as it occurs when the number of instances that represent one class(rare events) [1] is much larger than the other classes, a common problem in certain areas such as fraud detection, cancer gene expressions, natural disasters, software defects, and risk management [2].Rare events are difficult to detect because of their infrequency and casualness; misclassification of rare events could often results in heavy costs.For example, for smart computer security threat detection [3], dangerous connection attempts may only appear out of hundreds of thousands log records, but failing to identify a serious vulnerability breach would cause enormous losses.Moreover other examples are: the classification of the imbalanced data using radialbased undersampling [31], the learn of the imbalanced data improving interpolation-based oversampling [32], and the analysis of attribute mapping rules for recogniting in imbalaced dataset of DNA sequence applying SVM [34].Other examples of imbalaced datasets are: -Consistent performance of its High Voltage Circuit Breaker (HCVB) is determinate when it needs maintenance, which is an important problem, since these components are used over wide periods of time [27].
-The forecast accuracy of the ramp events tends to be low is a class imbalance problem, where take on some data sampling methods to overwhelmed [28].
-The research of detecting anomalies in smart grid is a current topic and is investigated by many researchers, taking into account the use recognized methods of pattern recognition [29].
Some datasets for forecasting of energy is not balance 1 , for example: 1) the power electric generation by electricity market module region and source; 2) the alternative fueling station locations; and 3) the electricity consumption from the California Energy Commission sorted by residential and non-residential from 2006 to 2009.
Then, in the case of the datasets with binary class, it can be defined that it is balanced if it has an approximately equal percentage of examples in the concepts to be classified, that is, if the distribution of examples by classes is uniform, otherwise it is imbalanced.To measure the degree of imbalance of a problem [4] defined the Imbalanced Ratio (IR) as: where: C+: Number of instances that belong to the majority class C−: Number of instances that belong to the minority class Therefore, a dataset is imbalanced when it has a marked difference (IR ≥ 1.5) between the examples of the classes.This difference causes low predictive accuracy for the infrequent class as classifiers try to reduce the global error without taking into account the distribution of the data.In imbalanced sets, the original knowledge is usually labelled as oddities or noise, focusing exclusively on global measurements [5].The problem with the imbalance is not only the disproportion of representatives but also the high overlap between the classes.To face this problem diverse strategies have been developed and can be divided into four groups: at the data level [6,7], at the learning algorithms level [8], cost-sensitive learning [9] and based on multi-classifiers [10]; being the techniques at the level of the data the most used, because its use is independent of the classifier that is selected.
One of the best-known algorithms within data-level techniques is the Synthetic Minority Oversampling Technique (SMOTE) [7,11] for the generation of synthetic instances.One of SMOTE's shortcomings is that it generalizes the minority area without regard to the majority class leading to a problem commonly know as overgeneralization; this has been solved with the use of cleaning methods such as SMOTE -Tomek links (TL) [6,11], SMOTE -ENN [6,11], Borderline -SMOTE1 [11,12], SPIDER [13], SMOTE-RSB* [33], ADASYN [6] among others.These algorithms have been designed to operate with values of both discrete and continuous features for problems with imbalances in their two classes; most of them use the KNN to obtain the synthetic instances, and although this is a method that offers good results, it does not take into account the dependency relationships between attributes, which can influence on the correct classification of the examples of the minority class.
A way to obtain the dependency relation of the attributes is Probabilistic Graphical Models (PGM) [14] which represent joint probability distributions where nodes are random variables and arcs conditional dependence relationships.Generally, the PGM has four fundamental components: semantics, structure, implementation, and parameters.As part of the PGM there are Gaussian Networks that are graphic interaction models for the multivariate normal distribution [15] and some use the Covariance Matrix (CM) to analyze relationships between variables.
This paper is an extension of the proceeding Conference's article [30], where an algorithm based on SMOTE and the Covariance Matrix estimation to balance datasets with continuous attributes and binary class, exploding the dependency relationships between attributes and obtaining AUC [16] values similar to the algorithms of the state-of-the-art.
An experimental study was performed ranking two SMOTE-Cov variants, SMOTE-CovI (which generates new values within the interval of each attribute) and SMOTE-CovO (which allows some values to be outside the interval of the attributes), against SMOTE, SMOTE-ENN, SMOTE-Tomek Links, Borderline-SMOTE, ADASYN, SMOTE-RSB* and SPIDER; using 7 data-sets from the UCI repository [17] with different imbalance ratios and using C4.5 as classifier.The performance of the classifier was evaluated using AUC and hypothesistesting techniques as proposed by [18,19] for statistical analysis of the results.

Over-sampling based on the Covariance Matrix
This section introduce over-sampling based on the Covariance Matrix.First, we describe the Covariance Matrix which allow to compute variable dependency.Then, we give an overview of our proposed algorithm.Finally, we describe our experimental setup in four steps: tool, dataset selection, evaluation methodology and classifier used.

Covariance Matrix
The covariance matrix contains the covariance between the elements of a vector, where it measures the linear relationship between two variables.If the vectorcolumn entries: then the covariance matrix ij is the matrix whose (i, j) entry is the covariance where the operator E denotes the expected value (mean) of its argument The Covariance Matrix allows determining if there is a dependency relationship between the variables and it is also the data necessary to estimate other parameters.
In addition, it is the natural generalization to higher dimensions of the concept of the variance of a scalar random variable [19].

SMOTE-Cov
The Algorithm 1 show the steps of SMOTE-Cov to balance datasets [30].During the loading of the dataset in the first step, the algorithm expects continuousvalued attributes and a binary class.Then, it uses the formula 1 to verify whether the dataset is balanced or not.If it is imbalanced, the algorithm computes the Covariance Matrix.The Covariance Matrix allows detect the dependency relationship between attributes.Then, from the estimated covariance matrix, new synthetic instances are generated to balance the minority class.This process stops when an equilibrium between the two classes is reached.The algorithm checks that all the new values generated from the covariance are obligatorily within the interval of each attribute, in the case that some are outside the interval what is done is to take it to the minimum or maximum, making a kind of REPAIR of the value.The computational complexity of the SMOTE-Cov in the worst case is O(n 2 ), which is similar to some stateof-the-art approaches, such as, SMOTE−ENN, SMOTE-RSB and ADASYN.

Tools and experimental setup
The algorithm was developed using the R language because it is designed for statistical processing and has the cov() function for calculating covariance.In order to evaluate the behavior of the proposed algorithm it was compared against the state-of-the-art algorithms of oversampling data balancing; two variants are taken into account: when the attributes inside or outside of the dependence range.Seven datasets from the UCI Algorithm 1: SMOTE-Cov steps Input: Dataset X,inRange [Boolean] Output: Balanced dataset X Data: Dataset X Step 1: Load dataset X; Step 2: Compute X IR using equation 1 ; if IR ≥ 1.5 then Step 3: Estimate covariance matrix using equation 3, this will provide us with a probabilistic distribution of the dataset; Step 4: For each attribute, a range is determined by it min-max value; while X is not in equilibrium do Step 5: Generate new instance y according to the covariance matrix; if range true then add y to X; return X; end repository were chosen with IR ≥ 1.5, see Table 1, with continuous attributes and binary class.This experiment uses 5-fold cross-validation and the data is split into two subsets: training/calibration set (80%) and test set (20%).The final result is the mean of the 5 result sets.The partitions were made using KEEL in such a way that the number of instances per class remained uniform.The partitioned datasets are available on the KEEL website [21].The training datasets are balanced, generating new synthetic instances from the minority class to complete the quantities of the majority class and using a sample of control test, which continues imbalanced and without any modification.The new datasets are generated from the obtained instances, using the SMOTE-Cov algorithm and a classifier is used as a mean to measure the performance using other techniques.
The classifier used for the experimental study is C4.5 (implemented in the Weka package as J48) [22], which has been referred to as a statistical classifier and one of the top 10 algorithms in Data Mining that is widely used in imbalanced problems [4].
The Area Under Curve (AUC) ( 5) is used to measure the performance of classifiers over imbalanced datasets using the graph of the Receiver Operating Characteristic (ROC) [16].In these graphics, the trade off between the benefits (TPrate) and cost(FPrate) can be visualized, and represent the fact that the capacity of any classifier cannot increase the number of true positives without also increasing the false positives.AUC summarizes the performance of the learning algorithm in a number. where: T P rate: Correctly classified positive cases that belong to the positive class FP rate: Negative cases that were misclassified as positive examples

Experimental study
The AUC result values is studied with this already balanced dataset.Table 2 shows that the AUC results of the data-balancing algorithm applying the Covariance Matrix with its CovI and CovO variants are similar or comparable with respect to the state-of-the-art oversampling algorithms, using as C4 .5 classifier.For the statistical analysis of the results, hypothesistesting techniques were used [18,19].In both experiments, the Friedman and Iman-Davenport tests were used [23], in order to detect statistically significant differences between groups of results.The Holms test was also carried out [24], with the aim of finding significantly higher algorithms.These tests are suggested in the studies presented in [18,19,23], where it is stated that the use of these tests is highly recommended for the validation of results in the field of automated learning.Table 3 shows the ranking obtained by the Friedman test for the experiment.Although the algorithm with the best ranking was ADASYN, Holm's test performed below will demonstrate to what extent this algorithm can be significantly superior to the one proposed in the research.4 summarizes the results of Holms test taking ADASYN as a control method, all hypotheses with p-value ≤ 0.05 are rejected; showing that ADASYN is significantly superior to the SMOTE-CovI and Borderline-SMOTE algorithms.In the case of SMOTE-CovO, SPIDER, SMOTE_ENN, SMOTE_TL, Original, SMOTE-RSB and SMOTE, the null hypothesis is accepted, this mean that there are not significant differences between ADASYN and them, so it can be concluded that they are as effective.
On the other hands, the results achieved on Nemenyi's post hoc comparisons for α = 0.05 and adjusted p-values are shown in: Nemenyi's procedure rejects those hypotheses that have an p-value ≤ 0.001111.
Nemenyi test is a test that to find difference on the groups of data after a statistical test of multiple comparisons.If it has rejected the null hypothesis that the performance of the comparisons on the groups of data then is similar.The test does pair-wise tests of performance.As can be observed, ADASYN and SMOTE-CovI has a similar performance, while the rest of pair-wise has a different performance.

Conclusions and future work
In this paper, a new algorithm is proposed to generate synthetic instances of the minority class, using the Covariance Matrix.The experimental study carried shows the effectiveness of the proposed algorithm compared to eight recognized algorithms of the stateof-the-art.SMOTE-Cov showed similar or comparable results, taking into account the results of the AUC curve of the C4.5 classifier and using non-parametric tests to demonstrate that there are no significant differences between them, with the exception of the ADASYN versus the SMOTE-CovI variant.This can be influenced because the attributes present in the studied datasets come from other intervals and not from the actual attribute within the dataset.
Having results comparable to those of the state-ofthe-art, for these datasets, allows in the future to extend the experimentation to datasets with tens, hundreds or thousands of attributes and with strong dependency relationships.It is also intended to use covariance regularization (Shrinkage) to balance data, where the number of positive instances is less than the number of attributes.The last recommendation is study the extension of the proposed algorithm to multi-class classification problems.

Table 1 .
Description of the datasets used in the experiments

Table 2 .
AUC of the data balancing algorithms with generation of oversampling classes of the state-of-the-art, CovI and CovO

Table 4 .
Holms test with α = 0.05, taking ADASYN as a control method