Analysis and improvement of evaluation indexes for clustering results

Clustering algorithm is the main field in collaborative computing of social network. How to evaluate clustering results accurately has become a hot spot in clustering algorithm research. Commonly used evaluation indexes are SC, DBI and CHI. There are two shortcomings in the calculation of three indexes. (1) Keep the number of clusters and the objects in the cluster unchanged. When transforming the feature vector, the three indexes will change greatly; (2) Keep the feature vector and the number of clusters unchanged. When changing the objects in the cluster, the three indexes will change tinily. This shows that the three indexes unable to evaluate the clustering results very well. Therefore, based on the calculation process of the three indexes, the paper proposes new three indexes NSC, NDBI and NCHI. Through testing on standard data sets, three new indexes can better evaluate clustering results. Received on 07 October 2019, 18 February 2020, published on 18 February 2020


Introduction
Clustering is an important algorithm for data mining in collaborative computing of social network. The purpose of clustering is to bring together similar objects and separate dissimilar objects. Since the object facing the clustering algorithm is unlabeled, the cluster to which the object belongs is unknown, so how to better evaluate the clustering result has become one of the research hotspots in unsupervised learning field. At present, there are three main indexes for the evaluation of unlabeled clustering results: Calinski-Harabasz Index [1], Davies-Bouldin Index [2] and Silhouette Coefficient [3].They define the calculation methods of intra-cluster relations and inter-cluster relations respectively, and evaluate the clustering results according to the combination of intracluster relations and inter-cluster relations. Because the three indexes can evaluate the clustering results intuitively, they can be applied to a wide range of clustering scenarios, or test the optimization effect of clustering algorithm.
Clustering tasks are divided into three steps. Firstly, objects are mapped to feature vectors with certain rules. Secondly, feature vectors are clustered by various * Corresponding author. Email: scnuzhonghao@foxmail.com clustering algorithms. Finally, CHI(Calinski-Harabasz Index), DBI(Davies-Bouldin Index) and SC(Silhouette Coefficient) are used to evaluate the clustering results. But there are two problems in evaluating the clustering results: (1) Different researchers will propose different rules for generating feature vectors. For the same set of objects, different feature vectors will be generated by using different generation rules. Keep the number of clusters and the objects in each cluster unchanged. When the vector elements change, the calculated values of CHI, DBI and SC will change greatly. (2) Different researchers will propose different clustering algorithms. Keep the number of clusters and the elements of each vector unchanged. When the objects in each cluster change, the calculated values of CHI, DBI and SC will change tinily. In order to solve the above two problems, and make the clustering results better evaluated, based on the calculation process of three indexes, this paper proposes new indexes NSC(New SC), NDBI(New DBI) and NCHI(New CHI). The main contributions are as follows: • Keep the number of clusters and the objects in each cluster unchanged. When the feature vectors change, the problem that the calculated values of indexes change greatly is solved to some extent. In calculating the relationship between feature vectors, the cosine of the angle between vectors

Related Works
The indexes CHI, DBI and SC can be applied to evaluate the cluster effect in various scenarios. For example, Hassani et al. [4] believed that when clustering algorithm is applied to analyze dynamic unlabeled data, it is impossible to use labels to evaluate clustering results. CHI is an evaluation index based on data itself, is suitable for evaluating clustering results of dynamic data. Schkafer et al. [5] analyzed the dynamic consumption data and static information data of users, clustered users by using hybrid fuzzy clustering algorithm, CHI, DBI and SC are used to evaluate the results of user clustering. In order to realize the division and management of different regions, Arroyo et al. [6] analyzed the meteorological data of Spain, used K-Means and other clustering algorithms to cluster the regions, and used CHI, DBI and SC to evaluate the clustering results. Babichev et al. [7] took the gene expression sequence of cancer patients as a feature, used different criteria to measure the similarity between the features, clustered the cancer patients according to the similarity, and used CHI index to evaluate the clustering results. In order to manage power resources better, Damayanti et al. [8] analyzed power consumption in each time period, applied clustering algorithm in every time period, and selected the best clustering result according to DBI index. In order to improve the accuracy of software size estimation, Prokopov et al. [9] proposed a new clustering method based on use case points, Compared with clustering algorithms such as K-Means, the evaluation of clustering results using indexes such as CHI and SC. Umam et al. [10] proposed a hybrid clustering method based on K-Means clustering and hierarchical clustering. The sequence of DNA was used as a feature to cluster, the DBI was used to evaluate clustering results. In order to achieve better management of video data, Kumar et al. [11] proposed an equal-partition clustering technology, achieved video clustering in real-time applications, using the DBI index to evaluate the results of video clustering. In order to achieve clustering of hospital patients, based on algebraic structure, Thanh et al. [12] constructed a neutron recommendation equivalent matrix for hospital patients, and performed λ-cutting on the matrix. The result of the cutting is the clustering result of the patient, using DBI to evaluate patient clustering results. In order to place public facilities in the appropriate population, Kisore et al. [13] proposed a clustering algorithm base on generalized density. Users were clustered based on their requirements, preferences and geographical location. SC and DBI were used to evaluate the results of user clustering. In order to realize the analysis of telecommunication service cluster, Tosida et al. [14] proposed a self-organizing mapping algorithm based on artificial neural network system, clustered telecommunication service facilities, and used DBI as the evaluation index of clustering results. In order to provide data for the construction of wind power plants, according to the wind speed in Turkey, Yesilbudak et al. [15] used K-Means clustering algorithm to divide the regions, and selected the best number of clusters according to the change of SC. In order to recommend different types of movies to everyone, Alfarizy et al. [16] extracts the valid information in the subtitles of movies, and clusters the movies based on these valid information. The default clustering effect is the best when SC calculates the maximum value.Rani et al. [17] applied clustering algorithm to news text clustering, extract the words in news text as features. Hierarchical clustering and K-Means clustering are used to get different clustering results, and the optimal clustering results are selected according to the change of SC. Sarasa et al. [18] clustered birds and animals based on audio data, when measuring the similarity between audio features, a method of normalized compressed distance was proposed. Hierarchical clustering was used to cluster audio features, SC was used to evaluate the clustering results. Mago et al. [19] applied clustering algorithm to neonatal clustering. According to the characteristic data of neonates, hierarchical clustering was used to cluster neonates, and SC was used to evaluate the clustering results, which could provide data for doctors to diagnose neonates with different conditions. The indexes CHI, DBI and SC can also evaluate the clustering algorithm. For example, Raposo et al. [20] proposed an automatic clustering algorithm based on genetic algorithm, which can select the best number of clustering in the data set, and used CHI as the evaluation index of clustering results, compared with K-Means and fuzzy C-means algorithm. In order to realize the clustering of hospital patients, Li et al. [21] proposed 2 EAI Endorsed Transactions on Collaborative Computing 10 2017 -06 2020 | Volume 4 | Issue 13 | e4 Analysis and improvement of evaluation indexes for clustering results a multi-objective clustering algorithm. Based on multiobjective differential evolution algorithm, optimized multiple objective at the same time, finded the optimal clustering result. Siddiqi et al. [22] proposed using a heuristic algorithm to solve the clustering problem. The algorithm was divided into two parts. The first part was using the greedy algorithm to select the data points with higher resolution as the cluster center. The second part set CHI as the objective function, when the objective function takes the maximum value, obtains the optimal clustering result. In order to achieve image clustering, Toz et al. [23] combined fuzzy C-means clustering algorithm with backtracking search optimization algorithm, improved the local search ability of the algorithm by optimizing the objective function, and used the DBI to evaluate the clustering results. Gowri et al. [24] believe that when clustering large data, it is necessary to use MapReduce framework for distributed processing of data. Under the premise of not changing the clustering result, according to the change of DBI, it can be proved that using the MapReduce framework to preprocess the data can shorten the running time of the clustering algorithm. Andryani et al. [25] applied the fuzzy C-means clustering algorithm to the clustering task of DNA, and proposed combining the splitting algorithm with the fuzzy C-means clustering algorithm. The process of solving the minimum value of the objective function can be simplified, the final clustering results were evaluated by using DBI. Halim et al. [26] transformed the data into a probability map representation, and proposed a clustering method based on the density, and then completed the clustering of the data. The clustering results were evaluated by using the DBI and SC. Ketsuwan et al. [27] proposed that using linear discriminant analysis to reduce the dimension of feature vectors, it can further minimize intra-cluster dispersion and maximize inter-cluster separation, thus improving the DBI of clustering results and obtaining better clustering results. Hasanzadeh et al. [28] proposed an automatic learning machine clustering algorithm. In the clustering process, the reinforcement signal was defined according to the Euclidean distance between the data points and the clustering centers. The automatic learning machine judged the cluster of each data point by correcting the reinforcement signal. SC was used to compare the clustering results with K-Means.
Aiming at the problems existing in the calculation of CHI, DBI and SC, Amorim et al. [29] believed that scaling features would affect the indexes such as CHI and SC. In order to improve the indexes of clustering, data sets were indexed according to such indexes as CHI and SC, it can find the ideal scaling factors. Fernandez et al. [30] believed that constructing different features on the same object would affect the CHI. Therefore, based on Laplace value and CHI, a feature selection framework was proposed, which is more suitable for object construction. The characteristics of clustering tasks achieve better clustering results. Cheng et al. [31] considered that SC and DBI could not be applied to evaluate the clustering results of all types of data. Therefore, based on SC, a new metric was proposed -the clustering effectiveness based on local clustering centers. The index selects the local cluster center with local maximum density as the representative point, which can more accurately evaluate the difference between the cluster centers.
In the above research, when applying CHI, DBI and SC to evaluate clustering results, some researchers have suggested that there are shortcomings in the calculation process. But only used CHI, DBI and SC as objective functions, by searching for the optimal solution of the objective function. constructed the feature or to select the number of clusters, and the calculation process of CHI, DBI and SC was not changed. The paper will analyze the problems in calculations process of CHI, DBI and SC, and propose NCHI, NDBI and NSC as new indexes. so that these new indexes can better evaluate the clustering results.

Definition of Relevant Symbols
According to the definition of three indexes, the variables and functions used in the calculation of indexes are defined in Table 1.

Calinski-Harabasz Index
When evaluating the clustering results, the larger the calculated value of the CHI (Calinski-Harabasz Index), the better the clustering effect. CHI contains the following specific meanings. There are two concepts: group dispersion and within-cluster dispersion. For cluster k, the group dispersion represented by Grd(k), the within-cluster dispersion represented by Wcd(k). The calculation is shown in formulas (1) and (2).
According to formulas (1) and (2), the calculation of CHI is shown in formula (3).

Davies-Bouldin Index
When evaluating the clustering results, the smaller the calculated value of the DBI (Davies-Bouldin Index), the better the clustering effect. DBI contains the following specific meanings. There is one concept: average similarity between two clusters. The similarity between cluster i and cluster j represented by Sim(i,j).
Formula (2) is included in the calculation process, as shown in formula (4).
According to formula (4), the similarity between cluster i and any other cluster is calculated, the similarity list SL(i) of cluster i is constructed.  (5):

Silhouette Coefficient
When evaluating the clustering results, the larger the calculated value of the SC (Silhouette Coefficient), the better the clustering effect. SC contains the following specific meanings. There are two concepts: the mean distance between a sample and all other points in the same cluster, and the mean distance between a sample and all other points in the next nearest cluster. For a sample K m , the mean distance between it and all other points in the same cluster represented by Wcmd(K m ), the mean distance between it and all points in other cluster k' represented by Gmd(K m ,k'). The calculation is shown in formulas (6) and (7).

Problem Description
Keep the number of clusters and the objects in each cluster unchanged, then Len(K), k and n remain unchanged. According to the formula of Eudis(), when the vector element K m changes in interval [0, +∞], the Grd(K m ), Wck(K m ), Sim(K m ), Gmd(K m , k') and Wcmd(K m ) related to Eudis() all change in interval [0, +∞]. Finally, the calculated values of CHI, DBI and SC will change, as shown in formula (9).
If logarithmic or exponential transformations are performed on vector elements, the variation rates of Grd(K m ), Wck(K m ), Sim(K m ), Gmd(K m , k') and Wcmd(K m ) are different. CHI, DBI and SC will have maximum and minimum values in the interval [0, +∞] of K m . It is impossible to evaluate the clustering results from the calculated values of the indexes.
Keep the number of clusters and the elements of the feature vector unchanged, when the objects in the cluster change, Mean(X), k and n remain unchanged. The result of clustering algorithm is that similar feature vectors are clustered into one cluster. Under the premise of the same number of clusters, Assuming that the set of objects in cluster k changes, a new clustering center C k ' is obtained. The new clustering center C k ' is approximately equal to the original clustering center C k , that is, C k ≈ C k '. In particular, CHI, DBI and SC all use the average distance method to describe the relationships between and within clusters, which further narrows the gap between the relationships between and within clusters. Therefore, the phenomenon of object change in clustering results can not be well reflected.

Solutions to problems
When the element of feature vectors changes in interval [0, +∞], if Eudis() is used to represent the relationship between any two feature vectors, the calculated value of Eudis() will change in interval interval [0, +∞] equally. The paper proposes to use Cos() to measure the relationship between vectors. When the vector elements change in interval [0, +∞], the calculated value of Cos() will be stable in interval [0,1]. Suppose that the relationship between vectors a and b is calculated by Cos(), as shown in formula (10).
When the object set in the cluster changes, the number of elements in the set changes relatively. Therefore, when calculating inter-cluster or intracluster relationships, the number of elements in the cluster is used as a coefficient, to increase the calculated value of intra-cluster or intra-cluster relationship. Thereby expanding the calculated values of CHI, DBI and SC. The calculated values of CHI, DBI and SC increase in magnitude, which can better reflect the change of object sets in clusters.

Redefinition of Calinski-Harabasz Index
According to the relevant definition of CHI, the calculation process Eudis() existing in Grd(k) and Wcd(k) is replaced by Cos(). Moreover, in the calculation process of Grd(k), Len(K) is used as a coefficient to expand the calculated value of Grd(k), thereby increasing the gap between Grd(k) and Wcd(k). The specific calculation processes of Grd'(k) and Wcd'(k) are shown in formulas (11) and (12).
The larger the calculated value of CHI, the better the clustering result. Without changing this principle, the NCHI calculation defined is shown in formula (13).

Redefinition of Davies-Bouldin Index
According to the relevant definition of DBI, the calculation process Eudis() existing in Wcd(k) and Sim(i,j) is replaced by Cos(). Moreover, in the calculation process of Sim(i,j), Len(I) and Len(J) are used as a coefficient to expand the calculated value of Wcd'(i) and Wcd'(j), thereby increasing the gap between the numerator and the denominator in the formula of DBI. The specific calculation processes of Sim'(i,j) is shown in formulas (14).
The smaller the calculated value of DBI, the better the clustering result. Without changing this principle, the similarity between cluster i and any other cluster is calculated, the similarity list SL'(i) of cluster i is constructed. SL'(i)=[Sim'(i,1), Sim'(i,2), Sim'(i,3), ..., Sim'(i,k)], extract the maximum value in the list. The maximum values in the similarity list of all clusters are added together, and the mean values are calculated as NDBI, as shown in formula (15):

Redefinition of Silhouette Coefficient
According to the relevant definition of SC, the calculation process Eudis() existing in Gmd(K m ,k') and Wcmd(K m ) is replaced by Cos(). Moreover, in the calculation process of Gmd(K m ,k') and Wcmd(K m ), Len(K') is used as a coefficient to expand the calculated value of Gmd(K m ,k'), Len(K) is used as a coefficient to expand the calculated value of Wcmd(K m ). Thereby increasing the gap between Gmd(K m ,k') and Wcmd(K m ) in the formula of SC. The specific calculation processes of Wcmd'(K m ) and Gmd'(K m ,k') are shown in formulas (16) and (17).
The larger the calculated value of SC, the better the clustering result. Without changing this principle, the average distances between K m and other clusters 5 Analysis and improvement of evaluation indexes for clustering results

Data Set and Evaluation Standard
In the sklearn database, four standard data sets are selected to test the effectiveness of the proposed method. The data set is suitable for classification tasks. The selected data set and the invocation method are shown in Table 2. After redefining the three indexes, the dimensions of the three indexes changed. Therefore, the paper used the CV(coefficient of variation) as the evaluation standard of the experimental results. The calculated value of CV indicates the degree of dispersion of the original index and the new index. The larger the absolute value of the CV, the more discrete the calculated value of the index. The calculation method is as shown in formula (19): Among them, N represents the number of samples (or experiments) and EI represents the calculation results of indexes.

An Example of Calculating Indexes
In the first experiment of this paper, an example is given to prove the problems of indexes in calculation. Data iris is used as test data, data iris is labeled data, and its label is defaulted to be the optimal clustering result. Keeping the clustering results unchanged, a negative exponential transformation (represented by New iris) is adopted for all vector elements in data iris, is X n = e − 1 X n . Taking the optimal clustering results, the CHI, DBI and SC are used to evaluate the original data, and to evaluate the transformed data. The experimental results are shown in Table 3: As shown in Table 3, only the vector elements are transformed, and the calculated values from CHI, DBI, and SC reflect the improvement of the clustering effect, but in fact the clustering results are not changed. It is proved that the calculated values of CHI, DBI and SC are sensitive to vector elements.
Similarly, data iris is used as test data. Because K-Means has the characteristics of randomly selecting the initial cluster center. The K-Means clustering algorithm is used to analyze the data iris twice. In the clustering process, the number of clusters of the two clustering results is the same, and the objects in each cluster are different. Two clustering results were evaluated using CHI, DBI, and SC, respectively. The experimental results are shown in Table 4: As shown in Table 4, when evaluating the clustering results, although the objects in each cluster have changed, but the change of the calculated value of three indexes is very small, so the calculated values of the indexes do not significantly reflect the changes in the clustering results.

Clustering Results Testing-Changing Feature Vectors
In the second experiment, this paper assumes that the elements of each feature vector are transformed by negative exponential transformation, is X n = e − α X n . α is the transformation coefficient. Set the change interval of α is [1,1000] , and increase the number of α once, complete a cluster analysis. Each clustering result was evaluated using CHI, DBI, SC, NCHI, NDBI, and NSC. Calculate the CV of 1000 clustering results. The experimental results are shown in Table 5: When evaluating the same clustering results, the ideal situation is that the calculated values of the indexes are unchanged. As shown in Table 5, the CV of CHI, DBI, and SC are larger, and the CV of NCHI, NDBI, and NSC are smaller. This indicates that the calculated values of CHI, DBI, and SC vary more, and the calculated values of NCHI, NDBI, and NSC vary less. The experimental results show that, the calculated value of the new index is more stable when the feature is transformed, and the influence of the feature transform on the calculated value is reduced to some extent, and the clustering result can be better evaluated.

Clustering Results Testing-Changing Objects in Clusters
Since the K-Means clustering algorithm is random when selecting the initial cluster center, when the K-Means clustering algorithm is used to cluster the same data, the clustering results may be different each time. In the third experiment of the paper, the number of clusters is set to 8, and each data set is clustered 500 times. Each clustering result is evaluated by CHI, DBI, SC, NCHI, NDBI and NSC. The CV of 500 clustering results is calculated. The experimental results are shown in Table  6. When the object in the cluster changes, under ideal situation, the calculated value of the indexes change significantly. As shown in Table 6, the CV of CHI, DBI, and SC are smaller, and the CV of NCHI, NDBI, and NSC are larger. This indicates that the calculated values of CHI, DBI, and SC vary less, and the calculated values of NCHI, NDBI, and NSC vary more. The experimental results show that, when the object in the cluster changes, the calculated value of the new index changes more obviously. It can more clearly show that the objects in the cluster have changed, and the clustering result can be better evaluated.

Conclusion
There are three main indexes for evaluating the clustering results. The paper analyzes two problems in the calculation of the three indexes and gives corresponding solutions. The first problem is that if the number of clusters is constant and the objects in the cluster are unchanged, the calculated value of indexes will change greatly when only the feature vector is changed. The cosine between the vectors is used instead of the distance between the vectors, which enables the three indexes to be stable within a small interval. The second problem is that the feature vectors are unchanged, the number of clusters is constant, and when the objects in the clusters are changed, the calculated value of the indexes does not change much. The paper uses the number of elements in each cluster as a coefficient, to expand the calculation between the inter-cluster relationship and the intracluster relationship, so that the three indexes can show a significant change. Two improvements are made to the indexes, so that the new indexes can better evaluate the clustering results, and the effectiveness of the improvement is proved by experiments.
In the future research work, it will be based on the following two aspects: Firstly, in an ideal situation, when evaluating the clustering results after feature transformation, the calculated values of the indexes should remain unchanged. After the improvement of the paper, there are still some fluctuations in the calculated values of the indexes, and future research work will try to further reduce the variation of the calculated values. Secondly, in an ideal situation, when evaluating the clustering results after the objects changed in the cluster, the calculated values of the indexes should change significantly. Through the improvement of the paper, the change range is only improved on the basis of the original indexes, and the future work will try to further increase the range of the change in the calculated value.