Clustering the objective interestingness measures based on tendency of variation in statistical implications

In recent years, the research cluster of objective interestingness measures has rapidly developed in order to assist users to choose the appropriate measure for their application. Researchers in this field mainly focus on three main directions: clustering based on the properties of the measures, clustering based on the behavior of measures and clustering tendency of variation in statistical implications. In this paper we propose a new approach to cluster the objective interestingness measures based on tendency of variation in statistical implications. In this proposal, we built the statistical implication data of 31 objective interestingness measures based on the examination of the partial derivatives on four parameters. From this data, two distance matrices of interestingness measures are established based on Euclidean and Manhattan distance. The similarity trees are built based on distance matrix that gave results of 31 measures clustering with two different clustering thresholds.


Introduction
The objective interestingness measures play an important role in evaluating the quality of knowledge in the form of association rules, especially in the post-processing stage of the process of mining knowledge from databases.Researchers in this field are mainly concentrated in two main directions: (1) proposing new measures; (2) studying the properties, behaviors and trends of the variation of the measures in order to rank, cluster and classify them.This study aims to assist users to select appropriate measures for their particular application.Clustering of objective interestingness measures is one of the research areas that many researchers are concerning [9] [21].Clustering measures is the process of searching and discovery of clustering measures to match each application area [21].Currently, there are many techniques that can be applied in the clustering measures: clustering based partitioning, clustering based on hierarchical [17] [18] and clustering based on density.In general, these techniques are directed at two main goals: the first goal is to find the most appropriate measure for specific applications [21] and the second goal is to consider the relationship of a particular measure with the remaining measure in a set of the study measures [10].In particular, the technique based on hierarchical clustering [19] mainly focused on the second objective.The selection of an appropriate measure for applications is what many researchers and users have always wished.However, the list of the objective interestingness measures proposed is increasing [11] and has surpassed 100, the selection becomes a significant challenge for them.The research results of the variation of the objective interestingness measures based on partial derivatives have opened up some new researches such as classification measure based on the tendency of the variation of measures [15], the consideration of the variability of the measures with statistical implication parameters and the relationship of interdependence between the statistical implication parameters in every measures [8] [16].The list of the objective interestingness measures can be reduced based on partial derivative results to support users whether to choose better measure?This paper proposes a new approach using the hierarchical structure of similarity tree [3][4] [5] [7] to cluster the objective interestingness measures which agreed asymmetrical properties.In this approach, we use the results of tendency of variation in statistical implications [15] based on partial derivatives of the calculated function of measures on each parameter to build distance matrix [13] of the measures.Clustering results of the measures are demonstrated through the structure of similarity tree.Each cluster is a group of measures that has proximity or similarity to each other.This is criteria for the use of researchers and users in order to choose an appropriate measure for their applications in a better way.This paper is organized into six sections.Section 1 introduces the measures clustering technique and raises the research issue.Section 2 presents tendency of variation in statistical implications and builds data based on values of partial derivatives of measures.Section 3 outlines the method of measuring distance and distance matrix of the measures.Section 4 describes algorithm for clustering the measures and similarity tree of the measures.Section 5 presents experiments.The final section summarizes the important results achieved by the article.

Statistical implications
Statistical implications [8] study the implication relationship between variable data or attribute data that allows the detection of rules a → b asymmetrical in the form "if a, then nearly as b" or "consider to what extent that b will satisfy the implications of a".

Tendency of variation in statistical implications
The tendency of variation in statistical implications is a researching direction to examine the stability of the implication intensity to observe small variations of measures in the surrounding space of parameters ,   ,   and   [8].
Identifying trends of variation in statistical implications of the measures shows some possibilities for application in the study of the interestingness measures and practical application: study the variability of the measure, dependent relationship between the variable parameters ,   ,   ,   [8], classification of the objective interestingness measures [15].To clarify the tendency of variation in statistical implications, we examine the Implication index measures [ [8]] under 4 parameters ,   ,   ,   with formula defined: √   (n − n  )  To observe the variation of q from the variability of the parameters ,   ,   ,   , Let us consider the parameters ,   ,   ,   as real numbers satisfying the following inequalities: ≤ inf (  ,   ) and sup (  ,   ) ≤  In this case, q can be considered as a continuous differentiable function: The function (,   ,   ,   ) is a scalar function variations on the surface represent 4 parameters.To observe the variation of q according to the parameters, we calculated the partial derivative for each parameter.In fact, this variation is estimated rising of the function q with variation according to the variation of q corresponding components ∆, ∆  , ∆  and ∆  .Therefore, we have the formula: Let us take the partial derivatives of q under n we have the following formula: According to the formula, if the parameters   ,   ,   ̅ is constant, the implication index decreases under parameter n.Let us take the partial derivatives of q under   we have the following formula: From the formula, if the parameter   changes from 0 to   , the Implication index always reduces to   and reaches the lowest value when   =   .For example, with  = 100,   = 20,   = 40,   ̅ = 4 then q = -2.309401and  Let us take the partial derivatives of q under   we have the following formula: Let us take the partial derivatives of q under   we have the following formula: From two formulas above, if ∆  and ∆  increase, the Implication index increase.For example, with  = 100,   = 20,   = 40,   ̅ = 4 then q = -2.309401,

Building the statistical implication data of measures
From the examining results of the Implication index measures, the tendency of the variation in statistical implications or partial derivatives for each parameter ,   ,   ,   reflects relatively accurate trends and rate of change of the measures [8][16] [15].However, the value variation of the partial derivatives disagrees with the variation of measures.It means that they only reflect on the meaning of the derivative mathematically.If partial derivative value is positive, the measures variably increase; if partial derivative value is negative, the measures variably decrease; if partial derivative values are zero, the measures are independent with the corresponding parameters [15].Based on the commented above, this paper builds the statistical implication data of measures based on the partial derivative values under 4 parameters by 3 principles as follows: Principles 1: If the partial derivative values of corresponding parameter are positive, the property of measures in the corresponding parameter is set to 1 (The measures variably increase with the corresponding parameter).Principle 2: If the partial derivative values of corresponding parameter are negative, the property of measures in the corresponding parameter is set to -1 (The measures variably decrease with the corresponding parameter).Principle 3: If the partial derivative values of corresponding parameter are zero, the property of measures in the corresponding parameter is set to 0 (The measures are independent on corresponding parameter).In these three principles, each measure is considered as a vector in 4-dimensional space under the form: m(v1, v2, v3, v4).

The distance between two measures
For clustering the measures based on the statistical implication data, the calculation of gap between two measures is an important step.Currently, there are many ways to calculate the distance between two measures in ndimensional vector space.To calculate the distance between two measures based on the statistical implication data, we apply two calculating methods: Euclidean distance and Manhattan distance [12].These methods have been used to calculate distance for data clustering applications since they are simple and effective.For the statistical implication data, we suppose that we need to calculate the distance between the measures with the vector form as follows:  1 ( 1 ,  2 ,  3 ,  4 ) and  2 ( 1 ,  2 ,  3 ,  4 ).We determined formula distance between two objective interestingness measures based on the statistical implication data as follows: The Euclidean distance between m1 and m2 is determined by the following formula: Example: To calculate the Euclidean distance between Confidence(0, 1, 0, -1) and Zhang (1, 1 -1, -1) as follows: The Manhattan distance between m1 and m2 is determined by the following formula: Example: To calculate the Manhattan distance between Coverage (-1, 1, 0, 0) and Laplace (0, 1, 0, -1) as follows:

Clustering algorithms for measures
Hierarchical clustering for measures is a method of clustering analysis that seeks for building a hierarchy of clusters of measures [1][2] [14].For the process of clustering, we assign each measure a cluster.Then we group two clusters with the closest distance into one cluster.This process is repeated until all measures are grouped into the same cluster.
Clustering algorithm includes the following steps: Step1: Change the properties of the measures variation in distance matrix.
Step2: Put each measure into one cluster (if we have 5 measures, we will have 5 clusters).
Step3: Repeat the two following operations until the cluster is equal to 1: -Group 2 clusters with the closest distance into one cluster.
-Recalculate the distance matrix.Clustering algorithm is presented on the following diagram:

Similarity tree of the measures
Similarity tree [4][5] [7] of the measures is a graphical hierarchical structure which is used to express the relationship of similarity between the measures.In similarity tree of the measures, the ordering of the leaf nodes expresses similarities of one measure compared to the rest measures of the tree [22].Two nearly leaf nodes at the same level in the tree represent the similarity of two measures.The height of the tree showing the difference between the measures is represented in the tree.Two leaf nodes spaced larger height, showing differences between the two measures are represented in the tree.Similarity tree of the measures is built up the distance matrix of the measures.The nodes of the tree will be represented in the ordered distance value following the principle: if the distance value between two measures is smaller, they are represented closer together according to the hierarchical structure.

Threshold for clustering of measures
The threshold for clustering is the smallest distance between two clustered measures.On similarity tree of measures, threshold for clustering the measures is determined based on the height of the tree.In the process of creating a similarity tree, clustering threshold is determined based on the distance matrix between the measures.At starting, the threshold for clustering of measures has the value equal to the smallest gap in the measures.This threshold will be updated after each step to build the tree and recalculate distance matrix.Clustering threshold of the measures tends to increase and reach a maximum value when all measures are mixed into one cluster.Example: For distance matrix in the example above, we apply clustering algorithms and build similarity tree of the measures.The result is presented in Figure 2.

Data description
The objective of this research is to cluster the objective interestingness measures based on the tendency of variation in statistical implications, in this experiment we selected 39 objective interestingness measures agreed with asymmetric nature in order to examine partial derivative value according to 4 parameters by three principles that were defined in Section 2.2 [15].However, in the process of examining partial derivative value of the measures we found 8 measures that their partial derivatives value is not always positive, negative or zero, where they change sign according to variations of the parameters ,   ,   ,   .Thus, the list of the measures agreed with three principles and only remained in this experiment is 31 measures.The results of examning on tendency of variation in statistical implications of 31 measures with ARQAT tool are presented in Table 1.

EAI
Table 1: The statistical implication data of the measures according to parameters ,   ,   ,   .

Implementation tools (ARQAT)
We use ARQAT package to deploy the experimental cluster measures on language R. In this package, we have quite fully updated objective interestingness measures functions for association rules based on 4 parameters n,   ,   ,   ̅ , partial derivative functions of the measures, functional distance matrix calculation, integrated hierarchical clustering functions and drawing structure of similarity tree functions of stats package on R [20].

Experimental results
Based on the statistical implication data of 31 measures presented in Table 1, we set up the Euclidean distance and Manhattan distance matrix of the measures on ARQAT tools.From these distance matrices, we apply clustering functionality for measures and drawing structure of similarity tree functions on ARQAT tools to cluster the measures based on distance matrices and drawing structure of similarity tree of 31 measures.The similar trees representation corresponding to each distance calculation method is presented in Figure 3.

Clustering results based on Euclidean distance
From similarity tree presented in Figure 3, if we consider the threshold h=2, the clustering result of 31 measures is 6 clusters.This result is presented specifically in Table 2.
The first cluster includes 15 measures with the same characteristics as increasing variation with two parameters ,   and reducing variation with two parameters   ,   .The second cluster includes 5 measures that have the same nature as variable increasing with parameter   , decreasing with parameter   and independent with both parameters ,   .The third cluster includes three measures (Coverage, Descriptive Confirmed Confidence, Gain) that have similar variability on two parameters   ,   and different on two parameters n,   .The fourth cluster and Fifth cluster are similar in the number of measures (2 measures) but they have different distance between the measures.The measures in these clusters have the same nature varying on statistical implication parameters.The sixth cluster includes 4 measures, a special cluster created from Putative Causal Dependency and one subcluster of three measures: Kulczynski index, Least contradiction and Recall having the same nature varying on statistical implication parameters.

Clustering results based on Manhattan distance
Basing on the structure of similar tree built in Manhattan distance matrix, with the threshold h = 2, the measures are divided into 7 clusters (Table 3).The first cluster consists of 15 measures that they have the same natures as increasing variation with two parameters ,   and reducing variation with two parameters   ,   .The second cluster includes 5 measures that have the same characteristics as variable increasing with parameter   , decreasing with parameter   and independent with both parameters ,   .The third cluster is a particular cluster since it formed from a cluster consisting of two measure with the same varied properties (Descriptive-Confirm, Gain) and one measure of variability is not of the same natures (Coverage).The fourth cluster formed by two measures is not of property variability in both parameters   ,   but their distance is equal to the level of clustering.
The fifth cluster has only IPEE measures.This measures have special properties varied independent on all three parameters n,   ,   ̅ and decreasing variability on parameter   .The sixth cluster includes 04 measures, a special cluster created from Putative Causal Dependency and one subcluster of three measures: Kulczynski index, Least contradiction and Recall.They have the same nature varying on statistical implication parameters.The seventh cluster also has only one measure (Leverage).This measure reduces variability with all three parameters n  ,   ,   ̅ and increases variable parameter n.

Comparison of the Clustering results
Based on the similarity trees in Figure 3, overall, clustering results in two measurement distances are relatively similar.When considering the threshold h=2, both clustering results have the list of measures in the majority of clusters that are similar as cluster 1, cluster 2 and clusters 3, cluster 4 and cluster 6.However, with clustering results based on Euclidean distance matrix, IPEE and Leverage are classified in the same clusters but with clustering results based on Manhattan distance matrix, IPEE and Leverage are classified in different clusters.As clustering thresholds increasing h=4, both similarity trees have 3 clusters for two distance measurement methods (Table 4, 5).In particular, the second cluster of both trees has the same list of measures.The first cluster and third cluster have the list of measures that differ on both similarity trees.The reason of deviation is that Leverage measures are located in two different clusters on two similarity trees.In the tree according to Euclidean distance, Leverage measures are classified into third cluster while the other tree, Leverage measures is classified in the first cluster.Clustering results show that the group of objective interestingness measures agreed with asymmetric properties is mainly classified into three large clusters: Cluster 1, Cluster 2 and Cluster 6.Each cluster consists of the measures with the same characteristics following the tendency of variation in statistical implications.For example, Cluster 2 is a group of measures variable increasing with parameter   , decreasing with parameter   and independent with both parameters ,   .The results also show the distance of the intensity variation of the measures under statistical implication parameters.This is a useful basis for further study about the relationship between the measures based on an examinination of their partial derivatives.

Conclusion
Clustering the objective interestingness measures is attracted to many researchers in the field of data mining.The study for clustering measures is primarily based on three main techniques: clustering based on partition, clustering based on hierarchical and clustering based on density.In this article we propose clustering method to cluster the objective interestingness measures based on the tendency of variation in statistical implications by hierarchical clustering techniques.From the statistical implication data, distance matrices of measures are built on two distance calculation methods of Euclidean and Manhattan.After calculating the distance matrices, we use our tools to build similarity trees for clustering 31 measures.The similarity trees show that the measures are classified with two clustering thresholds h=2 and h=4.This result can be used to support the choice of the appropriate measure of researchers and users for their specific applications and is also useful basis for further study about the objective interestingness measures bassed on the tendency of variation in statistical implications.
.73354, when   increases from 20 to 30 and the rest parameters do not change the value of q decrease and for Innovation EAI Endorsed Transactions on Context-aware Systems and Applications 03 -05 2016 | Volume 3 | Issue 9 | e 2886751, when   increases from 40 to 50 and the rest parameters do not change the value of ), when   ̅ increases from 4 to 8 and the rest parameters do not change, the value of q increases and ∂q ∂  does not change(q = -1.154701and ∂q ∂  ̅ = 0.2886751).
for Innovation EAI Endorsed Transactions on Context-aware Systems and Applications 03 -05 2016 | Volume 3 | Issue 9 | e dij = d(mi, mj) is value of the distance between two measures: mi and mj, and is calculated by formula (1), (2) in Section 3.1.

Figure 3 :
Figure 3: Similarity trees based on the statistical implication data.

on Context-Aware Systems and Applications
EAI Endorsed Transactions on Context-aware Systems and Applications 03 -05 2016 | Volume 3 | Issue 9 | e

Table 2 :
The clustering results of measures based on Euclidean distance (h=2).

Table 3 :
The clustering results of measures based on Manhattan distance (h=2).

Table 4 :
The clustering results of measures based on Euclidean distance (h=4).

Table 5 :
The clustering results of measures based on Manhattan distance (h=4).