Automatic Data Clustering using Dynamic Crow Search Algorithm on Context-aware Systems and Applications

This work proposes Automatic clustering using Dynamic Crow Search Algorithm, which updates its parameters dynamically. Crow Search is a recently proposed algorithm that imitates the working of crow. Clustering is an essential aspect of data analysis whose significance has increased manifold since the advancements of technology which has led to enormous data generation, which need to be analysed in real-time. Automatic clustering detects optimal cluster numbers and produces sustainable cluster centroids. ACDCSA uses Cluster Validity using Nearest Neighbour as an internal validity measure that acts as a fitness function to find the optimal cluster centres. The present work is compared with some well-known other meta-heuristic search algorithms like PSO, DE, WOA and GWO for the automatic clustering task over seven benchmark clustering datasets. Inter-cluster distance, intra-cluster distance and the optimal cluster number produced are used to assess the performance of ACDCSA.


Introduction
Data clustering is defined as the segregation of data points having the same traits that are placed together in a cluster such that intra-cluster compactness within a cluster and inter-cluster sequestration between the different clusters should be optimal [1]. Data clustering groups together data points having similar traits, and these grouped data are different from other data points. It minimises the distance between the data points and the cluster centres within a cluster. Being an active area of research as it has widespread use in various scientific and research works. Various application areas such as data mining, bioinformatics, image analysis, satellite data, and real-life applications require unsupervised data clustering before actual data analysis [2]. In literature, the clustering task is widely grouped into two types 1. Hierarchical Clustering and 2. Partitional clustering [3].
Hierarchical clustering clusters the data by creating a tree-like structure called a dendrogram, where it tries * Corresponding author. Email: iiitm.rajesh@gmail.com to find the cluster number through the dendrogram and groups the data based on similar traits. Various authors have proposed several partition-based algorithms. Among them, the prominent one is the K-means clustering algorithm. Being easy to apply for different data types, it is the most favoured clustering algorithm. However, it has certain imperfections, such as its convergence towards local minima and the number of clusters, i.e., are known a-prior [4]. Clustering the data when the data-label is not known a-prior comes under unsupervised machine learning task. Most of the data obtained from various fields usually do not have data labels. These unlabelled data must first be clustered to analyse and find the hidden pattern. In such a scenario, automatic clustering plays a vital role. Automatic clustering is the technique where the actual cluster number or the data labels needed to be clustered is unknown. Usually, the real-life data sets are very complex and large, which require to be grouped in small clusters. So, considering the present scenario need for automatic clustering is becoming inevitable. Meta-heuristic search-based algorithms. These natureinspired algorithms find optimal optimisation solutions while balancing global expedition and intense local exploration of the space search [5]. Since automatic clustering simultaneously tries to find cluster numbers and their cluster centroid, it also comes under an optimisation problem that tries to solve two objectives at a time. Several researchers have proposed and implemented these meta-heuristic search-based algorithms for the automatic clustering problem. Among them, the prominent evolution-based algorithms are Genetic Algorithm (GA) [6], Differential Evolution (DE) [7], most applied swarm-based algorithms are Particle Swarm Optimization (PSO) [8], Grey-Wolf Optimization (GWO) [9], Firefly Algorithm (FA) [10], Whale Optimization Algorithm (WOA) [11], and Physicsbased algorithms are Gravitational Search Algorithm (GSA) and Harmony Search Algorithm (HS) [12]. These nature-inspired algorithms use a partition-based approach to solve the clustering problem, using internal cluster validity indexes (CVI) as the optimisation function for evaluation. The prominent CVI's proposed by different researchers are Dunn's Index, Davies-Bouldin Index, CS-Index, Silhouette Index, S_Dbw Index [13], and Cluster Validity Index using Nearest Neighbour (CVNN) [14].
In the present work, a modified Crow Search Algorithm is used as a clustering algorithm to solve the automatic clustering problem. CVNN is used as an optimising function to simultaneously find the optimal number of clusters and their best centroid. Crow Search Algorithm is a recently proposed nature-inspired search-based algorithm that mimics crow working [15]. In this work, an extension of crow search algorithm that dynamically updates its awareness probability and flight length based on the result obtained from the fitness value has been used for the automatic clustering task.
The rest of the paper is in the following sequences. A brief literature review of the automatic data clustering using metaheuristic search algorithms is presented in section-2. Crow Search Algorithm and its main drawbacks are discussed in section-3. The suggested improvement in the CSA and modified Dynamic Crow Search Algorithm is discussed in section-4. Section-5 presents the implementation of Automatic clustering using the Dynamic Crow Search Algorithm using a hybrid approach, describing the dataset used for data clustering and results obtained after simulation. It also compares the obtained results with other algorithms. Finally, section-6 concludes the work and proposes future work to extend the present work.

Related works
Several researchers have proposed different heuristic and meta-heuristic algorithms for clustering when considering the automatic clustering problem. Further, different researchers have applied many efforts in hybridizing two or more different algorithms for automatic clustering tasks. One of the studies shows that these nature-inspired algorithms have been applied extensively by several researchers to solve automatic clustering tasks [16]. Some of the meta-heuristic searchbased algorithms used for solving clustering problems are discussed in the following section. Some of the significant works in automatic clustering are carried out using PSO, either fine-tuning its parameters or hybridizing it with other nature-inspired algorithms. Merwe et al. combined the PSO and K-means and used them for data clustering and hence paved the path of data clustering using the swarm-based algorithm. However, it failed to find the cluster number for unlabelled data [17]. Omran et al. [18] proposed Dynamic Clustering using PSO, used in image segmentation to cluster the image data automatically. Abraham et al. [19] proposed kernel MEPSO for automatic complex data clustering. Instead of using standard Euclidean distance measure in Cluster Validity Index, they induced Kernel-based similarity measure, and the obtained results were superior to the compared work. Recently, Alswaitti et al. [20] have proposed DPSO, a dynamic clustering algorithm, which solves the premature convergence of PSO and balances its intensified local search and diverse global search by combining a kernel-based density estimation technique. DPSO used Dunn's Index as a CVI to judge the robustness of the obtained result. Gao et al. [21] proposed a hybrid PSO-K-means algorithm, which uses the hybrid initialization technique using the K-means algorithm, and apply Lévy flight-based position update to avoid getting trapped into local minima. Sharma and Chhabra, in 2019, proposed AHPSOM, which uses a mutation operator into a hybrid PSO algorithm for solving automatic data clustering problems. AHPSOM is mainly applied for the continuously generated data, usually from different networks, having dynamic and heterogeneous features, with unknown cluster numbers [22]. Amol et al. [23] proposed the hybridized grey wolf optimizer with a whale optimization algorithm, each having a different hunting style to catch its prey and further applying it to the data clustering domain. The proposed work uses inter-cluster distance, intra-cluster distance and cluster density-based fitness measures to find the optimal centroid for the automatic clustering task. Ashish, in 2018 has proposed (MR-EGWO) which applied greywolf optimizer in the big-data environment using Mapreduce algorithm for clustering large-scale data sets in which the grey wolf is hybridized with binomial 2 EAI Endorsed Transactions on Context-aware Systems and Applications Online First crossover and Lévy flight-based searching is applied to elevate the searching capability [24]. Ibrahim et al. hybridized GWO with trajectory-based search algorithm TS for clustering to intensify the effectiveness and balance between exploring and exploiting the GWO algorithm. In this hybridized work, TS is used as an operator for GWO, which helps it find the leader's neighbourhood, thus giving stress to more localized search in cases with high chances of finding the solution [25]. Kuo et al. [26] proposed iABC, which combines the ABC with the k-means algorithm, where k-means help find the better initial centroid and thus direct the bees to better positions during further iterations. Here the K-means algorithm gives the initial centroid, which is used by the onlooker key to finding an optimized location nearby to the initial centroid. The algorithm is applied to real-life customer segmentation problems. Hussain et al. [27] proposed an ABC optimizationbased algorithm for clustering large datasets having higher dimensions. The proposed method incorporates aspects of co-clustering by the ABC algorithm. Instead of using Euclidean distance in this work, the author has applied higher-order correlations to find the result. Also, the search space is explored in three different ways to have a better diversification result. Finally, the proposed work has shown scalability to parallel architecture in shared memory and distributed environments. Kumar et al. has proposed and implemented Gravitational Search Algorithm for automatic clustering problem and further applied it to image segmentation. The proposed work is known as ACGSA. The method used variable chromosome representation for cluster centroid encoding, and further weighted cluster centroids were applied to get the best centroid. The authors have introduced a new fitness function to achieve better and more stable cluster centroids [? ]. Kazem et al. implemented a variant of harmony search algorithm (HS), called best-worst-mean harmony search for data clustering; it employs an enhanced memory consideration method to efficiently employ the collected insight and experience in harmony memory [28]. Tseng and Yang et al. has used, Genetic Algorithm for automatic data clustering problem, well known as CLUSTERING. It clustered the data at three levels to obtain the final clustering output, outperforming other algorithms used for comparison [29]. Vovan et al. proposed an Automatic Clustering for interval data using a Genetic Algorithm. The overlapped distance within data intervals helps determine the optimal clusters; the proposed algorithm has applications like clustering data with different characteristics and recognizing the images [30]. Das et al. proposed ACDE, Automatic Clustering using Differential Evolution for clustering unlabelled data, which, apart from standard data set, gave better results for high-dimensional data. Further to demonstrate the effectiveness of the present work, two cluster validity index CS Index and Davies Bouldin Index, is used to find the appropriate number of clusters and their centroids [7]. Chen et al. [31] implemented an elastic-differential evolution algorithm for automatic data clustering by adopting a variable particle encoding scheme where the population consists of changeablelength parameter vectors, each denoting a different number of clusters. Also, the mutation and crossover operators are designed accordingly.

Crow Search Algorithm
Crow Search Algorithm is a recently proposed natureinspired algorithm motivated by the crow's nature, considered one of the most intelligent species. Crows have an exceptional memory and searching ability, allowing them to recognize their food hideout and, lit also follows other crows to plunder their food storage at their hiding place. If the crow is aware of the follower, it changes its hideout place to a random position to deceive the follower crow.

Standard CSA Algorithm
In standard CSA, a group of N numbers of crow search in n-dimensional search space. M j, iter denotes the hiding location of j th crow during iteration 'iter'. It is the best location searched by j th crow up to 'iter' number of iterations. Further in the successive iterations, if the best location improves, this memory and the position also improve. Now, if crow 'j' visits its hideout location without knowing that it is being chased by crow 'i', (i.e., r j > ap j, iter ) in this situation, CASE-A occurs, and new position of crow 'i' is given by, Where r i and r j are random numbers lying between [0,1] and f l i, iter represents the flight length of crow 'i' during the iteration 'iter'. Flight length is an effective parameter that decides the searching capability of the crow. Suppose flight length is less than one (i.e., 'fl<1'), it directs to intensified local searching, and if flight length is greater than one (i.e., 'fl>1'), it guides to global searching, i.e., searching at some random position. Awareness probability is another parameter of the CSA apart from the flight length 'fl', here ap j, iter represents the awareness probability of crow 'j'at iteration 'iter'. Suppose crow 'j' is aware that another crow is chasing it; it goes to any random position to misguide the crow 'i'. In this scenario, CASE-B occurs, which is given by, This procedure repeats itself till the maximum iteration iter max is reached.

Drawbacks in Standard CSA
Standard CSA has only two parameters, which helps the CSA maintain a balance between intensification over local search and exploring unique positions for global search. In standard CSA, both these parameters remain constant, due to which the solution space is not fully explored for the optimum solution. We have addressed these shortcomings of the CSA Dynamic Crow Search Algorithm (DCSA) [32], that solves the problem mentioned above and dynamically changes the parameters to solve complex problems like data clustering. We have further extended the DCSA to solve the automatic clustering problem.

Dynamic Crow Search Algorithm
DCSA updates its flight length and awareness probability according to the results obtained from the fitness function used. A ranking system based on the obtained results helps find the best and worst crow during each iteration. Besides this, memory-based ranking is also performed to find the crow that has fetched the best memory. These results help to fine-tune the parameters of the original crow search algorithm, updating the awareness probability and flight length according to the rank-based value obtained from the fitness function used for optimizing the clustering task. Also, instead of going to any random position in Case-B, it uses Lévy-flight based position update in the DCSA. In the improved version of DCSA, the authors have used the hybrid approach to update the crow's positions. In Dynamic Crow Search Algorithm, the significant changes that the authors have suggested are: • In DCSA awareness probability 'ap,' which depends upon the value obtained from the objective function has been used.
where, c1 and c2 ∈ (0.1, 0.4), and Rank i, iter is the rank obtained by i th crow during iteration='iter' and Rank j, memory is the rank of the j th crow's memory which is being chased by the i th crow. • The flight length 'fl', used in the DCSA is also dynamic, whose value is static in the standard CSA.

Proposed Work
In the present work we have extended the Dynamic Crow Search Algorithm by using a two-point hybrid of the best two solutions. In the hybrid encoding scheme, one third (i.e., 1 3 ) of the solution space encoding is taken from the best solution, next one third (i.e., 1 3 ) is taken from 2 nd best solution, and finally, the last one third (i.e., 1 3 ) elements of the particles are again taken from the best solution. Also, the best position and 2nd best position keep changing after each iteration to maintain the diversification. The above discussed implementation of hybrid encoding scheme is depicted in Figure-1. Here best solutions are represented by P article X and P article Z . P article Hybrid is hybrid of P article X and P article Z .

Automatic Data Clustering
Automatic clustering groups together similar data within a dataset when the data labels or cluster numbers are unknown. So basically, it is a multi-objective optimization process in which it simultaneously detects the number of clusters formed along with their best. The various researchers have given several internal cluster validity indices to find the optimal cluster centres; among them, the major validity indices are Dunn Index, Davies Bouldin Index, CH-Index, Xie-Beni index, CS-Index. The present work uses the CVNN index as a cluster validity index. CVNN index uses the 4 EAI Endorsed Transactions on Context-aware Systems and Applications Online First

Return the best solution in terms of optimal memory
Nearest Neighbours concept given by [14]. As suggested by [33], a slightly modified version of CVNN is given in the present work.

Solution Space Encoding
A string having the size K max + K max × d represents each particle where, max is the maximum number of clusters chosen, i.e., in this case, K max = 10 and d = no. of attributes presents in a particular data set. The initial K max of solution space encoding represents the threshold values having a range [0,1]. In this work, as most researchers suggested, the threshold value is chosen as 0.5. If the value is more significant than 0.5 for i th position from initial K max , the corresponding centre will be selected for clustering; otherwise, it will be rejected. Now, in this work, stress has been given that during initialization, each K value from [2,10] will get an equal number of chances to participate in the clustering task. For example, if the number of particles taken is 27, then for each K value from [2,10], three particles will be assigned for each K. K max data points from the data set without replacement are selected for each particle during initialization. Now depending upon the threshold value of the initial K max position from the solution space encoding, these data points get activated and participate in the clustering task. In this work, Euclidean distance measure is used as a distance measurement. Initially, for the selected centroid values of each particle, Euclidean distance is measured for all the data points from the data set. The data point having a minimum distance from a particular centroid value is assigned to that centroid. While calculating the distance, it may be possible that some of the clusters may have two or less than two elements. In such cases, reinitialization of such particles is done such that none of the active centroids has less than or equal two elements. 5 EAI Endorsed Transactions on Context-aware Systems and Applications Online First

Internal Cluster Validity Index
In the domain of automatic clustering, the cluster internal validity measure plays a dominant role. It helps detect the number of clusters that could be possibly present in the given data set. The optimal value (i.e., maximum or minimum) of the chosen internal cluster validity measure over a set of k ∈ (K min , K max ) decides the best cluster number for the given dataset. The Internal cluster validity mostly depends upon two properties: a. Compactness. It measures the cohesiveness of the data points present within a cluster. Several measures are based on the distance, which estimates the compactness of the given cluster, such as maximum or average pairwise distance in a cluster or maximum or average centre-based distance.
b. Separation. It measures how different one cluster is from the other cluster. Here again, the distance measures play a crucial role in deciding the dissimilarity between the two clusters. For example, a pairwise minimum distance between data points in different clusters or between the centres of two clusters is widely used to measure separation.

Clustering Validation Index Based on Nearest Neighbours (CVNN)
Liu et al. [14] have proposed CVNN in which Nearest neighbour-based separation instead of directly distance-based separation measurement between the clusters in a given data set has been used. In CVNN, the separation part is added to a straightforward compactness measure based on dissimilarity values. To properly balance these two measures against each other's, the author proposes to divide both of them by their maximum over clustering with different numbers of clusters K = {K min , .., K max }. The aim is to find the best number of groups given a clustering method.
Where, q n (x) is the number data points among the 'n' nearest neighbors of x that are not in the same cluster. n j is the number of data points in j th cluster. The compactness statistic is just the average within-cluster dissimilarity, CV N N n (C K ) = Sep n (C K ) max C∈K Sep n (C) + Com (C K ) max C∈K Com (C) (10) A lower value of both the separation and compactness gives better results.

Performance Metrics
The following performance metrics have been used in the present work to evaluate the performance of DCSA for data clustering problem: a. Objective Function-In this work, Cluster Validity using Nearest Neighbour is used as an objective function. A lower value of CVNN represents a better cluster. The number of nearest neighbours as required by CVNN is taken as ten except in the case of the Wine dataset, where five nearest neighbours have been taken.
b. Optimal Cluster Number-As in automatic clustering, the essential task is to find the optimal cluster number, so it is also used as a performance metric to find the accuracy of the algorithm in finding the optimal number of clusters in particular datasets.
c. Intra-cluster compactness-It finds the cohesion between data points in the cluster formed after applying clustering algorithms. A lower value of cohesion denotes a better cluster formed. The Sum of squared error (SSE) is a commonly used measure to find the Intra-cluster distance. Mathematically it is given as: where, k denotes optimal number of clusters obtained and c i denotes centroid of cluster C i . d. Inter-cluster separation-It finds the separation between different cluster centres. A higher value of separation denotes a better result. Mathematically it is the distance of cluster centroids from the mean of the dataset used for clustering. It is given as: where, k represents optimal number of clusters obtained, X mean denotes the mean of the dataset used for clustering and c i represents cluster centroid of cluster C i .

Dataset Used
To analyse the performance of the DCSA over automatic data clustering, the author has experimented with seven standard datasets obtained from the UCI repository. Datasets details are given in Table 1.

Simulation Result
Further, the results obtained from ACDCSA are compared with four other search algorithms such as GWO, PSO, WOA and DE that were previously applied for the automatic data clustering problem. Furthermore, CVNN, which is discussed in the methodology section, has been used as a fitness function while simulating ACDCSA and other algorithms used for comparison. The parameters used for automatic clustering by different metaheuristic search algorithms are given in Table 2. Table 3, 4 and 5 represent the results obtained after simulating ACDCSA and the other four algorithms for the automatic data clustering problem. Table 3 represents the mean of an optimal number of clusters obtained and their standard deviation. From the results present in the table, it is clear that ACDCSA has produced better results than other algorithms. The number of clusters produced by ACDCSA is nearly equal to the number of clusters present in those particular datasets. Table 4 and Table 5 represent the mean and standard deviations of intra-cluster and inter-cluster distance. In the case of intra-cluster distance, a minimum obtained value represents better cluster formed, whereas, in the case of inter-cluster distance, a maximum value obtained signifies better cluster formed. It is clear from Table 4 and Table 5 that ACDCSA has produced better clusters in terms of intra-cluster and inter-cluster calculations. Figure-2 represents clustering results obtained using ACDCSA using CVNN as a cluster validity index. The figure shows that the algorithm has produced excellent and robust clusters.

Conclusion and Future Work
In the present work, an automatic clustering task is implemented using Dynamic Crow Search Algorithm, and the obtained results were compared with other nature-inspired algorithms over inter-cluster, intracluster distance and optimal clusters obtained. The performance of ACDCSA is better than different algorithms used for comparison. The present work can be extended to automatic document and image clustering, as most of the data produced are in the form of text and images. Further, ACDCSA can be extended to automatic feature selection and clustering simultaneously. ACDCSA can be used to cluster over the live stream and spatiotemporal data and improve the above algorithm for a distributed environment. In today's digital era., these data are produced extensively. Applying clustering techniques over these data gives insight into other data-analysis fields. Apart from the data science domain, the proposed work can be further applied to various other disciplines where the main task is problem optimization.
Further, ACDCSA can be tested and compared with other internal cluster validity indexes, which multiple researchers have already proposed. We can say that ACDCSA has multiple scopes in the data-science domain and further research and real-life optimization problems. 7 EAI Endorsed Transactions on Context-aware Systems and Applications Online First Rajesh Ranjan and Jitender Kumar Chhabra