Research Article
Grid Partition and Agglomeration for Bidirectional Hierarchical Clustering
@INPROCEEDINGS{10.1007/978-3-030-21373-2_60, author={Lei Wu and Hechang Chen and Xiangchun Yu and Sun Chao and Zhezhou Yu and RuiTing Dou}, title={Grid Partition and Agglomeration for Bidirectional Hierarchical Clustering}, proceedings={Security and Privacy in New Computing Environments. Second EAI International Conference, SPNCE 2019, Tianjin, China, April 13--14, 2019, Proceedings}, proceedings_a={SPNCE}, year={2019}, month={6}, keywords={Hierarchical clustering Grid-based clustering Top-down Bottom-up}, doi={10.1007/978-3-030-21373-2_60} }
- Lei Wu
Hechang Chen
Xiangchun Yu
Sun Chao
Zhezhou Yu
RuiTing Dou
Year: 2019
Grid Partition and Agglomeration for Bidirectional Hierarchical Clustering
SPNCE
Springer
DOI: 10.1007/978-3-030-21373-2_60
Abstract
Clustering is an important data processing tool, which can be used to reveal the distribution structure of unfamiliar domain data, or as preprocess methods to magnify data object to accelerate subsequent processing or simplify models. However, the distribution of many real-world data in feature space is very complex or uneven. Besides, the similarity/distance is not easy to be properly defined in feature space with different dimensional quantity. Therefore, many existing clustering algorithms are not stable in real datasets, and better performance of different datasets relies on artificial special design, such as scale normalization. In this paper, we propose a bidirectional hierarchical clustering (BHC) algorithm with two phases. In the first phase (Top-down), based on the probability density function of data in different dimensions, the feature space is divided into over-segmented grids to adapt to the complex distribution of data. In the second phase (Bottom-up), based on statistical information, a robust distance instead of geometrical distance is defined to agglomerate the grids into a dendrogram. Compared with the individual data points, grids created in the first phase can carry more statistical information, and the magnified processing objects can accelerate the clustering process. The second phase enhances the algorithm’s ability by the ability of recognize arbitrary shape data clusters. The effectiveness of BHC is compared with 20 popular or recent clustering algorithms on 8 artificial datasets and 6 real-world datasets. And the results show that our algorithm can achieve good results on most datasets. In particular, BHC surpasses all the comparison algorithms involved in the experiment on all real-world datasets. In addition, in order to test the efficiency of the algorithm, we design an experiment which can test the influence of dimension and data size on the operation time.