Security and Privacy in New Computing Environments. Second EAI International Conference, SPNCE 2019, Tianjin, China, April 13–14, 2019, Proceedings

Research Article

Grid Partition and Agglomeration for Bidirectional Hierarchical Clustering

Download
108 downloads
  • @INPROCEEDINGS{10.1007/978-3-030-21373-2_60,
        author={Lei Wu and Hechang Chen and Xiangchun Yu and Sun Chao and Zhezhou Yu and RuiTing Dou},
        title={Grid Partition and Agglomeration for Bidirectional Hierarchical Clustering},
        proceedings={Security and Privacy in New Computing Environments. Second EAI International Conference, SPNCE 2019, Tianjin, China, April 13--14, 2019, Proceedings},
        proceedings_a={SPNCE},
        year={2019},
        month={6},
        keywords={Hierarchical clustering Grid-based clustering Top-down Bottom-up},
        doi={10.1007/978-3-030-21373-2_60}
    }
    
  • Lei Wu
    Hechang Chen
    Xiangchun Yu
    Sun Chao
    Zhezhou Yu
    RuiTing Dou
    Year: 2019
    Grid Partition and Agglomeration for Bidirectional Hierarchical Clustering
    SPNCE
    Springer
    DOI: 10.1007/978-3-030-21373-2_60
Lei Wu1, Hechang Chen, Xiangchun Yu1, Sun Chao1, Zhezhou Yu1,*, RuiTing Dou1
  • 1: Jilin University
*Contact email: yuzz@jlu.edu.cn

Abstract

Clustering is an important data processing tool, which can be used to reveal the distribution structure of unfamiliar domain data, or as preprocess methods to magnify data object to accelerate subsequent processing or simplify models. However, the distribution of many real-world data in feature space is very complex or uneven. Besides, the similarity/distance is not easy to be properly defined in feature space with different dimensional quantity. Therefore, many existing clustering algorithms are not stable in real datasets, and better performance of different datasets relies on artificial special design, such as scale normalization. In this paper, we propose a bidirectional hierarchical clustering (BHC) algorithm with two phases. In the first phase (Top-down), based on the probability density function of data in different dimensions, the feature space is divided into over-segmented grids to adapt to the complex distribution of data. In the second phase (Bottom-up), based on statistical information, a robust distance instead of geometrical distance is defined to agglomerate the grids into a dendrogram. Compared with the individual data points, grids created in the first phase can carry more statistical information, and the magnified processing objects can accelerate the clustering process. The second phase enhances the algorithm’s ability by the ability of recognize arbitrary shape data clusters. The effectiveness of BHC is compared with 20 popular or recent clustering algorithms on 8 artificial datasets and 6 real-world datasets. And the results show that our algorithm can achieve good results on most datasets. In particular, BHC surpasses all the comparison algorithms involved in the experiment on all real-world datasets. In addition, in order to test the efficiency of the algorithm, we design an experiment which can test the influence of dimension and data size on the operation time.