A novel image clustering method based on coupled convolutional and graph convolutional network

Image clustering is a key and challenging task in the field of machine learning and computer vision. Technically, image clustering is the process of grouping images without the use of any supervisory information in order to retain similar images within the same cluster. This paper proposes a novel image clustering method based on coupled convolutional and graph convolutional network. It solves the problem that the deep clustering method usually only focuses on the useful features extracted from the sample itself, and seldom considers the structural information behind the sample. Experimental results show that the proposed algorithm can effectively extract more discriminative deep features, and the model achieves good clustering effect due to the combination of attribute information and structure information of samples in GCN.


Introduction
Image clustering is very important in computer vision. In order to make full use of these unlabeled data and study the correlation between them, many clustering algorithms have been proposed and successfully applied in various practical applications, such as image segmentation [1][2], target detection [3] and image classification [4,5]. Among them, traditional clustering methods, such as K-means clustering algorithm [6], spectral clustering (SC) algorithm [7] and non-negative matrix factorization clustering (NMF) algorithm [8], capture similarity based on the concept of distance in the original data space, so they are considered as shallow models. Although the shallow models have been successfully applied in a variety of scenarios, calculating distance-based measures in raw data space is only suitable for describing local relationships in the data space and is limited in expressing potential dependencies between inputs, which is insufficient to discover semantic similarity.
With the booming development of deep learning, many researchers have shifted their attention to deep unsupervised feature learning and clustering [9][10][11][12]. Thus, a new clustering strategy, called deep clustering, emerged. When dealing with large, high-semantic and highdimensional data, the multi-layer architecture based on deep neural network unsupervised representation learning has become the natural choice. In addition, deep clustering combines prior knowledge with clustering to obtain the optimal embedded subspace for clustering. Compared with traditional clustering methods, deep clustering method can effectively simulate the input distribution and capture the nonlinear characteristics of the input. Therefore, it can well solve the limitations of shallow model and is more suitable for practical clustering scenarios.
The deep clustering method integrates the clustering target into the powerful representation capability of deep learning. Therefore, learning the effective feature representation directly determines the quality of clustering. In order to make potential representations more discriminative, most existing deep clustering methods attempt to minimize reconstruction losses. For example, Xie et al. [13] used clustering loss to help autoencoder learn data representation with high clustering cohesion. Bashon et al. [14] used variational autoencoder to learn better data representation of clustering.
Although deep learning has achieved great success in many important tasks, there are still several problems when using deep neural networks to perform clustering tasks. First, many authors try to combine mature clustering algorithms with deep learning. For example, network training is combined with k-means goals [15][16][17]. However, a simple combination of clustering and presentation learning methods often leads to regression and resolution. Secondly, auto-encoders are widely used in deep clustering and only consider the reconstructed feature representation, lacking discriminant ability. The ideal approach would be to train a discriminator with adversarial networks, but this further increases the difficulty of the task [18]. In order to learn more discriminative deep features, Chen et al. [19] mined the similarity information contained in image triples. Hjelm et al. [20] maximized the mutual information between features. Finally, most deep clustering only focuses on the characteristics of the data itself and seldom considers the structural information between the data, which can often reveal the potential similarity between samples, thus providing valuable guidance for learning representation. Abdella et al. [21] connected stack auto-encoder with graph convolutional neural network (GCN) through transfer operator, and used self-supervision mechanism to optimize feature extraction and clustering training process. Although structural information plays an important role in data representation learning, it is seldom used in deep clustering.
To solve these problems, a novel image clustering method based on coupled convolutional and graph convolutional network (CCGCN) is proposed in this paper. The embedded mutual information estimation network and minimized prior distribution constraint in the convolutional auto-encoder. Then, the sample's own attribute information learned from the deep auto-coding network is integrated into the graph convolutional neural network, realizing the collaborative learning of the sample's own attribute information and structure information, and completing the end-to-end clustering task, which effectively improves the feature discrimination ability while retaining more available information. Finally, experiments are carried out on three classical image data sets to verify the effectiveness of the proposed algorithm.

Deep clustering based on auto-encoder
The clustering method based on a auto-encoder (AE) relies on a joint execution to represent a linear combination of learning and clustering of two objective functions. The joint optimization process is described as: Where res L is a function of reconstructed loss. c L is embedded cluster loss.  is a super-parameter, which is a factor that controls the degree of distortion in the embedded space. The general network architecture of the AE-based deep clustering algorithm is shown in figure 1, where X is the input image and the X is the reconstructed image.

Deep clustering based on variational auto-encoder
The AE-based deep clustering approach has been improved significantly compared to traditional clustering methods. However, they are specifically designed for clustering and do not reveal the true underlying structure of the sample. In addition, assumptions based on dimensional reduction techniques are usually independent of those of clustering technologies, so there is no theory to ensure that the network can learn viable representations. The variational auto-encoder (VAE) as a kind of deep degree generated module type, can be considered as a AE generated variant, it exerts a priori probability distribution characteristics of potential said, which will become bayesian approach combined with flexibility and scalability of the neural network, using the variational lower weight parameterized by a differentiable lower unbiased estimator. The objective function of the depth clustering algorithm based on the variational auto-encoder is expressed as: Where p(z) is the prior distribution of the whole potential feature space.

Deep clustering based on graph convolutional neural network
These deep clustering methods generally focus only on data representations learned from the sample itself, while another important message of learning characterization, namely, the structural information of the data, is rarely taken into account. In order to manage the structural information behind the data, the clustering method based on GCN [22] has been widely used. Kipf et al. [23] presented the graph auto-encoder (GAE) and graph subencoder (VGAE), which used graph convolution as an encoder to integrate the graph structure into the node characteristics and learned the embedding of the node. However, the vast majority of GCN-based clustering methods rely on reconstructing the adjacent matrix, which can only learn data representation from the graph structure, while ignoring the data itself characteristics.

Proposed CCGCN model
We  1) Image block hierarchical feature extraction network. Its function is to extract features from corresponding image blocks. The input of the network is two image blocks of the same size and different scales, and the output is tag related feature vector. In order to generate tag-related features, the network uses seven convolutional layers to extract features from the input image blocks, and batch standardized operation and linear rectification unit (ReLU) are added after each convolutional layer [24]. In order to promote feature communication between layers, avoid gradient disappearance in the process of optimization, and enhance feature extraction capability of the network, residual connection and dense connection are adopted in the inter-layer connection of the network. The last feature image output from the convolutional layer is compressed into a vector by the global average pooling layer, and then features are further integrated by the full connection layer. Finally, feature vectors related to disease labels are output (the dimension of feature vectors extracted from each image block in this experiment is defined as 128 dimensions). 2) b-GCN: Graph structure G(V,E) can be constructed by using the spatial structure of feature points mentioned above. Where V is the node set of graph structure. E is the set of edges between nodes in the graph structure. The feature points are defined as the nodes of the graph. The Euclidean distance between nodes is defined as the edge of the graph.  -dimensional feature vectors (  =128) extracted from image blocks at each feature point (denoted as K, and there are 30 feature points in total, K=30) are defined as feature vectors at corresponding nodes.
Thus, we can construct the node characteristic matrix Meanwhile, in order to describe the edge in the graph structure, namely, the adjacency state of the node, j i d , is defined as the Euclidean distance between the i-th and j-th feature points. Then the adjacent short matrix can be expressed as: By using the feature matrix and adjacency matrix of nodes, we can describe the topological structure of the image by the distribution of feature points in the image, and process it by b-GCN. For graph convolution layer of layer l , the feature matrix The feature matrix output by b-GCN is stretched into a 1-dimensional vector, and the prediction results of image labels are obtained after a fully connection layer processing.
The auto-coding network can learn useful representations from the sample, such as In this way, the representations learned by the first layer of the deep auto-coding network will be integrated into the corresponding layer of the GCN module for dissemination. In the last layer of GCN module, the multiclassification layer with Softmax function is used to output Y, so as to predict the distribution of samples.
Where Y is a probability distribution.

Experimental results and analysis
The experiment is divided into three parts. The proposed algorithms and six other clustering algorithms are first compared on three classic data sets including USPS, MNIST, and Fashion-MNIST. The clustering performance of the CCGCN algorithm is evaluated on two quantitative measures, Cluster Accuracy (ACC) and Normalization Mutual Information (NMI

Dataset
To demonstrate that the CCGCN algorithm is better able to handle a variety of types of data sets, three classic image datasets (USPS, MNIST, and Fashion-MNIST) are selected for experimentation [26]. Because the clustering task is completely unsupervised, the training samples are stitched together with the test samples in the experiment. Statistics for these datasets are shown in table 1.

Experiment settings
The experiment uses two standard unsupervised evaluation indicators to evaluate the clustering performance of the algorithm, namely ACC and NMI [27]. The two metrics have different characteristics in clustering tasks, and a higher value indicates better clustering performance. The experiment software environment is the Ubuntu16.04 system and the hardware environment is i7-6700 processor and NVIDIA GeForceGTX1060 graphics card, Python language, and the deep learning framework Pytorch [28,29]. In the experiment, the image is machine-disrupted within each batch, and a negative sample is selected in randomly disturbed order. In order to reduce the number of hyperparameter searches, the nearest neighbor is set to k=3, and the transfer operator δ is 0.5. To reduce random errors, 10 experiments are performed under the same conditions, with an average of 10 experimental results. The number of channels and core size settings for the auto-encoding network are shown in table 2.

Experiment results
The ACC and NMI on the three data sets of MNIST, USPS, and Fashion-MNIST are shown in Table 3. As can be seen from Table 3, compared with other comparison algorithms, the proposed CCGCN algorithm has achieved the highest ACC and NMI values on all three classical image data sets. The proposed method can improve the clustering energy to achieve better experimental results. Especially, on the complex Fashion-MNIST dataset, the proposed algorithm still produces the best results. On the three graph sets of USPS, MNIST, and Fashion-MNIST, the CCGCN method is 1.97%, 2.16%, and 3.18% higher than the sub-optimal clustering method. Depth-based clustering is generally better than traditional clustering methods, such as K-means algorithms. This is mainly because compared with the shallow clustering method, the deep neural network has the ability of dimensional reduction, which can effectively simulate the distribution of inputs, capture the nonlinear characteristics of inputs and learn well-learned deep-layer characteristics. Therefore, when dealing with highdimensional nonlinear data, the clustering performance based on the depth clustering model is mostly better than that of the shallow clustering model. Compared with other methods based on depth clustering, such as AE, DEC, IDEC, and Deep-cluster, the SDCN algorithm achieves a better predictive effect by combining the sample's own attribute information and structure information to achieve a collaborative study of the sample's own attribute information and structural information. Compared with the SDCN algorithm, the proposed algorithm embeds the mutual information estimation network and minimizes the a priori distribution constraint in the multi-layer convolutional encoder, effectively excavates the deep characteristics of more identifiable samples, and makes the coding space more regular, and improves the coding quality of unsupervised feature extraction, which in turn improves clustering performance. Experimental results show that the new method achieves better clustering results than current advanced algorithms on three classical data sets.   Table 4 clearly shows that each training strategy can effectively improve the clustering performance on the basis of multi-layer convolutional auto-encoder, especially after adding the mutual information estimation network and the structure information of fused samples into ConvAE, the clustering effect is significantly improved. Since multi-layer convolution encoder in strategy (2) embedded in the mutual information to estimate the network. Since the global mutual information between input and potential feature representation and the local mutual information between mid-layer feature and potential feature representation are considered at the same time, especially the local mutual information is equivalent to treating each small part as a sample, so that the original sample becomes 1+M×M samples. It greatly increases the sample size and improves the coding quality of unsupervised feature extraction. In the USPS data set, ACC and NMI of strategy (2) are improved by 0.087 and 0.092 respectively compared with strategy (1). In MNIST data set, ACC and NMI are improved by 0.059 and 0.057, respectively. In the Fashion MNIST data set, ACC and NMI is improved by 0.046 and 0.053, respectively. In strategy (3), the features learned by different layers in ConvAE are integrated into the corresponding layers of GCN module, so that the model can simultaneously learn the attribute information of the sample itself and the structural information between the samples. So the strategy of combining ConvAE with GCN also produces better results than using ConvAE alone. In USPS data set, ACC and NMI is increased by 0.084 and 0.136 respectively compared with strategy (1). In MNIST data set, ACC and NMI increases by 0.081 and 0.100, respectively. In the Fashion MNIST data set, ACC and NMI increase by 0.038 and 0.060, respectively. In strategy (4), the above three strategies are combined to jointly optimize feature extraction and clustering allocation process end-to-end, and finally the model produces a stronger prediction effect. In the USPS data set, ACC and NMI of strategy (4) increase by 0.013 and 0.016 and 0.07 and 0.027, respectively, compared with strategy (2) and strategy (3). In MNIST data set, ACC and NMI are increased by 0.094 and 0.051, respectively. In the Fashion MNIST data set, ACC increases by 0.025 and 0. 033, and NMI increases by 0.022 and 0.014, respectively.

Ablation experiment
By using the t-SNE visualization method, clustering results for different training strategies are visualized in the MNIST dataset, as shown in figure 5. Figure 5(a) shows the distribution of data points in convAE's potential subspace, and Figure 5

Algorithm parameter evaluation experiment
To study the sensitivity of the CCGCN algorithm to parameters 1  and 2  , ACC and NMI are used to assess the effects of different parameters on clustering energy.

Conclusion
In order to effectively improve the ability of deep feature identification, make full use of the structural information between unlabeled samples, jointly optimize the feature extraction and clustering process of samples, this paper proposes a CCGCN clustering algorithm. The algorithm embeds the mutual information estimation network and minimizes the a priori distribution constraint in the convolutional auto-coding network, and considers the property information of the imported sample itself and the structural information between the samples, which effectively improves the ability of feature discrimination while retaining more available structural information. On image clustering tasks, the CCGCN algorithm uses K-L diffuse joint to produce the potential feature distribution. Experimental results show that the clustering accuracy of CCGCN algorithm on three classical image data sets has been significantly improved. Especially on the complex Fashion-MNIST dataset, the accuracy of the proposed method is improved by 3.18% compared to the sub-optimal clustering algorithm. However, the effectiveness of the CCGCN algorithm is only validated on smaller image datasets, and how to effectively improve clustering performance on more large data sets is the focus of the next study.