Multichannel attention mechanisms fusion based on gate recurrent unit memory network for fine-grained image classification

Attention mechanism is widely used in fine-grained image classification. Most of the existing methods are to construct an attention weight map for simple weighted processing of features, but there are problems of low efficiency and slow convergence. Therefore, this paper proposes a multi-channel attention fusion mechanism based on the deep neural network model which can be trained end-to-end. Firstly, the different regions corresponding to the object are described by the attention diagram. Then the corresponding higher order statistical characteristics are extracted to obtain the corresponding representation. In many standard fine-grained image classification test tasks, the proposed method works best compared with other methods.


Introduction
Fine-grained image classification is a hot research issue in the field of computer vision in recent years. Its purpose is to divide more detailed subcategories in coarse-grained large categories [1,2]. These subcategories usually have small inter class differences, and they often need to be distinguished by small local differences. For example, ring billed Gull and California gull in the bird data set are very similar, only the beak shape is quite different, and it is also difficult for humans who master relevant knowledge [3]. Compared with inter class differences, there are usually large intra-class differences in finegrained image classification, including object pose, scale, occlusion and background. In particular, when the amount of data in each category is limited and there is no additional manual annotation information for object parts, it is a very challenging task to realize fine-grained image classification based on weak supervision information. According to the characteristics and difficulties of finegrained image classification, introducing visual attention [4][5][6] mechanism to highlight the key parts of the image with distinction is a common idea in the research of finegrained image classification in recent years. For example, Ying et al. [7] proposed the spatial transform network, which used soft attention to sample on the special map to obtain the morphological transformed features. Compared with the classical convolution network, it could extract the spatial feature information more effectively. The two level at-tension model proposed by Abdalla et al. [8] applied object level and part level attention. The convolution network was used to obtain object level information, and then the clustering method was used to obtain key local areas, so as to make more accurate use of multi-level information. SP-DA-CNN [9] proposed by Barbier et al. used part annotation in Cub bird dataset to train the detection network to obtain hard attention corresponding to seven different parts of birds in the dataset, and cut the features at the corresponding positions for image classification. Suh et al. [10] combined visual attention with recursive structure, fused features and attention weights at each level of recursive network, and combined key regional features of multiple scales in the model.
The above methods have achieved good results in applying the attention mechanism to fine-grained classification, but there are still some limitations on the role of attention: (1) For each attention and feature fusion process, the attention weight graph is a feature graph with the number of channels of 1, without using multi-dimensional attention features, this limits the ability of image feature extraction with complex distribution of key areas. In recent years, attention mechanism has been widely used in other fields outside the field of computer vision. Among them, the multi-head attention mechanism [11] generates multiple attention weight maps in parallel and integrates them with features at the same time, so that the model can obtain attention corresponding to different input positions, this method surpasses the previous methods based on complex models in tasks such as machine translation, and proves that multi-channel attention can provide more effective and comprehensive information.
(2) The method of attention weight map and image feature fusion is relatively simple. On the one hand, the method of multiplying the corresponding elements of attention weight map and feature map by position is adopted. On the one hand, it is unable to extract higherorder information more effective for classification, on the other hand, it is difficult to adapt to multi-channel attention features with more complex forms.
Based on the above analysis, this paper proposes a finegrained image classification model based on multichannel attention: a multi-channel attention generation method based on neural network is proposed to obtain rich spatial attention information by extracting multichannel attention weight map; At the same time, a new attention and feature fusion method is proposed. By extracting the high-order information of image features corresponding to attention, the high-level features with more descriptive ability are obtained. Finally, a deep neural network model that can be trained end-to-end is formed. In the experiments on common fine-grained image classification data sets such as Cub-200-2011, FGVC-aircraft and Stanford Cars, compared with the mainstream fine-grained image classification methods in recent years, the classification accuracy obtained by this model has been significantly improved.

Attention extraction
The role of attention can be regarded as the process of selecting some task related information from the input information, and the attention weight is the index of these information [12][13][14] is the attention scoring function, and the corresponding model can be selected according to the actual task and situation. For example, the point product model is used.
Where W is the learnable network parameter.

The fusion of attention and features
The method of attention weight acting on features can be regarded as the process of encoding input information under an information selection mechanism [15]. For a single dimensional soft attention weight graph, the most common way of attention and feature fusion is to multiply the corresponding position elements in the form of dot product.

Proposed fine-grained image classification
The deep neural network model described in this paper can be divided into feature extraction, attention weight graph generation, attention weight and feature fusion, Multichannel attention mechanisms fusion based on gate recurrent unit memory network for fine-grained image classification 3 classifier and so on. The feature extraction module transforms the input image into low-level features using full convolution network; The attention weight map generation module inputs image features to obtain multichannel attention weights; The fusion module fuses the attention weight with the low-level features of the image to obtain the feature vector as the high-level representation of the image; The classifier transforms the attention fused feature vector into the probability corresponding to each category of the data set, so as to obtain the classification result. The above parts constitute an image classification model framework that can be trained end-to-end, as shown in Figure 1. The feature extraction part contains a plurality of convolution layers, which can be converted from the pre trained network. For the input two-dimensional image, the output of the final convolution layer is H×W×D. The characteristic diagram of D can be regarded as group D characteristics, and each group contains N= H×W pieces of information respectively correspond to the corresponding spatial location, and the low-level features can be expressed as:

Generation of multi-channel attention weight
In the network model, the low-level feature x is shown above. The multi-channel attention of K dimension corresponds to K selection processes of input information, and the corresponding N×K the characteristic graph with attention weight N×K, expressed as Where K is the number of attention weight graphs.
Multichannel attention is equivalent to multi-terminal attention applied to two-dimensional features of convolution output [16]. When multi-channel attention acts on the model, attention corresponds to multiple separate selection processes of input information, which act on input features in parallel, that is, is the scoring function corresponding to the k-th attention. In order to ensure that the attention weights of different channels focus on different spatial positions in the feature map, the number of channels with attention weights applied by softmax function is 1:K.
For the input s of channel K, the softmax function acting on it is expressed as follows.
The attention scoring function (x ) ki s used in this method is based on the dot product model most commonly used in the application of attention mechanism, and normalized according to the characteristics of the input image features, as shown below: The above attention weight generation process can be realized by common operations such as convolution layer and softmax layer in neural network, which ensures that the model can be trained end-to-end as a whole.

Function of multi-channel attention weight map
For the low-level features and attention weights with the same spatial dimension and the number of channels are D and K respectively, the action process of the two can be written according to the attention fusion method shown in equation (8).
The low-level features are expressed as X in the form of matrix, and the attention weight is expressed as After the operation, we can get the High level characteristics of D×K dimension.
Furthermore, the characteristic mean corresponding to each group of attention weights is introduced into the model. The mean value of attention is a network parameter, which represents the low-level features corresponding to all data, and corresponds to the mean value of attention of each channel. This operation is different from Vlad in image representation The feature extraction process has similarities, and Vlad has been proved to be an effective image representation. The introduction of feature mean can extract higher-order features more related to categories, improve the expression ability of the output fusion results and improve the classification effect.
The mean value of attention is written as , then the attention fusion method can be expressed as: Where  represents the operation of dimension dot product corresponding to K . In equation (11), the action process of attention fusion is mainly composed of the multiplication of matrix and vector, which can easily realize the reverse operation.
After the fusion operation including subtracting the mean of attention, the output high-level feature dimension is still D×K.
The generation parameter } { k W W = of attention weight in the above is a key parameter in the network model. This parameter can be initialized randomly according to the traditional method, but the introduction of category information that is not related to the image category plays a strong role in promoting a more descriptive attention weight map. At the same time, initializing this parameter helps to accelerate the network convergence, It can reduce training time. Therefore, a certain external category information is introduced to initialize the parameter W by clustering. In this method, the orthogonal matching pursuit (OMP-k) algorithm [17] is used to initialize the parameter W and obtain the minimum value of the following operation.

Attention-based gate recurrent unit
The purpose of memory network is to retrieve the information needed to answer questions from the input visual information and store the valid information in memory. In order to improve the understanding of questions and images, especially when the problem requires transmission of reasoning, the memory network needs to transmit inputs multiple times, updating the remembered information after each transmission. The memory network is composed of the attention mechanism module and the memory update module. Each iteration will calculate the weight of the input vector through the attention mechanism to generate new memories. Then update the remembered information through the memory update module. AttnGRU refers to attention-based GRU model. ) ( Where i u is used to control the degree to which the state information of the previous moment is substituted into the current moment.

Model setting
This paper proposes that the convolution network of the feature extraction part of the model can be obtained from the pre-training model. In the experiment, VGG-16 network [18,19] pre-trained in ImageNet data set is selected as the basic network of this part. The pre-trained vgg-16 network can effectively obtain rich convolution features, which is used as the basis in many depth neural network models. In this model, the last convolution layer output of the pre-training network, namely conv5_3. As a low-level feature in the model, the feature dimension is 512. In the experiment, the input image size of network convolution layer is 512×512 pixels. The image goes through several times before entering the network Common data enhancement operations, including cutting part of the image at the ratio of 224/256, randomly mirroring the image, subtracting the image mean, etc. As described above, the attention weight map generation section may be composed of a convolution kernel with a size of 1×1 and Softmax. The parameters of the convolution layer are initialized by the OMP-1 method. In the experiment, the number of channels K of the attention weight graph is the key parameter of the network model, and the relationship between its value and classification accuracy can be determined through experiments. When the channel number of the basic attention weight graph is K=32, the channel number of the low-level features output from the convolution layer is 512, and the highlevel features output after the fusion of attention and lowlevel features are a long vector with a dimension of 32×512=16384. In order to enhance the stability of this high-level feature as an image representation, L2 normalization is performed to obtain the final high-level feature.
The high-level features obtained by attention and feature fusion are input into the full connection layer, and the output dimension corresponds to the category corresponding to the data set. After passing through the Softmax layer, the probability output of each category can be obtained. In this classifier, the input dimension of the full connection layer is high. In order to accelerate the network training speed, the high-level feature vectors corresponding to the training images can be collected to train the linear SVM classifier, and the parameters of the full connection layer can be initialized with the parameters of the SVM model.

Data sets
In order to comprehensively evaluate the performance of this method for fine-grained image classification, CUB-200-2011 bird dataset and FGVC are used Aircraft data set and Stanford cars data set [20] and other data sets commonly used in fine-grained image classification.
Caltech-UCSD birds-200-2011 fine-grained image data set, referred to as CUB-200-2011, is the most classic and commonly used data set in fine-grained image classification research at this stage. The CUB-200-2011 data set contains a total of 11788 images of 200 species of North American birds. According to the division provided by the data set, there are 5994 training images and 5794 test images. This data set has the characteristics of small difference between categories and small difference in images. Birds have challenging characteristics, such as diverse posture positions and limited training data.
FGVC aircraft fine-grained image classification data set contains 102 aircraft images of different models. Each model contains 100 images, a total of 10200 images, about one third of which are used as tests. The main objects in the images in this dataset are different types of aircraft. Because many aircraft types in the dataset are divided in detail, the similarity between some categories is very high; The aircraft coating and the environment are different. The same results in large changes within the category, making FGVC aircraft a challenging finegrained image data set.
Stanford cars fine-grained image classification data set contains 196 different types of car images, a total of 16185, of which 8144 images are used as training and others as testing. Cars dataset has the same vehicle model and manufacturer corresponding to many categories, and the perspective and coating of vehicles in the same category have great changes, which has strong finegrained image classification characteristics. For these fine-grained image classification data sets, this paper uses them according to the standard training and testing provided by the data set. There is no duplicate data between them, which ensures the effectiveness of the model and is easy to compare with other methods.

Experimental results and analysis
In Experiment 1, the classification results of cub-200-2011 dataset are configured according to the model described above. In this paper, the fine-grained classification model based on multi-channel attention obtains 87.7% classification accuracy in cub-200-2011 dataset, as shown in Table 1. Some of the comparison methods use additional supervision information outside the image category, including bounding boxes and location labels provided by the data set. In the control method, SPDA-CNN, mask CNN, CBCNN, B-CNN and RA-CNN all use VGG-16 as the basic network as the method in this paper, which is more helpful to compare the ability of the model to extract effective classification information based on the low-level features of the image. According to the experimental results in Table 1, the classification accuracy of this method is significantly improved compared with the previous weak supervised classification method without additional annotation; At the same time, compared with the labeling method of data sets such as parts, this method achieves the same level of classification accuracy. This result proves that the model based on multi-channel attention has the ability to effectively extract classification related features and distinguish fine-grained images.

Experiment 2 classification results of FGVC aircraft dataset and cars dataset.
According to the above configuration, the fine-grained classification model based on multi-channel attention obtains 88.4% classification accuracy in FGVC-aircraft data set; A classification accuracy of 92.5% was obtained in the cars dataset. Table 2 shows the comparison results of different methods in the two data sets. B-CNN [D,D] uses vgg-16 network as the basic network as the method based on depth neural network, which is the same as the method in this paper; B-CNN [D,M] combines the features extracted by VGG-16 and VGG-M [21]. It can be seen from the results in the table that the classification accuracy of this method is significantly improved compared with the previous methods. At the same time, combined with the complexity of the network model, it can be seen that when using the basic model with the same or smaller scale, the multi-channel attention model used in this method can extract the features related to fine-grained image classification more effectively.

Experiment 3 number of attention weight channels.
For the multi-channel attention model described in this paper, the channel dimension k of the multi-channel attention weight graph a in equation (11) is a key parameter. When the number of attention weight channels is low, it may be difficult to provide sufficient attention information and affect the classification results; When there are many attention weight channels, it will increase the model parameters and then increase the computational complexity of the model. At the same time, it will increase the dimension of the output image representation vector after attention, so it is difficult to obtain a compact image representation [22][23][24]. Table 3 shows the model classification accuracy obtained by training in the CUB-200-2011 dataset according to the model configuration described above when the number of channels of the attention weight map gradually increases and takes 4, 8, 16, 32 and 64 equivalents respectively. In the experimental results, when the number of attention channels is 4, the classification accuracy is significantly different from that when the number of attention channels is 8, up to 7.2%, which proves that the attention weight feature is not enough to provide sufficient information and has a great impact on the classification accuracy. When the number of channels in the attention weight map is not less than 16, the classification accuracy is close, and the model contains sufficient attention information. The experimental results show that taking the number of channels of attention weight map as 16 or 32 can achieve a good balance between classification accuracy and model complexity.

Experiment 4 image representation features
In the model described in this paper, the high-level feature vector output after the action of attention in equation (11) can be used as a feature representation of the input image. At this time, taking this layer of the model as the output, the high-level vector is obtained as the feature extractor of the image. The dimension of this vector and the accuracy of image classification are the key factors to evaluate the performance of the model. Table 4 compares different image classification models with the ability to extract image feature vectors, and takes the feature vector dimension and the classification accuracy on the cub-200-2011 data set as the evaluation results. Among them, this method uses two configurations: the number of attention weight channels is 16 and 32 respectively. In the comparison method, CNN-FC uses the 4096 dimensional output of fc7 layer of vgg-16 as the representation vector, and vgg-16 is also the basic network of all methods in the table; CNN-IFV reduces the dimension from the output of fc7 and fc8 layers of neural network to obtain Fisher vector as image representation vector; B-CNN uses bilinear pooling to fuse the 512 dimensional outputs of two groups of convolution to Rui Yang and Dahai Li 8 obtain a very high dimensional representation vector; CB-CNN method improves FB-CNN and reduces the dimension of representation vector while maintaining the classification accuracy. It can be seen from the results in the table that this method achieves better classification results in fine-grained image classification task while maintaining the representation vector with low dimension. This proves that the attention function method used in the model can extract important information helpful to classification more effectively.

Conclusion
In the first mock exam, a deep neural network model for fine grained image classification is proposed and verified. This model applies multi-channel visual attention, and extracts the higher-order information from the attention correspondence mean in the process of attention and image fusion. At the same time, a method of initializing attention parameters is proposed, which forms a set of image classification framework for training with end to end training. At the same time, it can be used to extract compact image representation. Experiments on a variety of fine-grained image classification data sets such as CUB-200-2011 show that compared with the traditional attention model and other classical fine-grained image classification frameworks, the fine-grained image classification model based on multi-channel visual attention has significant advantages in classification accuracy.