DSSESKN: A depthwise separable squeeze-and-excitation selective kernel network for art image classification

Image classification is one of the key technologies of content-based image retrieval, and it is also the focus and hotspot of image content analysis research in recent years. Through the image processing and analysis technology to automatically analyze the image content to complete the management and retrieval of images, this process is the main content for image classification. Faced with massive digital Chinese art works, how to achieve effective management and retrieval for them has become an urgent problem to be solved. Traditional image retrieval technology is mainly based on image annotation, which has many problems, such as large workload and not objective enough. In this paper, we propose a depthwise separable squeeze-and-excitation selective kernel network (DSSESKN) for art image classification. SKNet (Slective Kernel Network) is used to adaptively adjust the receptive field to extract the global and detailed features of the image. We use SENet (squeeze-and-excitation network) to enhance the channel features. SKNet and SENet are fused to built the DSSESKN. The convolution kernel on the branch of DSSESKN module is used to extract the global feature and local detail features of the input image. The feature maps on the branches are fused, and the fused feature maps are compressed and activated. The processed feature weights are mapped to the feature maps of different branches and feature fusion is carried out. Art images are classified by deep separable convolution. Finally, we conduct experiments with other state-of-the-art classification methods, the results show that the effectiveness of the DSSESKN obtains the better effect.


Introduction
Natural art painting is the embodiment of people's pursuit for life quality and spiritual quality. It can express emotions that cannot be expressed in words, including printmaking, Chinese painting, oil painting, gouache painting, watercolor painting, etc., as shown in figure 1.
Printmaking is a work of art produced by artists through conception creation, engraving and printing [1]. It expresses the theme style in the form of full and dense composition and small, simple composition. Chinese painting uses a brush dipped in water, ink or color on silk or paper to draw, pays attention to the spirit, the image of the object shape and bone method of the brush. Among them, the bone method expresses the artistry of using the Shaojie Zhang 2 pen, including the pen force, the sense of force, the performance structure, etc. Chinese painting does not pay attention to the rendering of the background of the image [2]. Most of the paintings have blank space, and ink is used instead of color, so that the ink produces rich and subtle color changes, that is, "the ink is divided into five colors". The speed of the brush and the different length of the brush make the ink skills ever-changing. The light and dark tones are rich and colorful. Oil painting is different from Chinese painting in that it focuses on realism. The painter creates the sense of light through the contrast between warm and cold tones, the intensity of light and shade, and the thickness and thickness of pigments. Oil painting has opacity, rich color and high saturation [3]. Watercolor painting is a painting form in which transparent pigments are mixed with water as the medium. The transparency of pigments makes the picture transparent and clear, while the fluidity of water enhances the sense of fluidity, natural and free. Gouaches are a form of painting in which water is mixed with silty pigments. The color transparency is between watercolor and oil painting. Gouache is in wet when the color saturation is very high, dry color loses luster, the saturation is decreased. Oil painting, watercolor and gouache can show the characters with rich and colorful color, rough texture, obvious color block. However, the interaction of light and dark tones and the use of light and shadow in the oil painting make the overall sense of space and three-dimensional more intense. The details are rich in texture. The lines formed by the repeated stacking of pigments enhance the semantic information of the oil painting, and the transparency is the lowest. Gouache transparency followed, gouache dry, the picture appears obvious powder color block, three-dimensional sense is not strong, color saturation is lower than oil painting. The transparency of watercolor painting is high, which can express the fresh and lively texture. Water stains and wet marks on the picture are one of the key features to distinguish them. Chinese painting and watercolor painting have similarities in aesthetic expression, intention and charm, but on the whole, Chinese painting uses ink painting system, rich lines, details in the form of points, lines, surface to describe the object's shape, bone method, light and dark and modal verve, the overall color of the picture is deep. The overall color of watercolor painting is rich, bright and moist, and the brightness is large. The details do not pay attention to the line painting, and the interaction between "water" and "color" is used to express the mood and charm of the picture. At present, the research work of art image management includes paintings classification according to subject matter and technique of expression, painters classification according to their creative styles, art portrait true or false identification. There are few studies on the classification of multi-category art images based on their style features. Different painters use different strokes, and the thickness of the lines in a painting can show the painter's style. Li [4] proposed to use 2-dimensional multi-resolution hidden Markov hybrid model to analyze most regions of the image, capture stroke features of key regions, and classify, compare painters. In traditional Chinese paintings, ink is often used instead of color. Sheng [5] used gray histogram to extract overall style features of ink paintings, used Sobel edge detection method to obtain local detail features with rich stroke styles, and used information entropy fusion algorithm to realize painter classification. Shen [6] divided the image into 4×4 sub-blocks. For each sub-block, it extracted the whole features such as color, texture, shape and the local texture features. They were input to radial basis function (RBF) neural network to train the parameters, then the vector of RBF neural network output hamming distance calculation in order to realize the painter classification of input image. Liang et al. [7] used monte Carlo convex hull feature selection model and support vector machine to describe the features of traditional Chinese paintings. According to the texture, color, shape and other features of Chinese painting, Wang [8] adopted the traditional method to extract the supervised heterogeneous sparse features, but the features only had 96 dimensions, which was insufficient to describe the overall features of Chinese painting. Jiang et al. [9] classified traditional Chinese fine brushwork and freehand brushwork based on the differences in expression techniques, and made use of the texture, shape, color and edge features at the lower level of images. Liong  DSSESKN: A depthwise separable squeeze-and-excitation selective kernel network for art image classification 3 (SIFT) feature detector and edge detection to get the key regions of the image. By describing the features of the key regions and the differences within the neighborhood, they analyzed the differences in expression techniques between fine brushwork and freehand brushwork via the cascading classifier. However, the characteristics of art are often combined in an organic form. It is difficult to generalize the combination skills. Different types of art images are similar in the features such as color, texture, line and so on. Extracting the texture, color, line and other features of art images using traditional methods cannot fully distinguish the style characteristics of each type of art image.

EAI Endorsed Transactions on Scalable Information Systems
Convolutional neural networks (CNNs) [11][12][13] have achieved excellent results in extracting the global features and local details. Different convolution kernels in the branch of InceptionV4 module extract the overall and detailed features of the input image, but most of these modules consider the spatial dimension information without considering the dependence between channels, and it does not further enhance the extracted features. The SE (Squeeze-and-excitation) module in SENet (Squeezeand-excitation networks) fuses the dependency between spatial dimensions and channels, strengthens the extracted useful features, and inhibits the irrelevant features. The SK (Selective kernel) module of SKNet enables the network to adaptively adjust the size of receptive field according to multiple scales of input information, and extracts the details and overall features of images. The SE and SK modules are shown in figure 2. This paper constructs DSSESKN module according to the characteristics of SE and SK modules. In the branch of DSSESKN module, the convolution kernel with different sizes extracts the overall and detailed features of the image. And it executes fusion, compression, and activation. Then the extracted style features are weighted and mapped to the feature maps on each branch. Finally, the different feature maps are fused to strengthen the feature information extracted by different convolution kernels. In this paper, DSSESKN network is used to classify five types of art images.

DSSESKN module
In this paper, DSSESKN combines the SE module and the SK module to better enhance the overall style features and local detail features of the extracted art images. The submodules Split, Squeeze, Scale 4 and Scale are shown in figure 3. The expression is: Where, U is the feature graph after fusion on the branch of DSSESKN module.  i.e, Where, H', W' and C' represent the height, width and channel number of the feature graph X respectively.
Where, c represents the number of channels in the filter and feature map, and 2) Squeeze. After the split operation, two new feature graphs 1 U and 2 U are obtained. By summing the corresponding elements, the feature information of the two feature graphs is fused, i.e., The fused feature graph U combines the feature information of 1 U and 2 U , and uses global average pooling to pool the global spatial feature nodes in each feature channel. Global average pooling compresses feature graph U space information into C channel descriptors, and generates statistic C R S  describing feature channel information. The c-th element in the statistic S is calculated by squeezing the space information of U, which is, 3) Excitation. In order to enhance the style features extracted from each type of art image and reduce the feature information with little effect, 1×1 convolution operation is used to reduce the channel dimension of the feature map after global average pooling to r / 1 of the original channel, where r is the decline rate. After BN processing and ReLU function, the channel number is raised to the original number by 1×1 convolution operation. Finally, the normalized weight between (0,1) is obtained through a gate mechanism of Sigmoid, and the style feature information of each type of art image is obtained.
r represents the dimension reduction ratio of the feature graph channel.

Depthwise separable convolution in DSSESKN
Referring to the network structure of MobileNetV1 [16], depthwise separable convolution is used to build the new network in this paper. The first layer uses the dilated convolution to extract the features of the original art image. Compared with ordinary convolution, dilated convolution has a larger receptive field and can keep more internal data structure and original image information under the condition that the computation amount does not increase. Depthwise separable convolution consists of depthwise convolution and pointwise convolution. The operation formula of DSSESKN module using depthwise separable convolution is: Here,   Table 1 shows the operation situation of taking 3×3 and 5×5 for ordinary convolution on DSSESKN branch. In order to reduce the over-fitting in the training, the L2 regularization method is used in the pointwise convolution operation. Then the dropout is utilized before GAP processing. In table 1, dilated_conv indicates the dilated convolution, d=2 indicates that the expansion rate is 2. depthwise_conv is the depthwise convolution, and the dep_separable_conv indicates the depth separable convolution. s1 indicates that the step size is 1, and so on.

Experiment environment
Adam is used as the optimizer in the experiment. The initial value of learning rate is 0.001. The training cycle is 120 times. In the training, the training set is rotated randomly from 00 to 200, and is randomly flipped and shifted by 0 to 10% in the horizontal and vertical directions respectively to enhance the generalization ability of the network. When the accuracy of network training does not improve after 3 times of training, the learning rate drops to 10%. In the experiment, 80% of the images is randomly selected as the training set and the remaining 20% as the verification set. Experimental hardware: two NVIDIA TESLAP100-12G GPUs, two Intel Xeon E5-2620 V4 CPUs, four DDR4 2 400 MHz16 GB memory. The deep learning framework uses Keras.

Data sets
The art images used in this paper are obtained from websites such as Arlib World Art Appreciation Database [17] and https://www.studentart.com.cn/, including 3400 prints, 3500 Chinese paintings, 3400 oil paintings, 3400 gouache paintings and 3400 watercolors. The style information of all art images is distributed evenly. The images with high resolution and rich style information are cropped into several images with 299×299pixels, and the images with rich style information are filtered manually to enhance the data. After data enhancement, there are 5200 prints, 5200 Chinese paintings, 5120 oil paintings, 5130 watercolors and 5128 gouaches. The network model in

1) Comparison with traditional network models.
In order to verify the effect of the DSSESKN in this paper on the classification of art images, the data sets are  As can be seen from table 3, compared with the InceptionV4, ResNet50 and Xception network models, the DSSESKN model has light parameters, short training time and has high accuracy. However, the DSSESKN module contains parallel convolution operation, which makes the network model parameters in this paper higher than ShuffleNet, MobileNetV1 and MobileNetV2. Compared with Proposed+SE and Proposed+SK(r=16), the accuracy of DSSESKN is 86.21%, which improves by 0.86% and 1.25% than that of SE and SK, respectively. When r=4, the classification accuracy of DSSESKN is 89.54%, which is higher than that when r=16. 2) Comparison with traditional methods.
Traditional global feature and local feature extraction methods are used to classify the data in this paper, and the results are shown in table 4.

Methods
Accuracy/% Reference [5] 66.79 Reference [6] 61.3 DSSESKN(r=4) 89.54 As can be seen from table 4, the traditional manual features and local features cannot fully distinguish the style features of various art images. However, the DSSESKN can better extract the overall features and local detail features of art images and improve the classification accuracy.
3) Drop rate and branch convolution kernel size.
Drop rate r and branch convolution kernel size are a set of important parameters to control computational resources and experimental accuracy in DSSESKN module. The r value of DSSESKN and the size of branch convolution kernel are tested and analyzed. The results are shown in table 5. It can be seen that when the size of the convolution kernel on the branch of the DSSESKN module is fixed, the classification result with r=4 is higher than that r=16. When r value is fixed, DSSESKN module with branch=2 has shorter training time and fewer parameters than that of branch=3. When the convolution kernel on the branch of DSSESKN module is 1×1 and 5×5 respectively, the experiment results have the highest accuracy. the results are shown in table 6. Where K1 denotes the ordinary 3×3 convolution kernel. K2 represents 3×3 convolution kernel with expansion rate=2 and receptive field=5×5. K3 represents 3×3 convolution kernel with expansion rate=3 and receptive field=7×7. The experiment results show that the parameters of dilated convolution are less than that of ordinary convolution with the same receptive field, but the classification accuracy is not as high as that of ordinary convolution.

Feature visualization
In order to facilitate the observation of key region features of network model, we select proposed+SE, proposed+SK and DSSESKN to make comparison. The Grad-CAM (gradient-weighted class activation) is used to visually display the extracted feature areas of the art images as shown in figure 4.   Figure 4(b)-(d) is feature thermal map. The deeper warm color denotes that the category judgment depends on the information in this region greatly. As can be seen from figure 4, the style features of art images are relatively uniform in the overall distribution. Compared with the proposed+SE and proposed+SK, the DSSESKN extracts the overall features of art images more comprehensively, and the local details are more prominent. To visually demonstrate the classification performance of the proposed method, the images in figure 5 are classified using the DSSESKN network. In figure 5, the blue box indicates that the oil painting is wrongly classified as gouache painting, the green box indicates that the oil painting is wrongly classified as watercolor painting, and the red box indicates that the gouache is wrongly classified as oil painting. As can be seen from figure 5, oil painting, gouache painting and watercolor painting have certain similarities in painting techniques, color layout, texture, application of warm and cold tones or design of light sensitivity. Oil painting and gouache painting that are classified as watercolor painting and gouache painting are not obvious in the style characteristics of their categories.

Classification performance comparison
The recall, precision and F-measure values are used to evaluate the performance of DSSESKN, and the results are shown in table 7. The accuracy and F-measure values of traditional Chinese painting are the highest, and the recall of printmaking is the highest. Printmaking's unique carving techniques and line characteristics are the key features that distinguish them from other kinds of paintings. Chinese paintings' unique painting skills and the use of ink color make it different from western paintings.

Conclusion
This paper proposes a new DSSESKN model for art image classification. The DSSESKN module uses the two-branch convolution kernel to extract the overall feature and local detail features of the image, and fuses the extracted features. Two 1×1 convolution operations and Sigmoid gate mechanism are used to extract the key features of each art image. The depthwise separable convolutional neural network is constructed to classify art images such as printmaking, Chinese painting, oil painting, watercolor painting and gouache painting, which can better realize the feature extraction and classification of art images. The DSSESKN achieves better classification effect of art images than existing network models. However, the parallel double convolution operation in the DSSESKN module increases the number of network model parameters significantly. In the following work, we will optimize the network model of art image classification, enhance the art images datasets, and further improve the classification accuracy of art images.