Convolutional block attention module based on visual mechanism for robot image edge detection

In recent years, with the continuous development of computer vision, digital image and other information technology, its application in robot image has attracted many domestic and foreign scholars to conduct researches. Edge detection technology based on traditional deep learning produces messy and fuzzy edge lines. Therefore, we present a new convolutional block attention module (CBAM) based on visual mechanism for robot image edge detection. CBAM is added into the trunk network, and a down-sampling technique with translation invariance is adopted. Some down-sampling operations in the trunk network are removed to retain the details of the image. Meanwhile, the extended convolution technique is used to increase the model's receptive field. Training is carried out on BSDS500 and PASCAL VOL Context datasets. We use the image pyramid technique to enhance the edges quality during testing. Experimental results show that the proposed model can extract image contour more clearly than other networks, and can solve the problem of edge blur.


Introduction
Edge detection is a basic problem in image processing and computer vision. The goal is to extract object boundaries and perceived edges from natural images, and retain the main points of the image and ignore unimportant details. It is usually regarded as a low-level technology. Various high-level tasks [1,2] have benefited a lot from the development of edge detection, such as object detection [3], object suggestions [4] and image segmentation [5].
Early edge detection uses the principle of first-order and second-order gradient information of images to detect edges such as Sobel [6] operator, Canny [7] operator.
These calculations have good real-time performance, but poor anti-interference, unable to effectively overcome the influence of noise, and poor positioning. Early edge detectors are designed manually to detect strength and color discontinuity. Martin et al. [8] found that converting changes in brightness, color and texture into features, and training classifiers to combine these feature information could significantly improve performance. Recent works explore deep learning-based edge detection methods. Doll á r, etc. [9] used random decision forests to represent structures in local image plaques, input color and gradient characteristics, and structured forests output high-quality edge information. However, the above methods are all based on manual design, high cost, cumbersome design, 2 weak practicality. And the ability of manual features to express high-level information is limited for semanticly meaningful edge detection.
In addition, in recent years, there has been a wave of development using Convolutional Neural Network (CNN), emphasizing the importance of automatic hierarchical feature learning, which greatly improves the performance of edge detection. Ganin et al. [10] proposed to combine CNN with near-neighbor search, used CNN to calculate the characteristics of each plaque in the image, then searched in the dictionary, found similar edges, and finally integrated these similar edge information to output the final result. Xie et al. [11] proposed the first end-toend edge detection model HED (Holistically-Nested Edge Detection), which was based on Full Convolutional network (FCN) [12] architecture, using multi-scale and multi-level learning way, the performance of edge detection has been significantly improved in VGG network [13]. Liu et al. [14] proposed RCF (Richer Convolutional Features for edge detection) based on the HED model, which used richer convolution features and presented a more robust loss function to improve detection performance.
In the field of edge detection, classic models such as HED and RCF, although considerable progress has been made in the field of edge processing, they are all based on the traditional convolution of VGG16, the overall features of the image cannot be extracted. Too many downsampling operations are used, which affects the generalization ability of the testing set. In addition, the feature graph generated in the lower stage of the network is often too messy, containing too many irrelevant detail textures. Although the network has a layer of fusion to fuse multi-scale features, simply fusing five stages with a convolution layer of 1×1 will lose some multi-scale information. In order to solve these problems, this paper proposes a new convolutional block attention module (CBAM) [15] based on visual mechanism for robot image edge detection. The model introduces the attention mechanism of CBAM into the VGG backbone network [16], and makes the convolution network shift invariant to enhance the feature extraction ability of the network. In addition, in the fifth stage of the backbone network, it uses the expansion convolution technique (dilation, dil) [17] to add the sense of the network to extract more semantic information. For the feature information generated at each stage, a method similar to the feature pyramid [18] is used to fully integrate multi-scale features, so that the low layer can also pay attention to the global characteristics of the high-level, supervise and learn from each stage, and finally use the convolution layer of 1×1 for these multi-scale features to generate the final edge map.

Baseline network
The RCF [19] model optimizes the network structure based on HED [20]). RCF network uses richer convolutional features to conduct supervised learning for each stage separately, which improves the convergence speed of the model. Finally, the multi-scale features of each layer are fused with 1×1 convolution layer and supervised learning. Compared with HED network, RCF utilizes feature information extracted from all the convolution layers of VGG16 and proposes a more robust loss function, which greatly improves detection results.

New model
Based on RCF network, this paper proposes a new model that integrates multi-scale features across hierarchically, as shown in figure 1. Each module is described in the following sections.

Backbone network
As shown in figure 1, VGG16 is used as the backbone network of the new model, and the full connection layer is removed, and the full convolutional network framework is adopted. CBAM attention mechanism is introduced into the backbone network, and down-sampling technology is applied to the first three stages of the backbone network. After the fourth stage [21], maxpool of 2×2 is used, and the step size is set to 1, so that the resolution of the feature map of the fifth stage remains unchanged and the details of the image are retained. Therefore, there are only 3 down-sampling operations in the network, and the resolution is reduced by 1/8. In addition, in order to solve the problem that the receptive field is limited after the down-sampling operation is removed, the extended convolution technology is used in the convolution operation of the fifth stage, and the extended parameter is set to 2 to increase the receptive field of the model under the condition that the network parameters remain unchanged.

A backbone network based on CBAM attention
The new model introduces CBAM attention module in the backbone network. SENet (Squeeze-and-Excitation Networks) learns feature weights based on loss, so that the effective feature maps have high weights, and the ineffective/inefficient feature maps have low weights, so that the model can be improved for better results. However, the shortcoming of SENet only considers the importance of pixels in different channels, ignoring the importance of different positions [22][23][24][25]. CBAM model has more spatial attention mechanism than SENet. This spatial attention module can learn the importance of different positions of each feature map. The CBAM module structure is shown in figure 2. representing average pooling feature and maximum pooling feature respectively. Then, the two descriptors are sent to the multi-layer perceptron (MLP) [26], and the features of MLP output are added based on element-wise, and then activated by sigmoid to generate channel attention graph c M . Finally, the feature graph F  with channel concern is obtained by multiplying F and c M .  Then, these two features are connected and convolved through the standard convolution layer, and then activated by sigmoid to generate spatial concern graph s M .
Finally, the spatial attention graph s M is multiplied by F  to obtain the feature graph F   with attention mechanism.
Considering the problem of network parameters, only part of the network convolution layer is added with CBAM attention mechanism, as shown in Figure 1. After the CBAM module is added, the network can learn the importance of different feature graphs and pixels at different positions, thus enhancing the feature extraction ability of the model.

Max pooling
Modern convolutional networks are not shift-invariant, because commonly used down-sampling methods, such as Maxpool strided-Conv and Avgpool, ignore the sampling theorem, so a small input shift or translation will lead to dramatic changes in the output. To solve this problem, Chintala et al. [27] proposed a BlurPool sampling method. As shown in figure 5, the first step in maximum pooling is to calculate the maximum value of the region and then perform down-sampling. BlurPool, on the other hand, inserts the anti-aliasing operation in the middle and smoothes the input signal by introducing a blur core so that the translated result is similar to the untranslated result.  .75], which is relatively smoother than the traditional maximum pooling. It should be noted that multiple fuzzy cores are provided in reference [28], and manual selection is required when using fuzzy kernels. The new model uses this down-sampling technique after the first three phases of the network to enhance the robustness and generalization of the model.

Multi-scale feature extraction
In deep learning, there are usually two ways for the network to learn multi-scale features: the first method is inside the neural network by increasing the receptive field and sampling layer in the network, so that the features learned at each layer are naturally multi-scale; The second method is by adjusting the size of the input image.
CFF extracts multi-scale features of images in the same way as RCF network. The side outputs of each layer in the network of the trunk are characteristic compressed through 1×1 convolution layer, and all side outputs are added up in the unit of stage. Then, through 1×1 convolution layer dimensionality reduction, a feature graph of a single channel is output.

Feature pyramid fusion module
In order to fully integrate the multi-scale features of each level, the new model adopts the feature pyramid method, which makes the lower level also pay attention to the global features by transferring the features of the higher level to the lower level. While fully integrating multiscale features, the problem of fuzzy details in low-stage generated feature maps is solved effectively.
The Feature Fusion Module (FFM) in figure 1 firstly up-samples the features of the upper layer, and then connects them with the features of the lower layer. Then Feature compression is carried out through the convolution of 1x layer. In this way, the lower level is able to integrate the features of the higher level. In addition, in order to avoid over-ignoring the important details of low-level features, the residual network structure is used for reference. In this module, the original feature graph and the output feature graph are added together, and the result is taken as the final output of FFM. The FFM structure in stage 4 is shown in figure 7. The structure is similar in other layers. Convolutional block attention module based on visual mechanism for robot image edge detection 5 operation is used to realize up-sampling, so that each stage outputs an edge graph, and the model carries out supervised learning of the edge graph output by each stage.

Contour detection
The edge image is obtained by edge detection and nonmaximum suppression, and the threshold control is used to filter out noise and false edge caused by small change, so as to obtain accurate contour image. In this paper, the Otsu method is used to further segment the robot edge image. The segmentation value of the maximum interclass variance is taken as the threshold value, and the edge detection results are tested twice to obtain the final contour, so as to realize the adaptive edge detection. Where, the inter-class variance is defined as formula (1):  is the total average intensity of edge image, defined as formula (2):

Multi-scale feature fusion
After the feature pyramid module, multi-scale features have been fully fused, so the model only uses one 1×1 convolution layer to fuse multi-scale features of all levels, as the final output edge image of the new model, and carries out supervised learning.

Loss function
Edge detection datasets are typically marked by multiple annotators. For each image, all annotators' marks are averaged to generate edge probability plots ranging from 0 to 1. Where 0 indicates that no annotator marks the pixel; 1 means all annotators are marked at this pixel. Pixels with edge probability higher than  are regarded as positive samples, and pixels with edge probability equal to 0 are regarded as negative samples. Otherwise, if a pixel is marked by an annotator less than n, the pixel may be a semantically disputed edge point, and whether it is treated as a positive sample or a negative sample may confuse the network, so the model ignores pixels in this category.
Because edge detection is to classify pixels, this model uses cross entropy function as the objective function. Same as reference [29], the loss of each pixel relative to pixel label is calculated as:

Setting the weight of loss at different stages
In the network, the output edge images of each stage differ greatly, and the magnitude of loss of each stage may be inconsistent, and the loss of fusion stage should be dominant. In addition, it is found in the experiment that when the model is trained to the 20th epoch, the feature maps of the first two stages almost no longer contains any detailed textures, which may be the negative effects of the integration of low-level features with high-level features. These problems are not good for the final forecast.
In order to restrain this phenomenon, this paper reduces the proportion of loss in the five stages of the network and increases the proportion of loss in the fusion stage, so as to balance the relationship between loss in each stage and loss in the fusion stage. The loss weight of the five stages of the network is set as

Multi-scale edge detector
In order to further improve the quality of edge, image pyramid technology is adopted in the test. Specifically, an image pyramid is constructed by re-sizing images during testing, and each image is fed separately into a trained single-scale detector. Then, bilinear interpolation is used to adjust all the edge probability graphs to the size of the original image. Finally, the weighted average of these results is used to obtain the final predicted edge graph. This model uses three different scales, 0.5, 1.0 and 1.5 respectively.

Model training
The BSDS500 [30] dataset and PASCAL VOC Context200 dataset are widely used in edge detection. The BSDS500 dataset consists of 200 training images, 100 validation images and 200 test images, each labeled by 4 to 9 annotators. In order to prevent the over-fitting phenomenon of the model, rotation, expansion and clipping of 300 images in the training set and verification set of BSDS500 are carried out to enhance the data set. Finally, the enhanced data set of BSDS500 is mixed with PASCAL VOC Context data set as training data.
The new network is written based on Python3, using the pytorch 1.0.1 deep learning framework, and several other libraries. The experiment is carried out on an Ubuntu server with hardware including E5-2678 V3 2.50 GHz CPU and an NVIDIA TeslaK40C video card. The video memory 12 GB model is trained with 30 epochs by stochastic gradient descent algorithm. Batch size is set to 1, the benchmark learning rate is set to 1E-6, different learning rates are specified for different convolutional layers, momentum is set to 0.9, weight decay is set to 0.0002. During training, no pre-training model is used and network parameters are initialized using Gaussian distribution [31][32][33][34][35].

Experiments and analysis
Given an edge probability graph, a threshold is required to produce an edge image, and there are two options for setting this threshold. The first is the Optimal Dataset Scale (ODS), which applies a fixed threshold to all images in the dataset. The second is the Optimal Image Scale (OIS), which selects an Optimal threshold for each Image. ODS and OIS are commonly used as indicators of edge detection models.

Experiments analysis
The non-maximum suppression technique [9] is applied to the Edge image output by the model to obtain the refined edge image for evaluation, and the Edge Box tool kit is used for evaluation. Figure 8 shows the evaluation results. Compared with traditional methods, the edge detection of RCF network has achieved better results, and the CFF model optimizes the shortcomings of RCF network, and its multi-scale strategy improves the ODS score to 0.818.

Figure 8. Evaluation results on BSDS500
Proposed method is compared with other related algorithms, and the results are shown in table 1. As can be seen from the indicators in the table, the ODS and OIS of the proposed model are 0.7% and 0.9% higher than the RCF model, respectively. The comparison results of edge images output by new method and RCF networks are shown in figure 9. It can be seen from the comparison that some lines in the edge image generated by RCF model are fuzzy, while the new model can clearly detect the edges in the image and deal with some fuzzy details better.
Original (b) ground truth (c) RCF (d) proposed Figure 9. Comparison of the results of proposed and RCF The comparison between the proposed model and the edge image output by RCF network is shown in figure 9. It can be seen from the comparison that some lines in the edge image generated by RCF model are fuzzy, while the new model can clearly detect the edges in the image and deal with some fuzzy details better.
In order to further demonstrate the optimization details of the proposed model, the comparison between the edge images output by the proposed model at each stage and the RCF network is given in figure 10. In the figure, each column is the edge image generated from stage 1-5 from top to bottom. It can be seen that each stage of RCF network has poor processing ability for some irrelevant details, and each stage contains some fuzzy lines. The proposed model can focus on some global contour information in the lower layer by integrating features of different levels across layers, which helps to fully integrate multi-scale features. As can be seen from the figure, the edge image output by the proposed model only contains few irrelevant details compared with RCF, especially in the first and second stages, without too much messy texture.

Ablation experiments
In this section, the internal structure of the proposed model is analyzed. As shown in Table 2, after the introduction of CBAM attention module and anti-aliasing downsampling technology into the trunk network, the ODS and OIS of the model increases by 0.4% and 0.5%, proving that the feature information extracted from the trunk network of this model is more abundant and effective. In addition, both ODS and OIS improve by a further 0.2% fusing the output features of different stages across layers, indicating that multi-scale features can be fully fused by transferring high-level features to low-level features. In order to balance the relationship between losses at different stages and suppress the loss of lowlevel details, the ODS and OIS of the model were increased by 0.1% and 0.2% respectively by setting the weight of losses at different stages.

Conclusion
This paper proposes a global edge detection network based on RCF network. In the new model, the CBAM module is added into the VGG16 backbone network and the down-sampling technique with translation invariance is adopted to improve the feature extraction capability of the network. Part of the lower sampling layer is removed to prevent the image resolution from being too low and affecting the model accuracy. At the fifth stage, dilated convolution technology is used to improve the receptive field of the network. In addition, the model adopts a feature fusion mode from depth to shallowness, which makes the network pay more attention to the global information, and sets different weight of loss for different stages to balance the loss of each stage, preventing the model from excessively ignoring the details of the lower layer. Experiments show that the new model can generate clearer edge images.