Region proposal network based on context information feature fusion for vehicle detection

By using the traditional methods, the feature information extracted from vehicle target detection is insufficient, which leads to the low accuracy in identifying small target vehicles or blocked targets. Therefore, we propose a region proposal network (RPN) based on context information feature fusion for vehicle detection. RPN obtains feature vectors of fixed length as vehicle target features. Context information fusion network obtains the corresponding context information features on the feature maps of different layers. Finally, the two features are fused. In addition, in order to solve the problem of data imbalance, experiments on PASCAL VOC2007 and PASCAL VOC2012 data sets with difficult sample training show that the proposed method has significantly improved the mean average accuracy (mAP) compared with other methods.


Introduction
Object detection refers to the recognition of the existence of known objects in a given image and the use of rectangular boxes to mark the location of objects. Convolutional Neural Network (CNN) is one of the classic object detection algorithms [1][2][3]. At present, target detection algorithms based on CNN are mainly divided into two categories [4]: 1) two-step method based on candidate box [5], which requires selective search algorithm to select; 2) Regression based one-step method [6], which uses CNN network to directly predict the category and position of the target. The method of obtaining candidate boxes is not strong in real time for target detection, but has high accuracy. The method based on regression does not need to obtain candidate boxes, and has strong real-time performance, but low detection accuracy.
In real life, the changes of image imaging conditions and environment significantly affect the appearance of objects, resulting in differences between objects. Even objects of the same category may affect the accuracy of target detection due to factors such as imaging time, location, weather conditions, camera, background, light and viewing distance. Context information can be understood as contextual information, which usually refers to any and all information that may affect the perception of the scene and the objects in it. Studies show that context information can improve the accuracy of target detection algorithm [7][8][9][10].
The current context information is mainly divided into two parts: 1) context-level context information around the target object, and 2) context information about the relationship between objects. Torralba [11] showed that EAI Endorsed Transactions the recognition accuracy could be improved by using object shape and context information, and the context features were divided into three categories: semantic context (probability), spatial context (location) and scale context (size). Divvala et al. [12] evaluated more contextual information, such as local pixel context, geometric context, geographical environment, etc. Liu et al. [13] proposed inside-outside Net (ION) and external network. Externally, the sliced-loop neural network was used to integrate contextual information around the Region of Interest (ROI) [14,15], and internally, ROI features of different layers were extracted for fusion. Liu et al. [16] proposed Structure InferenceNet (SIN), which used the relationship between context information at the scene level and instances for target detection and recognition. Shrivastava et al. [17] proposed a Contextual Priming and Feedback network (CPF) to provide topdown Contextual information Feedback. Mei et al. [18] proposed Attention to Context CNN (ACCNN), which utilized local multi-scale CNN features and generates global features through Long Short-term Memory (LSTM) for target recognition. However, the extracted global information only improved the detection accuracy by 0.6%, and all the above algorithms extracted only the context information features of a certain layer, resulting in insufficient feature information. In addition, Wang et al. [19] proposed the combination of front-end network and multi-layer context module, but focused on the front-end network, and only applied to image segmentation.
In the training process of target detection, the target area in the whole image is smaller than the background area, that is, the positive sample is less than the negative sample. If the classifier is trained directly with these unbalanced data, the classifier may tend to classify all samples as negative samples. Shrivastava et al. [20] carried out secondary classification of samples that were difficult to correctly classify and proposed Online Hard Example Mining (OHEM). All candidate frames were back-propagated, sorted according to Loss, and some candidate frames with large Loss were retrained as difficult samples to effectively improve the accuracy of target detection.

RPN and anchors
Faster R-CNN is a two-stage target detection method, in which an input image is sent to the convolutional layer for a convolution operation to obtain a feature image, and then the feature image is sent to a Region Proposal Network (RPN) to obtain a target candidate box. Finally, candidate boxes and feature maps are sent to pooling layer and full connection layer for classification.
We define three scales (8,16,32) and three aspect ratios (1:1,1:2,2:1) anchors in Faster R-CNN that represent nine possible sizes for each slide window in the original area of the RPN network when using the sliding window strategy for feature maps. Each point on the feature map generates 9 anchor points. In this paper, we define 6 kinds of anchors with different scales (1,2,4,8,16,32) and 3 kinds of height-width ratios (1:1,1:2,2:1), and obtain 18 kinds of anchors. The addition of small scale anchor points is conducive to the detection of small vehicle targets, as shown in figure 1.

Network structure
The proposed structure is shown in figure 2. Taking Faster R-CNN as the benchmark network, VGG16 was pretrained on ImageNet to initialize the network. For each input image, feature maps are generated through different convolutional layers and pooling layers. The feature graph was convolved with Conv5 to generate ROI features through RPN [21,22]. In conv4 and Conv5, the corresponding context information features of the same size and different scales were extracted for each ROI, and then the two connected context information features were input into the convolutional layer of 1×1 for normalization. Finally, the normalized context information features were fused with ROI features to generate fixed-length feature descriptors for each 512×7×7 Proposal Region. Two Fully Connected Layers (FC) process each descriptor and produces two outputs: a K-object class prediction, an adjustment to the bounding box of the proposed region.

Context information extraction
Context information features play an important role in target detection, but Faster R-CNN only extracts ROI without considering context information features. In addition, in the VGG network, a 2×2 Maxpooling is performed for each convolutional layer from conv1 to conv4. After four times of maximum pooling, the feature maps are down-sampled to 1/16 of the original. For example, a 32×32 area was reduced to 2×2 through "conv5", and a 16×16 block became a pixel. Therefore, when 7×7 is sampled in ROI Pooling, a lot of information is lost, resulting in poor performance of the target detection algorithm in small object detection. In this paper, the features of context information are extracted at conv4 and conv5 layers. With the increase of convolution and pooling layers, the resolution of feature maps decreases gradually. Conv4 feature maps were 28×28 in size. The Conv5 feature map is 14×14 in size. Compared with Conv5, Conv4 had higher resolution but less semantic information. Therefore, three small size context information features of 1.5x, 2x and 3x were extracted from conv4. In conv5, context information features of 1.5x, 2x and 4x were extracted from conv5. The method of extracting the integer multiples of the original candidate box is easy to calculate the extracted original candidate box.
( y x is the center coordinate of the candidate frame, w and h are the width and height of the candidate frame respectively. Extracting context information features is n-fold larger than the original candidate box. Candidate boxes for context information characteristics is: In conv4, n=1.5,2,3. In conv5, n=1.5,2,4. Figure 3 shows the extracted context information features of Conv5.

Conv1~3
Conv4 Conv5  Region proposal network based on context information feature fusion for vehicle detection

Normalization operation
After extracting the context information features of different layers, the context information features are fused by concat layer. The features extracted from different convolutional layers of neural network have different scales. If the features are simply combined, the learning rate will be unstable. Normalization is required to match the context information characteristics with the order of magnitude of ROI characteristics. This paper mainly studies L2 normalization layer and 1×1 convolution layer normalization. In CNN, L2 normalization layer is in the convolution of After Recurrent Neural Network (RNN) expanding, it is actually a full-connection layer and has different activation and structure compared with CNN. Therefore, the effect of L2 norm on CNN is far inferior to that of RNN. However, the convolution kernel of 1×1 can reduce or raise dimension to normalize the dimensions of different features, which is simpler than L2 norm. In addition, 1×1 convolution layer is required to ensure that the length of feature before entering the full connection layer is consistent with the original method. Therefore, 1×1 convolution layer conv6 is used in this paper to normalize the extracted context information features.

Fusion strategy
It adds a new layer (Eltwise layer) to the original network structure. This layer has multiple inputs, one output, and three operations (product, max, sum). product is the multiplication of corresponding elements. This operation will make the system relatively unstable and vulnerable to the influence of the weaker party, resulting in difficult convergence of the network. max means taking the maximum value of the corresponding element, which is equivalent to the set model in the network to some extent. When neither branch can detect the object, sum loses its mutual support advantage over sum [23]. sum means adding the corresponding input elements, and this layer defines the coeff parameter, which is used to adjust the weight. In this paper, element integration is chosen by element summing.

Negative sample classification
In order to solve the problem of classifier performance degradation caused by too many negative samples in the process of target detection, the difficult samples are automatically selected to join the training in the training process, which makes the training faster and more effective. Feature maps and all ROIs were propagated forward, followed by classification regression, and the loss of each ROI was calculated, including classification (cls) and positioning (dec). Using non-maximum suppression (NMS) selects the ROI of the first B (B= 128, jointly determined by the number of images contained in each batch during training N(N=1 in this paper) and the minimum batch-size of training (batch size =128 in this paper) with the largest loss according to ROI and corresponding loss iteration, and then inputs the network (for forward and backward) to learn and carry out gradient propagation. In network training, the overlapping rate of predicted candidate frames and marked candidate frames (Intersection-over) is adopted without human intervention Union (IOU) as the evaluation index. The ROI of forward propagation is the candidate area where the overlap rate of prediction candidate box and marker candidate box IOU is greater than 0.5; the ROI of backward propagation is the candidate area where the overlap rate of prediction candidate box and marker candidate box IOU is in the interval ) 5 . 0 , [g ; g is set as 0.0 in the paper.

Experimental configuration
The experimental environment was as follows: UBUNTU16.04 system, 16 GB memory, Intel I7-9700K, 3.60 GHz CPU, GeForce RTX2070 8 GB video memory, deep learning platform CAFFE combined with CUDA8.0 and CUDNN V5.1. First, the Vggnet-16 network is pretrained on the ImageNet data set for initialization, with an initial learning rate of 0.001 and momentum of 0.9, and Stochastic Gradient Descent (SGD) is used as loss propagation. Compared with Faster R-CNN, the threshold range parameters of negative samples are set differently. When the threshold value is [0.0,0.5], it is regarded as negative samples to reduce the difference between the number of negative samples and positive samples. After changing the threshold range, the training model was more balanced. However, when testing, the threshold range of negative samples is the same as the parameter in Faster R-CNN. Both the method and baseline in this paper are implemented on the Caffe framework, and the end-toend training is adopted without the need to train multiple models, which greatly shortens the training time.

EAI Endorsed Transactions on Scalable Information Systems Online First
Zengyong Xu

R E T R A C T E D
Region proposal network based on context information feature fusion for vehicle detection 5

Experimental data sets and comparative methods
The data set used in this paper is PASCAL VOC, and VOC data set contains 20 classes. The PASCAL VOC2007 dataset contains 9963 images and 24,640 objects, of which the training set and validation set contain 5011 images and the test set contain 4952 images.
There are 11540 images in the training set and validation set on the PASCAL VOC2012 dataset, with a total of 27450 objects [24,25]. Table 1 shows the statistics of the PASCAL VOC2007 dataset, which is divided into two main subsets: Trainval and Test. Trainval data is further divided into training (Train) and validation (Val) sets. By counting the number of images for each subset and class separately, it can be seen that there is a particularly large amount of data for the category person, which in previous datasets was basically a synonym for "pedestrian". However, on VOC data set, there are images of people engaged in various activities, such as horse riding and car riding, which make the detection of Person more complicated and increase the difficulty of target detection. In addition, the images in the PASCAL VOC dataset were not taken specifically for object recognition purposes. An image usually contains multiple categories of objects, which may lead to objects extending to the outside of the image or shielding each other between objects in an image, increasing the difficulty of target detection. Therefore, it is not enough to rely only on the features of the object itself for target detection.
In this paper, mean average precision (mAP) is used as the evaluation index of detection accuracy. The comparison methods are as follows: Fast R-CNN, Fast R-CNN+context, Faster R-CNN, Non-multi Layer Context (Non-MLC), AC-CNN, Single Shot Multibox Detector (SSD300), ION, SIN, Hyper Network (HyperNet), CPF, proposed network.

Normalization method and fusion strategy
The results of testing different normalization approaches on the PASCAL VOC2007 dataset are as follows. The mAP of L2 normal layer is 74.5%, and the mAP of 1×1 convolution layer is 77.4%. Using L2 normalized layer can not reduce the loss, but reduce the network performance, which is not different from the effect of not adding 12 normalized layer. Using 1×1 convolutional layer mAP can improve 2.9%. Therefore, 1×1 convolution layer is selected for normalization operation [26][27][28].
The mAP of the three different fusion strategies is as follows: sum is 77.4%, prod has no data, and max is 76.5%. Thus, the element-by-element summation strategy has the best performance. Therefore, the context information features and the object's own features are selected to be integrated element by element.

Experimental result
Public VGGNET-16 is selected as the initialization model. Although ResNet version of Faster R-CNN can improve the target detection accuracy, ResNet is slower than VGGNET-16 in speed, requiring more training times. In addition, the emphasis of this paper is to verify the characteristics of context information to improve the accuracy of target detection. In summary, VGGNet-16 model is selected.
Training is performed on the PASCAL VOC2007 dataset and the presented method is evaluated on the PASCAL VOC2007 dataset. Table 2 compares the test results of the five methods. c adds context information on the basis of Fast R-CNN without adopting other optimization strategies, and 67.3% mAP is obtained, which is 1.4% higher than Fast R-CNN. Because we detect vehicles in this paper, we only display the results of vehicle. In the proposed method, anchor points of six scales (1,2,4,8,16,32) and three aspect ratios (1:1,1:2,2:1) are used to obtain 18 anchor points of different sizes to strengthen the detection of small targets. The number of iterations of the whole network training is 80 000. The initial learning rate of the first 50,000 iterations is 0.001, and the learning rate of the last 30,000 iterations is reduced to 0.0001. Set the momentum parameter to 0.9, the weight attenuation parameter to 0.0005, and the effective batch size to 2. If 18 anchor points of different sizes are adopted in Faster R-CNN without adding context information features, 69.3% mAP can be obtained. The proposed method obtained 71.1% mAP, which was 2.6% higher than Faster R-CNN. It can be seen that the feature of context information is helpful to target detection and plays a great role in improving accuracy. The Faster R-CNN takes 0.139s to test an image, while the method in this paper only takes 0.117s. Therefore, the method in this paper is more effective both in time and accuracy.
The PASCAL VOC2007 training set and PASCAL VOC2012 training set were jointly trained, and 8 methods were evaluated on PASCAL VOC2007 data set. The test results are shown in Table 3. As the training set increases, the number of iterations of training needs to be increased. The learning rate of the first 60000 iterations is set at 0.001, and the learning rate of the last 40000 iterations is set at 0.000 1. The proposed method achieves 77.4% mAP on PASCAL VOC2007 data set, which is 4.2% better than Faster R-CNN without context information, and also better than other methods with context information. Among the 20 categories in the dataset, the proposed method can improve the results more effectively for those categories that are easily obscured or considered as background, such as sofas, people, tables and chairs. In addition, the detection accuracy of vehicles with specific backgrounds, buses and other categories has also been greatly improved, and the addition of context information provides assistance for the detection of these categories.

Conclusion
This paper proposes a vehicle target detection algorithm based on RPN. Context information plays an important auxiliary role in target detection. In order to make effective use of context information, the combined context information and the characteristics of the object itself. Comparative experiments on different data sets show the effectiveness of the proposed method. Both contrast method did not join the context information and some of the ways to add context information, testing results of this method has obvious promotion, especially from the earlier context information levels of can enrich the characteristic information of the target detection to avoid some target loss of information, and improve small target or easily obscured target detection accuracy. Future efforts will be made to solve the problem of adaptive selection of context information to further improve the accuracy of target detection.