A novel one-stage object detection network for multi-scene vehicle attribute recognition

In recent years, with the continuous development of computer vision technology, computer vision has been widely used in many scientific research fields and civil applications. As one of the basic tasks of many advanced visual tasks, object detection has important research significance in the field of computer vision and practical applications. At present, with the joint efforts of many scholars, the research on object detection based on deep learning has made remarkable progress. However, in some special weather, such as rainy days, foggy days, nights and the lack of visible light source, the visual distance and visibility are very poor, and the obtained images cannot be used normally, thus affecting the result of object detection. To solve the above problem, this paper proposes a novel one-stage object detection network for multi-scene vehicle attribute recognition, which mainly contains vehicle type and color attributes. The one-stage object detection network YOLOv3 is used as the basic network, and GIOU loss function is used to replace MSE loss function. Finally, experimental results show that the accuracy of the proposed algorithm is improved significantly on public data sets.


Introduction
With the rapid development of social economy, the living standards of the people have risen steadily. At present, the transportation tools have gradually transformed from their own feet to various motor vehicles, with various kinds of transportation problems becoming more and more complex. Intelligent Transportation Systems (ITS) is an ideal solution specifically to traffic problems caused by current economic development. As an important member of traffic, all kinds of motor vehicles play the effective identification of their attributes in the intelligent transportation system [1][2][3][4]. In the general surveillance video screen, there is generally not clear image pixels, the license plate is blocked, daub, corrosion and other situations, which is unable to accurately locate and identify the license plate number, then it is particularly important to quickly and accurately identify the other attribute information of the vehicle. For example, other appearance characteristics of the vehicle, the accurate identification of the vehicle type and color can make up for the lack of license plate identification, and can supplement the license plate identification results, more comprehensively increase the vehicle information. It can also improve the reliability and safety of ITS, and greatly improve the intelligence of vehicle traffic management. Quickly and accurately identifying the type and color of vehicles and making accurate analysis according to the identification results can effectively serve ITS, and can also establish a vehicle information database to provide vehicle information retrieval, which will greatly improve used the support vector machine (SVM) [13] method to classify and recognize vehicle colors. They extracted the same Region of Interest (ROI) to extract color features from vehicles, several different feature combinations were used, and testings were carried out according to these feature combinations. Finally, 87.52% average accuracy was achieved. Xue et al. [14] used different processing methods for images under different lighting conditions to weaken the influence of lighting factors on color recognition and improve the accuracy of color recognition. However, they needed to spend a lot of time manually classifying images during image processing. Ruan et al. [15] first used faster R-CNN network to detect vehicles, and then improved the structure of GoogLeNet network by connecting multiple loss layers behind the full connection layer to realize identification of vehicle identification, posture and color attributes, which achieved 85% accuracy.
To sum up, there are many studies and achievements in the field of vehicle attribute recognition, but most of the learning tasks are based on single-task learning with single properties, with few studies on complex learning tasks via multiple properties. It is difficult to achieve locating a specific vehicle only through a certain attribute of the vehicle. The identification process generally takes a long time and cannot be effectively applied to practical applications. Moreover, the more information the external attributes of the vehicle can be determined, the greater it helps to locate a specific vehicle. For a vehicle, also has multiple attribute descriptions. For example, according to the type of car, it can be described as car, suv, van, etc. Depending on the color of the vehicle, it can be described as red, white, black, etc. Based on the above analysis, this paper constructs the vehicle multi-attribute data set through network search and real shot data, and proposes a vehicle attribute identification method based on the deep neural network. The upgraded YOLOv3 network is used to train the image global area, the vehicle type and color to detect the vehicle type and color attributes to improve the applicability of the model.

YOLOv3 network
At present, deep learning (DL) is widely used in the field of target detection and recognition [16][17][18]. In recent years, the algorithm based on DL has made breakthrough achievements in image classification and object detection. When processing visual tasks, deep learning architectures can learn more efficient representations from raw images than traditional methods using artificial features, and perform better than traditional methods. CNN first shows its practicality in the digital recognition work. Krizhevsky et al. [19] first applied CNN to large-scale image classification problems, and obtained a high performance application instance on ImageNet dataset, which was the largest and most challenging image dataset so far.
In this paper, the YOLOv3 Network is used as the prototype, and the residual network (ResNet) [20,21] is used to skip the layer connection mode to increase the network depth and still make the network convergence, which achieves end-to-end target detection and identification. YOLOv3 neural network can predict the object and its position information in the image only by looking at the image once. The detection frame is extracted directly from the image and the target object is detected by the whole image feature. In this process, the input image is first divided into S×S grids, and then B detected boxes are predicted for each grid. Each detection box contains five predicted values, namely, x, y, w, h and confidence. x and y are the central coordinates of the detection enclosure. e and h indicate the width and height of the detection enclosure respectively. Confidence is the confidence of the category to which this detection box belongs. The loss function of each detection frame contains four parts, as shown in equation (1) loss is the classification error of a detection frame. The loss function is divided into two parts: the part with object and the part without object [22]. The loss of the part without object increases the weight coefficient. Most of the content in an image does not contain the object to be detected, which will lead to the calculation amount of the part without object is greater than that of the part with object, which will lead to the tendency of the network to detect the cell grid without object. Therefore, the weight coefficient is added in the part without objects to reduce the contribution weight calculated in the part without objects. The value in this paper is 0.5.
Central coordinate loss xy loss is defined as equation  (2) Formula (2) represents that when the j-th detection frame of the i-th grid is responsible for a real target, the boundary frame generated by the detection frame is compared with that of the real target, and the loss of central coordinates is calculated. Where Wide and high coordinate loss is defined as: Formula (3) represents that when the j-th detection frame of the i-th grid is responsible for a real target, the boundary frame generated by the detection frame is compared with the boundary frame of the real target, and the width and height coordinate loss is calculated.
Confidence loss conf loss is defined as equation (4): The first term in formula (4) represents the confidence error of the object detection frame. The second term indicates that there is no confidence error of object detection frame. Where cls loss are defined as formula (5): Formula (5) indicates that only when the j-th detection frame of the i-th grid is responsible for a real target, the boundary frame generated by the detection frame will calculate the classification loss function. The proposed network structure is shown in figure 1. Figure 2 is the spatial pyramid pooling (SPP) module [23,24], and figure 3 is the flow chart of the algorithm in this paper.   In YOLOv3, the mean square error (MSE) is used as the loss function to regression the center point, width and height of BBox, but in this way, the coordinate value of each point of BBox is regarded as an independent variable, without considering the integrity of the object frame, and the n l -norm is sensitive to the scale of the object. In order to solve the above problems, Yu et al. [25] proposed to replace the traditional MSE loss function with IOU loss function according to the intersection ratio between IOU and BBox. Later, when the real frame and the prediction frame do not coincide with each other in IOU loss function, the loss is 0 and gradient return cannot be implemented. In reference [26], it is intuitively shown that the qualities of the detection results are different for the same l-norms value. Therefore, GIOU and GIOU loss functions are proposed, which are expressed as follows: Where c A represents the area of the minimum box containing BBox and GT. U represents the total area of BBox and GT. GIOU, like IOU, can also be regarded as a distance measure, which meets the basic requirements as a loss function and has scale invariability. Meanwhile, it improves the situation that the gradient is 0 caused by the disintersection between the prediction frame and the real frame that exists in IOU as a loss function.
According to equation (8), this method can effectively improve the inaccurate positioning of YOLOv3.
In addition, in order to further improve YOLOv3 to the characteristics of the power of expression, referenced by the ideas of the space pyramid in YOLOv3 SPP module is added to the network, make the original for the characteristics of the target detection figure after SPP, global features and local features to fusion, enrich the expression ability of feature maps, expanded the figure characteristics of the receptive field, to detect in the image. Standard size span is relatively large. If directly input YOLOv3 for training, it is easy to lead to over-fitting. However, after the difference between RGB and infrared images is reduced through the pre-processing of data set, the weight after vehicle target detection on RGB images is used as the initial weight of vehicle target detection, which can not only reduce the network's requirements on data volume, but also reduce the time of network training.

Vehicle attribute recognition network
The deeper network denotes the higher accuracy of target detection identification, but it has the longer corresponding detection time. In the ordinary static image recognition, the influence of the time factor is not very prominent, but in the video application, it needs to consider the real-time of video conditions. In the process of a frame video target detection identification, time factor is an important standard. In this experiment, including the two attribute categories of vehicle type and vehicle color, considering the different color distribution areas of different types of vehicle, such as the window and the hood of the front, the whole image area of the vehicle are as the ROI of the vehicle information for color training, which will conflict with the ROI of vehicle type. In this experiment, the vehicle type and vehicle color attributes are graded training, which can not only use different network structures, but also avoid the problem of model and vehicle color ROI conflict. Using a different network structure between the two modules during the training time, and then integrating calls to the model, which can greatly improve the accuracy and detection time of vehicle attribute recognition. Figure 4 is the vehicle attribute recognition algorithm of this experiment. In figure 4, the input image is first resized to 416×416. Then CNN is run on the image for feature extraction. Finally, the threshold value of the detection result is set through the confidence degree of the model to screen the detection frame.  In this experiment, YOLOv3 neural network is used to establish the vehicle attribute recognition model. Figure 5 is the structure diagram of the vehicle type recognition network. resn: n is number, res1, res2,.., res8, indicating that an res_block contains n res_units. concat: tensor splicing, splicing the upper samples of the middle layer and a later layer, so that the network can learn deep and shallow features at the same time, so that the expression effect is better. up: up-sampling.
The network extracts features through several DBL components and res residual units, and then learns features of each layer through tensor splicing and fusion of up-sampling of the middle layer and a subsequent layer through concat. Finally, outputs of three scales are created, namely [y1, y2, y3] in figure 5. The bottom layer information contains global features and the middle layer information contains local features. Such splicing can give consideration to both. In addition, the idea of Feature Pyramid Networks (FPN) [27,28] is used to detect objects of different sizes with multiple scales. The finer grid element denotes the finer object that can be detected.
In the vehicle type recognition module, because need acquisition of feature points is more, the deeper the characteristic sampling network, the more models feature points collected, classification accuracy is higher, the vehicle color recognition module, vehicle color distribution is evener, its feature points is less, only need to calculate the color pixels in each ROI, characteristics of sampling network just need some simple layer can be realized. Based on the YOLOv3 neural network, some of its convolutional layers are retained, and a pooling layer is added after each convolutional layer, and then a new network structure of 23 layers is reassembled by tensor splicing. The vehicle color samples are trained to improve the overall time of vehicle attribute recognition. Figure 6 shows the improved network structure of vehicle color recognition. Where, DP is the combination of convolution layer and pooling layer. In this network, feature extraction is carried out through multiple DP components and several convolution layers. Then, tensor splicing and fusion are also carried out through concat to learn attribute features. Finally, outputs of two scales are created, namely [y1,y2] in figure 6. The DP combination method simplifies the network structure, reduces the network depth, and greatly shortens the time of model detection and recognition without affecting the accuracy of recognition. Combined with the model recognition model, the accuracy of vehicle attribute recognition can be improved and the overall time of vehicle recognition can be shortened by calling the vehicle color model for color recognition when the model is detected by the network.

Model training
Before model training, it is necessary to prepare the vehicle image sample data set required by model training for feature learning by network. In addition to the original image sample data set, the sample region of interest required by network training, namely the label data set of the sample, should also be prepared. In this label data set, the training data need to be manually marked with the favorable region of interest and category name, which can not only improve the accuracy of the region of interest, but also reduce the interference of noise and improve the effectiveness of feature extraction. In this paper, vehicles are divided into bus, car, coach, truck, suv and van. According to the color, it is divided into black, blue, gray, green, orange, purple, red, white, champagne, yellow and silver gray.
The experiment is trained on GTX 1080Ti GPU. Before the training, some parameters needed to be adjusted. The training data set was divided into 64 batches, and the number of each batch was set to 4, so as to reduce the GPU burden and iterate at the fastest speed. The higher the learning rate parameter value is set in the training process, the higher the recognition accuracy of the obtained model will be [29][30][31]. However, this parameter cannot be set at will. Too high learning-rate will lead to learning bias of the network only learning data sets, so learning rate is used in this experiment is set to 0.001. In the training process, the curve changes of the model's training loss and model learning rate are shown in figure  7. Figure 7 (a) is in the process of model training curve of average loss, the average loss in the process of training the more down to studying the better the results of the model, figure 7 (b) is in the process of model training vector graph, the curve reflects the training ability of the model in the process of learning to the attribute of the size, expectations towards training before the set value, namely 0.001. As can be seen from figure 7, the training loss value of the model finally decreases to 0.02, which can meet the training requirements. However, the learning rate drops sharply after 400000 iterations, indicating that the training model has an over-fitting phenomenon at this time. Therefore, combined with the average loss curve, the optimal number of iterations should end at 400. At this time, the average loss is still at the lowest, and the learning rate reaches the highest. Stopping the training at the right time can reduce the influence of over-fitting in the training process and improve the identification accuracy of the model.

Data sets
In order to realize the identification of vehicle type and color attributes, it is necessary to have sufficient training data and rich categories in the data set. In order to meet the requirements of the experiment, vehicles of different environments, different angles, different types and different colors should be included. However, in the current open vehicle data set, vehicle types are generally older, and the color is quite different from the current vehicle color, which cannot meet the data requirements of this experiment. Therefore, in order to meet the data requirements of this experiment, an AttributesCars data set containing a variety of vehicle attributes was built, including vehicles of various types and colors, totaling 20000 vehicle images, which can complete the training preparation of vehicle types and color attributes. Among them, 50% are from Stanford Cars, a vehicle data set publicly available on the Internet [32]. The types of vehicles in this data set are not much different from current vehicles, but the types of vehicles are relatively fixed, with car and suv in the majority. Therefore, the other 50% are collected from data of various types of vehicles publicly available on the Internet and data of various types of vehicles in various scenarios manually collected. In order to expand the data set and enhance the robustness of the training model, these data are zoomed and blurred respectively, among which 80% are screened one by one and labeled for training, and the remaining 20% are used for testing.

Experimental results and analysis
In the process of vehicle attribute recognition, it is necessary to meet the requirements of real-time video to the maximum extent on the premise of satisfying the recognition accuracy. Simplified in this experiment, color identification module uses a modified version of the network structure, training the model significantly reduced testing to identify the time used, and the accuracy is not affected, although in the vehicle recognition module identification for a long time, but in combination with color module module can make up for model identification of defects for a longer time, the overall vehicle identification time can satisfy the real time video requirements.
In vehicle driving, road scenes vary, and vehicle images in surveillance videos vary with the changes of road scenes. In a surveillance picture, the distance between the location of vehicles and the location of cameras directly affects whether vehicle attributes can be correctly and effectively recognized. In order to evaluate the applicability of the vehicle attribute recognition model, the test and verification are carried out in two different scenarios, close-range monitoring and traffic monitoring. In the close-range monitoring scene, the camera is close to the target position of the vehicle, and the images collected are relatively clear. The accuracy of vehicle detection and identification in the monitoring picture is relatively high. In the traffic monitoring scene, the camera is generally located in a higher position and far from the vehicle, so the pixel deviation of the vehicle image collected is larger than that of the close shot, the target is smaller, and the accuracy is relatively low.  In experimental results can be seen that the experimental model can correctly identify the type of vehicle under different scenarios and color attributes, for verifying the superiority of the method in the vehicle properties recognition, the method of this article and the other in models and research methods of vehicle color recognition, because this study involves two vehicle types and color properties, and is a classification to the training. The vehicle type and color attributes are compared respectively. A comparative analysis is made on the recognition accuracy and recognition time respectively. The comparative analysis of the recognition results of vehicle type and vehicle color is shown in Table 1 and Table 2 respectively. As can be seen from Table 2, the accuracy rate of vehicle color recognition based on SVM method is 87.63%. The accuracy of vehicle color recognition based on illumination processing method is 89.76%. However, YOLOv3 method can achieve 93.47% accuracy and 38.69ms recognition time, which is significantly shorter than other methods. The accuracy of the method proposed in this paper is 93.19%, and the recognition time is only 3.85ms. Compared with the YOLOv3 method, the recognition time is greatly shortened without affecting the recognition accuracy, and it is more practical for the application of vehicle recognition in video. Based on the results of Table 1 and Table 2, it can be seen that the vehicle multi-attribute recognition model proposed in this paper can effectively recognize the vehicle color while recognizing the vehicle type, and the model recognition time is 31.98 ms.
The recognition time combined with vehicle color is 3.85ms. The overall vehicle recognition time is 35.84ms, which can meet the requirements of real-time video. In addition, the combination of vehicle type and color category makes the vehicle recognition results more comprehensive, with an average accuracy of 95.63%, which can be more effectively applied to the detection and recognition of vehicle attributes in real traffic. Figure 9 shows the detection results of some vehicle data sets. It can be seen from the comparison between column 9(a) and column 9(b) that when the original YOLOv3 network detects infrared targets, its detection ability is poor for large targets at short distance or edge targets, and there are problems of false detection and missed detection when the target is small. The author of YOLOv3 also pointed out that YOLOv3 network could detect medium and large targets [33,34]. The comparison of column 9(a) with column 9(c) and column 9(d) shows that the detection ability of the network after modification has been significantly improved for large targets in short distance and edge targets, and the accuracy of positioning has also been improved to some extent. The box of network prediction is closer to the real value. By comparing the columns in figure 9(c) and figure 9(d), it can be seen that the network error detection and error detection capability are reduced after the addition of SPP module. After overcoming the original deficiency of YOLOv3, the target detection accuracy can be further improved compared with the method of only modifying GIOU loss.

Conclusion
This paper proposes a vehicle multi-attribute recognition method based on deep neural network in multi-scene. First on real road scenario monitoring image, collection contains various types and various colors of vehicle data sets, and classify the screening and label processing, training after the improved YOLOv3 neural network to get the vehicle multiple attribute recognition model, this model can be achieved on the premise of shortening the recognition time and high recognition accuracy, and the testing requirement with good effect, applicable to all monitoring images in real road scenarios. In this paper, the sample data sets of vehicle type and color attributes are graded to effectively improve the accuracy and realtime performance of attribute classification. Vehicles are obtained by test data set for training multiple attribute recognition models is tested, and the experimental results show that the proposed approach in different road scenario is of high precision, meet the requirement of practical use, and can effectively identify the types of vehicles and color information, and has high accuracy and good applicability and robustness.