Development of the video stream object detection algorithm ( VSODA ) with tracking 1

The object tracking is one of the most important task in video analysis. Many methods have been proposed such as TLD (Tracking, Learning, Detection), Meanshift and MIL but they show good accuracy in laboratory cases, not in real ones, where the accuracy is defined as a numerical difference between computed object coordinates and the real ones. One of the reasons is lack of information about tracked object and environment changes. If a method has the prior information about tracked object, then it will be able to perform with higher accuracy. Some of the newest object tracking methods such as GOTURN use trained CNN (convolutional neural network) and have better accuracy because of knowledge about how the tracked object looks like in different situations such as light intensity changes and tracked object’s rotations. If we use only a classification algorithm (classifier) then it can find an object that was in training set with high probability. But if its appearance is changing it will be lost when deviation will be higher than trust limit. Then it is important to have parts of prior and posterior information about tracked object. The prior information is given by detector (CNN) and posterior information – by tracking algorithm (TLD). One of the biggest detector problems is high computational complexity in terms of operations’ number and one of the solutions is to use the classifier in parallel with the tracker. In future work we are going to use different sensors, not only RGB camera, but RGBD camera, which may improve accuracy due to higher amount of information.


Introduction
The video object tracking is one of the video analysis tasks.In case of object tracking we restrict an object area by means of bounding box in a first video frame.Then the algorithm finds object on next video frames.The existed tracking algorithms use only posterior information about object and it is their main disadvantage.The work's main point is to upgrade one of the best existed tracking algorithms by adding knowledge base about tracked object.The information about object will be obtained from a neural network, which will be trained to classify objects in video frames and will correct a tracker.
In our research we use the popular open-source computer vision library OpenCV, which has many important off-the-shelf functions e.g.very efficient tracking and classification algorithms [1].

Tracking methods
There are two methods for object tracking: recognition and tracking.In case of recognition, the program knows how object looks and it sequentially checks parts of the image to find similar objects.Weakness of this approach is impossibility to find object, if it is overlappedpartially or fully.Also, it is possible if appearance of the object changes too much.
In case of tracking, the object is selected manually at the initial time, and then the program will track the object by estimating the optical flow.Optical flow characterizes relative change of objects locations on the next video frame.As the program knows where the object was in previous frames, it knows speed and motion direction of the object.This data allows to predict the next location of the object with high accuracy.In the case of a short-term loss of the tracking object from the lens' field of view, the program will no longer be able to track it.
An analysis of the above-mentioned shortcomings made it possible to work out a solution consisting in using a combined method.
At the initial time, a trained neural net detects object in the frame and sends coordinates of its bounding box (high left and bottom right corners) to the object tracking algorithm.After some time, the neural net detects object again to determine its new location.Thus, the work of the tracking algorithm is adjusted to avoid a loss of the object.Furthermore, our method makes it possible to save computing resources due to intermittent work of the neural network.

Tracking algorithms
Most of the tracking algorithms are implemented in the OpenCVcomputer vision library, which has a large set of functions for working with images and video stream.For this research, we selected algorithms with maximum accuracy and speed of operation.The speed of operation is measured using processed number of frames per second (FPS).Below is a brief description of each of the algorithms, as well as their advantages and disadvantages.I) MIL (Multiple Instance Learning).MIL uses current location of the object as positive example for classifier learning.Several parts of the image that are equal in size to the object and are in a small neighborhood next to the current location of the object are used as potentially positive examples.
Thus, if the current location of the tracking object is not accurate, a positive example with exact current location may be in the set of potentially positive examples.In particular, given a training data set {( 1 ,  1 ), … , (  ,   )} in current frame, where a bag   = { 1 , … ,   } and   = {0,1} is its label, as well as a pool of  candidate weak classifiers  = {ℎ 1 , ℎ 2 , … , ℎ  }, MIL sequentially chooses  weak classifiers from the candidate pool based upon the following criterion where (2) is the log-likelihood over bags, and is the strong classifier consists of the first  − 1 weak classifiers [1].
MIL works stable at partial overlapping of the object, but at short-term fully overlap the algorithm loses object of tracking.
As the name suggests, this algorithm breaks long-term tracking into three components: short-term tracking, learning, and recognition [2].First component tracks the object frame-by-frame.Second component corrects the tracking module, if necessary.Third component is learning module.It accumulates images of the object and seeks to reduce the error.The search space is restricted to Where  ∈ 1.2  ,  ∈ − … ,  is image width,  is image height,  is width of initial bounding box, ℎ is height of initial bounding box,   and   are margins between two adjacent subwindows and set to be of the values of original bounding box [2].The scheme of the algorithm is shown in Figure 1.Among the advantages it is worth to notice that the work of the algorithm goes in real time, the stability of long-term overlaps of the tracking object and the stable tracking in case of object scale changing.

Neural nets for classification
Convolutional neural networks are usually used in image and video recognition, sometimes in natural language processing etc.This kind of neural net architecture has been suggested by Canadian scientist Yann LeCun [3].
The principle scheme of the typical deep CNN (DCNN) is shown on Figure 3.
The advantages of convolutional neural networks:  CNNs have less set of customizable weights than fully connected networks with equal accuracy. The possibility of using parallel computation, including computation on GPUs, which speeds up the learning and operation of the neural network. Resistance to image distortion, such as shift. High accuracy of image classification [9].
Main disadvantage of CNNs is high requirements for computational resources.If there is enough CPU to classify images using CNN, then powerful GPUs are required to solve the detection task.

Object detection algorithm
As a detection algorithm to analyse, we chose YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories [4] On a GPU Titan X it processes images at 40-90 FPS and has a mAP on VOC 2007 of 78.6% and a mAP of 48.1% on COCO test-dev.
The algorithm applies a single convolutional neural network to the full image and the network divides the image into regions and predicts the bounding boxes and probabilities for each region.These bounding boxes are weighed according to the predicted probabilities.
YOLOv2 as an update of YOLO9000 has the best performance and detecting accuracy ratio [5].
Despite its advantages YOLO has one serious problem very high requirements for computational resources.It makes impossible to use YOLO directly on mobile platforms such as Raspberry Pi etc.

Algorithm description
Basing on existed tracking algorithms TLD is chosen as basic algorithm.Convolutional neural network (CNN) is used for object detection.We selected this type of neural networks due to the advantages listed in the literature review.
We decided to use a state-of-the-art detector YOLO 9000.This detector based on Darknetneural network framework which has been developed with C and CUDA [6].
Input of detector takes an image with three color channels (RGB).Output of detector return five parameters: object class, top left corner coordinates, width and height of bounding box.
In the beginning it is planned to use own NN model with detector, but its accuracy was worth than state-ofthe-art detectors.
The algorithm has next work principle.The initial video frame is processed by CNN, which classifies objects and determines its position on the frame.Then the information is transferred to TLD tracker and it tracks the necessary object.Periodically the object position from classifier and tracker is comprised.If difference is bigger than the threshold TLD will get new tracked object position which is determined by CNN.
CNN plays the expert role and its information about the tracked object is more important than from TLD.In situations when CNN cannot find the tracked object the information from TLD is used.The main program algorithm is presented on figure 1.

Results
Addition of the CNN to the tracking algorithm solves the problem of object loss in case of its disappearance from the frame.To verify that CNN usage increases the tracking quality we checked algorithm work with and without CNN on test videos.The tracking quality is ratio between time when the algorithm works correct and video length.The results are shown in table 1 The first column presents list of test videos.Case 1 demonstrates the partially overlapped tracked object.Case 2 shows the fully overlapped tracked object.In case 3 the distance between the tracked object and the camera is changing.
The second column (TA)usage only tracking algorithm.The third column (TA+CNN)combination of tracking algorithm and CNN (our algorithm).The fourth column is relative difference between TA+CNN and TA methods.
When the background is static and the tracked object is fully presented on the video frame, the simple tracking algorithm can be used.But in more difficult situations CNN helps to find the tracked object when it is partially overlapped.
The program is developed for two operating systems (OS): Windows and Linux.We tested our program on the Windows-laptop with CPU Intel Core i5-7200U, 8 GB DDR4 RAM and GPU NVidia GeForce GTX 950M.The initialization of the detector takes about 8 seconds.The tracker operates at a rate of 5-12 FPS.The detector processes one frame in 3-4 seconds.This allows robot to track static or not very fast objects.The detector tuned to recognize two classes of objectscups and sport balls.
We also tried to test our algorithm on Raspberry Pi 3 with Raspbian OS, however we failed due to lack of RAM for the detector YOLO.
Examples of the work of our program is shown on the figures 4-7.Red bounding boxes are provided by the tracker, green bounding boxesby the detector.When the object is overlapped partially by the black block, the tracker often continues to track, but the detector sometimes cannot find the object.This is due to specific tuning of the detector YOLO for recognition many classes of objects.It recognizes black block as TV or monitor with high probability, which greater than probability of the target object.We plan to solve this problem in the future through additional tuning of the detector.

Figure 7. Object partially out of frame
When the object partially out of frame the tracker loses it, but the detector can find the object.Thus, our algorithm solves problem of redefinition the target object in case the object disappeared from the frame and came back.
In the future, it is planned to scale the platform, as well as increase the performance and quality of the algorithm due to using NVidia Jetson.

Conclusions
During this work, an algorithm was developed to track objects of two classes.The program can be used to control mobile robot using relative object position on the video frame and it can obtain information from different sensors such as range sensors or motor encoders and uses it to compute control signal for the robot.

Figure 3 .
Figure 3. Scheme of the typical DCNN

Figure 4 .
Figure 4. Object is not overlapped

Figure 5 .
Figure 5. Object is not overlappedWhen the object is not overlapped the tracker works correctly, the detector finds object (figures 4 and 5).Then the detector returns bounding boxes larger than object.

Table 1 .
. Comparison tracking quality with and without CNN