LRSDSFD: low-rank sparse decomposition and symmetrical frame difference method for moving video foreground-background separation

In scenes with dynamic background or measurement noise, the low-rank sparse decomposition background modeling algorithm based on kernel norm constraint is easy to separate the moving background or noise as part of the foreground and the foreground at the same time, and it has poor modeling performance for complex background. In order to solve this problem, this paper proposes a low-rank sparse decomposition and symmetrical frame difference method for moving video foreground-background separation. Firstly, low-rank sparse decomposition is used to constrain the background matrix. Secondly, the moving objects in the region of interest (ROI) are extracted by symmetrical frame difference method, and the background image is obtained by block background modeling. Numerical experiments show that compared with other five main algorithms, the proposed algorithm can separate moving objects more accurately in the scene with dynamic background.


Introduction
Video foreground and background separation has always been the core issue in the field of computer vision. As the basis of intelligent monitoring technology, its research will have a direct impact on applications such as target recognition [1], target detection [2], target tracking [3][4].
At present, the mainstream algorithms for moving video foreground and background separation can be divided into optical flow method [5], frame difference method [6] and background modeling method [7][8]. Low rank sparse decomposition model is one of the background modeling methods. In the framework of compressed sensing theory [9][10][11], Liu et al. [12] proposed robust principal component analysis (RPCA). It regarded all sequences in the video as a large observation matrix, and each video image was equivalent to a column in the observation matrix.
Since the background part of the video does not change with time and the position of the same position in different frames is the same, the background part can be expressed as a low rank matrix. The change of foreground is dynamic and can be expressed as a sparse matrix. Then a separation method called principal component tracking (PCP) is proposed.
RPCA algorithm is easily disturbed by dynamic background and noise in front background separation, and Hongqiao Gao 2 the separation effect is poor. Therefore, it is still a challenging problem to accurately separate the prospects. Zhou et al. [13] proposed the decomposition algorithm "GoDec", which divided the objective function in low rank sparse decomposition into foreground, background and noise, and used bilateral random projection instead of singular value decomposition (SVD) in PCP. It improved the robustness of PCP in complex background and noise. In PCP, the kernel norm is the convex envelope of the rank function. The kernel norm is used to constrain the background matrix. Although the original problem is transformed into a convex optimization problem that is easy to solve, it limits the flexibility of dealing with practical problems. Considering that the larger singular value has a greater impact on the matrix approximation performance, Gu et al. [14] proposed the weighted kernel norm minimization (WNNM) model. The model would adaptively give different singular values with different weights, which improved the flexibility of the model in dealing with actual scenes. Zhang et al. [15] proposed the truncated kernel norm (TNN) model, which kept the large singular value unchanged and only minimized the small singular value, and also received good recovery effect in some scenarios. Since the kernel norm is only the optimal convex approximation form of the matrix rank function, and the approximation of the non convex form will get better recovery results, Nie et al. [16] proposed the schatten-p norm, which was a non convex minimization form of the rank function, which could suppress the noise generated during measurement. But schatten-p norm also faces the same problem as kernel norm, and cannot deal with different singular values flexibly. Inspired by the weighted kernel norm, Xie et al. [17] proposed the weighted schatten-p norm minimization (WSNM) model, which could deal with different singular values more flexibly and was a more accurate rank function approximation [18].
PCP uses L1 norm constraint for the sparse part. Because L1 norm processes each element of the matrix separately, it ignores the spatial structural relationship when the foreground object moves. This property is obviously different from the local motion and periodic motion of the dynamic background. Therefore, Guyon et al. [19] proposed 1 , 2 l norm with block sparsity, emphasized the relationship between the low rank of the background and the sparsity of the foreground, and it better separated the foreground objects. Liu et al. [20] proposed a low rank and structured sparse decomposition (LSD) model. The sparse part was a structured sparse norm based on overlapping groups, which could adapt to complex and changeable video scenes. Inspired by the weighted schatten-p norm in reference [18] and the structured sparse norm in reference [20], a new low rank sparse decomposition model was proposed by combining the symmetrical frame difference with the structured sparse norm, which was solved by the augmented Lagrange multiplier (ALM) method. The experimental results in multiple groups of complex video scenes showed that, the proposed model had a good separation effect on moving targets in dynamic background.

Background image generation based on symmetrical frame difference processing
A preliminary moving target image is obtained by symmetrical frame difference processing, and then the block background modeling is applied to generate the background image [21][22][23].

Symmetrical frame difference processing
The input image of the symmetrical frame difference processing is the consecutive three frames of dimensionalreduction grayscale image (frame i-1, i, i+1), denoted as 2) Median filtering.
In order to eliminate the difference image noise initially, the median filter operator with 2×2 pixel size is used to filter  LRSDSFD: low-rank sparse decomposition and symmetrical frame difference method for moving video foreground-background separation 3 related to the OTSU optimal segmentation threshold is the foreground pixel; otherwise, it is the background pixel. In this case, let  =25.
So far, the binarization image Similarly, the binarized image  figure 2. It can be seen that the fusion of the difference results of three frames based on the symmetrical frame difference method can eliminate or reduce the dilated pixels inside the image, enhance the integrity of the target, and effectively improve the detection effect of moving targets.

Block background modeling
This paper proposes a block modeling method based on symmetrical frame difference, and the process is shown in figure 3. The main steps are described as follows: 1) The size of the initial background image b I is consistent with the symmetrical frame difference result i B and the dimensional-reduction gray scale image i I .
sub-blocks (the size of sub-blocks depends on the actual situation. For example, if M=48 and N=27, the size of sub-blocks is 20×20 pixels). The pixels of i B are divided into foreground pixels (white pixels represent moving objects) and background pixels (black pixels represent background). Therefore, the sub-block of i B may have the following three situations: background sub-block (all background pixels, sub-block 3 in figure 4), sub-block containing background pixels and foreground pixels (subblock 2 in figure 4), and foreground sub-block (all foreground pixels, sub-block 1 in figure 4). Normally, the background modeling process will enter the i++ loop many times, that is, the symmetric frame difference processor will be called many times before the background image can be successfully generated, as shown in figure

3) Generating background image
Where ) ( rank R represents the rank function of matrix L, that is, the rank size of matrix L. 0 || ||  represents the quasi zero norm of matrix S, that is, the number of nonzero elements in matrix S. λ>0 is a penalty factor. Because the rank function and quasi-zero norm in equation (4) are non-convex and non-smooth, it is an NP hard problem.
The convex relaxation of equation (4) is transformed into a convex optimization problem, and its mathematical model can be expressed as: Where   || || represents the kernel norm of matrix L, that is, the sum of singular value of matrix L. 1

|| || 
represents the L1 norm of matrix S, that is, the sum of the absolute values of all elements in matrix S.

Low rank modeling
In RPCA algorithm, the low rank matrix L is constrained by kernel norm, which is not the best approximation of rank function. The weighted schatten-p norm in reference [24] is a more accurate rank function approximation and can suppress the noise generated during measurement. Therefore, the weighted Schatten-p norm is used in this study to carry out the low-rank constraint, and its expression is:

Sparse modeling
In RPCA algorithm, the sparse matrix S is constrained by L1 norm, which processes each element in the matrix independently, but the movement of foreground targets is continuous in space. Using this structural prior information, reference [20] proposed a structured sparse norm based on overlapping groups.
As shown in figure 6, assuming the sparse foreground on the 8×8pixel image has two different distributions, as shown in figure 6 (a) and figure 6 (b). In the figure, white indicates that the pixel gray value is large, and black indicates that the pixel gray value is small. Since L1 norm represents the sum of the absolute values of all elements, L1 norm will have similar values in both cases. Structured sparse norm is a t×t pixel window slides on the image line EAI Endorsed Transactions Scalable Information Systems 03 2022 -04 2022 | Volume 9 | Issue 36 | e2 Hongqiao Gao 6 by line and column by column to obtain the  l norm corresponding to each window. This paper uses 3×3pixel window, six pixels overlap between adjacent windows.
On the image with 8×8 pixel, 36 groups sub-space g of 3×3pixel can be obtained in advance, that is, G={g1,g2,...,g36}. Then it takes the  l norm of each subspace, that is, the maximum value of each subspace. Therefore, two significantly different values will be generated. For figure6 (a) and figure6 (b), since there are more groups of pixels with larger values in figure 6 (d), the value of figure 6 (c) will be much smaller than that of figure 6 (d). Under the requirement of minimum sparsity of foreground, figure6 (a) will be more likely to be considered as foreground. In the foreground matrix, each frame image is regarded as a column of elements, then the structured sparse norm constraint for the whole foreground matrix S can be expressed as: represents the j-th column element of the foreground matrix S, that is, the jth frame of the video. g represents the subspace of the elements covered by the window. G represents the set of g , and || || )

Model establishment and solution
Based on the above two norm constraints, the low rank sparse decomposition model is obtained by replacing the low rank part and sparse part in equation (5), which can be expressed as: The above optimization problem is solved by ALM algorithm. The specific solution steps are as follows.
1) The augmented Lagrange function is constructed.
3) Fix S and Y and solve L. The generalized soft threshold algorithm (GST) is used to solve equation (14). 4) Fix L and Y and solve S.

Experimental environment
In order to verify the effectiveness of the proposed algorithm, nine videos with complex background characteristics in three mainstream databases: I2R [25], CDnet2014 [26] and Wallflower [27]  WavingTrees 160×120 Dynamic background, the swinging branch 100 frames are selected equidistantly for each video data set, and compared with the mainstream GoDec, PCP, LSD, WNNM and WSNM algorithms. The running environment of all experiments in this paper is Intel (R) core (TM) i7-8550U, 1.80GHz 8GB memory and matlab2017a [28][29][30]. All the experimental results are processed by median filter.
According to the experience of existing algorithms, the termination condition used in this experiment is  [18], when p=0.7, the recovery effect is the best and the sensitivity to rank and noise is the smallest. Therefore, p is taken as 0.7 in this paper. Please refer to references [13,14,15,17,20] for the specific experimental parameters of the comparison algorithm. The value of C will be discussed in 4.2.

Evaluation index
The comprehensive measurement index F-measure is used to evaluate the separation effect, and its expression is: Where FP TP TP p + = denotes precision, that is, the ratio of foreground pixels correctly recovered by the algorithm to all foreground pixels recovered by the algorithm.

FN TP
TP r + = denotes recall, that is, the ratio of foreground pixels correctly recovered by the algorithm to real foreground pixels. TP indicates that the foreground pixels are judged as foreground pixels, that is, the number of foreground pixels judged correctly by the algorithm. FP indicates that the background pixel is judged as the foreground pixel, that is, the number of foreground pixels judged wrong by the algorithm. FN indicates that the foreground pixels are judged as background pixels, that is, the number of foreground pixels not judged by the algorithm [31]. According to equation (14), the higher the value of F-measure, the better the separation effect.
Taking the airport dataset as an example, other parameters are fixed, the value of C is between [0.1,1], and the analysis is carried out in steps of 0.1. As can be seen from figure 7, when C= 0.1, the F-measure value is the highest, so C in this paper is 0.1. LRSDSFD: low-rank sparse decomposition and symmetrical frame difference method for moving video foreground-background separation

Comparison experiments
The number of real foreground frames provided by different databases also varies. I2R provides 20 frames, CDnet2014 provides all the real foreground frames, and Wallflower only provides the most representative one frame. The experimental results obtained are shown in figure 8, and F values under quantitative experiment are given in Table 2.   Figure 8 shows the foreground effects of GoDec, LSD, PCP, WNNM, WSNM and the proposed in this paper in nine scenarios (airport, bootstrap, curtain, switchlight, watersurface, highway, office, camouflage and wavingtrees). It can be seen from the bootstrap data set that the proposed algorithm is not much different from other methods for front background separation under static background. It can be seen from the switchlight dataset that the method in this paper cannot be applied to the case of sudden illumination change, but it has excellent separation effect under other complex and changeable dynamic backgrounds. Especially on the camouflage dataset, its background is a changing computer screen [32][33][34]. When other algorithms fail, the proposed algorithm can still correctly separate the foreground and background, It shows the excellent performance. From the measurement index F values in Table 2, it can be seen that most F values with dynamic background in the nine data sets belong to the highest value or the second highest value, and the average F value of the proposed in this paper is the highest in all methods [35][36][37][38][39][40]. Therefore, in a comprehensive view, the performance of new method is better than other comparison algorithms on the whole.

Conclusion
Based on the above two constraints, the low rank and sparse parts are modeled respectively, and a low rank sparse decomposition model for video foreground and background separation is proposed. The low rank part adopts the symmetrical frame difference method, which can more accurately approximate the rank function, suppress the noise generated during measurement. It is more suitable for background modeling. The sparse part adopts the structured sparse norm based on overlapping groups, and uses the foreground structure information to judge the foreground target more accurately, which is more conducive to foreground modeling. Experimental results show that the proposed algorithm is not robust to sudden illumination scenes, but it can obtain good separation effect in static background and better separation effect in complex and diverse dynamic background. In the next work, not only the sudden change of illumination should be considered to make it more robust, but also the movement of camera should be considered.