The Face Object based HEVC System for Video Call

Guaranteing the quality of face object in video call is the key task for video coding systems under the constrained bandwidth of network. Conventionally, the face object is divided into disperse blocks under the hybrid coding framework. Therefore, the characteristics of the complete face object have not been fully used. Meanwhile, it is diﬃcult to predict the complex aﬃne transformations such as rotation and scaling of the face object in neighboring frames based on the current translation motion model. In this paper, we propose an improved video coding scheme for video call. The complex transformation of complete face object is used to improve the compression eﬀect. Experimental results show that our proposed method has better performance compared with HM12.0, the bits rate saving of face region is up to 19.59% under the similar visual quality .


INTRODUCTION
In the recent years, video calls have been widely applied with the rapid development of internet and mobile communication technology.Meanwhile, more advanced displaying devices ask higher video quality for good visual effects.How to guarantee the visual quality of conversational videos under the limited bandwidth is a challenging topic.
Most of the improved methods of conversational video coding are based on the Region-Of-Interest (ROI) detection [7].At the encoder side, the quantization parameter and the Rate Distortion Optimization (RDO) Lagrange multiplier [14] are adapted to make the face region corresponds to higher distortion sensitivity [18] [8].At the decoder side, the face region is taken as the most interested area to perform error recovery or error resilient [6] [2].Beyond that, a very low frame-rate video coding strategy for face-to-face teleconference proposed by Wang et al. [15].To reduce the frame rate, only some selected high quality frames will be encoded.These methods just weighted the face region as a key area, the features of the complete face object e.g.rotation and scaling have not been fully take account for encoding.
In the latest High Efficiency Video Coding (HEVC) [13] standard, the rotation transformation has been utilized to the intra prediction [21] [20].It is effectively for reducing the bits rate.However, due to the coding cost, the complex affine transformation is hard to use to the inter prediction [19].There are several methods dedicated to the complex motion prediction.Zhang et al. [10] proposed an additional mo-tion candidate of merge mode.The additional complex motion is predict by using the motion information from nearest neighbors.Roman et al. [5] proposed an affine motion field prediction based on translational Motion Vectors (MVs) for better modeling complex motion, which is used as a postprocessing step after mode decision and Motion Estimation (ME).A parametric skip mode based on higher-order parametric motion model is presented in [3] for better prediction of complex motion.The methods described above estimated the rough complex affine transformation by translation motions of nearly units, but did not consider it as a the independent object.In our previous work [16], we established a personal face patch database for performing face distortion recovery of conversational videos at the decoder side, which gains high quality promotion by utilized the correlations of the face object such as complex affine transformation.This scheme is worked for the reconstructed conversational videos, and it is independent of the CODEC.
In this paper, we introduce the complex affine transformation to prediction procedure for video calls.The face objects in the conversational videos are detected at first.Then, the exact affine transformation of face objects in adjacent frames are obtained and used to reset the referent picture set.The inter prediction of our scheme is based on the new referent picture which is much closer to the current frame.
The rest of this paper is organized as follows: The procedure of our proposed scheme is given in Section 2. In this section, the details of face processing and the modified encoding procedures are also be presented .In Section 3, the experimental results are reported.Finally, the paper is concluded in Section 4.

THE FACE OBJECT BASED VIDEO COD-ING SYSTEM
The flowchart of the face object based encoding system for video call in show in Figure .1.Assume that the current input frame is I0.First, a face detection program will be performed to locate the face region on I0.The position of the face will be recorded when there is a face.
Then, we check the referent pictures set (RPS) of I0.Support R is a referent picture of the current frame I0.If R also have a face region, R should be aligned to I0 to make the face region of R maximum approximate to the one of I0.
The aligned picture R ′ will be added to the RPS of I0.
Finally, I0 will be compressed by using the new RPS.And the alignment parameters need to transfer to the decoder side.Due to the new referent picture R ′ is used by the whole slice rather than a unit, we encode the alignment parameters in the slice header of I0.After that, R ′ will be deleted from the RPS.
It should be noticed that the I type slice does not have RP-S, therefore, the additional referent picture doesn't work on the Intra prediction.However, the I frames can be used as referent pictures.For this reason, the face detection procedure should be also executed to the I frames.For each frame, the face detection procedure performs only once, and the detection results will be recorded.

Face Processing Procedures
The quality of face region is the most important thing in a video call.The proposed improved coding system is designed for the video call.Therefore, to process the face is one of the key steps in our work.It mainly contain two parts, 1) Face detection; 2) Face alignment.

Face Detection
The first goal of face detection is to check weather there is a big face in the current frame.If there is a face, the region of the face and the position of the eyes are needed.Milborrow et al. [9] proposed an effective algorithm to locate facial features with an extended active shape model (STASM)1 .The method in [9] is adopted to detect the face in our paper.A diagram of detection result is shown as Figure .

Face Alignment
In this section, the goal is to minimal the difference between the face region of current frame I0 and the one of aligned referent picture R ′ .After face detection, the face region and eyes position of I0 and original referent picture R can be obtained.The objective function of face alignment in our paper can be define as: min where Gx represents the face region of picture x. τ is a image affine transformation matrix which is computed as: where the image rotation and translation transformation matrix, respectively.Which means the aligning transformation contains scaling, rotation and translation but does not processes the distortion transformation and perspective transformation.
In order to deal with occlusions, min in Equation(1), where Then, the formulation (1) can be modified as follows: min where the ℓ 0 norm ∥ • ∥0 computes the number of nonzero elements in the error matrix E, and k is a given constant that describes the maximum number of corrupted pixels of ′ and E is shown in Figure .3.In our previous work [17], we designed a fast method based on the algorithm proposed by Yigang Peng et al. [11] to solve the function (3).The details are represented in [17].After face alignment, the face region in new referent picture R ′ gets to be more similar to the one in current frame I0.An example of face aligning results is shown as Figure .4.It can be observed that the residuals error is much smaller after doing face alignment.

Setting the Referent Picture Set
The RPS in HEVC decides how perviously decoded pictures are managed in a decoded picture buffer (DPB) in order to be used for reference [12].Reference picture list initialization creates default List 0 and List 1. Due to the low-delay requirement of video calls, we just adopt I type and P type slices for coding.Therefore, only List 0 will be used.
As shown in Figure .5, each picture in the List 0 has gone through the face detection procedure.A flag is defined to mark weather this picture has a face.A new picture will be created for the first facial picture in the list, and be added to follow the last referent picture.Of course, the index of referent pictures should not be bigger than the allowed maximum number of referent pictures.

Encoding the Slice Header
If the aligned referent picture is added, it needs do the same step in the decoder side.Therefore, the image affine transformation matrix τ should be coded into the slice header.
At the decoder side, τ is reconstructed by the above rules.
And the procedure of setting RPS is the same as encoder side.

VALIDATIONS
To validate the proposed scheme, we integrate it into HEVC reference software HM12.0 [1].All test video sequences are in YCbCr 4:2:0 format.The common coding configurations are set as Table .2. Evaluations are performed to compare the proposed method with HM12.0.In our paper, 4 YUV sequences (Johnny, KristenAndSara, suzie, GreenCloth) are used to test.Johnny (1280x720), KristenAndSara (1280x720) and suzie (720x480) are the standard YUV sequences.Due to the goal of our method is focus on the face object, and the face regions of Johnny and KristenAndSara are small, therefore, we resize this 2 sequences to 480x480 by cutting a part of background in the frames.The new frames of them are shown in Figure .7. Sequence GreenCloth (640x640) is created by our own.It can be obtained by contacting authors.
All test YUV sequences are encoded with three different sets of QPs: QP1 = {25, 26, 27, 28}, QP2 = {30, 31, 32, 33} and QP3 = {35, 36, 37, 38} which represent {high, middle, low} bit-rate coding configurations, respectively.Due to that our proposed method is designed for face object, firstly, we analyze the R-D performance of the face region in each frame (The simulation face region is shown in Figure .2).The R-D curve of the simulation results of face region for each sequences are shown in Figure .6.We can see that for all sequences with different QPs, our method gains better performance.The comparisons are shown in Table .3.The proposed scheme achieves average rate reduction of 13.59% for QP1, 9.25% for QP2 and 9.26% for QP3 and 10.7% for all QPs.The maximum coding gains 19.59%rate saving for Johnny.And the average ∆PSNR and ∆Rate of face region of all test sequences are calculated for various QPs.The results are shown in Table .4.It can be observed that our method gets better performance on high video quality.The subjective quality evaluation is given in Figure .We can observe that most of frames encoded by our method consumes fewer bits and obtains higher quality.Some encoding information are analyzed in Table .5.The extra bits of slice header are given in the second column.The average additional header bits using by our method is 115.Column 4 and column 5 represent the average motion vector of the face region encoded by HM and our method, respectively.It shows that the motion vectors get much smaller while using our method.In the third column, the usage of the additional referent picture is given.On average, 57.29% units in face region are encoded by using the new referent pictures.The distribution of usage is shown in Figure .8. The x axis represents that how many percent of the units in a face region referent our picture.The usages on new referent picture of most face regions are more 90%.It shall be notice that some face regions have none unit using the new referent picture.For these frames, the motion vector is 0, which means the 2 adjacent frames are almost the same.
Finally, the R-D performance of the whole frames are given in Table .6.For QP1, our method achieves 8.94% rate reduction or 0.22 increase in PSNR.For QP2 and QP3, our method also gets improvement.The performance of QP3 promotes small because that under the condition of low bitsrate, the total coding bits get greatly decreased and the saving bits by our method also decreased, but the additional bits of slice header remain unchanged.Therefore, the total bits-rate reduction get much smaller, although the bits-rate reduction of the face region is still significant.

CONCLUSIONS
In this paper, we proposed an improved video coding system based on HEVC for video calls.The goal of the method is to predict the face object more accurately by precessing the face region as a complete object.The complex affine transformations such as rotation and scaling of the face regions in nearly frames are obtained at first.Then, we create a new referent picture of the current frame and put it to the referent picture list.To use the new referent picture at decoder side, the transformation matrix needs to be writed to the bit streams.We design the procedure to keep a minimum modifications to the decoder.In this paper, the improvement of the R-D performance of the proposed method is validated.In the future work, the time cost by encoder should be further optimized.

Figure 1 :Figure 2 :
Figure 1: The flowchart of the face object based coding system at encoder side

2 .Figure 3 :
Figure 3: A diagram of elements in function(3).The testing images are taken from the Labeled Faces in the Wild (LFW) [4] dataset.

Figure 4 :
Figure 4: A diagram of the effect of face alignment.(a) Frame I0 (b) Frame R (c) The minimum residuals error of face regions in I0 and R. (d) The aligned picture R ′ , R ′ = R • τ .(e) The minimum residuals error of face regions in I0 and R ′ .The horizontal axis and vertical axis are painted in (d).It can be observed that, R ′ is obtained through slight rotating and shrinking on R, and the residuals error shown in (e) is smaller than the one shown in (c).
After getting the new referent picture R ′ , it needs to add R ′ into the RPS and transfer the image affine transformation matrix τ to the decoder.

Figure 5 :
Figure 5: The referent picture list

 1 
Su cos θ Su sin θ 0 −Sv sin θ Sv cos θ 0 Tu Tv can be obtained from equation 7. Consecutive 6 frames of each sequence encoded by HM12.0 and the proposed method are shown.The records on the bottom of a picture represent the PSNR and Bits of a face region, respectively.Pictures (a)(c)(e)(g) are encoded by HM, and pictures (b)(d)(f)(h) are encoded by the proposed method.

( a )
Johnny encoded by HM (b) Johnny encoded by Proposed (c) Sara encoded by HM (d) Sara encoded by Proposed (e) suzie encoded by HM (f) suzie by Proposed (g) GreenCloth by HM (h) GreenCloth encoded by Proposed

Figure 7 :
Figure 7: Consecutive 6 frames of each sequence encoded by HM and the proposed method.

Figure 6 :
Figure 6: The RD curve of test YUV sequences (on face region).

Table 5 : Coding information (∆B: additional bits of slice header; Usage: usage of additional referent pictures)
This work was supported in part by National Basic Research Program of China (973 Program): 2015CB351802, in part by National Natural Science Foundation of China: 61472389, 61025011, 61472203.