Double-channel cascade-based generative adversarial network for power equipment infrared and visible image fusion

At present, visible light imaging sensor and infrared imaging sensor are two commonly used sensors, which are widely used in aviation, navigation and other military fields of detection, monitoring and tracking. Due to their different working principles, their performance is different. The infrared imaging sensor records the infrared radiation information of the target itself by acquiring the infrared radiation of the ground target. It identifies the target by detecting the thermal radiation difference between the target and the background, so it has special recognition and camouflage ability, such as finding people, vehicles and artillery hidden in the woods and grass. Although the infrared imaging sensor has a good detection performance for thermal targets, it is insensitive to the brightness changes of the scene and has low imaging resolution, which is not conducive to human eyes interpretation. Visible light imaging sensor is sensitive to the reflection of the target scene and has nothing to do with the thermal contrast of the target scene. The obtained image has high clarity and can provide the details of the target scene. Therefore, the fusion of infrared and visible images will be beneficial to the combination of infrared image's better target indication characteristics and visible image's scene clearing information. In this paper, we propose a double-channel cascade-based generative adversarial network for power equipment infrared and visible image fusion. The experimental results show that the fusion image not only retains the target information of the infrared image, but also retains more details of the visible image, and achieves better performance in both subjective and objective evaluation.


Introduction
Image fusion is a process in which two or more sensors obtain the image or image sequence information about a specific scene at the same time or at different times to synthesize it to generate new information about the scene interpretation [1][2][3].In here, visible and infrared image fusion is one of the research fields of multi-source sensor information fusion [4], and its application involves target tracking [5], target detection [6], medical imaging [7], etc. Visible image can provide the color and texture details of the scene, but it is susceptible to the change of ambient light; infrared image reflects the thermal radiation information, although not susceptible to the change of light, but can not provide details.By using their complementary information, infrared and visible images can be fused to obtain more comprehensive and accurate information of the target and scene, and it has better visual effects.
At present, the fusion methods of infrared image and visible image mainly include traditional method and image fusion method based on deep learning.Traditional image fusion methods can be divided into two categories: one is image fusion method based on Transform domain, such as multi-scale transform (MST) [8,9] and sparse representation (SR) [10,11].The other is spatial domain based image Fusion, such as guided filtering-based fusion (GFF) [12,13], etc.However, most fusion methods require manual design of complex fusion rules, and the calculation is more and more complicated.In recent years, methods based on deep learning have been widely used in the field of image fusion.According to the Network structure, the image fusion methods based on deep learning include Convolutional Neural Network (CNN) [14][15][16][17] and Generative Adversarial Network (GAN) [18].Reference [19] proposed a method based on convolutional neural network, which could effectively extract source image features, but this method ignored the middle layer, resulting in information loss and still required manual design of activity level measurement and fusion rules.
Reference [20] proposed a generative adversarial network for infrared and visible image fusion method.The final fusion image is obtained through the adversarial training between generator network and discriminator network, but only relying on adversarial training to increase the details leads to insufficient local information of the fused image, and the target edge information in the fusion result is often fuzzy.
To solve the above problems, this paper proposes a double-channel cascade-based generative adversarial network (DCGAN) for power equipment infrared and visible image fusion.DCGAN is an end-to-end model without the need to manually design complex fusion rules.The balance is achieved through the game confrontation between generator and discriminator to fuse infrared and visible images.At the same time, the evaluation index is introduced into the loss function part to guide the model training and get higher quality fusion image.

DCGAN framework
The framework of infrared and visible image fusion method based on DCGAN is shown in figure 1.The fusion image of infrared and visible images with rich information is generated by the generator model.In the generator model, feature extraction and transmission from source images and feature information sharing are carried out by cascade method.Discriminator network plays an antagonistic role with generator network in the whole framework, distinguishing whether the target image is a real image or one generated by generator through training, and obtaining the final fusion image through game antagonistic balance with generator network.In identifying two discriminant model design, namely discriminant infrared image and visible light image discriminant, in the process of generator networks and discrimination against, fused images through infrared image discriminant for more target information, and through the visible light image discriminant for more detail information, until the discriminant cannot distinguish between generate images and real images.Jihong Wang and Haiyan Yu The generator network consists of four parts: input source image, feature extraction module, feature connection module and fusion output module.The network structure is shown in figure 2. The feature extraction module contains four layers of convolutional neural networks.Each convolutional layer uses a 3×3 convolutional kernel with step size set to 1.A 1×1 convolutional kernel is used in the fusion output module.Compared with infrared image, visible image contains more details and infrared image contains more target information.In order to retain more effective information of source image, the idea of dense network is introduced and information exchange learning is designed between the two channels to build a generator model.The exchange information is generated through the method of connection and convolution between the two paths, and the exchange information is connected with the output of the previous layer as the input of the next convolution layer.In the process of designing the network structure, in order to avoid problems such as gradient dispersion, Batch Norm was used to normalize the data after the convolution kernel at the first four layers according to reference [21], and then an LReLu activation function was used to improve the network effect, and Tanh activation function was used at the last layer.The main function of discriminator network is to judge whether the images generated by generator obey the real sample distribution.In order to improve the quality of fusion image, the generated fusion image can retain more texture details through game confrontation training between generator and discriminator.In this paper, Discriminator-IR (D-IR) and Discriminator-VIS (D-VIS) have the same network structure design, as shown in figure 3.In here, 3×3 convolution kernel is used in the convolution layer, and the number of filters in the convolution layer is set to 32, 64, 128 and 256.In order to avoid introducing noise, the convolution layer with step size of 2 is used to replace the pooling layer, which makes the classification effect of discriminator better.In order to prevent the image distribution from being destroyed leading to the training instability, according to the reference [22], the discriminator input layer does not use data normalization.All layers of the discriminator use activation functions to improve the effect of image generation, and the last layer is linear layer classification.
In the formula, H represents the height of the input source image.W represents the width of the input source image.According to the subjective evaluation, it can be seen that the fusion results only pay attention to the visible light information and lose the thermal target information of the infrared image.Therefore, in order to retain more effective information of the source image and the correlation between the source image and the fusion image, the content loss function designed in this paper includes gradient loss and similarity loss, namely, Since the thermal radiation information of the infrared image is characterized by its pixel intensity and the texture detail information of the visible image is characterized by its gradient [23], the image intensity and gradient are calculated and can be expressed as: In the design of content loss function, the idea of reference [24] is introduced.Image structural similarity is an image quality measure that calculates the differences between images.In order to retain valid information in the source image, the similarity weight between the fusion image and the source image is defined, which can be expressed as: Where SSIM represents the structural similarity between the output fusion image and the input source image.
In the image fusion based on generative adjudgment network proposed, only a single visible discriminator is designed to fight against the generator, forcing the fusion image to keep more details of the visible image.It can be seen that the fusion results only focus on visible light information and lose the target information of the infrared image.In this paper, two discriminators are designed to make the fusion image not only retain more detailed information of visible image, but also contain thermal target information of infrared image.The loss function of discriminator D_IR and discriminator D_VIS can be expressed as: Where b, c, d represent the truth data of infrared image, visible image and fusion image respectively.

Experimental platform and parameter setting
From the TNO Image Fusion Dataset [25] and VIFB public Dataset [26], 41 and 21 groups of registered visible and infrared images in different scenes are selected as experimental training data, and then 22 and 21 groups are selected as testing data.In order to expand the training data set, the sliding window method is adopted, the clipping step is set to 10, the size of the clipping image block is 120×120, and the training parameters are set as follows: the batch size is 32, the number of training iterations is 20, and the initial learning rate of gradient descent is 10.In the generator,  is set as 20, DCGAN network is trained based on TNO and VIFB open data sets.Firstly, a single infrared and visible image is taken as the input image of the network respectively, and the fusion image is generated through the generator network with double cascaded channels.Under the constraint of generator loss function, the information of the two source images is preserved as much as possible.Then, the discriminator network discriminates and classiates the image and the visible and infrared images, and repeats in the number of iterations, so as to realize the antagonistic training of generator network and discriminator network and get the ideal model parameters.Parameter optimization in DCGAN is updated by Adam optimizer.

Subjective evaluation
In order to prove the effectiveness of the double discriminator, four groups of images ("Men", "Helicopter", "Steamboat" and "Solider") are set up for detailed experimental comparison and analysis, and the results are shown in figure 4. Figure 4(a) (b) are four groups of registered infrared and visible images respectively.Figure 4(c) (d) are the results of infrared image discriminator and visible image discriminator respectively.Taking the "Helicopter" group as an example, it can be seen that a single infrared discriminator training model can retain the pixel intensity information of infrared images, but lacks the details of visible images.Although the single visible discriminator training model can retain the pixel intensity information of helicopter, the spatial contrast of the whole image is low.Meanwhile, some visible light detail is missing from the bottom of the helicopter.Figure 4(e) is the result of the double discriminator.It is not difficult to see that the fused image has higher visual information fidelity and looks more clear and natural.
The DCGAN method in this paper is compared with eight classical image fusion methods, including Cross Bilateral Filter (CBF) [31], Convolutional Sparse Representation (CSR) [32], Fourth order Partial Differential Equation (FPDE) [33], A Fusion Approach to Infrared and Visible Images (DenseFuse) [34], generative adversarial network (FusionGAN), a Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion (DDCGAN) [35], A Novel Decomposition Method for Infrared and Visible Image Fusion (MDLatLRR) [36].The "Men" group is selected for experimental analysis, and the experimental results are shown in figure 5.It can be seen that the experimental results of CBF algorithm and ConvSR algorithm show that the tree background is over-blurred and contains noise artifacts.The experimental results of FPDE algorithm, CNN algorithm and DenseFuse algorithm improve the above problems, and there is no artifact phenomenon in the part with branches, but the results lack the texture details of branches.The experimental results of FusionGAN algorithm, DDGAN algorithm and MDLatLRR algorithm show that the hierarchical contrast between the target and the whole scene is weak and more edge and detail information of infrared and visible images is missing.In the experimental results of DPCAM algorithm proposed in this paper, the texture structure features of the upper left branches are rich and obvious, and the fusion image retains the edge information of the infrared image well and contains more detailed information of the visible image.
The fusion results of some scenes may be accidental.Another six groups of images of different scenes are selected from the TNO Image Fusion Dataset and VIFB for an expanded comparative experiment, and the experimental results are shown in figure 6.It is not difficult to see that the fusion results of CBF and ConvSR algorithms are not clear and have very serious noise phenomenon.Compared with CBF and ConvSR algorithms, the fusion results of FPDE, CNN, DenseFuse and MDLatLRR algorithms contain clearer contour information of source but some information is not obvious.The fusion results of FusionGAN and DCGAN algorithms are obtained through the antagonistic game between generator network and discriminator network.Most of the results only retain the useful information of visible images, but lack the target characteristic information of infrared images.However, the proposed method can obtain high quality fusion image with more detailed information and clear target.

EAI Endorsed Transactions on Scalable Information Systems
Online First

Jihong Wang and Haiyan Yu
Double-channel cascade-based generative adversarial network for power equipment infrared and visible image fusion

Figure 6. Results of infrared and visible image fusion based on TNO and VIFB dataset with different algorithms
The subjective evaluation of fusion images is based on the subjective judgment of human eyes, which is onesided and random to some extent.Therefore, the fusion image quality evaluation needs to be combined with objective evaluation comprehensive analysis and comparison.

Objective evaluation of fused images
Information entropy (EN), edge information retention ( F AB Q / ), difference correlation coefficient sum (SCD), structural similarity (SSIM), visual information fidelity (VIFF) and spatial frequency (SF) are selected for comparative experimental analysis of fusion image quality [37]. 1) Information entropy is an evaluation index based on information theory, which measures the average amount of information contained in the fused image and is defined as: Where L is the gray level of the image.P is the normalized histogram of the corresponding gray level in the fusion image.The larger entropy value denotes the more information contained in the fused image.
2) Edge information retention is an evaluation index based on source image and fusion image, which evaluates the fusion quality of significant information expression in the fusion image, and is defined as: 3) The information correlation coefficient is an evaluation index based on the source image and the fusion image.According to the correlation between image differences and the difference correlation sum of image quality measures, the false information contained in the fusion image can be evaluated, which is defined as: Where X=1,2.
) ( The larger the information correlation coefficient denotes the better the fusion performance and the less false information. 4) Visual information fidelity is an evaluation index based on human visual perception, which reflects image distortion characteristics by fitting real human eyes.Multi-scale method is used to divide source image and fusion image into different blocks, and then evaluate the visual information of each block.Finally, the mutual information between images is calculated to measure the quality of the image [38][39][40][41].6) Spatial frequency is an evaluation index based on image features, which is used to measure the structural texture of the fusion result.The larger the value of spatial frequency is, the richer the edge and texture information of the fused image is, which is defined as:


Where M and N represent the height and width of the fused image respectively.RF and CF represent the frequency of the row and the frequency of the column respectively.
) , ( j i F represents the value of the pixel point of the i-th row and j-th column in the fused image.The larger the SF value denotes the clearer image.
Firstly, a quantitative analysis is conducted on the experiment group of "Two men in front of house", and the results are shown in Table 1.As can be seen from the objective evaluation index data in Ten groups of infrared and visible images of different scenes are selected from the TNO dataset and fused images are obtained by different algorithms.The fusion images are quantitatively analyzed and compared, and the obtained index scores are visualized in the form of broken line graph.Figure 7 is the objective evaluation index broken line graph of nine different algorithms [42][43][44][45][46].It can be seen that, as a hot research method in recent years, deep learning method achieves higher index scores compared with the indexes obtained by traditional methods.In addition, the comparison between various indicators shows that DCGAN method has good visual information fidelity while improving fusion image quality, which is consistent with subjective evaluation.
The objective evaluation experiment was expanded and 21 groups of fusion results were selected from TNO and VIFB data sets for objective comparison experiment.The results are shown in Table 2.It can be seen that the proposed method has obtained the average quantitative values of 6 evaluation indexes in 21 EAI Endorsed Transactions on Scalable Information Systems Online First Jihong Wang and Haiyan Yu Double-channel cascade-based generative adversarial network for power equipment infrared and visible image fusion groups of different scenarios, and the results of 3 evaluation indexes are optimal.Although the method is not optimal in the results of the other 3 evaluation indexes, it is only second to the individual algorithm, which is consistent with the subjective evaluation results on the whole.In other words, the fusion result of the proposed algorithm has less false information, and the fusion result is naturally clear and has a good sense of hierarchy.

Conclusion
In this paper, combining the generative adversarial network and dense network, an infrared and visible image fusion method based on double channel cascade adversarial mechanism is proposed.In the generator model, feature extraction is carried out by double-path cascade to retain more information of the source image, and the real fusion image is obtained by game confrontation between the generator model and the discriminator model.The proposed method is compared with eight classical image fusion methods by qualitative and quantitative analysis.The results show that the proposed method can not only reduce the noise, but also keep the thermal target information of infrared image and more texture details of visible image, and the fusion results are clear and natural.The subsequent work will continue to improve the network structure, seek the optimal parameter value, make each evaluation index achieve the ideal effect, and apply it in different fields.

Figure 3 .L
Figure 3. Network structure of the discriminator

rI
represents the infrared image of the input.fI represents the fused image. aims to balance the positive parameters between the two terms, and  represents the gradient calculation.

Where 1  and 2 
are the parameters controlling the balance between the two items.grad L aims to retain thermal target information of infrared image and detail information of visible image.SSIM L aims to retain similarity information between fusion result and source image, and comprehensively measure consistency between brightness, contrast and structure.

2 
are the positive parameters controlling the balance between the two terms.ir I and vis I represent the infrared image and visible image respectively.ir I  and vis I  represent the values of the two source images calculated by the gradient operator

2  7 and 1 . 2 .
=500.The parameter settings are close to 1.The truth value of infrared image b and visible image c are set as random numbers between 0.The truth value of fusion image d is set as between 0 and 0.3.The training step of discriminator is set to 2, and Adam optimizer is used to optimize network model parameters[27][28][29][30].

Figure 4 .
Figure 4. Subjective fusion results of infrared and visible images of four different scenes

Figure 5 .
Figure 5.Comparison of fusion results of 'Two men in front of house'


represents the weight of importance of each source image to the fused image.image respectively.The larger the edge information retention value is, the more edge information is contained in the fused image.

5 )FI
Structural similarity is an evaluation index based on structural similarity, which measures the degree of similarity with source image brightness, contrast and structure by calculating the amount of information change of image structure.A new unreferenced image fusion performance measure (MS-SSIM) is defined as: represents the fusion image.

Table 1 ,
the VIFF index of DPCAM algorithm is 40.31% higher than that of FusionGAN algorithm and has good visual information fidelity.At the same time,

Table 1 .
'Two men in front of house' objective evaluation of comparative experiments

Table 2 .
The average quantitative value of the fusion results of 21 groups Figure 7.Comparison of six objective metrics for different algorithms