Comparative Analysis of Scene Classification Methods for Remotely Sensed Images using Various Convolutional Neural Network

Remote sensing image (RSI) scene classification has received growing attention from the research community in recent days. Over the past few decades, with the rapid development of deep learning models particularly Convolutional Neural Network (CNN), the performance of RSI scene classifications has been drastically improved due to the hierarchical feature representation learning through traditional CNN. But, we found that these models suffer from characterizing complex patterns in remote sensing imagery because of small inter class variations and large intra class variations. In order to tackle these problems, we have finetuned and proposed three different CNN models namely, Dilated CNN (D-CNN), RSI Scene Classification model (RSISC-16 Net) and fused the features of CNN and RSISC-16Net to improve the performance of RSI scene classification. The aim of proposed CNN models is to incorporate more relevant information by increasing the receptive field of convolutional layer. In addition, we have performed feature fusion of two CNN models and finetuned by varying hyper parameters such as activation function, dropout probability and batch size to reduce over fitting problem and to improve the performance of our proposed work. For evaluating the proposed approach, we have collected 7,000 remote sensing images from NWPU 45-class dataset and the experiments are carried out using different CNN models and results. The obtained accuracy is 89.85%, 94.7% and 96.5% respectively.


Introduction
With the rapid development of earth observation technology, image scene classification plays a significant role in the field of RSI. It is used for variety of applications ranges from agriculture monitoring, environmental monitoring, land use/ land cover planning, scene classification, urban planning, surveillance, geo-graphic mapping, disaster control, object detection and etc., [1][2]. Several techniques have been developed for image scene classification during the last decades. These techniques are broadly categorized into two types based on the features they use, namely low-level features learning based and high level or deep features learning based methods. Earlier, image scene classification was based on the low level features or handcraft feature learning method [4]. This method was mainly used for designing the handcraft or human engineering features, such as color [3], shape, texture, spatial and spectral information. The histogram of gradients (HoG), color histogram (CH), gray level cooccurrence matrix (GLCM), local binary pattern (LBP) and scale in-variant feature transform (SIFT) are some of the familiar handcraft feature extraction methods that are used for image scene classification [5][6]. These low level features are producing better results, but they require domain expertise and consume more time for the limited data. In addition, handcrafted features require an artificial dilation for extracting the features.
To overcome the limitations of handcraft features, learning the features automatically from images is considered as best way. In recent years [7], deep learning method has great success in the field of image scene classification. It is composed of multiple layers that can learn more powerful feature extraction of data with multiple levels of abstraction. In addition, the deep layers of representations have great potential to characterise robust features with complex patterns and semantics, such as land use, land cover, functional sites and etc. Currently, so many deep learning models are available such as Convolutional Neural Network, Recurrent Neural Network (RNN) with Long Term Short Memory (LSTM), Auto Encoder (AE), Deep Belief Network (DBN) and Generative Adversarial Network (GAN). The main reasons for the popularity of deep learning are the highly improved parallel processing capability of hardware especially the general-purpose graphical processing units (GPUs), the substantially increased size of data available for training, and the recent improvements in machine learning algorithms. These advancements enable deep learning methods to effectively utilize complex data, compositional nonlinear functions, learn distributed and hierarchical features automatically by utilizing both labeled and unlabeled data effectively. Figure 1a and 1b, are the images from NWPU 45-class dataset in which images have similar visual perceptions; they are correctly classified as car in parking lot and ship in harbor using deep learning models. So, successful deep learning application requires a very large amount of data to train the model as well as GPU to process the data rapidly. Especially, the CNN models are familiar and widely used for image classification and have achieved better results. The remainder of the paper is structured as follows: Section "Related works" contains the literature survey of CNN classification for remote sensing images; Section "Proposed work" presents the newly developed different CNN models such as dilated convolutional neural network, RESISC-16 model and feature fusion of CNN and VGG-16; Section "Experimental result and analysis" discusses how the performance is improved from traditional CNN to new proposed convolutional neural network models; and in Section "Conclusion" we reiterate the focus of the paper and summarize the work presented.

Related Works
The first CNN model was developed by LeCun et al. [8][9] which is similar to the traditional neural network and also it is the foundation for modern CNN. The structure of the CNN model is inspired by the neurons in animal and human brains. In recent days, researchers have developed many models which are related to image classification problems. For example, Xuning Liu et al. [10] developed Siamese networks for Remote sensing scene classification. The results showed that the Siamese CNN model performance is efficient and better than the VGG-16 (Visual Geometry Group) results. The research in [11] proposed CNN model for road recognition system from remote sensing images. The research by Wong et al. [12] presented a smart object detection system for blind people. This method captured object scene by webcam and then extracted the features by using convolutional layer. After that, audio detector was used to analyse the detected object for the blind people.

Proposed Work
In this section, we have proposed three different convolutional neural network models such as Dilated Convolutional Neural Network, fine tuning the hyper parameter of RSISC-16 Deep CNN model and Feature fusion of CNN and RSISC-16 by replacing the traditional CNN for improving the classification accuracy of RSI scene classification. To deal with more complex situation and to achieve better performance of network, we have incorporated more relevant information by increasing the receptive field of convolutional layer in traditional model and also increased the number of convolutional layers. In order to handle these problems, which are present in the previous study, we have introduced different kinds of CNN instead of traditional CNN. In general, the traditional CNN consists of four major steps namely, convolutional layer, sub-sampling layer, activation function and fully connected layer.

Dilated Convolutional Neural Network
The Dilated CNN model consists of N number of dilated convolution layers followed by N number of pooling layers and two fully connected layers. The major problem in deep learning techniques is overfitting while training these structures. Data augmentation, optimizer, dense, dropout and drop connect are some of the techniques developed to avoid overfitting problems.
(a) (b) (c) Figure 2. (a) Traditional convolutional with kernel size 3×3 (b) dilated convolution with dilation rate 2 and kernel size is 5×5 (c) dilated convolution with dilation rate 3 and kernel size is 7×7. Figure 3 shows the traditional and dilated convolution kernel over an image of size10 × 10, where (a) is a traditional 3×3 convolution kernel, a zero is inserted between each point in the matrix in (a) and transformed into (b) is called dilation rate 2, similarly, (c) is a dilation rate 3 kernel. As shown in Figure 2, the kernel's receptive field are 3×3, 5×5 and 7×7 respectively. The receptive field size is increased by adding the zero between the matrices; however, the number of parameters in all the dilated convolution kernels is same. Therefore, using such a dilated convolutional kernel to process images, we can get more information from the convolution kernel without increasing the computation. In dilated convolution, a small kernel size w×w is extended to w + (w-1) (dr-1) with dilate rate dr. In traditional convolutional kernel with size of 3×3, and its receptive field is 3×3. While performing dilated convolutional kernel with size of 3×3, it's receptive field is 5×5 when dilation rate dr = 2, and 7×7 when dr = 3. The receptive field is generally defined as

Dilated Convolutional Layer
The convolution layer is the most important layer in the CNN which is the origin of the "convolutional neural network". The aim of convolution layer is to learn feature representations of the inputs. The convolution layer is a three dimensional matrix with size of h×w×c with corresponding weight for each point, where h represents height of the inputs, w represents width of the inputs and c represents the depth of the channel. A kernel of convolution is a neuron, and the size of the convolutional kernel is called as neuron's receptive field. Like neural networks, convolutional networks use convolution operation rather than matrix multiplication process. The general form of convolution is defined as: (1) where nin represents the input matrices of the tensor. Xk is k th input matrix. Wk is the k th sub-convolution kernel matrix of the convolution kernel. s(i,j) represents the output values for matrix of corresponding elements to the kernel w. For example, 10×10 two-dimensional matrix as a input with padding size of 1(11×11 input size) and the size of convolution kernel is 5×5 matrix, the size of stride is set to 1, the output of corresponding convolution size is 7×7 and convolution process is shown in Figure 4.

Pooling Layer
The sub sampling or pooling layer is used to reduce the feature resolution. This layer reduces the number of connection between the convolutional layers, so it will lower the computational time also. There are three types of pooling: max pooling, min pooling and average pooling. In each case, the input image is divided into non-overlapping two dimensional spaces. For example, if the input size is 4×4 and sub sampling size is 2×2, a 4×4 image is divided into four non-overlapping of matrices 2×2. For max pooling, the maximum value of the four values is selected. In the case of min pooling, the minimum value of the four values is selected. The figure 5 shows the operation of max pooling and average pooling process. The CNN model ends with fully connected layer and softmax function. In these layers, sum of all the weight of previous layer features is calculated and the specific output is determined. Finally, fully connected layers reduce the dimension into 2048 and classify the ten class object using softmax function. The Activation Function improves the Deep CNN performance. In general, there many activation functions are available such as tanh, sigmoid, ReLU and etc., for solving the non linear problems. The Rectified linear unit (ReLU) is one of the standard and popular activation functions in the last few years. b i,j,k = max (a i,j,k , 0) (2) where, ai,j,k is the input of the activation function at location (i, j) on the k th channel. In this layer we remove every negative value from the filtered images and replace it with zeros. The Figure 6 elaborates the process of activation function.

Fully Connected Layer
After the several convolutional and pooling layer processes, the two-dimensional data is converted into one-dimensional vector. The one dimensional data will be the input for fully connected layers. There may be one or more hidden layers which perform high level reasoning. Each neuron uses the data from the previous layers and multiplies them with the connection weights and then adds a bias value. The working principle of flatten layer is shown in Figure 7. The output of final fully connected layers is fed into the classifier ie. softmax function. The softmax function is used to classify the object. The general form of softmax is defined as in Eq. (3) where, exp (sfj) is the probabilities of each target class where as sfq is possible of all the target classes.  The convolutional layers extract features from the input images. The 13 convolutional layers are distributed in five blocks. The first two blocks contain two convolutional layers in each block. Similarly, the remaining three blocks consist of 3 convolutional layers in each block. The first block convolutional layer extracts low level features such as lines and edges. Higher level layer extracts high level features. Every convolutional filter has a kernel size of 3×3 filters with stride 1.The filter size of convolutional layer is gradually increased from 64 to 512. The overfitting is an unneglectable problem in RSISC-16 deep CNN model that can be reduced by regularization. In this paper, we use the effective regularization technique of dropout is used. The dropout was introduced by Hinton et al. and it has been proved that it is effective in reducing over fitting problem. The dropout techniques are used in the fully connected layer and we can specify the different level of dropout parameter like 0.

Feature Fusion of CNN and RSISC-16 Model
In this work, we focus on the ensemble model for improving the performance and increasing the predication rate of remote sensing image scene classification. We have introduced two standard Convolutional Neural Networks such as CNN and RSISC-16 as our base models. For each base model, we have fine tuned the hyper-parameters and trained the two models independently with 10 class dataset. Then we have calculated the average of all the features after fully connected and applied softmax function to classify the data. The architecture diagram of the proposed method is shown in Figure 9 and the details of the components in the architecture diagram are discussed in the following sections.

Convolutional Neural Network
The CNN model consists of four hidden layers (three convolutional-sub sampling layer and one fully connected layer), one input layer and one output layer. The input layer contains 224×224×3 neurons, indicating the RGB values for a 224×224×3 image. The convolutional-sub sampling layer use the size 3×3 stride lengths and followed by 2×2 regions. The fully connected layer contains 256 neurons with Rectified Linear Unit (ReLU). Finally, the output layers used soft-max function to produce the class of objects in RSIs.

RSISC-16 Model
This model is one of the most powerful deep convolutional neural networks which have been proposed by Simonyan et al. It consists of 13 convolutional layers with 3×3 filter size, 5 sub sampling/ max pooling layer with size of 2×2, two fully connected layers with activation function and soft-max function. Each block is made by consecutive 3×3 convolutions and followed by a max pooling layer. To avoid the problem of over-fitting, we need to eliminate the redundant features by adding the Dropout.

Average Feature Fusion Classifiers
Average feature fusion technique is one of the familiar ensemble approaches for classifying remote sensing images. It takes average of output score divided by probability of all base CNN model and reports it as predicted score divided by probability. Due to the high capacity of Deep CNN, the average feature fusion model improves the performance substantively. Taking average of multiple models will reduce the variance. The number of convolutional layer, pooling layer in each model is showed in Table 1.

Experimental Analysis
In this section, we have discussed and analyzed various remote sensing image scene classification methods using deep learning techniques based on Dilated CNN, RSISC-16 model, feature fusion of CNN and RSISC-16. First, we introduce the benchmark datasets for RSI scene classification, then analyzed the performance of traditional CNN model, and finally presented the experimental results for three proposed models with same dataset and parameters.

Dataset Description
For experimental evaluation, we have used publicly available large-scale benchmark dataset for remote sensing image scene classification. The dataset is North Western Polytechnical University (NWPU) 45-class dataset [4] which contains 45 classes and totally 31,500 images. Each class consists of 700 images with resolution of 256×256 pixels. As far as we know, the NWPU45 class is the most challenging dataset in very high resolution (VHR) image scene classification tasks because it has larger scale scene categories and image number than other datasets. In addition, each image category in NWPU-45 class dataset have rich variations, like illumination changes, resolution, shooting angle, background, etc., which also increases the difficulty of scene classification. The spatial resolution of images ranges from 0.2 to 30m. The dataset was collected from more than 100 countries and extracted by Google Earth. For our proposed work, we have chosen ten classes namely airplane, baseball diamond, beach, bridge, forest, ground track, harbor, parking lot, river and storage tank for remote sensing image scene classification. A sample image from benchmark dataset is shown in Figure 10.

Performance Metrics
We have evaluated the performance of a proposed model by using various performance metrics such as Accuracy, Precision, Recall, F1-measure and Mean Square Error (MSE). The Accuracy can be calculated by the number of properly classified data in a dataset divided by the total number of samples, as shown in the equation (4).
where, t is a number of properly classified samples and n is a total number of samples in a dataset. The precision can be measured by number of properly classified data in a datasets divided by total number of all samples in a class. Precision value of the class c, Pc can be shown in equation (5) where, tc is a total number of properly classified samples in class c and nc is a total number of samples in the class c.
The recall can be measured by number of properly classified data which are divided by the number of all relevant samples in the corresponding class. Recall value of the class c, Rc can be shown in equation (6) where, tc is a total number of properly classified samples in class c and kc is number of samples classified as relevant to class c.
The F1-measure (harmonic mean) is used to show the balance between the precision and recall measures. F1score value can be calculated using equation (7): The MSE value returns prediction error rate between the original input image and predicted image. It is calculated by the equation below:

Experimental Results of different Scene Classification Methods
The proposed model is trained and tested with NWPU 45class dataset using tensor flow in Core i7 CPU 2     Figure 14, shows the performance analysis of our three proposed CNN models with the traditional CNN model.

Conclusion
Scene Classification in remote sensing images is a challenging problem because objects of same category have often a diverse appearance. So, we have proposed three different CNN model for remote sensing image scene classification by replacing the kernel of traditional CNN; dilated convolutional model, fine tuned the hyper parameters of RSISC-16 model and feature fused D-CNN and RSISC-16Net model for accurate scene classification. In order to demonstrate the efficiency of proposed models, experiments are conducted on a 10 class dataset which are selected from NWPU RESISC-45 and achieved the accuracy as 89.85%, 94.7% and 96.5% respectively. We have observed that all the three proposed CNN models have given better accuracy than traditional CNN model. In future, we have planned to incorporate our proposed convolutional neural network model for remote sensing object detection and also the same will be implemented in GPU environment for reducing the computational time.