Scene Classification of Remotely Sensed Images using Optimized RSISC-16 Net Deep Convolutional Neural Network Model

Remote Sensing Image (RSI) analysis has seen a massive increase in popularity over the last few decades, due to the advancement of deep learning models. A wide variety of deep learning models have emerged for the task of scene classification in remote sensing image analysis. The majority of these models have shown significant success. However, we found that there is significant variability, in order to improve the system efficiency in characterizing complex patterns in remote sensing imagery. We achieved this goal by expanding the architecture of VGG-16 Net and fine-tuning hyperparameters such as batch size, dropout probabilities, and activation functions to create the optimized Remote Sensing Image Scene Classification (RSISC-16 Net) deep learning model for scene classification. Using the Talos optimization tool, the results are carried out. This will increase efficiency and reduce the risk of over-fitting. Our proposed RSISC-16 Net model outperforms the VGG-16 Net model, according to experimental results.


Introduction
The evolution of remote sensing image classification from pixel and object level to scene level starts in 2010, after the emergence of land use/land cover area. Here the term 'scene' represents a local area cropped from a large scale satellite image. The Remote Sensing Image Scene Classification (RSISC) study, which has received a lot of attention, intends to classify remote sensing images with a set of semantic categories by analyzing differences in the spatial arrangement and structural pattern of ground objects [1]. The aim of RSISC is to correctly label the remote sensing images with equivalent semantic classes. Figure 1 shows how to categorize an urban area image into commercial building, residential area, and industrial area. In general, different types of ground objects can be found in remote sensing images [2]. An industrial scene may contain roads, trees, buildings and so on. Scene classification is a challenging task when compared to object-based classification since the scenes contain a variety of complicated spatial ground objects that do not have a consistent shape and structure. The study of Simonyan and Zisserman [3] in Neural Information Processing Systems (NIPS) 2012 and success in the International Large Scale Visual Recognition Challenge (ILSVRC) 2012 ImageNet competition popularized the use of deep neural networks for image classification and recognition. This improvement in performance motivated many other researchers to focus on deep neural networks in their own specific problems, and deep learning is now a hot topic in vision research [4]. Almost every day, a new scientific paper is published to improve deep learning solutions to vision problems. On the other hand, the availability of high-quality sensors, which is combined with an improved aerospace and satellite industry, allows researchers to collect larger amounts of remote sensing data with higher spectral and spatial resolution. Thus, increasing the quality and number of remote sensing images allows researchers to attack this problem and makes deep learning for remote sensing possible. Deep learning in remote sensing is used to create a fully automated system that can classify geospatial objects and land cover into distinct classes such as airplane, barren land, building, cultivated field, forest, roadway, runway, ship, storage tank, water, and so on. It is critical to be able to classify land use/land cover in order to monitor the Earth's constant changes and manage urban development [5][6][7]. Utilization of machine learning techniques for this purpose is quite critical and challenging due to the small amount of remote sensing imagery that is available with ground truth labels. Therefore, many computer vision scientists have proposed a number of different deep learning algorithms to extract information from remote sensing images and significantly contributed to the literature of the computer vision and the remote sensing field. Though several studies of RSISC have been made in the past few decades, no algorithm has yet been developed to accurately classify RSI scenes. The following are some of the problems faced by researchers while performing remote sensing image scene classification.
❖ There is high intra-class diversity in remote sensing images. ❖ There is a lot of inter class similarity between scenes. ❖ Aerial images have much larger scale variation than conventional images.
❖ Several ground objects are presenting in the same scene with complex background. In addition to these, the quantity and quality of the images create a high computational cost which makes it difficult for near-real time applications. The main motivation of our proposed work is to develop remote sensing scene image classification model using deep learning [21,22] to extract features automatically and to classify the scenes accurately which in turn handles the problem of intra class diversity, inter class similarity, large scale variation present in aerial images and complex background scenes. Our proposed deep learning RSISC -16 Net Model is also optimized with Talos tool using hyperparameters of batch size, drop out probabilities and activation function that will reduce computational cost. The remainder of the paper is structured as follows: Section "Related works" contains the literature survey of CNN classification for remote sensing images. Section "Proposed work" presents the newly developed optimized RSISC-16 Net model, Section "Experimental result and analysis" discusses how the proposed model performance is improved from VGG-16 Net model; and in Section "Conclusion" we reiterate the focus of the paper and summarize the work presented.

Related Works
Convolutional Neural Networks (CNN) is a broad idea that can be used to apply scene classification methods. LeCun et al. [8,9] created the first CNN model, which is similar to a standard Artificial Neural Network (ANN) and serves as the basis for modern CNN. The neurons in animal and human brains provide inspiration for the CNN model's structure. In recent days, researchers have developed many models related to image classification problems. For example, Liu et al. [10] developed Siamese networks for scene classification using remote sensing images. The results demonstrated that the performance of the Siamese CNN model is efficient and superior to the VGG-16 (Visual Geometry Group) model. The research in [11] suggested a CNN model for a road recognition system based on remote sensing images. Cheng et al. [12] presented a discriminative CNN model to improve RSI scene classification performance, which addresses both within-class diversity and between-class similarity issues. By using CNN-based sparse coding learning techniques, Qayyum et al. [13] established an efficient method for scene classification of aerial images. A capsule network for RSI-scene classification was introduced by Zhang et al. [14]. To improve the classification accuracy, this model first extracts features using CNN and then feeds the extracted features into a capsule network. The individual scene classification models do not efficiently classify the scene of remote sensing images. So, in order to improve the scene classification, two or three individual CNN models are combined. It is sometimes called as ensemble classification. The ensemble classification model, also known as the fusion model, is widely used for image classification from remote sensing images. The goal is to combine the results of two or more individual models. Fusion or combining the features of two or three models is seen to be the best solution to overcome the limitations of individual classification models. Many researchers developed feature level and decision level scene classification of remote sensing images. Chaib et al. [15] combined VGG-16 and Inception model features to create a feature fusion model for high resolution remote sensing image scene classification. In [16], a deep learning decision level fusion was introduced for improving the classification accuracy of remote sensing images. This method combined the decision level features of three stateart-of the models, namely traditional CNN, VGG-16, and ResInception, to achieve greater accuracy than the individual models. Dong et al. [17] developed a combined deep learning model for High Resolution-RSI scene classification. This model combines the representation of CNN features with the LSTM model to improve scene classification accuracy. For land cover classification of HRI, Scott et al. [18] introduced a fusion technique in which multiple deep CNN models such as CaffeNet, GoogLeNet, and ResNet50 features were extracted. Travis et al. [19] introduced ensemble based image classification by using wavelet transform. This model converts the data into wavelet domain to achieve better accuracy and efficiency for image classification.
All the above mentioned individual models are not efficiently classify the scenes and also the fusion models require more computational time to train and validate the data. Taking the above disadvantages into consideration, we have proposed an optimized RSISC-16 Net model based on VGG-16 Net for scene classification of remote sensing images. When compared with VGG-16 Net model, our propsed RSISC-16 Net model requires only less number of parameters.

Proposed Work
In this section, we have proposed RSISC-16 Net model by extending the architecture of VGG-16 Net. The RSISC-16 Net model consists of a total of 13 convolutional layers, 5 pooling layers, two fully connected layers and one soft-max classifier in 5 different blocks as specified below: ❖ First two blocks have 2 convolutional layers each ❖ Rest of three blocks have 3 convolutional layers each ❖ 5 pooling layers one in each block The aim of convolution layer is to extract the low, mid and high level features from the given training datasets. The proposed baseline architecture of RSISC-16 Net model is shown in Figure 2. The input images are processed using a convolutional and pooling layer sequence. The first two blocks of convolutional layers extract the low-level features like lines, edges and shapes, which the next three blocks extract high level feature like internal shape of scenes. Every convolutional filter has a kernel size of 3 × 3 and a stride of 1. The depth of the convolutional layer is gradually increased from 32 to 64.  The optimization techniques are used to improve the model's efficiency and performance on target tasks. The Talos optimization tool was used to optimize the parameters of the RSISC-16 Net model by combining all parameters in a grid. Talos repeats the process until an acceptable model is developed or a model provides maximum predictive accuracy. Figure 3 depicts the flow diagram of optimization procedure. The RSISC-16 Net model is implemented and tested by using the grid search in the parameter dictionary. The optimizable parameters, the corresponding search ranges and the selected values are shown in Table 1.   Table 2 and visualization of each feature maps is shown in Figure 4. The parameter of each convolutional layer is calculated by using the following equation.

PC=((F×F)×D+1)×L)
(1) where Pc represents a parameter calculation in convolutional layer, F represents the filter size, D represents the dimension of features and L represents number of layers. Similarly, parameter of fully connected layers is calculated by the equation (4.6): PFC=(D+1)×L) (2) where PFC represents a parameter calculation in fully connected layer, D represents the dimension of features and L represents a number of layers.

Performance Metrics
To demonstrate the effectiveness of the proposed optimized RSISC-16 Net model, performance metrics such as Precision, Recall, F1-score, and Accuracy are calculated using the confusion matrix of the model.

Precision
Precision is one of the most effective ways to demonstrate how the model is accurate. The ratio of accurately predicted positive observations to total predicted positive observations can be used to measure it. Precision value can be calculated using the equation (3).

Recall
Recall is the ratio of correctly predicted positive observations to all observations in the actual class. It is used to calculate how many of the actual positives the model catches by labelling it as positive. Recall value can be calculated using the equation (4).

Accuracy
The Accuracy can be calculated by the number of properly classified data in a dataset divided by the total number of samples, as shown in the equation (5).

F1-Score
The F1-measure (harmonic mean) is used to show the balance between the precision and recall measures. The F1score measure can be calculated using the equation (6).

Dataset Descriptions
Dataset plays a crucial role in developing and evaluating various scene classification models. We have used NWPU 45-class publicly available datasets for scene classification of remotely sensed images. The dataset which was extracted from Google Earth and covers the high resolution remotely sensed images in more than 100 countries. It is released by the North Western Polytechnical University and is currently the largest scene classification dataset. This dataset consists of 31,500 remote sensing images which are categorized into 45 classes. For each class, there are 700 images with the size of 256×256 resolution in the Red-Green-Blue (RGB) color space. The spatial resolution may vary from the 30m to 0.2m for each pixel. To evaluate the effectiveness of the proposed approach, we have chosen ten classes, namely airplane, beach, commercial area, desert, forest, lake, overpass, river, tennis court and wetland from the above mentioned benchmark datasets for scene classification.

Experimental Results of VGG-16 Net Model
The confusion matrix of the VGG-16 Net model is shown in Table 3, where correct responses are represented in the diagonal of matrix, the airplane, lake and river were almost recognized well. From this tennis court was misclassified as commercial area vice versa. The experimental results of VGG-16 Net model with individual results and overall results are shown in Table 4.    The confusion matrix of the proposed optimized RSISC-16 Net model on NWPU 45-class dataset is shown in Table 6. All the Land Use classes are recognized with good results except wetlands.   Based on the optimized results from RSISC-16 Net model, Table 7 and Figure 5 show the average Precision of 95.29%, Recall of 95.06%, F1-Score of 95.05% and average Accuracy of 95.06% are obtained for the individual remote sensing image classes. The optimized RSISC-16 Net model performance is better when compared to the VGG-16 Net pre-trained model.