Recognition system for fruit classification based on 8-layer convolutional neural network

INTRODUCTION: Automatic fruit classification is a challenging task. The types, shapes, and colors of fruits are all essential factors affecting classification. OBJECTIVES: This paper aimed to use deep learning methods to improve the overall accuracy of fruit classification, thereby improving the sorting efficiency of the fruit factory. METHODS: In this study, our recognition system is based on an 8-layer convolutional neural network(CNN) combined with the RMSProp optimization algorithm to classify fruits. It is verified through 10 times 10-fold crossover validation. CONCLUSION: Our method achieves an accuracy of 91.63%, which is superior to the other four state-of-the-art methods. Abstract


Introduction
The traditional fruit sorting methods, which rely on manpower, consume a lot of time and labor. Fruits can be classified according to different internal structures and the origin of the fruit. In the factory, fruit classification can help employees improve the efficiency of fruit packaging and transportation [1]. In supermarkets, because customers are used to choosing different fruits themselves, fruit classification can help cashiers quickly determine the price of fruits without packaging and barcode scanning [2]. Nowadays, people attach great importance to their health. Eating fruits helps maintain health, and classifying fruits according to their effects can help people pick out the fruits that suit them, especially for patients who need conditioning.
Some researchers proposed some classic methods for the problem of fruit classification. Wei, L. [3] proposed to use biogeography-based optimization (BBO) to identify fruits. Tan, K. et al. [4] used histogram-oriented gradients and color features to identify blueberry fruits of different maturity in outdoor scenes. Nyarko, E. K. [5] proposed a knearest neighbor classifier based on convex detection for fruit recognition in RGB-D images. Aok, S [6] proposed a sixlayer (6L) convolutional neural network for fruit classification. Li, Y. [7] proposed an improved hybrid genetic algorithm (IHGA) for fruit classification. Zhang, H. et al. [8] used volatile compounds in fruit peels as biomarkers to identify citrus species.Hassoon, I. M. [9] summarized the advantages and disadvantages of various shape-based feature extraction algorithms and techniques in fruit classification, classification and grading, and fruit quality evaluation. Ghazal, S. et al. [10] proposed a new fruit classification method combining Hue, Color-SIFT, Discrete Wavelet Transform, and Haralick features, which can better deal with the influence of rotation and light effect. SARI, A. C. [11] developed an app that can compare the quality of fruits by scanning real fruits and obtaining quality information and 3d images. Some successful image processing applications [12] [13] [14] [15] and artificial intelligence in other fields [16] [17] [18]. Although the methods above have achieved good results, they still have several flaws. For example, they need to collect features in advance and preprocess the images, which will waste lots of time. Secondly, the extracted features and selected indicators may not be accurate in different environments. With the emergence of convolutional neural network technology, some scholars began to use this method to solve the above problems and applied it to many fields, including face identification [19], cell segmentation [20], mechanical structural damage detection [21], and so on. This paper builds a deep convolutional neural network to classify fruits. Compared with the traditional CNN model, we increased the number of convolution layers and used the ReLU activation function. By properly adjusting the parameters of the pre-trained model and the number of convolution layers, the accuracy and precision of image classification can be improved [22]. ReLU function can solve the problem of overfitting. The main contribution of the research is that the recognition system can identify images of different kinds of fruits mixed and further improve the accuracy of fruit classification. It reduces the classification time, thereby reducing the cost of classification.

EAI Endorsed Transactions on e-Learning
The rest of this paper is organized as follows. Section 2 shows the dataset. Section 3 describes the standard convolutional neural network model. Section 4 discusses the CNN model we built and the experimental results. In Section 5, we summarize the research in this paper.

Dataset
The experimental dataset of this paper comes from the following three channels: (i) http://images.google.com (ii) http://images.baidu.com (iii) Digital camera shooting.
After three months of collection and processing, we obtained 1,800 image datasets. The data set contains nine kinds of fruits, and the average data ratio of each fruit is 200. The nine fruits include Anjou pear, blackberry, black grape, blueberry, Bosque pear, cantaloupe, golden pineapple, Granny Smith apple, green grape. Figure 1 shows the sample of the fruit picture.

Methodology
Deep learning and transfer learning are widely used in medical image analysis [23]. CNN is a technology that directly extracts features from images and then obtains accurate classification results by training and testing the extracted feature data. It is a feedforward neural network [24] with deep structure and convolution computation. CNN was first introduced in 1989, and until the 2012 ImageNet competition, it received more attention because of its excellent performance. The error rate was cut in half by applying CNN to a data set containing many images of different categories compared with the traditional best calculation method [25] [26] [27]. Our recognition system is based on CNN. The recognition process is shown in Figure 2.  CNN technology is the core of the fruit recognition system. Its standard architecture consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. The convolution layer is used to convolve the images. The pooling layer is used to reduce parameter dimensions and prevent overfitting. The fully connected layer maps all neurons in the previous layer [28]. However, according to the size of the data set, we can appropriately choose the number of layers of the CNN model. The CNN model shown in Figure 3 contains two convolutional layers, two pooling layers, and a fully connected layer.

Convolution Layer
The role of the convolutional layer in the CNN model is extracting image feature information. Each convolutional layer has a different convolution kernel, and the images ' characteristic data is convoluted by the convolution kernel [29]. The working principle of the convolution kernel is dividing the complete images into small pieces, which helps extract the feature pattern. The kernel uses a specific set of weights to convolve the images by multiplying its elements with the corresponding elements of the accepted domain [30].
The reasonable increase or decrease of the number of convolutional layers can not only adjust the amount of calculation in the training phase but also reduce the storage space requirement of the training model [31] [32] [33]. The basic operation of convolution is shown in Figure 4.

Figure 4. The basic operation of convolution
Suppose represents the original image data, and is the convolution kernel, then the convolution operation is as follows where ( , ) is the size of the convolution kernel, and ( , ) is the index of the original image.

Pooling
After the image data is passed into the pooling layer through convolution operation, the role of pooling is reducing the number of parameters in the training model, which can try to avoid overfitting [34] [35] [36] [37] [38]. The most commonly used pooling layers are the max-pooling and average pooling layers. The max-pooling operation returns the max value of the feature map, and the average pooling operation returns the average value. The basic operations of max pooling and average pooling are shown in Figure 5 and Figure 6.

Fully Connected Layer
In the CNN model, convolution and pooling are used for feature extraction, and the purpose of the fully connected layer is image classification [39] [40] [41] [42]. Any neuron in the fully connected layer will be fully connected to the neurons in the adjacent layer. The structure of the fully connected layer is shown in Figure 7.

ReLU
Rectified linear unit (ReLU) as an activation function in neural networks is generally used to solve the gradient descent problem in CNN models. Compared with leaky ReLU, Theoretically speaking, leaky ReLU has a better effect than the ReLU function, but a large number of practices have proved that its effect is unstable, so the application of this function is not much in practice. Due to inconsistent results from different functions applied in different intervals, it will be impossible to provide consistent relationship prediction for positive and negative numbers with the same input absolute values. The ReLU function has a sparse activation probability and can create a sparse representation of the data, which is very helpful for classification [43] [44] [45]. The calculation formula of ReLU is as follows, and the activation curve is shown in Figure 8. The SGDM is a relatively popular deep learning optimization method. In experiments, the momentum term is usually set to 0.9, which helps to suppress oscillation [46]. The algorithm is defined as follows: where ℎ represents the gradient with respect to the current parameter at time and is the learning rate, is a hyperparameter to control the moment.
RMSProp is very similar to momentum because it helps to eliminate the direction of large swings and allows a higher learning rate to accelerate the algorithm's learning [47] [48]. The algorithm is defined as follows: where, [ℎ ] 2 is the exponentially decaying average of squared gradients, while is a vector of small numbers to avoid division by zero.

Experiment Design
Since the dataset in the experiment is not very large, the CNN model was tested by using 10-fold cross-validation. We divide the dataset into 10 equal parts. Nine of these datasets are used for training, and the remaining one is used for testing. The operation is shown in Figure 9. A total of 10 iterations are performed, and the final result is averaged over 10 iterations.

Figure 9. 10-fold cross validation operation
In the experiment, the 10-fold cross-validation will be run ten times, then the average of these ten times will be taken to detect the performance of the model. We use overall accuracy (OA) as the performance index of the judgment model. The overall accuracy is the ratio between the number of correct predictions on all test sets and the overall number.

Training algorithms
We use this 8-layer CNN, then employ max pooling and RMSProp training algorithm. The results are shown in Table  3 and intuitively displayed in Figure 11. After 10 times 10fold cross calculation, the overall accuracy of the CNN model reaches 92.52%. This is a very good result.   Table 4 and intuitively displayed in Figure 12. The overall accuracy achieved by the CNN model combined with average pooling is 91.57%, which proves that using max-pooling in the model can achieve better results than using average pooling.

Training algorithm comparison
In order to judge the RMSProp training algorithm and SGDM training algorithm, which can provide the best optimization model, we employ the SGDM training algorithm and max-pooling in the CNN model then perform ten times 10-fold cross calculation. The results are shown in Table 5 and intuitively displayed in Figure 13. Although the overall accuracy reached 91.21%, the RMSProp training algorithm has better overall accuracy in optimizing the 8layer CNN model used in this experiment.

Comparison of Different Number of Conv Layers
Based on the 3 fully connected layers, we set different numbers of convolutional layers in the model for experimental comparison. The results are shown in Table 6,  Table 7, and Figure 14. Experiments show that using 5 convolutional layers in the model can obtain better detection results.

Comparison of different number of FCLs
In the same case where the number of convolutional layers is 5, we set up different numbers of fully connected layers for experimental comparison. The results are shown in Table  8 and Table 9 and are clearly shown in Figure 15. We found that the effect achieved by using three fully connected layers in the model is the best.   In the experiment, we compared our method with four stateof-the-art methods: kSVM [2], 6-layer CNN [6], BBO [3], and IHGA [7]. The results are shown in Table 10 and intuitively expressed in a histogram in Figure 16. It can be seen that our method has high accuracy. Based on the same dataset, the classification effect of our proposed 8-layer CNN is the best. The overall accuracy reached 91.63%.  for the fruit classification recognition system. The main contributions are as follows: (i) We first used eight-layer CNN for the fruit classification recognition system, improving classification accuracy.
(ii) The RMSProp optimization algorithm can improve the stability of the CNN model.
(iii) Our proposed CNN model is superior to the four latest methods.
In future work, we will conduct fruit classification experiments based on more fruit pictures and further improve the detection accuracy by building an ideal model.