Improving Semi-Supervised Classification using Clustering

Supervised classification techniques, broadly depend on the availability of labeled data. However, collecting this labeled data is always a tedious and costly process. To reduce these efforts and improve the performance of classification process, this paper proposes a new framework, which combines a most basic classification technique with the semi-supervised process of clustering. Semi-supervised clustering algorithms, aim to increase the accuracy of clustering process by effectively exploring available supervision from a limited amount of labeled data and help to label the unlabeled data. In our paper, a semi-supervised clustering is integrated with naive bayes classification technique which helps to better train the classifier. To evaluate the performance of the proposed technique, we conduct experiments on several real world benchmark datasets. The experimental results show that the proposed approach surpasses the competing approaches in both accuracy and efficiency.


Introduction
Supervised Classification is a process, in which a sample data is used as the representative structure of a class, termed as a training set. This training set is used to train the model for further classification process. Training sets are particularly selected based on the availability of knowledge or labeled data set [1][2]. Now days, supervised classification techniques have been extensively used in the field of pattern matching, including medical diagnosis, face recognition, document classification, banking sector and many other application areas [3][4][5]. The performance of these techniques mostly depends on the availability of the labeled data. However, labeled data is very limited and is difficult to obtain. It is very time consuming process and requires expert involvement, often resulting in very expensive process [6]. To overcome the problems of supervised classification, semi-supervised learning is used in the process of classification. A lot of research has been carried out in this area. Semi-supervised learning is applicable for both clustering as well as classification process. There are a variety of semi-supervised clustering techniques that have been given and proved to perform better as compared to unsupervised approach. Basically semi-supervised clustering deals with the methods to incorporate additional information gathered by different sources into unsupervised clustering process. Basu et al. (2002) [7] proposed semi-supervised clustering algorithm based on center initialization mechanism. In this algorithm, seeds are used to initialize the centers of clusters using labeled data and then updated using clustering process. Demiriz et al. (1999) [8] have used J. Arora, M. Tushir and R. Kashyap 2 genetic algorithm along with supervised and unsupervised clustering to design semi-supervised clustering algorithm. Blum et al (2001) [9] used graph based method to provide information regarding labeling in the process of unsupervised clustering. Gan et. al. (2013) [10] have given the concept of self-training, involving semi-supervised clustering along with classification technique. Further Zhang et al. (2004) [11] have defined the semi-supervised clustering along with kernel based approach. They have defined the objective function by reducing the classification errors of both the labeled and unlabeled data. But kernel based approach is more complex and time consuming. Piekari et. al. (2018) [12] have used cluster analysis to improve the results of semi-supervised learning and further used supervised classification approach to classify pathological images of breast. Dunlop et. al. (2019) [13] defined a graphbased semi-supervised learning approach and introduced large data limits of the probit and Bayesian level set problem formulations. Wang et. al. (2019) [14] have used unsupervised approach integrated with classification process for the prediction of day ahead electricity price. There are several such algorithms available in the literature which are quite effective [15][16]. However question always remain whether we can still improve the results. It is a wellknown fact that there is no single algorithm which is capable of giving good results on all kinds of datasets. Hence there is always a scope of improvement over the existing algorithm and thus we have proposed a new model for classification where clustering is integrated with the process of classification. In our framework, semi-supervised fuzzy c-means clustering (SSFCM) is used to train the most basic classifier, resulting in the process of self-training. The main advantage of the model is the efficient use of available labeled and unlabeled data. The classification process helps to reveal the internal structure of the data, which is used by the clustering process along with unlabeled data to further improve the process of classification. Basically, this paper works on the following factors:-1. A semi-supervised fuzzy c-means clustering is used to label the unlabeled data and further helps to improve the training of the classifier. 2. The naive bayes classification is used for the process of classification and further it helps to provide better internal structure of the data in the process of semi-supervised clustering. 3. Both labeled and unlabeled data are used by the model and help to better classify the unlabeled data with better results. 4. The simplicity of classification process helps to give better results and is more efficient.
Further, the paper is organized in the following section. Section 2 gives the details of the related work including Semi-Supervised Fuzzy C-means Clustering and Naive Bayes classification clustering technique. Section 3 describes the framework of the proposed algorithm. Section 4 includes the experimental results carried out on various real datasets along with the discussions. The paper is concluded in section 5.

Related Work
The methodologies of different machine learning techniques widely depend on the availability of labeled and unlabeled data. Supervised learning requires a close attention towards development of training data. If the training data is poor or not representative the classification results will also be poor [17]. On the other hand, unsupervised learning suffers from the problem of local traps due to random initialization of the process [18]. To avoid the above problems, the semisupervised approaches basically involve an intermediate link between supervised and unsupervised learning techniques [19].

Semi-supervised Fuzzy C-means (SSFCM)
A salient feature of partial supervision in the clustering algorithms is the availability of labeled data in the given data or some other constraints that can provide some supervision to the process of unsupervised clustering.  [21] extended the definition of SSFCM to cover the hidden information in a better way with the availability of some amount of labeled information. E. Bair (2013) [15] presented the review on the effectiveness of semi-supervised clustering methods. An algorithm for SSFCM can be defined as:- Here data is represented as X with n number of vectors in feature space Here p denotes the dimension of dataset. The most basic goal of clustering algorithms is to provide label to each data point implying its belongingness to some class [22]. Let c denotes the number of classes, In semi-supervised clustering, a small set of labeled data is provided along with large set of unlabeled data. So X in case of semi-supervised clustering can be represented as Improving Semi-Supervised Classification using Clustering x is a member of class i and 0 otherwise and u im  will be initialized randomly. Further initial seed (cluster center) values will be calculated from labeled data and labeled membership matrix as

Naive Bayes Classification (NB)
In probability, bayes theorem finds the probability of an event based on prior condition that might have a relation to the event. So similarly, naive bayes classifier is a probabilistic classification model which is completely based on Bayes theorem. It is used in supervised learning where we already have some information to train the classifier [23]. It is an efficient algorithm which does not need much time to train and quickly classifies the data. The most notable advantages of using a naive bayes classifier is that it does not require much training data and it is not easily affected by outlier values. However the efficiency of classification process increases with the increase in the training information. Bayes Theorem is represented as follows: Similarly, Naïve Bayes is represented as follows: Where, k C represents the class for k possible outcomes  

Proposed Semi-supervised Fuzzy baysian classification algorithm (SSFBC)
In our implementation, we have used SSFCM and naive bayes classification, respectively for semi-supervised clustering and classification. Thus Semi-supervised fuzzy baysian classification (SSFBC) is an extended classification procedure defined with the integration of clustering with partial supervision. In partial supervision, clustering is provided with data having some percentage of labeled data along with a large amount of unlabeled data. In the complete process, the classifier is trained using labeled data and semi-supervised fuzzy c-means is used for labeling unlabeled data as shown in Fig. 1. The percentage of labeled data increases with every iteration and is used for better training of the classifier. Thus, semi-supervised clustering provides classification to unlabeled data and the classifier is retrained. Here Naive Bayes classifier is used for the process of classification. This classifier is known for its simplicity and accuracy, as it does not require any training for classification [23]. So, the simplicity of naive bayes classification process reduces the time-complexity of the whole process and is use to obtain accurate results quickly. The complete process is repeated until all unlabeled data are labeled. On comparing the proposed technique with other commonly used techniques like SVM or Naive Bayes, we noticed that these techniques do not adapt to the unlabeled data. In SVM, the classifier trains itself on the labeled data, which then creates a straight line, and classifies the unlabeled data whereas in Naive Bayes, no such type of training is required but when the unlabeled data is being classified, it just calculates the probability of the most probable classification using the labeled data as its reference. In both these cases, only the labeled data is used for the process of learning. These approaches fail in the cases where the unlabeled data is continuously expanding as in such scenarios, it is possible for the pattern to change, which causes these algorithms to fail. In our proposed algorithm, the data is labeled continuously with high degree and high confidence. By doing so, our algorithm is able to adapt to the changes in the pattern of data and outperforms the usual algorithms. In Fig.  2, a synthetic data is generated with partial labeled points. This synthetic data is based on the situation where the data is continually expanding and the pattern is changing as well. The data is trained with the help of partially labeled data using SVM, Naive Bayes and proposed SSFBC technique. Fig. 2(a) shows the result of classification done using SVM where lot of data is misclassified; Fig. 2(b) shows the result of Naïve Bayes classification where the data is learned inaccurately. Fig. 2(c) shows the result of proposed technique and shows that the proposed technique gives high quality result with much better accuracy. The algorithm properly adapts the changes in the structure and results in better classification process. It does not get affected by the continuously changing trend in data and is able to give results with high accuracy. From Fig. 2(a) and Fig. 2(b), we observed that the results continuously degrades as the classifier does not adapts well with the increasing data. The algorithm of the proposed technique, Semi-Supervised Fuzzy Bayesian classification (SSFBC) is given below. We are denoting degree threshold and confidence threshold by 1 e and 2 e , respectively.
Step 2: Initialize 1 e and 2 e as 0.95 and 0.60 respectively Step 3: Repeat until flag == true (a) Calculate membership using SSFCM clustering for unlabeled data (b) Select samples with high degree i.e. samples whose degree is more than 1 ei x > 1 e break the iteration (f) Finally, add these samples to labeled data and remove them from unlabeled data.
(g) If size (unlabel) == 0, set flag as false Step 4: Train the Naive Bayes classifier using labeled data Step 5: Using the trained model, classify the remaining samples in unlabeled data.

Experiment Results and Discussions
In our experiments, we have tested our algorithm on three real benchmark datasets from UCI dataset repository: Iris, Wine and Wireless Indoor Localization datasets. All of these are multi-class datasets where Iris and Wine datasets have three classes whereas, Wireless Indoor Localization dataset has four classes. We have divided our datasets in the ratio of 1:4 i.e. we have used 20% as labeled and the rest 80% as unlabeled. Table 1 shows the details about the datasets.

Table 1. Details of real datasets taken from UCI repository
We have compared our proposed SSFBC method with a range of successful supervised and semi-supervised classification methods available in the literature. We have done investigations on standard supervised SVM technique, semi-supervised SSFCM clustering based classifier and semi-supervised SSFCM+SVM [10] technique. SSFCM+SVM works on the same line as our proposed method. It first finds the underlying structure in the data space by applying the semi-supervised clustering on both labeled and unlabeled data and then SVM classifier is trained on the labeled data.

Error rate of labeling unlabeled data
In this section, we compared the results of our proposed method with other stated algorithms on the basis of error rate in classifying the unlabeled data. The results of experiments are assessed using Huang's accuracy measure [24]: The labeled data will allow the clustering process to specify the number of clusters c on the basis of class information available. The constraint for labeling is that the data should be labeled from every class in order to provide training patterns that could capture a training set from every class. As the labeling of data is done randomly, so on each dataset, we have taken 10 observations and then finally calculated the mean error and standard deviation. Here the graph shows percentage error rate of misclassification of data points with their respective clusters for all ten test cases. This gives us an overview of the performance of the above algorithms for different cases. Fig.  3(a) shows the result of iris data set where the proposed technique shows minimum error rate as compared to other techniques in each and every run. Fig. 3(b) shows the result of wine dataset. It reveals that SSFBC performs comparative well in most of the cases. Fig. 3(c) shows the result of wireless localization dataset. In all the cases, proposed technique SSFBC reveals its superiority and performs well as compared to other techniques. Table 2 shows the mean and standard deviation of the error rate calculated after every run. Here in each pass, labeling is done randomly, and with same set of labeled data, classification process is carried out for every technique.
Finally after 10 runs, mean and standard deviation is calculated to obtain the final results. Table 2 shows that the proposed technique SSFBC shows better results with minimum mean and standard deviation of error rate on all the three datasets.

Time consumption for labelling unlabelled data
In this section, we will compare our algorithm with the 7 previously used algorithms based on the time taken to label the unlabeled data. Time consumption can be basically defined as time taken by the particular technique for the completion of the process. Here the process of clustering is carried out on data with different size and dimension. As the size of data increases, more time will be required for the computational process. Table 3 gives the detail of time spent in the classification of iris data set for each and every run with different techniques. The value illustrates that SSFCM takes minimum time for every run as compared to other techniques. But SSFCM fails to classify the data completely into different classes and gives maximum error rate as compared to other techniques. Table 4 gives the detail of time spent in the classification of wine dataset and Table 5 gives the value for wireless indoor localization dataset. In each case the proposed technique SSFBC outperforms the other techniques of SVM, SSFCM+SVM in the terms of error rate and computational time.

Impact of threshold parameter e1
In this section, we have discussed how the proposed technique responds to the change in the value of parameter

Conclusion
In this paper, we discussed about integration of two different algorithms to produce a synergic effect as clustering provided us with better insights and help us train a better classifier. This was done to tackle the problem where we have lack of labeled data and our data is continuously growing. To do this we used SSFCM as clustering algorithm while Naïve Bayes to train the classifier as it provides accurate results in short amount of time. We carried out multiple experiments on different datasets and compared the results with different supervised/semi-supervised algorithms. In all the experiments, we observed that the error rate of the proposed algorithm SSFBC was exceptionally low as compared to other techniques. In case of time consumption, we observed that regardless of the different sizes in the datasets, the time consumed by the proposed algorithm was very less as compared to SVM and SSFCM+SVM classifier. Only SSFCM was able to give instant results but if we also include the error rate, our algorithm has clearly performed better. Hence the proposed technique proved to perform better in the terms of accuracy and effectiveness.