Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC

Although there exist various machine learning and text mining techniques to identify the programming language of complete code files, multi-label code snippet prediction was not considered by the research community. This work aims at devising a tuner for multi-label programming language prediction of stack overflow posts. To that end, a Hyper Source Code Classifier (HyperSCC) is devised along with rule-based automatic labeling by considering the bottlenecks of multi-label classification. The proposed method is evaluated on seven multi-label predictors to conduct an extensive analysis. The method is further compared with the three competitive alternatives in terms of one-label programming language prediction. HyperSCC outperformed the other methods in terms of the F1 score. Preprocessing results in a high reduction (50%) of training time when ensemble multi-label predictors are employed. In one-label programming language prediction, Gradient Boosting Machine (gbm) yields the highest accuracy (0.99) in predicting R posts that have a lot of distinctive words determining labels. The findings support the hypothesis that multi-label predictors can be strengthened with sophisticated feature selection and labeling approaches.


Introduction
Stack overflow helps software developers to find solutions for a programming problem, thereby including millions of questions and answers. The less experienced programmers are most likely to spend time on stack overflow to enter new questions or review past entries. Those activities do not only increase the speed of development processes but also improve the programming skills of the programmers.
Platforms like quora or stack overflow provide questions and answers along with their tags that ease to find the possible solution. Users are forced to tag their posts in stack overflow. However, inexperienced users may sometimes choose the wrong tags while posting. For instance, the tag "entity-framework" is generally used for both Java and C#. That tag fits very well with C#. Hibernate [1] is a mapping tool * Corresponding author. Email: muhammedozturk@sdu.edu.tr that was developed for Java to manipulate objects as in entity-framework. Therefore, a user should employ "Hibernate" to tag object-relational functions in Java rather than "entity-framework". Further, the tag "google-maps" is preferred when the functions related to the mapping are frequently invoked in the applications. However, "google-maps" can not give any hint to figure out the type of programming language. Rather, we expect to see more than one tag to determine exactly what the type of programming language is. Moderators either flag or downvote those posts to cope with misleading information. In that case, the following issues emerge: 1. Wrong tags result in a significant increase in the workload of moderators. 2. Finding a wrongly tagged post becomes difficult even though it presents valuable information.
The majority of the studies finding stack overflow a worthwhile research subject have focused on code analysis [2][3][4], user behavior [5][6][7], and predictive 1 EAI Endorsed Transactions on Scalable Information Systems Online First

EAI Endorsed Transactions on Scalable Information Systems
Research Article models [8][9][10][11][12][13]. Text mining tools [14,15] and machine learning (ML) techniques [16][17][18] are frequently employed in the assessment of stack overflow posts. Further, some researchers [19,20] investigated to what extent machine learning techniques are beneficial to run a fast and rigorous experiment. Large-scale analysis of the vast data requires a hyperparameter tuning process [21,22] to obtain reliable results as well. Integrated Development Environment (IDE) such as Visual Studio, NetBeans, and Xcode allows practitioners to organize and publish their codes. However, those tools can not predict the language of a given file. Rather, they recognize the source code by checking its file extension. This creates a burden for developers editing file extensions manually. To alleviate that burden, software language prediction methods have been developed in various studies [23][24][25]. However, previous works mostly use data sets including a large number of code lines. ML methods developed for those data sets result in high prediction accuracy since the number of features extracted from the source codes is very high. On the other hand, those methods can not produce promising results when the experimental data sets include relatively small number of code lines.
As stack overflow posts have small code snippets in question blocks, sophisticated code snippet prediction tools are needed. There exist some works [26][27][28] utilizing tag and question information to predict programming language. Programming Languages Identification (PLI) [29] is the unique commercial tool developed by Algorithmia for predicting programming language of code snippets. PLI claims a high success in programming language prediction (PLP) (>%98 accuracy) but that record was mostly obtained via GitHub codes which are larger than code snippets available in stack overflow.
In this study, multi-label classification of stack overflow questions is conducted. To that end, a novel multi-label label generation method is devised along with a hyperparameter optimization method namely HyperSCC. The method chooses an optimizer by comparing the prediction results obtained through cross-validation on 10% of all training instances. Hence, the most suitable optimizer is met with multi-label predictors that help result in a time-saving experiment.

Motivation
Multi-language coding is common in software development [30]. In this context, stack overflow questions may have multiple language tags. On the other hand, there is no research on the multi-label classification of code snippets. Preceding works focused on predicting one language tag of the stack overflow questions [27,31]. In addition to that research gap, there are few researches [20,32] that analyze the impact of hyperparameter tuning of ML methods handling with stack overflow posts. Hence, disregarding the combination of tuning methods with ML is the main drawback of the preceding works.
The development of hyperparameter tuning techniques has given rise to more precise predictive models [33][34][35][36][37][38]. However, a tuning process should be organized to conform with experimental data sets [39]. Further, sometimes hyperparameter optimization is suspended or resumed depending on the effectiveness of the tuning process [40]. In this respect, we need new perspectives to improve source code classification techniques.
To clarify the motivation of the paper, Table 1 is designed by summarizing the studies that are similar to our work. Specifically, tag recommendation studies are mostly tuning-free. It is worthwhile to note that this study combines hyperparameter tuning and multi-label prediction.
Apart from preceding works, for multi-label classification, this study presents HyperSCC that alleviates computational burden originated from hyperparameter optimization. Revealing which tuning method is beneficial for programming language prediction helps researchers find strategies to use ML methods in new ways. To the best of our knowledge, this research is the first extensive investigation proposing a tuning approach for the multi-label classification of stack overflow questions.

Research objectives
This paper defines four research objectives as follows: Research objective 1 (RO1): Investigate whether automatic rule-based labeling helps increase the success of hyperparameter optimization. To accomplish this objective, default labels of the posts are modified with a multi-label label data frame. Thereafter, a comparison including seven multi-label predictors is conducted after hyperparameter optimization. Research objective 2 (RO2): Investigate whether HyperSCC is beneficial for one-label programming language prediction as detected in multi-label prediction. To accomplish RO2, four state-of-the-art methods including HyperSCC are evaluated with F1 score results. Research objective 3 (RO3): Examine whether preprocessing helps reduce the training time of multi-label predictors. For RO3, the training times of multi-label predictors are compared to an increasing number of instances up to 5000. Research objective 4 (RO4): Examine which script language yields the highest accuracy when using grid search. For RO4, the accuracy values of eight script languages 2 EAI Endorsed Transactions on Scalable Information Systems Online First

Contribution
The major contributions of the study can be elucidated as follows: 1. A hyperparameter tuning method, which utilizes a small part of the training instances to decide the optimizer, is proposed to set hyperparameters of multilabel classification. 2. We develop a rule-based multi-label label generation technique for stack overflow questions. 3. Empirically, to validate the reliability of our method, extensive experiments, which involve single and multiple programming language predictions, are conducted to evaluate and discuss the method.

Research questions
In this work, four research questions are aimed to be addressed: RQ1:What type of multi-label classification technique to choose for better success in programming language predictions? Some multi-label classification techniques produce results depending on label-or size-specific features. Those techniques are discussed and evaluated regarding performance measures in this question. RQ2: Is HyperSCC superior to the state-of-the-art methods with respect to one-label programming language prediction? In this question, the advantages and disadvantages of HyperSCC versus the state-of-the-art methods are assessed. 24 programming languages are involved in this sub-experiment.
RQ3: To what extent can preprocessing increase prediction time? multi-label prediction of programming languages is an effort-intensive and time-consuming process. This question aims to check whether preprocessing reduces training times of the predictors. The preprocessing entails removal of the instances featuring infrequent labels (<5), instances without labels, and constant attributes. RQ4: Which script language is the most feasible for one-label programming language prediction? Script languages include similar words. For instance, "array" is one of the most detected words in Perl, PHP, and Lua. It is of great importance to conduct a rigorous analysis of such languages to enhance the comprehensiveness of the study. For this reason, script languages are evaluated both for one-label and multilabel prediction.
The organization of the rest of the article is as follows: Section 2 presents background and notions. The method is described in Section 3. Experimental configurations are presented in Section 4. The findings are given in Section 5 to discuss them in several aspects. Last, the paper is concluded in Section 6.

Background
This section describes underlying concept and notions of the study. To this end, four subsections are devised to present a general view. 3 EAI Endorsed Transactions on Scalable Information Systems Online First

Source code classification
The terms 'classification' and 'identification' are sometimes used interchangeably in this field. Source code classification is a challenging issue due to the large number of features extracted from the text corpus. Let X = R m denote m-dimensional input space where y = 0, 1 is the binary label. Here, the objective is to predict y by utilizing a function f → X that maps the input space according to its mathematical assumptions. If the number of instances is very small compared with the m, the prediction may not be completed. On the other hand, m should not be very large to reach prediction results in a reasonable time. To address this problem, feature selection is conducted on R m . Specifically, X is divided into parts x 1 , ...., x t that the total dimension of these part is equal to m. An f r ← X function takes X to delete some dimensions. After that, new dimension can be represented with m ′ that should meet m > m ′ .

Multi-label classification
Given an input space X = R m , where y = y 1 , y 2 , ..., y n is the label set and n is the number of labels. In programming language classification, n also represents the number of programming languages. Unlike one-label classification, it is necessary to produce multiple outputs for each instance. Hence, performance measures can be extended with hamming-loss, one-error, and subset-accuracy. Feature reduction may also be a feasible solution for multi-label classification. More importantly, infrequent labels should be removed before the training process. Concretely, n is replaced with n " that is obtained with a reduction on n. The threshold of that process depends on the objectives of the established model. In some cases, manual label generation leads to the production of unlabeled instances. To remedy that problem, unlabeled instances are removed. If a feature has a single value for all the instances, it is also removed from X.

Hyperparameter optimization
Let L be a machine learning algorithm where the parameter and hyperparameter set can be represented with L p and L ph , respectively. For a tuning function f t , the main purpose is to configure L ph . On the other hand, L p is out of scope in that process since the parameters of a machine learning algorithm change during the training. For instance, the weights of a neural network are not tuned since they are determined as constructing the neural network.
If L ph includes three hyperparameters a, b, c in which the length differs depending on the type of hyperparameters. During the tuning process, f t search a, b, c to find optimal hyperparameter set a i , b j , c k .

Problem formulation
Let L 1 , L 2 , ..., L t be a set of learning algorithms in which tuning methods T 1 , T 2 , ..., T z can be applied to those algorithms. Each tuning algorithm results in a tuning time △ along with a performance record P . The main objective is to reveal optimal tuning time △ * that is detected by comparing cost functions produced from z i=1 △ i * P i . First and foremost, D x , which is a small part of training set, is taken from X to analyze △ * P . Optimal cost function is then represented with △ o * P o that is calculated for each L. In that comparison, the number of instances should be five or higher times L ph . z i=1 P i △ i is the general effect of the chosen tuning methods. In this context, the aim of tuning process is to maximize

HyperSCC
Step 1: Term document matrix generation. HyperSCC starts by analyzing raw data set to extract term document matrix [44] as shown in Figure 1. Each unique word is recognized as a feature in the matrix. Questions and their titles are converted to documents. Texts are interpreted as lower style for each post. After that, punctuation is removed. Numbers and white spaces are eliminated to finalize the document. If those processes are completed successfully, the term document matrix is produced. Word and character counts are further calculated and added to that matrix for each post since they are not available in raw posts.
Step 2: Labeling. Existing works [29, 31] only consider source code language prediction as a binary classification problem. Contrary to these studies, we aim to complete multi-label language prediction for each instance. In labeling, each instance is processed to yield multi-labels representing 24 different programming languages: javascript, sql, java, C#, python, c++, c, php, ruby, swift, objective-c, vb.net, perl, bash, css, scala, html, and lua, haskell, markdown, 4 EAI Endorsed Transactions on Scalable Information Systems Online First R, matlab, GO, and kotlin. In the tag data set, each post may be associated with multiple programming languages. To establish rule-based labeling, at least two distinctive words are searched for each programming language. These words were also extracted in [31]. Further, if a post is tagged with one or more than one of the programming languages, it is labeled as tagged. A labeling algorithm, displayed in Algorithm  [9][10][11]. ListKeywords is generated by converting distinctive keywords to a list (line 13). The tag data set is checked to detect whether the analyzed post is tagged with a specific programming language (line 14).
Here, second column of T agD includes tag information (T agD[, 2]). If distinctive keywords such as "C#", ".net", "sql", and "php" are identified one or two times in the related instance, it is then labeled as 1. Hence, threshold is either 1 or two, and these values are set depending on the programming language. Lastly, a data frame is generated by using labeling lists y1 : y24 to return labeling feature vectors LM.
Step 3: Preprocessing. Since each word is regarded as a feature in the posts, feature selection is a must to complete training in a reasonable period. To that end, Pearson correlation analysis is chosen to remove pair-wise correlations. In each step, the correlations are re-evaluated with a specific cutoff (0.7). Character and word counts are not involved in the correlation analysis. The formula of Pearson correlation analysis is given in Equation 1 where a and b denote the variables. sc(a, b) is the sample covariance of them and the sample variances are sv(a) and sv(b). At the end of feature selection, highly correlated features are removed from the term document matrix. Last, the data frame converted from the term document matrix is exposed to a three-phase process. 1) Infrequent labels, which are less than two, are disregarded. 2) The instances having no labels are removed. 3) If a feature has constant value for all the instances, they are also removed from the data frame.
Step 4: Selection of optimization method. Firstly, 10% of the instances allocated for training are randomly taken. 80% and 20% of the selected instances are used for training and validation, respectively. Thereafter, the results of four optimization methods including Neldermead, Genetic algorithm, Bayes, and Random search are evaluated for that validation. They are involved in the experiment due to the following reasons: 1) Neldermead requires fewer optimization iterations [45] than the equivalent competitive methods. 2) An intensive data augmentation process is not conducted in Bayesian hyperparameter optimization [46]. 3) Genetic algorithm is a robust hyperparameter optimization technique to reach a well-tuned algorithm to obtain accurate and high results [47]. 4) If the number of hyperparameters is not large, Random search could achieve promising results in a reasonable time [48]. The best method is selected by comparing the prediction accuracies of the optimization methods.
Step 5: Optimization and production of results. In this step, training and testing parts are renewed on the instances (80%-training, and 20%-testing). The training instances are divided into 10 folds. 9 folds are employed for training and one fold is used for the validation. That process is repeated ten times. For each iteration, the validation set is changed. Optimal configuration, which is found by the optimization method decided in the previous step, of the multi-label predictor is saved. 10-fold cross-validation is repeated for the training instances by setting optimal configuration of the multilabel predictors including ensemble of binary relevance (EBR) [49], random k-labelsets (RAKEL) [50], controlled label correlation exploitation (CTRL) [51], ensemble of classifier chain (ECC) [52], ensemble of single label (ESL) [53], label specific features (LIFT) [54], and metabr (MBR) [55]. Last, the testing set is utilized to yield general prediction results.

Data sets
We retrieved experimental data sets including three types of data from two sources: questions and tags data 5 EAI Endorsed Transactions on Scalable Information Systems Online First sets that are publicly available 1 . The question data set includes 1264216 posts. Since a post of stack overflow may be tagged with multiple programming languages, the tag data set has 3750994 instances that are relatively higher than that of the question data set. Due to the fact that a concurrent analysis of 20000 instances leads to a huge computational burden (162 GB) for RAM, 500 instances are processed in each iteration of the whole experiment. By averaging the results of those parts, an ultimate output is yielded. The increase in feature number is very fast for first 500 the instances as shown in Figure 3. This is because the number of new words decreases as the number of processed posts increases. The details of the experimental data sets are given in Table 2. The term document matrix is generated by combining "Title" and "Body" of the Questions data set. The column namely "Sparsity" shows whether the related feature has "NA" values. Note that some questions may remain unclosed so that "ClosedDate" is the sole feature having sparsity. Different from the questions, the answers of stack overflow have no tag as shown in Figure 2. Generally, bodies of questions play an important role in understanding the issue. Questions having an enriched description therefore get a fast and clear response.

Prediction configurations
The proposed method is coded with R [56] that provides fast computing for machine learning experiments. To run Nelder-mead, nloptr library [57] is utilized. A function namely neldermead is run by giving the initial point along with the upper and lower bounds of the target hyperparameter. GA function, which is available in the GA library to run the genetic algorithm, is executed with the following configurations: crossover-p: 0.8, mutation-p: 0.1 (p refers probability), and maximum number of iteration: 100. GA is a function of R package GA library [58] that consists of several functions for performing optimizing using genetic algorithms. GA library also enables us to modify genetic operators and run them sequentially or in parallel depending on the experimental design. GA function maximizes a given fitness function using basic principles of genetic algorithm. Parameters that are utilized to run Bayes are as follows: the number of iterations: 50, type of acquisition function: gaussian process upper confidence bound, kappa: 2.576, epsilon of expected improvement, and probability of improvement:0. To perform Random Search, each target hyperparameter is yielded randomly for 100 iterations. Thereafter, the parameter yielding the highest accuracy is determined as the optimal value. 1 https://www.kaggle.com/stackoverflow/stacksample Search spaces of the hyperparameters tuned for multi-label predictors and their definitions are presented in Table 3. It is worthwhile to note that the number of hyperparameters changes depending on the type of predictors. The machine we employ to run the experiment has the following technical properties: 32 CPU(s), Intel(R) Xeon(R) CPU E5-2690, 222 GB Ram, CentOS 7 operating system, and NVIDIA Tesla S870 graphic card. A parallelization is further established on that machine to shorten the completion time of the experiment.
To make a quantitative comparison with other stateof-the-art one-label programming language prediction methods, Xgboost and Random forest algorithms are employed. The mean results of those are compared with those of PLI, SCC [27], SCC++ [31], and DeepSCC [59]. A public key was requested by us to run R script to yield F1 score results of PLI. The R script devised to execute PLI is given in 2 .
One-label and multi-label predictions have the same configurations for applying preprocessing and dividing the data sets to obtain performance measures. Three hyperparameters of Xgboost, which inherit the advantages of parallelization in creating decision trees, are subject to optimization as follows: max.depth (the depth of the tree): 3-7, eta (control parameter determining the rate of model learning): 0.001-0.008, nrounds (number of iterations): 19-80. Random forest algorithm is exposed to tuning process to set mtry (number of random variable for each split of the training): 1-5. gbm is employed to compare onelabel prediction performances of script languages. Four hyperparameters of gbm are tuned with Grid search algorithm. The hyperparameters and the search space are as follows: interaction.depth:1-8, number of trees: 50-100, learning rate:0.1-0.8, minimum number of observations: 5-20. caret library of R is utilized to run gbm along with Grid search.
The source codes of HyperSCC were uploaded to Github, and the URL is https://github.com/muhammedozturk/HyperSCC/. To run HyperSCC, detailed explanations are given in that address. Further, the link of the processed file that was created after feature extraction is available.

6
EAI Endorsed Transactions on Scalable Information Systems Online First    Hamming-loss shows the ratio of misclassified instances. The extended version of accuracy is called Subset-accuracy that is a harsh metric calculating the most common label. For n observations, here n.l denotes the matrix of label set. R denotes real memberships and P represents predicted memberships. I is the indicator function that evaluates R i = P i where R i and P i denote the indexes of real and predicted membership values which are being processed at the related iteration. The ratio of irrelevant labels, which is considered as confident, is calculated with One-error.

RQ1: What type of multi-label classification technique to choose for better success in programming language predictions?
To compare multi-label predictors with respect to HyperSCC, three performance measures presented in Table 4 are produced for each fold of the validations. Note that one-error decreases as the number of validation increases. On the other hand, the results have similar hamming-loss values regardless of the type of predictors. The accuracy results of the multilabel predictors are presented in box-plots in Figure  4. We observe that the predictors show similarities in accuracies as detected in Table 4. Following findings are confirmed with RQ1: 1) The number of validation sets is crucial to stabilize prediction errors. 2) A hyperparameter optimization approach developed for multi-label classifications creates a similar effect on both prediction accuracy and errors. To answer this research question, HyperSCC is utilized to obtain combined mean results of XGboost and Random forest. Table 5 reports F1 scores of the state-of-the-art programming language prediction methods along with HyperSCC. It is worthwhile to note that HyperSCC outperformed the others for 15 programming languages. The ineffectiveness of HyperSCC for the other six programming languages may have originated from the labeling rules. GO programming language comprises some common tags such as "api", "sql-server", and "xml" that also belong to other types of programming languages. That case may have led to a dramatic decline in the success of GO prediction. In spite of a lot of syntax and structural properties, F1 score of Java is 0.1 greater than that of C#. The labeling rule of Java has more distinctive words than C# have. For example, the words "java" and "jar" are abundant in stack overflow questions. On the other hand, "instance" is one of the three words utilized in constructing the labeling rule of C#. In that case, the training becomes more imbalanced to learn C# instances. SCC and SC++ are superior to HyperSCC for those languages. More precisely, the 8 EAI Endorsed Transactions on Scalable Information Systems Online First Table 4. The results of HyperSCC of multi-label predictors. Measure  V1  V2  V3  V4  V5  V6  V7  V8  V9  V10 CTRL [51] hamming-loss 0,0583 0,0571 0,0576 0,0576 0,0571 0,0583 0,0588 0,0592 0,0569 0,0576 one-error 0,0760 0,0678 0,0657 0,0637 0,0595 0,0575 0,0554 0,0638 0,0679 0,0617 subset- accuracy 0,3758 0,3717 0,3655 0,3676 0,3676 0,3717 0,3676 0,3683 0,3642 0 Table 6, there is a conspicuous difference between HyperSCC and the other methods. However, the statistical test shows that the results of SCC and SCC++ presented in Figure 4 are similar. This is because SCC++ is an improved similar version of SCC.

RQ3: To what extent can preprocessing increase prediction time?
Preprocessing is generally required in such an experiment to prepare data in a way that guarantees yielding performance measures for each instance. Therefore, exposing data to a preprocessing operation is sometimes crucial. To answer RQ3, a random data set including 5000 instances is retrieved from the data corpus, randomly. A training process is then repeated for the various number of training sizes as follows: 50 [63]. LIFT associates the training instances with one of the class labels [54] so that LIFT costs much more time to complete training. On the other hand, MBR needs relatively low training time thanks to its fast decision-making mechanism [55]. 5.4. RQ4: Which script language is the most feasible for predicting one-label programming language?

Unpreprocessed data
One-class programming language prediction is performed with gbm to compare the accuracies obtained for eight programming languages including Javascript, Python, PHP, Ruby, Perl, Bash, Lua, and R. Grid search is one of the most popular hyperparameter optimization techniques [64][65][66] so that we prefer Grid search to perform a straight evaluation of script languages by excluding HyperSCC from this sub-experiment. The details of changing accuracy values are given in Figure  6. The highest accuracy is of the optimal maximum depth of the tree. In this respect, the optimal depth of the tree (1) of Perl is quite distinct from all of the others. R outperformed the other script languages that it yielded 0.99 of accuracy. The accuracy of Python is the lowest (0.89) among those which produced accuracy values higher than 0.9. Fluctuations in the accuracy of R are vastly clearer than those of the others. Moreover, optimizing the maximum depth of the tree is easier for Perl which has a large margin between the optimal (1) and ordinary hyperparameters (2-3-4-5-6-7-8).

Discussion and implications
In this section, we discuss to what extent this study differs from existing works by delving into the results presented in the previous sections.
Programming language prediction techniques suffer from various data-centric and experimental drawbacks. The majority of studies focus either on creating a feature set from the raw text [27] or the robustness of machine learning techniques [59,67]. However, relying on programming language labels of data sets without making an in-depth tag analysis may result in unreliable predictions. The reason is that tagging is a compulsory operation, especially for posting questions in stack overflow. The automatic labeling process presented in Section 3 aims to enhance the reliability of labels of training data. The results presented in Table 2 support the hypothesis that a suitable process made on labeling features has a positive impact on prediction success. Stack overflow data sets had been exposed to the machine learning process by using the same hyperparameter set employed in the previous studies [68] or utilizing one hyperparameter tuning technique [69]. Instead, we have conducted a minibatch training to decide the tuning approach to be used in the rest of the experiment. In this way, we were able to produce high accuracy regardless of the type of predictors in multi-label programming language prediction. Apart from preceding studies, the results of this study assert that performing one-label programming language prediction on Perl and R yields higher accuracy than that of other script languages. This validates that the programming languages including less distinctive keywords do not respond as well to labeling and preprocessing processes. 10 EAI Endorsed Transactions on Scalable Information Systems Online First Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC  Figure 6. Accuracy curves of gbm for various boosting iterations. 11 EAI Endorsed Transactions on Scalable Information Systems Online First

Conclusion and Future Remarks
In this work, we proposed a new hyperparameter optimization technique namely HyperSCC for multilabel code prediction of stack overflow, where wrong labeling is addressed by establishing an automatic multi-labeling process. Compared with the state-of-theart source code classification methods, the experimental steps we present in the paper do not rely on a specific optimization technique. Instead, the best optimization technique is chosen by analyzing the performance of a smaller part of the training data set, thereby executing HyperSCC. According to the obtained findings; 1) The success of multi-label source code prediction is not related to the predictor. 2) For one-label prediction, HyperSCC is superior to the other three methods in the F1 score.
3)The time passed for the training mainly depends on the type of multi-label predictor. MBR shows significant resistance to the increase of training time for a large number of training instances. Further, the effect of data processing in training time is negligible for the majority of the multi-label predictors.
4) The tuning burden of predicting script languages can be alleviated via more robust labeling and tagging approaches. This paper can be extended with the future agenda encompassing the following items: 1) Raw posts of stack overflow create a remarkable computational burden if each word is recognized as a feature as performed in the experiment. We rather need to develop feature selection techniques, thereby regarding exceptional cases of source code prediction. 2) The labeling method presented in this work could be leveraged by establishing a fuzzy rule-based model [70].
3) The effectiveness of HyperSCC may be validated by comparing with the methods developed for distributed data optimization [71] and resource allocation issues [72].

Declarations
Funding Not applicable.
Conflict of interest The authors declare that they have no confict of interest.
Availability of data and material The data required to replicate the experiment is presented in Section 5.2.
Code availability The link required to access the replication packages is presented in Appendix A.
Authors' contributions Not applicable.
Ethics approval This article dos not contain any studies with human participants or animals performed by any of the authors.
Consent to participate Informed consent was obtained from all individual participants included in the study.

Consent for publication
The authors affirm that human research participants provided informed consent for publication.