sis 22(1): e5

Research Article

Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC

Download355 downloads
  • @ARTICLE{10.4108/eai.27-5-2022.174084,
        author={Muhammed Maruf \O{}zt\'{y}rk},
        title={Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC},
        journal={EAI Endorsed Transactions on Scalable Information Systems},
        volume={10},
        number={1},
        publisher={EAI},
        journal_a={SIS},
        year={2022},
        month={5},
        keywords={Multi-label classification, hyperparameter optimization, programming language prediction},
        doi={10.4108/eai.27-5-2022.174084}
    }
    
  • Muhammed Maruf Öztürk
    Year: 2022
    Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC
    SIS
    EAI
    DOI: 10.4108/eai.27-5-2022.174084
Muhammed Maruf Öztürk1,*
  • 1: Department of Computer Engineering, Suleyman Demirel University, West Campus, Isparta, 32040, Turkey
*Contact email: muhammedozturk@sdu.edu.tr

Abstract

Although there exist various machine learning and text mining techniques to identify the programming language of complete code files, multi-label code snippet prediction was not considered by the research community. This work aims at devising a tuner for multi-label programming language prediction of stack overflow posts. To that end, a Hyper Source Code Classifier (HyperSCC) is devised along with rule-based automatic labeling by considering the bottlenecks of multi-label classification. The proposed method is evaluated on seven multi-label predictors to conduct an extensive analysis. The method is further compared with the three competitive alternatives in terms of one-label programming language prediction. HyperSCC outperformed the other methods in terms of the F1 score. Preprocessing results in a high reduction (50%) of training time when ensemble multi-label predictors are employed. In one-label programming language prediction, Gradient Boosting Machine (gbm) yields the highest accuracy (0.99) in predicting R posts that have a lot of distinctive words determining labels. The findings support the hypothesis that multi-label predictors can be strengthened with sophisticated feature selection and labeling approaches.