Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC

Muhammed Maruf Öztürk

Research Article

Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC

Download355 downloads

Cite: BibTeX Plain Text

@ARTICLE{10.4108/eai.27-5-2022.174084,
    author={Muhammed Maruf \O{}zt\'{y}rk},
    title={Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC},
    journal={EAI Endorsed Transactions on Scalable Information Systems},
    volume={10},
    number={1},
    publisher={EAI},
    journal_a={SIS},
    year={2022},
    month={5},
    keywords={Multi-label classification, hyperparameter optimization, programming language prediction},
    doi={10.4108/eai.27-5-2022.174084}
}

Muhammed Maruf Öztürk
Year: 2022
Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC
SIS
EAI
DOI: 10.4108/eai.27-5-2022.174084

Muhammed Maruf Öztürk¹^,*

1: Department of Computer Engineering, Suleyman Demirel University, West Campus, Isparta, 32040, Turkey

*Contact email: muhammedozturk@sdu.edu.tr

Abstract

Although there exist various machine learning and text mining techniques to identify the programming language of complete code files, multi-label code snippet prediction was not considered by the research community. This work aims at devising a tuner for multi-label programming language prediction of stack overflow posts. To that end, a Hyper Source Code Classifier (HyperSCC) is devised along with rule-based automatic labeling by considering the bottlenecks of multi-label classification. The proposed method is evaluated on seven multi-label predictors to conduct an extensive analysis. The method is further compared with the three competitive alternatives in terms of one-label programming language prediction. HyperSCC outperformed the other methods in terms of the F1 score. Preprocessing results in a high reduction (50%) of training time when ensemble multi-label predictors are employed. In one-label programming language prediction, Gradient Boosting Machine (gbm) yields the highest accuracy (0.99) in predicting R posts that have a lot of distinctive words determining labels. The findings support the hypothesis that multi-label predictors can be strengthened with sophisticated feature selection and labeling approaches.

Keywords: Multi-label classification, hyperparameter optimization, programming language prediction

Received: 2022-03-21
Accepted: 2022-05-26
Published: 2022-05-27
Publisher: EAI

: http://dx.doi.org/10.4108/eai.27-5-2022.174084

Copyright © 2022 Muhammed Maruf Öztürk, licensed to EAI. This is an open access article distributed under the terms of the Creative Commons Attribution license, which permits unlimited use, distribution and reproduction in any medium so long as the original work is properly cited.