
Research Article
A Three-Level Training Data Filter for Cross-project Defect Prediction
@INPROCEEDINGS{10.1007/978-3-030-69069-4_10, author={Cangzhou Yuan and Xiaowei Wang and Xinxin Ke and Panpan Zhan}, title={A Three-Level Training Data Filter for Cross-project Defect Prediction}, proceedings={Wireless and Satellite Systems. 11th EAI International Conference, WiSATS 2020, Nanjing, China, September 17-18, 2020, Proceedings, Part I}, proceedings_a={WISATS}, year={2021}, month={2}, keywords={Machine learning Cross-project defect prediction Transfer learning}, doi={10.1007/978-3-030-69069-4_10} }
- Cangzhou Yuan
Xiaowei Wang
Xinxin Ke
Panpan Zhan
Year: 2021
A Three-Level Training Data Filter for Cross-project Defect Prediction
WISATS
Springer
DOI: 10.1007/978-3-030-69069-4_10
Abstract
The purpose of cross-project defect prediction is to predict whether there are defects in this project module by using a prediction model trained by the data of other projects. For the divergence of the data distribution between different projects, the performance of cross-project defect prediction is not as good as within-project defect prediction. To reduce the difference as much as possible, researchers have proposed a variety of methods to filter training data from the perspective of transfer learning. In this paper, we introduce a “project-instance-metric" hierarchical filtering strategy to select training data for the defect prediction model. Using the three-level filtering method, the candidate projects that are most similar to the target project, the instances that are most similar to the target instance, and the metrics with the highest correlation to the prediction result are filtered out respectively. We compared three-level filtering with project-level filtering, instance-level filtering, and the combination of project-level and instance-level filtering methods in four classification algorithms using NASA open source data sets. Our experiments show that the three-level filtering method achieves more significant f-measure and AUC values than the single level training data filtering method.