
Research Article
Cost-Effective Malware Classification Based on Deep Active Learning
@INPROCEEDINGS{10.1007/978-3-031-25538-0_12, author={Qian Qiang and Yige Chen and Yang Hu and Tianning Zang and Mian Cheng and Quanbo Pan and Yu Ding and Zisen Qi}, title={Cost-Effective Malware Classification Based on Deep Active Learning}, proceedings={Security and Privacy in Communication Networks. 18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, proceedings_a={SECURECOMM}, year={2023}, month={2}, keywords={Deep active learning Malware classification Cost-effective}, doi={10.1007/978-3-031-25538-0_12} }
- Qian Qiang
Yige Chen
Yang Hu
Tianning Zang
Mian Cheng
Quanbo Pan
Yu Ding
Zisen Qi
Year: 2023
Cost-Effective Malware Classification Based on Deep Active Learning
SECURECOMM
Springer
DOI: 10.1007/978-3-031-25538-0_12
Abstract
Malware has now grown up to be one of the most important threats to internet security. As the number of malware families has increased rapidly, a malware classification model needs to classify the samples for further analysis. Recent success in deep learning-based malware classification, however heavily relies on the large number of labeled training samples, which may require considerable human effort. In this paper, we propose a novel malware classification framework for the cost issue, which is capable of building a competitive classifier via a limited amount of labeled training instances in an incremental learning manner. A cost-effective sample selection strategy is leveraged to focus expert efforts on labeling samples that are most informative for the classifier. We first convert the malware byte sequences into fixed-size gray-scale images through data visualization. Afterward, based on the strategy designed and oriented towards informative malware acquisition, we select samples through Convolutional Neural Network (ConvNet) to query experts for annotation according to the estimated gradients towards the last linear layer. The updated labeled dataset is then fed into the network for further fine-tuning progressively. To evaluate the capability of our method for acquiring informative malware from a pool of unknown samples, we conduct a series of experiments on a benchmark dataset named BIG 2015. Compared to random selection and other existing high-performance strategies, the proposed system can achieve a promising performance rise cost-effectively with less labeling effort wasted. The effectiveness of sample selection towards different families is also analyzed and further proves the efficiency of labeling cost. Moreover, the initialization methods and the pre-defined number of samples queried are studied for practical implementation.