
Research Article
Research on Data Drift and Class Imbalance in Android Malware Detection
@INPROCEEDINGS{10.1007/978-3-031-63989-0_22, author={Zhen Liu and Ruoyu Wang and Bitao Peng and Changji Wang and Qingqing Gan}, title={Research on Data Drift and Class Imbalance in Android Malware Detection}, proceedings={Mobile and Ubiquitous Systems: Computing, Networking and Services. 20th EAI International Conference, MobiQuitous 2023, Melbourne, VIC, Australia, November 14--17, 2023, Proceedings, Part I}, proceedings_a={MOBIQUITOUS}, year={2024}, month={7}, keywords={Android malware detection data drift class imbalance feature learning CNN}, doi={10.1007/978-3-031-63989-0_22} }
- Zhen Liu
Ruoyu Wang
Bitao Peng
Changji Wang
Qingqing Gan
Year: 2024
Research on Data Drift and Class Imbalance in Android Malware Detection
MOBIQUITOUS
Springer
DOI: 10.1007/978-3-031-63989-0_22
Abstract
In the Android ecosystem, malware detection is still a nontrivial task. Existing works have recently applied convolution neural networks (CNNs) for detecting Android malwares. However, data drift and class imbalance are still open problems in this field. The distribution of malware data may vary significantly if data are represented by unstable features, leading to data drift problems. The model may not be able to effectively detect malwares on the future data. In addition, the class imbalance may degrade a model on identifying a specific type of malwares with fewer training samples. To handle both of the two problems, this paper presents a new Android malware detection framework. Specifically, we devise a data distribution-aware feature learning framework for learning features with a stable distribution to handle data drift. We further devise a new loss function for CNN to handle the class imbalance problem. Using our loss function, this model can reinforcement learn the minority class samples and hard samples. The experimental results on the real datasets revealed that our method outperforms existing works for Android malware detection on the datasets with data drift and class imbalance problems.