
Research Article
Language and Platform Independent Attribution of Heterogeneous Code
@INPROCEEDINGS{10.1007/978-3-031-25538-0_10, author={Farzaneh Abazari and Enrico Branca and Evgeniya Novikova and Natalia Stakhanova}, title={Language and Platform Independent Attribution of Heterogeneous Code}, proceedings={Security and Privacy in Communication Networks. 18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, proceedings_a={SECURECOMM}, year={2023}, month={2}, keywords={Source code and Binary attribution Authorship attribution}, doi={10.1007/978-3-031-25538-0_10} }
- Farzaneh Abazari
Enrico Branca
Evgeniya Novikova
Natalia Stakhanova
Year: 2023
Language and Platform Independent Attribution of Heterogeneous Code
SECURECOMM
Springer
DOI: 10.1007/978-3-031-25538-0_10
Abstract
Code authorship attribution aims to identify the author of source or binary code according to the author’s unique coding style characteristics. Recently, researchers have attempted to develop cross-platform and language-oblivious attribution approaches. Most of these attempts were limited to small sets of two-three languages or few platforms. However, rapid development of cross-platform malware and general language, platform and architecture diversity raises concerns about the suitability of these techniques. In this paper, we propose a unified approach that supports attribution of code irrespective of its format. Our approach leverages an image-based code abstraction that preserves the developer’s coding style and lends itself to spatial analysis that reflects hidden patterns. We validate our approach on a set of Android applications achieving accuracy 82.8%–100% with source and byte code. We further explore the robustness of our approach in attributing developers’ code written in 27 programming languages, compiled on 14 instruction set architectures types and 18 intermediate compiled versions. Our results on the GitHub dataset show that in the worst case scenario the proposed approach can discriminate authors of code in heterogeneous format with at least 68% accuracy.