Security and Privacy in Communication Networks. 13th International Conference, SecureComm 2017, Niagara Falls, ON, Canada, October 22–25, 2017, Proceedings

Research Article

Guilt-by-Association: Detecting Malicious Entities via Graph Mining

  • @INPROCEEDINGS{10.1007/978-3-319-78813-5_5,
        author={Pejman Najafi and Andrey Sapegin and Feng Cheng and Christoph Meinel},
        title={Guilt-by-Association: Detecting Malicious Entities via Graph Mining},
        proceedings={Security and Privacy in Communication Networks. 13th International Conference, SecureComm 2017, Niagara Falls, ON, Canada, October 22--25, 2017, Proceedings},
        proceedings_a={SECURECOMM},
        year={2018},
        month={4},
        keywords={Belief propagation Big data analysis for security Graph inference Malicious domain and IP detection Guilt-by-association Graph mining},
        doi={10.1007/978-3-319-78813-5_5}
    }
    
  • Pejman Najafi
    Andrey Sapegin
    Feng Cheng
    Christoph Meinel
    Year: 2018
    Guilt-by-Association: Detecting Malicious Entities via Graph Mining
    SECURECOMM
    Springer
    DOI: 10.1007/978-3-319-78813-5_5
Pejman Najafi1,*, Andrey Sapegin1,*, Feng Cheng1,*, Christoph Meinel1,*
  • 1: Hasso Plattner Institute (HPI)
*Contact email: pejman.najafi@hpi.de, andrey.sapegin@hpi.de, feng.cheng@hpi.de, christoph.meinel@hpi.de

Abstract

In this paper, we tackle the problem of detecting malicious domains and IP addresses using graph inference. In this regard, we mine proxy and DNS logs to construct an undirected graph in which vertices represent domain and IP address nodes, and the edges represent relationships describing an association between those nodes. More specifically, we investigate three main relationships: , , and . We show that by providing minimal ground truth information, it is possible to estimate the marginal probability of a domain or IP node being malicious based on its association with other malicious nodes. This is achieved by adopting belief propagation, i.e., an efficient and popular inference algorithm used in probabilistic graphical models. We have implemented our system in Apache Spark and evaluated using one day of proxy and DNS logs collected from a global enterprise spanning over 2 of disk space. In this regard, we show that our approach is not only efficient but also capable of achieving high detection rate (96% TPR) with reasonably low false positive rates (8% FPR). Furthermore, it is also capable of fixing errors in the ground truth as well as identifying previously unknown malicious domains and IP addresses. Our proposal can be adopted by enterprises to increase both the quality and the quantity of their threat intelligence and blacklists using only proxy and DNS logs.