Research Article
A Scalable Sampling Scheme for Clustering in Network Traffic Analysis
@INPROCEEDINGS{10.4108/infoscale.2007.930, author={Abdun Mahmood and Christopher Leckie and Parampalli Udaya}, title={A Scalable Sampling Scheme for Clustering in Network Traffic Analysis}, proceedings={2nd International ICST Conference on Scalable Information Systems}, proceedings_a={INFOSCALE}, year={2010}, month={5}, keywords={adaptive sampling network traffic analysis clustering}, doi={10.4108/infoscale.2007.930} }
- Abdun Mahmood
Christopher Leckie
Parampalli Udaya
Year: 2010
A Scalable Sampling Scheme for Clustering in Network Traffic Analysis
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.930
Abstract
Sampling is a popular method for improving the scalability of analyzing massive datasets such as network traffic traces, webclick traffic and other forms of transaction data. However, in some cases, existing simple sampling strategies fail to capture the underlying distribution of the data. In particular, for network traffic, sampling is influenced by heavy traffic from flash crowds and Denial of Service (DoS) attacks. In such cases, it reveals little information about the other smaller traffic patterns which may contain interesting yet important information about the traffic. We propose an adaptive sampling technique that utilizes a buffer of frequently seen patterns and a combination of sampling steps to build a hierarchical tree of traffic clusters. We show that this sampling technique ensures that smaller and newer patterns are represented in the cluster tree while satisfying the maximum sampling rate imposed by the resource constraints. This technique has two benefits: it preserves the underlying patterns of the data, and improves efficiency by reducing the sampling of records from known patterns. Through an empirical evaluation on a benchmark dataset, we demonstrate the accuracy of our system in detecting certain types of rare attacks that are otherwise not detected by systematic sampling. We also demonstrate the efficiency of our system in terms of reducing the number of sampled records in detecting frequent patterns.