
Research Article
On IT and OT Cybersecurity Datasets for Machine Learning-Based Intrusion Detection in Industrial Control Systems
@INPROCEEDINGS{10.1007/978-3-031-78806-2_3, author={Mohammad Pasha Shabanfar and Yiheng Zhao and Jun Yan and Mohsen Ghafouri}, title={On IT and OT Cybersecurity Datasets for Machine Learning-Based Intrusion Detection in Industrial Control Systems}, proceedings={Smart Grid and Innovative Frontiers in Telecommunications. 8th EAI International Conference, EAI SmartGIFT 2024a, Santa Clara, United States, March 23-24, 2024, Proceedings}, proceedings_a={SMARTGIFT}, year={2025}, month={1}, keywords={Information Technology Operational Technology Datasets Cybersecurity Intrusion Detection System}, doi={10.1007/978-3-031-78806-2_3} }
- Mohammad Pasha Shabanfar
Yiheng Zhao
Jun Yan
Mohsen Ghafouri
Year: 2025
On IT and OT Cybersecurity Datasets for Machine Learning-Based Intrusion Detection in Industrial Control Systems
SMARTGIFT
Springer
DOI: 10.1007/978-3-031-78806-2_3
Abstract
Intrusion detection plays a pivotal role in the cybersecurity of industrial control systems (ICS) to safeguard the safety of individuals, communities, and nations. Lately, intrusion detection models based on machine learning have been adopted to improve the detection of cyberattacks. However, there is a lack of a systematic approach to selecting the appropriate dataset for training these models. An appropriately selected dataset should be based on the needed collection environment, i.e., Information Technology (IT) and Operational Technology (OT), and include required specifications of the under-study ICS, e.g., deployed protocols. On this basis, this paper classifies the existing intrusion detection datasets into IT and OT datasets. The IT datasets are investigated from the perspectives of attack/normal traffic inclusion and their anonymity, number of packets, duration, and kind of traffic. On the other hand, the OT datasets are studied based on features such as data protocols, distribution, and data domain. Then, we have discussed the gap between the method of detection and the selection of the appropriate dataset in terms of (i) performance indicators, i.e., detection time and imbalanced distribution of data, and (ii) use case, i.e., summarizing communication layers, protocols, and attack types contained in datasets. Finally, the essential features for constructing an effective cybersecurity dataset are discussed to illustrate how to establish an ideal dataset accordingly.