Research Article
A Framework for Utilizing Permutational Multiple Analysis of Variance as a Precursor for Nonparametric Statistical Learning with Cyber Network Data
@ARTICLE{10.4108/eetcasa.v9i1.2929, author={Thomas Woolman and John Pickard}, title={A Framework for Utilizing Permutational Multiple Analysis of Variance as a Precursor for Nonparametric Statistical Learning with Cyber Network Data}, journal={EAI Endorsed Transactions on Context-aware Systems and Applications}, volume={9}, number={1}, publisher={EAI}, journal_a={CASA}, year={2023}, month={7}, keywords={Artificial intelligence, machine learning, Internet of Things, Malware, network flows, multinomial, PERMANOVA, NPMANOVA, multinomial classification}, doi={10.4108/eetcasa.v9i1.2929} }
- Thomas Woolman
John Pickard
Year: 2023
A Framework for Utilizing Permutational Multiple Analysis of Variance as a Precursor for Nonparametric Statistical Learning with Cyber Network Data
CASA
EAI
DOI: 10.4108/eetcasa.v9i1.2929
Abstract
INTRODUCTION: Although scientific hypothesis testing methodologies are well established, their application to falsifiable hypothesis testing for assessing causal relationships potentially identified by machine learning and artificial intelligence models is rare due to the primarily nonparametric statistical nature of these systems. OBJECTIVES: The primary objective of this study is to demonstrate the potential for applying nonparametric statistical tests to a mixed qualitative and quantitative cyber network dataset as a method to pre-assess the feasibility of applying forms of statistical hypothesis testing before a machine learning algorithm models the data. METHODS: A mixture of permuted analysis of variance models augmented by the use of transformed non-Euclidean multivariate distances between curated dependent variable classes produced this research data. Quasi-experimental data from an enclosed laboratory environment utilizing a monitored, locally unrestricted network that introduced known Internet of Things (IoT) malware software supplied network flow events. RESULTS: A PERMANOVA model was executed against 62,000 records of the network flow observations, using Euclidean distance measurements with variable-dependent relationship ordering, using terms added sequentially (first to last) in the order encountered in the raw network flow dataset, using 200 permutations. This precursor test resulted in a p-value for the PERMANOVA model that incorporated terms added sequentially of 0.02985, providing an F value of 0.00017 with which to determine the ratio of explained to unexplained variance. Utilizing an analysis of the F values for all of the residuals, we show 29,998 degrees of freedom with a residual F model score of 0.99983, indicating that there is a strong proportion of explained to unexplained variance across all of the independent variables contained in the model. The model is thus statistically significant with a p-value below the alpha test statistic of 0.05. CONCLUSION: This research has demonstrated that it is possible to apply tests of falsifiability that incorporate reproducible methods into the quasi-experiment design and apply this to the field of machine learning. Applied to AI/ML (artificial intelligence/machine learning) models, this pre-assessment methodology supports the appropriateness of cyber network flow datasets in which a final test of statistical significance would be required. The authors believe that this represents a substantially useful precursor assessment stage for the suitability and reliability of the utilization of any nonparametric statistical learning algorithms applied to cyber network data predictive analytics.
Copyright © 2023 Woolman et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.