Research Article
Runtime Performance Prediction of Big Data Workflows with I/O-aware Simulation
@INPROCEEDINGS{10.4108/eai.5-12-2017.2274337, author={Faris Llwaah and Jacek Cala and Nigel Thomas}, title={Runtime Performance Prediction of Big Data Workflows with I/O-aware Simulation}, proceedings={11th EAI International Conference on Performance Evaluation Methodologies and Tools}, publisher={ACM}, proceedings_a={VALUETOOLS}, year={2018}, month={8}, keywords={big data data-intensive simulation workflowsim next generation sequencing pipeline}, doi={10.4108/eai.5-12-2017.2274337} }
- Faris Llwaah
Jacek Cala
Nigel Thomas
Year: 2018
Runtime Performance Prediction of Big Data Workflows with I/O-aware Simulation
VALUETOOLS
ACM
DOI: 10.4108/eai.5-12-2017.2274337
Abstract
Modelling and simulation of Big Data analytics processes running in the cloud is a difficult problem which introduces many challenges. The major one is the collection of training data which is scarce and costly to obtain, due to large scale and long runtime of those processes. In our previous work, we proposed a methodology to model, simulate and predict the runtime of Big Data processes such as complex Next Generation Sequencing (NGS) pipelines. The major contribution of our simulation methodology is that it provides a reasonable prediction of runtime for testing data much larger than the training inputs. To further improve the accuracy of prediction we present now an extension of our previous work that can model cloud data storage. Our simulation framework is based on CloudSim and WorkflowSim, to which we have added a shared storage component. We present the design and implementation of the storage extension together with an evaluation performed on selected scientific workflows: the Pegasus Montage workflow and NGS pipeline implemented in e-Science Central. The evaluation shows that the proposed extension works correctly and can improve prediction accuracy for our largest 390GB input dataset by about 16% when compared to previous results.