11th EAI International Conference on Performance Evaluation Methodologies and Tools

Research Article

Runtime Performance Prediction of Big Data Workflows with I/O-aware Simulation

  • @INPROCEEDINGS{10.4108/eai.5-12-2017.2274337,
        author={Faris  Llwaah and Jacek  Cala and Nigel  Thomas},
        title={Runtime Performance Prediction of Big Data Workflows with I/O-aware Simulation},
        proceedings={11th EAI International Conference on Performance Evaluation Methodologies and Tools},
        publisher={ACM},
        proceedings_a={VALUETOOLS},
        year={2018},
        month={8},
        keywords={big data data-intensive simulation workflowsim next generation sequencing pipeline},
        doi={10.4108/eai.5-12-2017.2274337}
    }
    
  • Faris Llwaah
    Jacek Cala
    Nigel Thomas
    Year: 2018
    Runtime Performance Prediction of Big Data Workflows with I/O-aware Simulation
    VALUETOOLS
    ACM
    DOI: 10.4108/eai.5-12-2017.2274337
Faris Llwaah1, Jacek Cala1, Nigel Thomas1,*
  • 1: Newcastle University
*Contact email: nigel.thomas@ncl.ac.uk

Abstract

Modelling and simulation of Big Data analytics processes running in the cloud is a difficult problem which introduces many challenges. The major one is the collection of training data which is scarce and costly to obtain, due to large scale and long runtime of those processes. In our previous work, we proposed a methodology to model, simulate and predict the runtime of Big Data processes such as complex Next Generation Sequencing (NGS) pipelines. The major contribution of our simulation methodology is that it provides a reasonable prediction of runtime for testing data much larger than the training inputs. To further improve the accuracy of prediction we present now an extension of our previous work that can model cloud data storage. Our simulation framework is based on CloudSim and WorkflowSim, to which we have added a shared storage component. We present the design and implementation of the storage extension together with an evaluation performed on selected scientific workflows: the Pegasus Montage workflow and NGS pipeline implemented in e-Science Central. The evaluation shows that the proposed extension works correctly and can improve prediction accuracy for our largest 390GB input dataset by about 16% when compared to previous results.