1st International ICST Conference on Networks for Grid Applications

Research Article

Reliable DAG scheduling on Grids with Rewinding and Migration

Download147 downloads
  • @INPROCEEDINGS{10.4108/gridnets.2007.2137,
        author={Israel Hernandez and Murray Cole},
        title={Reliable DAG scheduling on Grids with Rewinding and Migration},
        proceedings={1st International ICST Conference on Networks for Grid Applications},
        publisher={ICST},
        proceedings_a={GRIDNETS},
        year={2007},
        month={10},
        keywords={Fault tolerance Grid computing parallel processing DAG scheduling.},
        doi={10.4108/gridnets.2007.2137}
    }
    
  • Israel Hernandez
    Murray Cole
    Year: 2007
    Reliable DAG scheduling on Grids with Rewinding and Migration
    GRIDNETS
    ICST
    DOI: 10.4108/gridnets.2007.2137
Israel Hernandez1,*, Murray Cole1,*
  • 1: Institute for Computing Systems Architecture School of Informatics University of Edinburgh
*Contact email: j.i.hernandez@sms.ed.ac.uk, mic@inf.ed.ac.uk

Abstract

Fault tolerance is an important issue in Grid Computing as the availability of Grid resources can not be guaran- teed. E®ective scheduling methods must include fault tolerant mechanisms to preserve the execution of DAG applications, despite the presence of a processor failure. To address this, we designed the DAG rewinding mech- anism, an event-driven process executed when a failure is detected at some rescheduling point. The rewinding mechanism preserves the execution of the application by recomputing and migrating those tasks which will disrupt the forward execution of succeeding tasks. The mecha- nism rewinds the progress of the application to a previous state, thereby preserving the execution despite the failed processor(s). This paper extends our work in the area by adding the rewinding mechanism to our previous dynamic scheduling methods GTP and GTP=c. We show how to integrate the rewinding mechanism within our dynamic execution models.