An Adaptive Fault Tolerant Scheduling System for Desktop Grid

In Desktop Grid, managing faults is very crucial and challenging job.So a fault tolerant system is mandatory requirement in desktop grid for fault identification and their resolution.A fault tolerant system allows applications to continue execution despite having faults without termination.In this paper,an adaptive fault tolerant scheduling system is presented that combines dynamic replication with rescheduling.The system initially schedules jobs depending upon the completion time and fault rate of resources and then fault-tolerant strategies are applied depending upon availability of resources.To measure the performance,experiments has been carried out and it has been observed that proposed scheduling system outperforms by a factor of 4.8% w.r.t. Average task response time and 0.02% w.r.t. Average flowtime as compared to existing system.On the other hand,there is no significant improvement is observed when BoT completion time and average execution time are compared to existing system.


Introduction
Internet proves to be a boon when it comes to communication and sharing, along with the availability of powerful computers and high speed communication networks for sharing resources has led to the sharing of spatially distributed resources and using it as an integrated computing machine.A wide range of resources are joining together to form a cluster to use them as a single integrated resource, thereby known as Grid Computing [1].
Specifically Grid computing can be elaborated not only as a coordinated sharing of resources but also as a solution to the complexity of dynamic and multi-organization collaborations [2].It can be inferred that Grid computing can be used to solve a single yet huge problem, with the cluster of multiple resources which act as a single computational machine.
Grid computing has a wide range of applications both in the field of science and industry.Some of the applications include concentrated to data, information and computation [3].A coordinated resource sharing is required for all the applications to compute, as the sharing is not only the exchange of files but rather resources are directly accessed such as software, memory, data, information and so on.Along these lines, sharing is highly restricted with resource contributors and clients those are accessing these resources.Organizations and/or foundations have to delineate sharing regulations, known as Virtual Organization (VO) [4].
In Desktop Grid, Information-free schedulers are using much more resources than necessary, so cannot make use of full potential of Desktop Grid.Prior information about resources and tasks will help in taking better scheduling decisions.For example if the scheduler has the knowledge about the resource reliability then it can be decided that which resource is able to complete the execution of task efficiently.Due to diverse nature of Desktop Grid, managing faults is very crucial and challenging task.So a fault tolerant system is a mandatory requirement in Desktop Grid for identification of faults and their resolution.A fault tolerant system allows applications to continue its execution despite having faults without termination.Also the system has to maintain a bare minimum quality of service.

Related Work
Three fundamental challenges that portray Grid computing are: diversity, scalability and element flexibility.A Grid includes a variety of heterogeneous resources and might traverse various organizational domains.As the size of Grid increases, raises the issue of performance deprivation.Finally, in a Grid with a variety of resources, the likelihood of resource failures is normally high.In addition, the hosts those are contributing in terms of donating their resources can be recovered by them whenever required, and is difficult to know ahead of time whether and when they will get to be distinctly accessible again.The resources administrators or applications must tailor their conduct effectively in order to get the maximum performance from the accessible resources and administrations.
Despite these issues, Grid computing must offer a simple way to use or access unlimited computing and heterogeneous resources, so it must have the ability to find, allot, supervise, control and deal with the utilization of system available abilities so as to accomplish different worldwide Quality of Service (QoS) also called resource administration [5].In conventional computing systems, resource administration is an all-around considered issue.Resource administrators are composed and work under the suspicion that they have complete control of a resource and in this manner can implement or use the approaches required for efficient utilization of that resource in isolation.Tragically, this presumption does not have any significant bearing to Grids in view of the fundamental issues already portrayed.This circumstance is complex by the general absence of information available about the present framework and the contending needs of clients.
Therefore, great parts of the previous work in Grid resource administration concentrated on prevail over the essential problem of diversity, for instance through the various resource administration protocols [6] and communicating resource and task prerequisite mechanisms [7].
The underlying difficulties of Grid computing relating to how to execute a task, how to exchange substantial files, how to deal with various client accounts on various frameworks have been resolved primarily, so clients and scientists can now deal with the issues that will permit more efficient resource utilization.Significant challenges remain, in any case, in seeing how these systems can be effectively consolidated to make consistent virtual perspectives of hidden resources and administrations.The enormous reputation of internet has produced huge prospects for Grid computing i.e., many desktop personal computers, whose idle cycles can be changed to run Grid applications, are joined with wide-zone systems both in the business enterprises and in the home.These new stages for high throughput applications are called Desktop Grids [8].
A Desktop Grid consists of aggregation of spatially distributed resources across diverse associations with varying security constraints and attributes into a single system.In such environment, volunteers allows sharing of resources for solving extensive problems in weather forecasting, high energy physics, etc [9].The distributed management, heterogeneity and inconsistency of desktop grid resources often results unavailability of resources as compared to traditional distributed system [10].In Grid systems, a task is said to be failed if an allocated resource is not able to complete that task within the given time period [11].Due to these failures, application performance degrades in terms of execution time.Moreover providing fault tolerance despite maintaining system performance in terms of application execution time is a challenging job [12].
Anousha et al. [13] presented a scheduling system that depends upon estimating the completion time of the tasks on each of resources.The proposed system will generate the scheduling decisions as per the value of estimated completion time.The results show that the proposed scheduling system improves total completion time in comparison to Min-Min strategy.S.K Panda et al. [14] presented a scheduling system that depends upon utilizing the Round Trip Time (RTT) to discover failures and after identification of failures checkpointing strategy is being used to recover from failures.Results show that the proposed scheduling system improves total completion time and lessen the makespan in comparison to Min-Min and Max-Min strategies.In N.M.Reda et al. [15], the proposed strategy is based on finding appropriate resources by getting the average value via sorting list of completion time of each task.Finally, the task having the maximum average is allocated to the machine that has the minimum completion time.Results show that the proposed strategy outperforms almost other strategies in terms of resources utilization and makespan.K. Kousalya et al. [16] proposed a QoS based Task Rescheduling algorithm (QTR) in which scheduled tasks are collected and then rescheduled using Minimum Completion Time(MCT) value.The results of the computations show that the QTR algorithm reduces makespan in comparison to existing strategies.J.Y Maipan et al. [17] proposed an algorithm MinExt which calculates the average completion time of all tasks.Then tasks having more than average completion time value are scheduled first followed by the set of tasks less than or equal to average completion time value.The results indicate that the proposed strategy minimizes total completion time value and utilizes the idle resources effectively in comparison to existing strategies.
Fault tolerant systems can be proactive or post-active based upon whether they handled prior to scheduling of tasks on desktop grid resources [18] or otherwise.Nazir et al. [18] proposed a fault tolerant scheduling system that works on retaining the value of fault index to handle job failures.In this paper grid scheduler generates the scheduling decisions on the basis of value of fault index.
Nandagopal et al. [12] proposed another scheduler that adds response time in addition to use fault index to make scheduling decisions.In [19] Mohammed Amoon et al. presented a fault tolerant scheduler that is based on using a An Adaptive Fault Tolerant Scheduling System for Desktop Grid 3 scheduling indicator to select resources.This scheduling indicator consists of response time and fault rate of Grid resources.During scheduling the proposed scheduler will use this scheduling indicator to generate scheduling decisions.A most recent scheduling system presented by R.M.R Kovvur et al. [20] incorporated rescheduling on account of resource failures during execution of tasks.
So it has been seen that due to diverse nature of Desktop Grid, a single strategy does not work well in all situations.

Proposed Methodology
The main contribution of the proposed system is to present an adaptive fault tolerant scheduling system that combines dynamic replication with rescheduling.Initially the scheduling is being performed on the basis of completion time and fault rate of resources.Furthermore an appropriate fault tolerant strategy is used depending on the availability of resources.If the number of available resources are more than the number of tasks requesting for allocation then dynamic replication is preferred and if the number of resources are less as compared to number of tasks then rescheduling is preferred.The proposed system is then compared with the system presented in [15].The improvement in terms of terms of average task response time and average flowtime value is clearly presented.
At the first stage the proposed system schedules the resources to tasks on the basis of scheduling criteria.Then in the second stage, the system checks the number of resources available.In case of more number of available resources in comparison to tasks then the system computes a dynamic threshold value for each task and replicates the task on multiple resources and if the number of available resources are less in comparison to tasks requesting for allocation then the proposed system perform rescheduling on account of resources failures during execution.The main components of proposed system and their interconnections are shown in figure 1.

User Interface/Desktop Grid Portal
Through this interface user submits the jobs to the resource broker for execution along with information like number of jobs, type of job and size of jobs etc.

Resource Broker
The Resource Broker performs the following functions: a) Send request to Grid Information Server (GIS) for the number of available resources.
b) Select the best list of resources in terms of scheduling criteria (Sc).c) Sort the list of resources as per the value scheduling criteria (Sc).d) Communicates with Desktop Grid Scheduler and send the sorted list of resources along with the list of tasks to be executed.Allocates tasks to the first available resource according to the list received from resource broker.If it fails then will go to the next available resource and will continue to do until it find a suitable resource for execution of task.b) Maintains a list of faulty resources c) Will check the number of available resources.If number of available resources is more in comparison to number of tasks then the scheduler will do dynamic replication for each task.And in case of less number of resources in comparison to number of tasks then the scheduler will do rescheduling on account of resource failures.
d) Communicates with availability manager and send the list of faulty resources along with fault rate value.

Fault Tolerant Manager
The fault tolerant manager performs the following functions: a) Keep updating the resource history as per the value of fault rate received from desktop grid scheduler.
b) Keep updating Grid information server with updated resource information along with the number of available resources.

Grid Information Server (GIS)
The Grid Information Server performs the following functions: a) Will send the available resource list to the resource broker as per the information received from availability manager.

Outcome Manager
The Grid Information Server performs the following functions: a) Will send the available resource list to the resource broker as per the information received from availability manager.

Adaptive Fault Tolerant Scheduler's Process
Desktop Grid Scheduler receives tasks for execution from user interface along with information about tasks i.e. number of tasks, type of task and size of task and so on.Then it allocates task to the most efficient resource for execution.Efficient resource selection is based upon value of fault rate and completion time value stored in the history of Grid Information server.The fault rate and completion time is defined as follows:-Fault Rate (Fr): It is the ratio of the number of tasks failed to execute to the total number of tasks.Fault rate is maintained and updated dynamically in resource history table.Thus fault Rate (Fr) of jth resource is: Frj= Tf*100/N . ( Here Tf is number of tasks failed to execute (Tf) N is the total number of tasks The value of fault rate (Fr) and completion time (Cab) is used by the desktop grid scheduler during taking scheduling decisions.The resource having the minimum value of fault rate (Fr) and completion time (Cab) is called as most efficient resource.
To achieve its objective, the scheduler will calculate the value of Scheduling criteria (Sc) as per the following formula Scj =Cij* (1+Frj) During generating scheduling decision the value of scheduling criteria (Scj) is utilized.The most efficient resource selection will be done with minimum value of scheduling criteria (Scj).
The flow of work of the proposed system is shown by the following steps. 1.
Resource broker collects tasks for execution from user Interface along with information like number of jobs, type of job and size of jobs etc. 2.
Resource Broker will send request to Grid Information Server (GIS) for the number of available resources.
3. Grid Information Server (GIS) will send the available resource list to the resource broker as per the information received from availability manager. 4.
After getting the resource list from GIS, the resource broker will performs the following functions: a) Select the best list of resources in terms of scheduling criteria (Sc).b) Sort the list of resources as per the value of scheduling criteria (Sc).c) Communicates with Desktop Grid Scheduler and send the sorted list of resources along with the list of tasks to be executed.5.
After getting the resource list along with tasks for execution the Desktop Grid scheduler will do the following tasks:a) Allocates tasks to the first available resource according to the list received from resource broker.If it fails then will go to the next available resource and will continue to do until it find a suitable resource for execution of task.more in comparison to number of tasks then the scheduler will do dynamic replication for each task.And in case of less number of resources in comparison to number of tasks then the scheduler will do rescheduling on account of resource failures.d) Communicates with fault tolerant manager and send the list of faulty resources along with fault rate value.6.
Fault tolerant manager will keep updating the resource history as per the value of fault rate received from desktop grid scheduler and also updates Grid information server(GIS) with updated resource information along with the number of available resources. 7.
Outcome manager will send gathered outcomes of tasks to the desktop grid scheduler those have successfully completed their execution. 8.
Finally the desktop grid scheduler will send the result back to the resource broker and the resource will send the collected result to the user interface.

Results and Discussion
The proposed scheduler, an adaptive fault tolerant scheduling system is compared with existing scheduling system for performance analysis by using Java based simulation environment known as GridSim Toolkit.The performance analysis is based upon average task response time, BoT completion time (makespan), average execution time and average flowtime.
In order to verify proposed scheduling system in comparison with the existing scheduling system [7], number of experiments has been performed.The factors used for simulation are shown in Table 1.For exhaustive analysis of the proposed strategy, four scenarios are taken in which the number of resources are kept constant i.e. 40 in all cases.In first two scenarios, there is less number of tasks in comparison to number of available resources and in last two scenarios, the number of task are more in comparison to number of resources.For comparison purpose 25% and 50% of tasks in comparison to 40 numbers of resources are taken in first two scenarios and 200% and 400% of tasks in comparison to 40 numbers of resources are taken in last two scenarios.In each scenario, 30-35% resource failures are induced for analysis.Random failures are induced with the assumption that resources may fail anytime.The complete comparison analysis w.r.t.average task response time, BoT completion time also called makespan, average execution time and average flowtime with various scenarios is illustrated in figures 2, 3, 4 and 5 respectively.

t Average Flowtime
As per the analysis, it has been observed that the proposed scheduling system outperforms existing scheduling system by a factor of 4.8% w.r.t.Average task response time.Furthermore when Average flowtime is taken into account then the proposed system shows a performance improvement of 0.02% as compared to existing scheduling system.Additionally, when BoT completion time and average execution time are taken in to account, then the performance of the proposed scheduling system remains same with the existing scheduling system.So it has concluded that the proposed system outperforms in every case as compared to the existing scheduling system.Furthermore the proposed system must be preferred for the applications where high response time is desired.

Conclusion and Future Scope
In this paper an adaptive fault tolerant scheduling system is presented that combines dynamic replication with rescheduling.The system initially schedules jobs depending upon the completion time and fault rate of resources.Furthermore, if the number of available resources are more than the number of tasks requesting for allocation then dynamic replication is preferred and if the number of resources are less as compared to number of tasks then rescheduling is preferred.The proposed adaptive scheduling system is evaluated under different scenarios with latest fault tolerant scheduling system that depend upon using the rescheduling on account of resource failures during execution of tasks.The parameters used for evaluation are average task response time, BoT completion time (makespan), average execution time and average flowtime.
As per the analysis, it has been observed that the proposed scheduling system outperforms existing scheduling system by a factor of 4.8% w.r.t.Average task response time.Furthermore when Average flowtime is taken into account then the proposed system shows a performance improvement of 0.02% as compared to existing scheduling system.Additionally, when BoT completion time and average execution time are taken in to account, then the performance of the proposed scheduling system remains same with the existing scheduling system.So it has concluded that the proposed system outperforms in every case as compared to the existing scheduling system.Furthermore the proposed system must be preferred for the applications where high response time is desired.
In future, scheduling can be done for dependent tasks also.Further improvements can also be done by specifying the time zones on which scheduling can be done because especially in night time zones resources are almost free to use.Replication and Checkpointing policies can be improved further.Furthermore, more than one BoTs concurrently can be considered for execution.

Figure 1 .
Figure 1.Adaptive Fault Tolerant Scheduling System b) Maintains a list of faulty resources c) Will check the number of available resources.If number of available resources is EAI Endorsed Transactions on Scalable Information Systems 03 2019 -06 2019 | Volume 6 | Issue 21 | e4

Figure
Figure 2. Comparison of existing & proposed system w.r.tAverage Task Response Time

Table 1 .
Factors for Simulation