Modelling and Simulation for Efficiency Factor Evaluation of Maintenance Strategies in a Computer System

An efficient maintenance allows extension of operating life of system, thus contributing to increase system performance. Computer systems like other systems must operate without interruption and applied strategies of maintenance must be efficiency. This paper proposes to define a maintenance strategy of computer system subjected both to corrective and preventive maintenance. These aspects are modelled by competing risks concept and the Alert-delay model, which are generally used in industrial systems. The approach is applied on real data from computer system, localized in an industrial company and different scenarios are generated following various maintenance strategies. These policies are evaluated through minimal, perfect and imperfect models of maintenance for corrective and preventive maintenance. Simulation results give failure intensity assessment and efficiency factor value. Final outcomes are validated by dependability measures to select best strategies of maintenance.


Introduction
To ensure dependability of any system [1], maintenance has become a fundamental process, leading to an optimal functioning. Computer networks like other systems must operate without interruption and applied policies of maintenance must be efficiency.
Hence, a good maintenance management through definition of an appropriate and efficient strategy will have a positive impact on system performance.
Generally, systems are subjected to two kinds of maintenance action: Corrective (CM) and Preventive Maintenance (PM). Corrective maintenance is performed after a failure and is intended to put the system in working condition but that will not avoid failure consequences. A more defensive approach is to implement a preventive maintenance which is carried out when the system is operating and is intended to reduce and prevent these failures. Preventive maintenance can be performed at predetermined intervals or according to prescribed criteria for assessing the system degradation state and decide on an intervention when a certain threshold is reached [2,3]. In practice, these two types coexist and the simple way of modelling this situation is competing risks theory, introduced in the context of maintenance in [4].
Modelling effect of performed maintenance is necessary, in order to be able to assess maintenance efficiency. Maintenance efficiency can be a perfect repair maintenance where the system is renewed; it can be minimal repair maintenance where the system is restored to state it was before maintenance. However, reality is between these two extreme cases: maintenance reduces failures intensity but does not leave the system as good as new. This is known as imperfect maintenance [3].
R. Oudjedi Damerdji and M. Noureddine 2 This paper proposes to define a maintenance strategy of a computer system subjected both to corrective and preventive maintenance, through competing risks concept. Many scenarios are generated including CM and PM strategies. Considering that corrective maintenance effect is minimal, the maintenance efficiency factor is estimated following different models of imperfect maintenance for preventive maintenance. Dependability measures are used to validate the obtained results. Figure 1 illustrates proposed approach.  The remainder of the paper is organized as follows: Section 2 presents background by description of the maintenance models, efficiency factor of maintenance and theory of competing risks for maintenance analysis. Section 3 outlines the proposed approach with definition of adopted model of competing risks and its application on a specific computer system; identification and treatment of data give different scenarios of maintenance strategies. Section 4 describes simulation on these scenarios through maintenance models. The most suitable policy is selected, according to value of efficiency factor and validates by dependability measures. Section 5 concludes the paper with perspectives of this work.

Background
According to standards NF X 60-010 and 60-011 [5], maintenance is all actions which enable to maintain an item in a specified state or restore it in order to provide a determined service. In maintenance definition, one finds two keywords: maintain and restore. The first refers to preventive maintenance (PM) and second refers to corrective maintenance (CM).
The existence of a maintenance service has effect of maintaining the system and decreasing failures. Through preventive and corrective actions, the system is continuously monitoring, giving a perfect knowledge of its state. So, interventions are minimal and taken at the right time, and system is restored after each failure. Maintenance models allow description of these effects.

Models of maintenance
Following maintenance operations, it is necessary to examine impact of these operations on the maintained system. It is therefore important to build models of maintenance effects for evaluating their effectiveness.

Basic models
Most common assumptions on maintenance efficiency are to assume that the effect maintenance is minimal also known ABAO (As Bad As Old) and is to restore operating system to the state it was just before failure, or perfect maintenance also called AGAN (As Good As New) where system is in a new state after maintenance, this case fits to replacement of non-repairable components. ABAO and AGAN models are represented respectively by the non-homogeneous Poisson process and renewal process [6,7].

Imperfect maintenance models
Generally, maintenance reduces the failure intensity but does not renew system. The model which takes into account a maintenance effect between ABAO and AGAN is called the Imperfect Maintenance (Better than Minimal Repair) process [6,7].
A famous model of imperfect maintenance is the virtual age models, originally developed by [8]. Modelling and Simulation for Efficiency Factor Evaluation of Maintenance Strategies in a Computer System 3 simplifying assumption used in these models is that a maintenance action usually allows a reduction in the system age at each maintenance operation, and therefore, state of maintained system is rejuvenated or aged [9]. These models generalize previous two models based on virtual age of the system:  If virtual age after each maintenance is identical to the real age of the system, the maintenance is minimal (ABAO model),  If virtual age is zero after each maintenance, the maintenance is perfect (AGAN model).
Most popular models, based on virtual age concept, are the arithmetic reduction of age [6,10] and the Brown Proschan model [11].
(i) Arithmetic Reduction of Age (ARA). The ARA model assumes that effectiveness of maintenance remains unchanged; these models have been proposed considering that maintenance effect depends on one or more previous maintenance interventions [9,10]. Two classes are defined [6,10]:  Reduction of Age model with memory one (ARA 1 ). The ARA 1 model assumes that maintenance acts on last inter-Maintenance period, so virtual age is reduced since the last intervention.  Reduction of age model with infinite memory (ARA  ).
The ARA  model assumes that maintenance acts on whole inter-maintenance periods, so virtual age is related to the total life cycle of system.
(ii) Brown-Proschan model (BP). The Brown-Proschan model is one of the first models proposed in the literature; it considers that after the system failure, the maintenance is perfect (AGAN) with probability p and minimal (ABAO) with probability (1-p). The effect of maintenance is characterized by a random variable B i of the Bernoulli distribution with parameter p ( [6,9,11]. One finds in particular cases of the BP model the basic models:

Factor of Efficiency maintenance
After maintenance, one expects to observe either a decrease or an increase in the failure intensity. The effect of maintenance is assessed by the parameter ρ called: maintenance efficiency or restoration factor.

Failure intensity
Most maintenance efficiency models include parameters associated with initial failure intensity, representing failure rate of new system not maintained, noted λ(t).
In general, industrial systems are assumed to wear out and initial failure intensity is traditionally increasing [12], following a Weibull distribution. Initial failure intensity is equal to [6,7,12]: Where α is the scale parameter and β is the shape parameter characterizing speed of the system wear-out. Description of the three phases of a device life is specified through the shape parameter β, as follows [3]:  If  >1, system wears out (aging).
The system failure intensity function represents instantaneous failure rate. Effect of maintenance is assessed by the parameter ρ called: maintenance efficiency or restoration factor. Depending on maintenance efficiency factor, failure intensity is expressed by [6,7,10]: Following failure intensity definition, Table 1 presents variations of parameter ρ characterizing the maintenance efficiency models.
These models assume that initial intensity function is increasing and in contrary, a decreasing initial intensity function deals with opposite considerations on maintenance efficiency [6].

Maximum likelihood method
In most practical maintenance models, parameters are estimated by the statistical method of maximum likelihood. The maximum likelihood method is theoretically the best method when number of observations is small [7,10].
Effects of PM and CM are characterized, respectively, by  p and  c for virtual age models and p p for the BP model. The variables  = (, ,  c ,  p ) represents different parameters to estimate and maximizes the log-likelihood function associated with dates of CM and PM observed on interval [0, t]. This function depends on parameters (α and β) of the failure intensity. More details can be view in [13], which describe implementation of maximum likelihood method for the maintenance models.

Concept of Competing Risks for Analysis Maintenance
This concept is widely applied in many areas (such as medicine, statistics, economics and engineering), where several events are likely to occur and only time and type of the first event is observed [14,15]. A system subjected to competing risks is a system that is confronted with several types of mutually exclusive events. Concept of competing risks is introduced in maintenance analysis to model the dependence between these two types of PM and CM maintenance. The principle of this approach is the following [16]: at the time, system is put back into operation after maintenance, we do not know if next failure occurs before or after next preventive maintenance; otherwise, it is not known whether next maintenance is preventive or corrective.
Formally, consider two competing risks in the context of maintenance analysis [16]:  Y called the risk of PM,  Z called the risk of CM.
After the kth maintenance, two risk variables are defined:  Y k+1 is the potential waiting time of next maintenance, if it is preventive,  Z k+1 is the potential waiting time of next maintenance, if it is corrective.
In competing risks concept, the observations are time to next maintenance W k+1 and type of next maintenance U k+1 [16][17][18][19]: 1 if the maintenan e is pre enti e 0 if the maintenan e is orre ti e We present in this part, the most common methods [16], [18][19][20][21] of competing risks in maintenance analysis. We classify them according to dependency between the risk variables Y and Z:

Independent Competing Risks models
 The Independent Competing Risks model. This is the simplest situation where Y and Z are independent and laws of variables Y and Z can express from observations W and U.  The Mixture of Exponentials Model. This model assumes that Z is a mixture of two exponential distributions while Y is exponential and independent of Z (particular case of previous model).  The Doubling independent model. This model assumes both that Y and Z are independent and that W and U are independent, through two independent exponential laws or two independent Weibull laws.  The Conditionally Independent Risk model. This model considers competing risks Y and Z share a common quantity C; so Y and Z are dependent by intermediaries C and independent conditionally to C.  The Delay Time model. This model is used in reliability: an alert signal is delivered by the system before failure. After this signal, it remains waiting times before performing a PM or to observe a failure.

Dependent Competing Risks models
 The Random Sign model. This model based on random sign assumption and it uses the idea that time at which PM can occur is related to failure time. This model is then assumed that type U in the next maintenance does not depend on moment potential failure Z.  The Alert Delay model. This model assumes that an alert is delivered just before the system failure and allows situating preventive maintenance policy compared to the failure occurrence and after the alert.

Proposed Approach and Application
After a review of various competing risk models used in maintenance context, Alert Delay (AD) model seemed the most interesting. Indeed, it expresses the dependency between times of PM and CM strategy considered in this study, it includes particular cases and it allows having an initial overview of the maintenance efficiency.
In this present study, the proposed approach consists of determining maintenance strategies through the warning, delivered by the system when its degradation passes a critical threshold and avoids failure by preventive maintenance. In this case and to select an effective maintenance strategy for a system, based on the competing risks concept, we propose an approach based on the Alert Delay model.

Formal Definition of Alert-Delay Model
The Alert Delay Model (AD), based on an alert before failure requires delay ε to perform PM after the alert [16]. It is defined by:

EAI Endorsed Transactions on Energy Web Online First
Where  Y is the duration of preventive maintenance,  Z is the duration of corrective maintenance,  p [0,1], alert delivered in p Z,  ε is the delay (positi e random ariable).
Several particular cases are identified in this model: Knowing that models of competing risks are generally used in industrial systems [16], the aim of this study is to determine maintenance strategies of a computer system following concept of Alert-Delay model. Performance of these strategies is evaluated by value of maintenance efficiency factor and dependability measures.

Application on real data
We apply the proposed approach on real data from a computer system; this system, a set of PCs and server interconnected, is localized in an industrial company dedicated to production of ammonia and fertilizers.

Data Identification
The system was observed through a network surveillance program, over a fixed period from 9 hrs until 15 hrs for seven days, given t obs =1008 hrs.
During the system observation, in a precedent study the authors identified failure data [22]. Table 2 presents the instant of observed failure. The unit time is second, so t obs = 3628800sec.

Parameters of Alert-Delay Model
Our approach is to vary parameters of Alert-Delay model, in order to define several maintenance strategies (scenarios) and to take into account the impact of model parameters. The aim is leading to perform an adapted and efficient maintenance strategy. These parameters are warning issued by the system (p) and the time (ε) required to perform preventive maintenance after the alert. We consider standard random ariations of parameters (p, ε), through fi e alues for p and a corresponding alue for delay ε [16]. These specifications are shown in Table 3.  9 7 Using the assumptions (3)-(5) of Alert Delay model applied on data presented in Table 2, five scenarios are determined for each value of ε and p. So, all instants (W k ) and types (U k ) of preventive and corrective maintenance are obtained ( Table 4).
As expected, the authors notice that when p is close to 1, probability to perform a corrective maintenance is greater than probability to perform a preventive maintenance such as the 5 th scenario; when delay is null, they observed only preventive maintenance such as the 1 st scenario. This confirms particular cases of Alert-Delay model discussed previously. Preventive maintenance is effective if it is performed as late as possible that is to say that alert is not delivered too early and if delay is short. Also, they can say that maintenance strategy is effective if it delays the duration of next corrective maintenance. Table 4. Instants and types of maintenance.

Experiments and Discussion
Among obtained scenarios, selection of a maintenance strategy is done by the estimation of maintenance efficiency factor through maintenance models previously defined. The software MARS (Maintenance Assessment of Reparable System) is used [13]. This tool was developed to implement maintenance models and to jointly estimate effects of aging and maintenance (preventive and/or corrective) for repairable system. The MARS tool consists in simulating a set of data from an experience feedback corresponding to instants of corrective and /or preventive maintenance deterministic. It can treat different maintenance models (ABAO, AGAN, ARA 1 , ARA  and BP) and possible combinations between different effects of maintenance MC/MP are included in the software. MARS uses maximum likelihood method for parameters estimation.
In practice, a repairable system formed a set of components is maintained as a failure result of one of its components; this maintenance will allow the system to continue its function but will not reset it to a new state. For this, we consider that the corrective maintenance effect is minimal (ABAO), like in [23]. Models of imperfect maintenance are adopted for preventive maintenance effects. So for each scenario, by using the assumptions (1)

Variation of Preventive Maintenance Effects
In this section, experiments for maintenance strategy effectiveness on different scenarios are considered.
Simulation results give failure intensity and efficiency factor for PM following ARA models and BP model (respectively  p and p p ). Effects of CM are supposed minimal, thus  c = 0.

Arithmetic Reduction of Age Model
Consider the ARA model with memory one (ARA 1 ) for PM. With the model CM ABAO -PM ARA 1 , the maximum likelihood estimates maintenance strategy through the three parameters,  and  p . The results are:  for scenario 1:  = 3.07x10 -107 ,  = 20,  p = 1. Value of  indicates that system is wearing-out and value of  p means that PM effect is perfect (AGAN). The system improves and PM has allowed delay of failures. From obtained values, Figure 2 gives failure intensity corresponding to model CM ABAO -PM ARA 1 . In the figure, red dotted lines on the y-axis represent times of PM.
Value of  indicates that system is in the burn-in period and value of  p means that PM effect is perfect.  for scenario 3:  = 0.012 ,  = 0.489,  p = -1.068.
Following values of parameters ,  and  p , interpretation is the same.  for scenario 4, the maximum likelihood method estimates following values: The value of  indicates that system is in the burn-in period and value of  p means that PM effect is inefficient. The system is not improves but the fact of performing MC decreases intensity of failures. For this scenario, Figure 3 gives failure intensity corresponding to model CM ABAO -PM ARA 1    Simulations using ABAO model for corrective maintenance and ARA model with infinity memory (ARA  ) for preventive maintenance are similar.

Model BP
Consider the Brown-Proschan model for preventive maintenance.
With model CM ABAO -PM BP, the evaluation of parameters, for the maintenance strategy defined in all scenarios gives approximately the same results.
For example, values for scenario 4 are: As previously, value of efficiency factor p p means that PM effect is inefficient and Figure 4 gives failure intensity corresponding to the model CM ABAO -PM BP.

Maintenance Strategy Selection
Same approach is applied to other scenarios cited in Table  4. Following simulation results, estimated maintenance efficiency factor of each maintenance strategy is presented in Table 5:

Best scenarios
In considered models, scenarios (1,2,3) give perfect effect of preventive maintenance. System is renewed and corrective maintenance allows only restore the system to state it was just before failure. Given the assumption in (CM ABAO-PM ARA 1 ) model, PM does not renew the system but it is restored in state it was just before the previous CM.
That means that preventive maintenance is not AGAN but is called AGAP (As Good As Previous) [3]. So, maintenance strategies of scenarios have in majority a perfect preventive maintenance effect, except for scenarios (4,5).

Dependability measures
To validate efficiency of maintenance strategies on the studied various scenarios, classical dependability measure is estimated before and after maintenance.
In the framework of dependability attributes, reliability of a system which is the continuity of its function is closely related to maintenance function [1]. Indeed, an adequate maintenance strategy will improve the system reliability [24]. In this context, an important measure is the Mean Time between Failures (MTBF), defined as time interval during which the system is in function after installation, proper maintenance and overhaul [24].
Before maintenance and from data of failures, the estimated constant failure rate is evaluated [22]; consequently, initial MTBF is estimated to 6926,6468 sec.
After maintenance, the measure is evaluated giving a set of five MTBF relating to the five scenarios. So, from instants (W k ) and types (U k ) of preventive and corrective maintenance, MTBF (1 to 5) is estimated respectively for each scenario (1 to 5). Figure 5 presents estimations results: MTBF have increasing values from the initial value to the MTBF1. According to dependability measures, the first scenario is a significant improvement of MTBF; this validates maintenance efficiency on the studied system. Even if scenarios 2 and 3 have lower MTBF values, these scenarios cannot be excluded, given that they have similar effects of maintenance to maintenance effects of first scenario.

Time
Before maintenance After maintenance

Discussion of results
After study complete results, the first scenario seems the best defined maintenance strategy, validated by the MTBF dependability measure. In this case, preventive maintenance is perfect and corrective maintenance is sufficient to slow the system aging and occurrence of failures. These estimations are obtained through the famous statistical method of maximum likelihood.
To select most appropriate model for the studied system to characterize effect of CM and PM, in addition to this adopted method, there are others approaches being able to make this model choice, such as methods based on information criterion, like the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) [25,26]. The information criterion methods are dedicated to compare performance of models and it will be interested using these methods for maintenance efficiency model.
On the other hand, imperfect maintenance models are used in this study, which takes into account a maintenance effect between minimal and perfect. Among imperfect models, we adopted models based on virtual age of system. As they are famous and most popular, a generalization of imperfect maintenance models appears, as the Generalized Virtual Age model (GVA) [27]. Its application will expand the study field of competing risks models on computer systems.

Conclusions
In this paper, we have presented an approach for an assessment of maintenance strategies effects for a computer network, submitted to preventive and corrective maintenance, like industrial systems.
To model this situation, a competing risks model is introduced and in maintenance context, the Alert Delay model is used. This model expresses dependency between times of PM and CM strategy. Simulation on a data set of a computer system is proposed, giving many scenarios for different maintenance strategies. These scenarios are evaluated following efficiency factor of PM and CM on best-known various maintenance models. To validate obtained results, dependability measures have estimated for each scenario. Comparison between these two evaluations gives same best scenarios and the most efficiency maintenance strategies were be highlighted.
To complete this study, many perspectives can be considered. As previously noted, methods based on information criterion will be used to select the maintenance efficiency model. Further, others models of imperfect maintenance could be applied. So, all kinds of assumptions on maintenance effects will have been studied for a computer system. The presented approach can also be applied on another real computer system. After identification of failures, the same study will be conducted giving maintenance strategies and the most efficient will be chosen to improve system performance.