Reliability Estimation in Coherent Systems
Agatha Sacramento Rodrigues
Carlos Alberto de Bragança Pereira
Adriano Polpo.
Contents
Preface
Usually, methods evaluating system reliability require engineers to quantify the reliability of each of the system components. For series and parallel systems, there are some options to handle the estimation of each component’s reliability. We will treat the reliability estimation of complex problems of two classes of coherent systems: seriesparallel, and parallelseries. In both of the cases, the component reliabilities may be unknown. We will present estimators for reliability functions at all levels of the system (component and system reliabilities). Nonparametric Bayesian estimators of all subdistribution and distribution functions are derived, and a Dirichlet multivariate process as a prior distribution is presented. Parametric estimator of the component’s reliability based on Weibull model is presented for any kind of system. Also, some ideas in systems with masked data are discussed.
This is a first version of the manuscript. We are sure that is necessary many improvements to this text. We are improving, and probably we will have a new version soon.
May 23, 2018.
Agatha Sacramento Rodrigues
Carlos Alberto de Bragança Pereira
Adriano Polpo.
Chapter 1 Introduction
In engineering, the quality of the produced system is of great interest. In this sense, the reliability study has been object of research in last years. Among important works, Barlow e Proschan (1981) is highlight about theory of reliability. Barlow and Proschan’s book shows concepts and theories about reliability, presenting important results when components and system reliabilities are known. In situations where component’s reliability is unknown, statistical inference can be suitable considered to estimate system and components reliabilities and it will be discussed in the sequel of this book.
As a motivation, the reliability estimation of an automatic coffee machine is presented. The machine has two causes of failure: 1. the failure of the component that grinds the grain; or 2. the failure of the heating water component. Clearly, the failure of any component, 1 or 2, leads the coffee machine failure. The failure of a component implies that the possible future failure time of the other becomes invisible, i.e. a censored data. Statistical inference for the reliability of the machine depends on both marginal components models. The reliability study of components allows for oneoff actions on the components that need to be improved in order to maximize system performance, rather than changing the entire system, generating less costs, time and unnecessary effort. Hence, inferences for both components are needed.
Statistical inference of component reliability is not an easy task: censorship, dependence and unequal distributions are some of the troubles. Considering a sample of the coffee machine example for which all sample units are observed up to death. Every sample unit will produce a component failure time and a censored failure time to the other component. Both components failing at the same time is considered unlikely in such situations. In this example, the sample will produce failure times observations and censored times for the two components in test. Relative to component failure time, it is reasonable to say that the two components are not identically distributed: probably one of the components may suffer more censors than the other. It is common that only one component is responsible for the system failure at time , implying that all the remaining components are censored also at time , although the types of censor could be different. In general, the number of censored observations should be higher than the uncensored ones.
The reliabilities of a system and its components also depend on the structure of the system, that is, the way that components are interconnected. The coffee machine is a series system of two components, a simple case known as competing risks problem. The illustration of a system structure can be considered by what is known as the block diagram, where each component of the system is illustrated by a block. Figure 1.1 is the block diagram of a series system with four components  at the time the system fails only one component is uncensored and the other three components are rightcensored at the system failure time, that is, they still could continue to work after system fail.
Suppose a system of components and denoting the failure time of the th component, . Let the random variable that represents the system failure time. The function that defines in relation to depends on the system structure. For a system with the structure represented in Figure 1.1, for instance, and .
Consider initially that a random sample of systems with the structure in Figure 1.1 is observed and being a sample of the random variable . The goal is to estimate the reliabilities of components involved in this series structure. At the system failure, however, not all components would have their failure time observed. In addition, a particular component may be responsible for system failures in some sample units and not in the remaining ones, cases of rightcensored observations.
When a system fails, the failure time of a given component may not be observed, but its censored time of failure is. For all sample units, the system failure times and are recorded. Associated to each sample unit, let be the indicator of the component whose failure produced the system to fail, with . At the time a series system fails, a given component can only be uncensored (responsible for system failure), that is , or rightcensored at the system failure time, that is , for .
The data of observed systems are presented in Table 1.1. For instance, system ID=1 failed at time and component 1 is the first to fail, that is, and the others components are rightcensored at .
System ID  t  

1  1.92  1 
2  1.85  2 
3  2.00  4 
4  1.74  3 
5  1.41  1 
6  1.97  2 
7  1.65  3 
8  2.08  1 
9  1.74  4 
10  2.40  2 
Considering a parametric model, let and the reliability and density functions, respectively, and is the parameter that can be either a scalar or a vector. For th component, the likelihood function can be writen as
(1.1) 
where or , and .
A parallel system as in Figure 1.2 works whenever at least one component is working. Again, only one component has its failure time uncensored, the other components are leftcensored observations, that is, they had failed before system failure. For a system with the structure represented in Figure 1.2, and .
Consider that a random sample of systems with the structure in 1.2 is observed and being a sample of the random variable . Associated to each sample unit, let be the indicator of the component whose failure produced the th system to fail, for and . At the time a parallel system fails, a given component can only be uncensored (the last component to fail), that is , or leftcensored at the system failure time, that is, , for .
The parallel systems data are presented in Table 1.2. For instance, system ID=3 failed at time and component 2 is the last to fail, that is, and the others components are leftcensored at .
System ID  t  

1  0.18  1 
2  1.15  3 
3  4.93  2 
4  0.01  2 
5  1.01  1 
6  1.51  3 
7  1.74  2 
For th component and considering a parametric model, the likelihood function can be writen as
(1.2) 
where or , and .
The literature on reliability of either parallel or series systems is abundant; different solutions have been presented. SalinasTorres et al. (1997), SalinasTorres et al. (2002), Polpo e Pereira (2009) and
Polpo e Sinha (2011) discussed the Bayesian nonparametric statistics for series and parallel systems. Under Weibull probability distributions, Bayesian inferences for system and component reliabilities were introduced by Polpo et al. (2009) and Bhering et al. (2014) presented a hierarchical Bayesian Weibull model for components’ reliability estimation in series and parallel systems, proposing an useful computational approach. Using simulation for series systems, Rodrigues et al. (2012), considering Weibull families, compared three estimation types: KaplanMeier, Maximum Likelihood and Bayesian Plugin Estimators. Polpo et al. (2012) performed a comparative study about Bayesian estimation of a survival curve.
Series and parallel systems are particular cases of a class of system called coherent. A system is said to be coherent if all components are relevant, that is, all components play any role in the functional capacity of the system, and the structure function is nondecreasing in each component, that is, the system will not become worse than before if a failed component is replaced by another that works.
An important property is presented by Barlow e Proschan (1981) that every coherent system can be written as seriesparallel system (SPS) representation and as parallelseries system (PSS) representation. Consider the PSS in Figure 1.3 and Figure 1.3 presents its SPS representation. The SPS in Figure 1.4 has its PSS representation in Figure 1.4.
Considering this celebrated property, Polpo et al. (2013) introduced Bayesian nonparametric statistics for a class of coherent system in order to estimate components reliabilities. They restricted themselves to cases for which no component appears more than twice in parallelseries and seriesparallel representations, under assumption that two components or more can not fail at the same instant of time. However, it is common that, in the representation of the system, some components appear in two different places within it. For instance, consider again the representation in Figure 1.3 (or Figure 1.4). We have the reliabilities of four components to estimate. However, two of them are in fact the same component (component ), and they will fail at the same time, which violates the assumption. For this reason, it is important to have the estimators for both SPS and PSS that give a wide variety of representations. If one of these representations does not violate the assumptions, then the proposed Bayesian nonparametric estimator can be used. This nonparametric approach is presented with details in Chapter 2.
Figure 1.5 is the bridge system described in the literature
(Barlow e Proschan, 1981) and Figure 1.6 illustrates its SPS and PSS representations. Note that each of the five components appears twice for both representations. Another interesting structure is the outof system  work only if at least out of the components work. For instance, Figure 1.7 considers the simple outof case into SPS and PSS representations. Note that each of the three components also appears twice in both combinations. Situations like these violate Polpo et al. (2013) assumption and their approach is not suitable anymore both for the PSS representation and for SPS representation. Thus, a solution to estimate the reliability of components in systems such as Figures 1.5 and 1.7 is to consider the parametric approach and the general likelihood function is developed in the sequel.
In likelihood functions (1.1) and (1.2) the th component is susceptible only to rightcensored or only to leftcensored data, respectively. For a more general case, a component can be susceptible to both side of censoring. For instance, system outof (Figure 1.7)  system work if at least 2 out of 3 components work. Consider that a system outof is observed and component 1 failed first and component 3 failed in the sequence. At the moment of component 3 failure, the system failed. Thus, component 3 is uncensored, component 1 is leftcensored data and component 2 is rightcensored observation. Another outof system is observed but for this system, component 2 was the first to fail (leftcensored), component 1 was the last to fail (uncensored) and component 3 yet worked in system failure (rightcensored). Note that each component is susceptible to be uncensored, left or rightcensored failure time.
Another kind of censoring could also occur: suppose a machine failure time (a sample unit) is in an interval for the observed lower limit and for the upper limit. If two or more components failed, they are all interval censored in .
To generalize the notation for all cases of component failure and censoring, consider the following notation: for a specific component of the system unit , let be a general interval of time, in which

, if the th component failure time causes the th system failure time;

and , if the th component is rightcensored at ;

and , if the th component is leftcensored at ;

, if the th component is intervalcensored.
Consider that a random sample of systems with the structure in Figure 1.7 is observed. The data are presented in Table 1.3. For instance, system ID=2 failed at time and the failure of component 2 causes the system failure ( and ), component 1 had failed before time ( and ) and component 3 is rightcensored at time ( and ).
System ID  Component 1  Component 2  Component 3  

l  u  l  u  l  u  
1  1.95  1.95  1.95  0  1.95  
2  0  2.09  2.09  2.09  2.09  
3  3.56  3.56  3.56  0  3.56  
4  2.55  0  2.55  2.55  2.55  
5  1.89  1.89  1.89  0  1.89  
6  3.01  0  3.01  3.01  3.01  
7  2.43  2.43  0  2.43  2.43  
8  0  1.51  1.51  1.51  1.51  
9  3.55  3.55  3.55  0  3.55  
10  2.35  0  2.35  2.35  2.35 
To complete the theoretical environment, let and the density and reliability functions, respectively, and is the parameter that can be either a scalar or a vector. Using the above notation, the likelihood function is as follow:
(1.3) 
where or ; and
.
A parametric approach which considers the likelihood function (1.3) for estimation of components’ reliabilities involved in any kind of coherent system, from the simplest to the most complex structures, is presented in Chapter 3. The available information are the failure time of system and the status of each component at system failure instant. This approach does not need the supposition of identically distributed components lifetimes. The main assumption is that components’ lifetimes are mutually independent and the components lifetime distributions are the threeparameter Weibull, a very general distribution that can approximate most of the lifetimes distributions. The paradigm is the Bayesian one. Another advantage of the Weibull is that, in our paradigm, even with improper priors, the posterior distributions turn out to be proper. The presented mechanism of calculus can well be used for any other family of distributions whenever proper priors are used.
Chapters 2 and 3 will address the problem of component estimation in coherent systems under the Bayesian paradigm in nonparametric and parametric approaches, respectively. In both chapter the status of each component at the time of system failure is considered to be known. However, identifying which component fail caused the failure of a given system can be a difficult task. In special situations, we can only establish that failed components belong to small sets of components. Cases like this are known as masked data failure cause and it is usually due to limited resources for the diagnosis of the cause of the failure. As an example of masked data problem, Basu et al. (1999) cited situations of failures of large computer systems, where the analysis is often performed in such a way that a small subset of components is identified as the cause of failure. In an attempt to repair the system as quickly as possible, the entire subset of components is replaced and the component responsible for the failure can not be investigated.
In Chapter 4 the masked data problem formulation is developed and a Bayesian threeparameter Weibull model for components’ reliabilities in masked data scenario is presented. This model is general and it can be considered for components involved in any coherent system.
Chapter 2 Nonparametric
A nonparametric estimator for all the reliability functions involved in the seriesparallel system (SPS) and parallelseries system (PSS) under the assumptions that the components reliabilities are unknown is presented; the only available information are the failure times of the system and the component that produced the failure. The required assumptions are mutually independent components failure times and that two or more components cannot fail at same instant of time.
In Section 2.1 are presented the probability results necessary for the development of the estimator. Section 2.2 is devoted to the construction of the nonparametric Bayesian estimator for SPS and PSS with three components (). In Section 2.3 the results are extended to a more general case of ; and in Section 2.4 the estimator is used in simulated datasets and illustrated its qualities.
2.1 Probability relations
In this section, important results and properties of the PSS and SPS are presented. Before it, we present these results for series and parallel systems that will facilitate the understanding of the results for PSS and SPS, once series and parallel systems are simpler.
2.1.1 Series and parallel systems
First consider a parallel system with components. Let be the failure time of th component with marginal distribution function (DF) and be the system failure time. The indicator of the component whose failure produced the system to fail is when , . The th subdistribution function evaluated at a time is the probability that the system survives at most to time and the last component to fail is the th component, that is, .
Let be the joint distribution function, in which continuous partial derivatives are assumed over all arguments. The following theorem establishes the relation between the joint distribution function with the th subdistribution .
Theorem 1
The derivative of , , is equal to the partial derivative of at the th component, evaluated at .
Because the life of the components are assumed to be mutually independent,
(2.1) 
Using the fact in (2.1) and the Theorem 1,
(2.2) 
where is the reversed hazard rate (RHR) of the th component:
(2.3) 
From (2.3) one can write
(2.4) 
Letting , (2.2) becomes
(2.5) 
Taking now the sum for in both sides of (2.5), we obtain
(2.6)  
Consequently,
which combined with (2.5) leads to
(2.7) 
Finally, (2.4) implies
(2.8) 
that is, the relationship of interest between marginal distribution functions and subdistribution functions.
Unfortunately, the expression in (2.8) does not work for the case with jump points. To obtain a version of (2.8) in the presence of jumps, we introduce the following definition and theorem.
Definition 1
For simplicity, consider the case of . The function based on the subdistributions and is
where is integration over disjoint open intervals that do not include the jump points of and is product over jump points of .
The next result, although restricted to , extends expression (2.8) in the sense that it can include disjoint jump points.
Theorem 2
The subdistribution functions and determine (uniquely) the distribution function for by .
An analogous development can be performed for a series system with components, in which and , if . The version of (2.8) for a series system is given by (SalinasTorres et al., 2002):
(2.9) 
in which is the subreliability function for th component.
Unfortunately, the expression in (2.9) does not work for the case with jump points. To obtain a version of (2.9) in the presence of jumps, we introduce the following definition and theorem.
Definition 2
For simplicity, consider the case of . The function based on the subreliability functions and is
where is integration over disjoint open intervals that do not include the jump points of and is product over jump points of .
The next result, although restricted to , extends expression (2.8) in the sense that it can include disjoint jump points.
Theorem 3
The subreliability functions and determine (uniquely) the distribution function for by .
For more details about relations among the distributions and subdistribution functions (subreliability functions) can be found in
SalinasTorres et al. (2002) and Polpo e Sinha (2011) for series system and Polpo e Pereira (2009) for parallel system.
In next Subsection the relations among the distributions and subdistribution functions are presented for a more general class of systems  SPS and PSS.
2.1.2 PSS and SPS
Let and be the lifetimes of three components of an PSS and SPS with marginal distribution functions (DF) , , and , respectively. The restriction here is that the three sets of jump points of , , and must be disjoint. The indicator of the component whose failure produced the system to fail is when , when , and when . Let be the subdistribution function of the th component and the distribution function of the system. The following properties can be proved.
Property 1
The subdistribution functions (SDF) , , and determine the DF of the system,
(2.10) 
Property 2

;

;

;

.
Property 3
The set of jump points and are the same, where . Because , , and have disjoint set of jump points, so have , , and .
Property 4
If for , and 1 for , then is the largest support point of the system.
The lifetime of the SPS is and the system reliability of sindependent components is
(2.11) 
The lifetime of the PSS is and the system reliability of sindependent components is
(2.12) 
Property 5
The SDF of the SPS can be expressed using the marginal DF of the components by
(2.13) 
and the SDF of the PSS can be expressed using the marginal DF of the components by
(2.14) 
Our interest is to obtain the inverse of (5) and (5); that is, to express the DFs , and as a function of the SDF (, , ). These inverses are presented with the following definitions and theorems.
Definition 3
The functions , and based on subdistributions , , and are
The functions (for a series system), and (for a parallel system) are the versions with three components for those presented in Polpo e Sinha (2011) and Polpo e Pereira (2009), respectively. First, Theorem 4 states the relation between and , , and .
Theorem 4
The SDF , , and determine (uniquely) the DF of an SPS for by , and the DF of a PSS for by .
The next definition gives the functions (for the SPS), and (for the PSS).
Definition 4
The functions , and , based on subdistributions , , and , are
Theorem 5
The SDF , , and determine (uniquely) the DF of an SPS for by , and the DF of a PSS for by .
Note that Theorem 5 can be easily rewritten to obtain the relation of DF and the SDF.
Theorem 5 provides an important relation between the SDF and DF, for both SPS and PSS. Using this result, in the next section, we have developed the nonparametric Bayesian estimator for the DF of the system’s components.
2.2 Bayesian Analysis
This section describes a Bayesian reliability approach to SPS and PSS. We have derived a nonparametric Bayesian estimator of the distribution function using the multivariate Dirichlet process
(SalinasTorres et al., 2002). From Property 1, we have that the subdistribution functions are related to the system distribution function by a sum. Considering that , we have the restriction that these four quantities have a sum equal to , and that the set of possible points for is the fourdimensional simplex, or for the nonsingular form. In this case, for a fixed , we have that a natural prior choice is the Dirichlet distribution, and for any , we have the Dirichlet multivariate process. In this Section a nonparametric estimator for the distribution function of the components in an SPS or in a PSS, and using the Dirichlet process, we have a complete distribution for the set . In this case, our parameters are the functions that we want to estimate, giving us a nonparametric framework.
Consider a sample of size and the observed data are , in which for SPS and
for PSS. Besides, if , for and . Equivalently, for each ,the random variables are observed:
in which is a indicator function of set .
The function is empirical subdistribution function of th component. If is the empirical distribution function corresponding to the observations , thus for each ,
For each , let the realization of , in which
In this context, for each , the likelihood function corresponds to the likelihood of a multinomial model being , for , and , that is,
(2.15)  
The prior distribution for is constructed from the characterization of the multivariate Dirichlet process, defined in
SalinasTorres et al. (1997) and it may have the following simplified version.
Definition 5
Let be a sample space, be finite positive measures defined over , and be a random vector having a Dirichlet distribution with parameters . Consider Dirichlet processes, , with , . All these processes and are mutually independent random quantities. Define . The is a Dirichlet multivariate process with parameter measures .
In the context of SPS and PSS, consider , and , for . Then, the vector of components subdistribution functions is and the prior distribution is given by
(2.16) 
Combining the prior distribution (2.16) and the likelihood function in (2.15), the posterior distribution of is, for each ,
Thus, the posterior means of and are given by
(2.17) 
where , and
(2.18) 
These Bayesian estimators are strongly sconsistent. For instance, using the Glivenko Cantelli Theorem (Billingsley, 1985), it can be shown that converges to uniformly with probability 1.
If , the Bayesian estimator of is given by
(2.19) 
Let the distinct order statistics of be . Set , and , . Define
(2.20) 
(2.21) 
(2.22) 
(2.23) 
(2.24) 
(2.25) 