RAM-study

October 17, 2017 | Author: Panos Sarafidis | Category: Reliability Engineering, Computer Engineering, Technology, Computing, Science

Share Embed Donate

Report this link

Short Description

Download RAM-study...

Description

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Hg............................................ Hjk;jkl ................................. Kl;j ..................................

Reliability - Availability - Maintainability For the PROJECT: MASE

Δ906.Ν

________________________________________________________________________ Page 1 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Table of Contents 1.

Introduction ...................................................................................................... 3

1.1 1.2 1.3 1.4 1.5 1.6

Types of Analysis........................................................................................... 3 Limitations of Analysis ................................................................................ 4 Analysis Format.............................................................................................. 4 Failure Rate Calculation Models ................................................................ 4 Methodology ................................................................................................... 6 System Availability ....................................................................................... 6

1.6.1 1.6.2.

Availability in Series.................................................................................. 7 Availability in Parallel ................................................................................ 8 2. Theory Behind Model ............................................................................... 10 2.1 Definitions...................................................................................................... 10 2.2 Single Component Analysis ...................................................................... 12 2.2.1 Systems with Components in Series ......................................................... 14 2.3 Systems with Redundant Components ................................................... 17 2.3.1 Redundant Active Systems - Non-repairable............................................ 19 2.4 Common Cause............................................................................................. 21 Common MTBF Misconceptions ............................................................................. 24 3. The RAM model of the offered System........................................... 24 3.1 Reliability Block Diagram (RBD) ........................................................... 28 3.2 Offered Component’s Availability List .................................................. 32 3.3 Spare Parts ..................................................................................................... 33

4.

Maintainability.............................................................................................. 36

4.1

Preventive Maintenance ............................................................................. 40

4.1.1 4.1.2 4.1.3 4.1.4

Value of Preventive Maintenance............................................................. 41 The Fallacy of "Constant Failure Rate" and "Preventive Replacement" .. 42 Determining Preventive Replacement Time............................................. 43 Optimum Age Replacement Policy .......................................................... 45 5. Conclusion ...................................................................................................... 47 5.1 Cost/Penalty Function................................................................................... 52 5.2 Quantifying the Cost/Penalty Function......................................................... 53 5.3 Maximum Achievable Reliability................................................................. 57 5.4 Cost Function ................................................................................................ 58 6. General RAM details of SUN V880 .................................................. 59 6.1 Modeling Sun Cluster Availability general details (proposed alternatively) 62

________________________________________________________________________ Page 2 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

1.

Project: MASE Δ906Ν Implementation

Introduction

Various methods are utilized to perform reliability and availability studies, depending on the configuration and parameters of the system being modelled. These methods and models are adapted to almost any type of electronic/mechanical system. In general there are systems using a combination of serial and parallel configurations. The availability of each configuration is then calculated, based on component mean time between failures (MTBF) and mean time to repair (MTTR), by one or more methods depending on the variables involved. The most common variables are: number and configuration of active chains, number and configuration of backup chains, repairmen availability, and the logistic delay time for spare parts.

1.1

Types of Analysis

In order to concentrate on design issues, the availability should be based on the premise that all components are spared and an engineer is available for each failed unit. The objective of this analysis is to reveal any design weaknesses at a very early stage (when changes can readily be made at low cost) and to provide an estimate of achievable availability. An updated analysis should be made when the design work is complete and the actual component manufacturers and models are fixed. This analysis will incorporate operating procedures and maintenance procedures (in so much as they affect availability) and more accurate MTBF and MTTR data. The objectives of this analysis are twofold: i) to provide an availability input to the sparing policy, and ii) determine the final estimate of system availability.

________________________________________________________________________ Page 3 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

1.2

Project: MASE Δ906Ν Implementation

Limitations of Analysis

Generally the analyses are modelled to determine the availability of a system during the operating life when the failure rate is considered constant. Obviously the absolute accuracy of the analysis depends on the quality of the input data and modelling constraints, which may result in a significant degree of departure from actual operating experience. However, the usefulness of the analysis lies primarily with its relative accuracy. By varying the system design parameters, operating procedures, and maintenance procedures, significant insights into the importance and effects related to each variable can be inferred.

1.3

Analysis Format

The result of the analysis is presented in a formal format that consists of three major sections. The main section provides a written report containing a summary of the results, the postulates on which the analysis is based, availability block diagrams, discussion of the limitations of the analysis, and identifies availability critical components. The methodology section provides a detailed summary of the methods used to calculate the availabilities. The third section consists of supporting documentation. The format can be adapted to conform to customer and contractual requirements.

1.4

Failure Rate Calculation Models

________________________________________________________________________ Page 4 of 65

Panos Sarafidis

Project: MASE Δ906Ν Implementation

Reliability - Availability - Maintainability STUDY

There are two commonly used failure rate models used for performing reliability predictions on electronics. The names are MIL-HDBK-217 and Bellcore. Both are accepted standards, developed over several years. Both include mathematical equations for determining the failure rate (and in turn, the MTBF) of various components (such as integrated circuits, transistors, diodes, resistors, and capacitors). The MIL-HDBK-217, also known as the Military Handbook for "Reliability Prediction of Electronic Equipment", is published by the Department of Defense, based on work done by the Reliability Analysis Center and Rome Laboratory at Griffiss AFB, NY. The latest version of MIL-HDBK-217 is MIL-HDBK-217F, Notice 2 (217F-2). The non-military alternative to MIL-HDBK-217 is Bellcore. The Bellcore reliability prediction model was originally developed by AT&T Bell Labs. Bell Labs modified the equations from MIL-HDBK-217 to better represent what their equipment was experiencing in the field. The most recent revision of the Bellcore Reliability Prediction Procedure, TR-332, is Issue 6. While the main concepts between MIL-HDBK-217 and Bellcore are similar, MILHDBK-217 is targeted for military applications, and tends to be somewhat pessimistic when used for commercial quality products. Bellcore adds the ability to take into account burn-in, field, and laboratory testing. These added abilities have made the Bellcore standard very popular with commercial organizations. And even though MIL-HDBK-217 is widely used by both commercial and military

companies,

many

companies

making

products

such

as

computers,

telecommunications systems, medical systems, and power supplies, are choosing to use the Bellcore handbook for their reliability predictions, reporting that their predicted MTBF is consistent with their actual field experience. ________________________________________________________________________ Page 5 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Because of this, the Bellcore model should be the expected model when predicting MTBF for most commercial applications.

1.5

Methodology

The following is brief summary of the methods usually employed: Markov Process - redundant chains are modelled using a Markov process. This method utilizes transient state equations to calculate the availability. The equations are usually solved using matrix representations of the state equations in a steady state condition except when calculating mean time to system failures when non-steady states must be considered. Binomial Method - a modified binomial expansion is used to determine the availability of a redundant chain. This method produces the same result as the Markov process using a less intuitive but somewhat more versatile model. Monte Carlo Simulation - a proprietary computer program conducts a large number of trials in which the state of the system is determined for each trial. The availability is based on the number of system failures encountered. Although the results vary with each simulation and number of trials, this method can be adapted to many complex systems for which Markov and Binomial modelling may not be feasible. Approximation Formula - an approximation formula is occasionally used to provide a quick estimate of the availability and as a crosscheck for one of the above methods

1.6

System Availability

________________________________________________________________________ Page 6 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

System Availability is calculated by modelling the system as an interconnection of parts in series and parallel. The following rules are used to decide if components should be placed in series or parallel: -

If failure of a part leads to the combination becoming inoperable, the two parts are

considered to be operating in series -

If failure of a part leads to the other part taking over the operations of the failed

part, the two parts are considered to be operating in parallel 1.6.1

Availability in Series

For a series block diagram shown below, the probability of occurrence of system success of all independent components operating can be shown as:

As stated above, two parts X and Y are considered to be operating in series if failure of either of the parts results in failure of the combination. The combined system is operational only if both Part X and Part Y are available. From this it follows that the combined availability is a product of the availability of the two parts. The combined availability is shown by the equation below:

EXAMPLE:

________________________________________________________________________ Page 7 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

1.6.2. Availability in Parallel For a parallel block diagram, the probability of occurrence of system success of all independent components operating can be shown as:

As stated above, two parts are considered to be operating in parallel if the combination is considered failed when both parts fail. The combined system is operational if either is available. From this it follows that the combined availability is 1 - (both parts are unavailable). The combined availability is shown by the equation below: ________________________________________________________________________ Page 8 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The implications of the above equation are that the combined availability of two components in parallel is always much higher than the availability of its individual components. Consider the system in the figure above. Two instances of Part X are connected in parallel. EXAMPLE:

________________________________________________________________________ Page 9 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

2.

Project: MASE Δ906Ν Implementation

Theory Behind Model

The reliability calculations that the model is based upon are relatively straightforward and can be found in most textbooks on the subject. The following is a brief description of the computations used when combining items in series and parallel. First, we will review the definition of a few common terms. Most of these definitions and the equations can be found in the IEEE Gold Book (IEEE Std. 493-1997) and other textbooks.

2.1

Definitions

Availability - A ratio that describes the percentage of time, a component or system can perform their required function. Component - A piece of electrical or mechanical equipment, a line or circuit, or a section of a line or circuit, or a group of items that is viewed as an entity for the purpose of reliability evaluation. Failure - The termination of the ability of an item to perform a required function. Failure rate - The mean number of failures of a component per unit exposure time. Forced downtime - The average time per year a system is unavailable in between failures and expressed in hours per year. Lambda (λ) - The inverse of the mean exposure time between consecutive failures. Lambda is typically expressed in either years per failure or millions of hours per failure.

________________________________________________________________________ Page 10 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

MTBF - The mean exposure time between consecutive failures of a component or system. The mean time between failures is usually expressed in either failures per year or failures per million hours. For some applications measurement of mean time between repairs (MTBR) rather than mean time between failures may provide more statistically correct information. MTTR - The mean time to repair a failed component. For a system, it is the total amount of time it is unavailable in between failures and is expressed in hours in both cases. Point - Any place or location within the electrical system. The name or designation for a point is always the same as the name of the zone that the point is located within. RAM Table - A lookup table in the model that displays the Lambda and MTTR for electrical components. Reliability - An indication of the ability of a component or system to perform its intended function during a specified time, that is t he probability that a system will perform its required functions under stated conditions for a stated period of time at a given confidence level Restore Time - In the model, the time to restore is the sum of the mean time to repair (MTTR) for the failure plus the computed time to restream or restart the connected process unit or load. System - A group of components connected or associated in a fixed configuration to perform a specified function of distributing power.

________________________________________________________________________ Page 11 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

2.2

Project: MASE Δ906Ν Implementation

Single Component Analysis

Reliability and availability are necessary to describe the characteristics of the single component shown.

Reliability of components is frequently given as failures per million hours of operating time. Using these numbers, the Mean Time Between Failure (MTBF) can be calculated using:

The failure rate (λ) is given by: λ= 1 / MTBF and the component reliability (R1) for one year is given by:

________________________________________________________________________ Page 12 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

For a single component, the availability (A) is given as the total operating time over the total time, or: A = (MTBF) / (MTBF + MTTR) This formula can be transformed to the following form, which is more suitable for the following solution:

The next basic task consists of obtaining a relationship between reliability parameters of the system and parameters of the sub-system. The mean time between failures for the system can be expressed as:

and the mean time to repair the system is calculated by:

Then the relationship between system availability and R&M parameters of subsystems can be obtained by substitution of the eqs:

________________________________________________________________________ Page 13 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

transformed:

i Number of sub-system. n Total number of sub-systems. 2.2.1

Systems with Components in Series

In a series configuration, a failure of any component results in failure for the entire system. In most cases when considering complete systems at their basic subsystem level, it is found that these are arranged reliability-wise in a series configuration. For example, a personal computer may consist of four basic subsystems: the motherboard, the hard drive, the power supply and the processor. These are reliability-wise in series and a failure of any of these subsystems will cause a system failure. In other words, all of the units in a series system must succeed for the system to succeed The reliability of the system is the probability that unit 1 succeeds and unit 2 succeeds and all of the other units in the system succeed. So, all n units must succeed for the system to succeed. The reliability of the system is then given by:

________________________________________________________________________ Page 14 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Where: = reliability of the system. = event of unit i being operational.

P( ) = probability that unit i is operational.

In the case where the failure of a component affects the failure rates of other components (i.e. the life distribution characteristics of the other components change when one fails), then the conditional probabilities in Eqn. must be considered. However, in the case of independent components, Eqn. becomes:

Effect of Component Reliability in a Series System

In a series configuration, the component with the smallest reliability has the biggest effect on the system's reliability. There is a saying that "a chain is only as strong as its weakest link." This is a good example of the effect of a component in a series system. In a chain, all the rings are in series and if any of the rings break, the system fails. In addition, the weakest link in the chain is the one that will break first. The weakest link dictates the strength of the chain in the same way that the weakest component/subsystem dictates the reliability of a series system. As a result, the reliability of a series system is always less than the reliability of the least reliable component. ________________________________________________________________________ Page 15 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

For the system shown below with three different components in series,

d = MTBF (in hours) r = MTTR (in hours) The characteristics of each individual component can be calculated using:

Using these, the combined failure rate (failures per year) becomes:

Reliability of the system for one year: Information about the failure-free interval and it’s described in % values (Lusser’s equation), ________________________________________________________________________ Page 16 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

System availability:

Probability of failure during one year:

MTBF:

MTTR in hours:

Forced downtime:

2.3

Systems with Redundant Components

In a simple parallel system, as shown in Figure, at least one of the units must succeed for the system to succeed. Units in parallel are also referred to as redundant units. ________________________________________________________________________ Page 17 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Redundancy is a very important aspect of system design and reliability in that adding redundancy is one of several methods of improving system reliability. It is widely used in the aerospace industry and generally used in mission critical systems. Other example applications include the RAID computer hard drive systems, brake systems and support cables in bridges.

The probability of failure, or unreliability, for a system with n statistically independent parallel components is the probability that unit 1 fails and unit 2 fails and all of the other units in the system fail. So in a parallel system, all n units must fail for the system to fail. Put another way, if unit 1 succeeds or unit 2 succeeds or any of the n units succeeds, then the system succeeds. The unreliability of the system is then given by:

Where: = unreliability of the system. = event of failure of unit i. P( ) = probability of failure of unit i. ________________________________________________________________________ Page 18 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

In the case where the failure of a component affects the failure rates of other components, then the conditional probabilities in Eqn. must be considered. However, in the case of independent components, the above Eqn. becomes:

Effect of Component Reliability in a Parallel Configuration: When we examined a system of components in series, we found that the least reliable component has the biggest effect on the reliability of the system. However, the component with the highest reliability in a parallel configuration has the biggest effect on the system's reliability, since the most reliable component is the one that will most likely fail last. This is a very important property of the parallel configuration, specifically in the design and improvement of systems

2.3.1

Redundant Active Systems - Non-repairable

Reliability can be dramatically increased by installing a parallel (redundant) system. The simplest of these is a system that operates satisfactorily if either on or both of two parallel components functions. ________________________________________________________________________ Page 19 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The reliability for such a system for one year can be calculated using:

the combined failure rate is:

the system availability is:

and the probability of failure during one year is:

Repairable Redundant Systems If the components can be repaired, the reliability of the systems described above also becomes a function of the time required to repair the system. ________________________________________________________________________ Page 20 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Using a constant failure rate for two identical units, the stead-state availability is:

where the repair rate (μ) is (MTTR) -1 and the MTBF is:

For two different components in parallel, the MTBF is:

and the stead-state availability for n blocks is:

or for two parallel Components:

2.4

Common Cause

Redundancy calculations frequently lead to reliability numbers that are outside the realm of reason, for example one failure in a thousand years. In reality, even redundancy of components still leaves a chance that the parallel system will fail from a common-mode. Examples of this include common electrical connections, common alarm wiring, or the environment. In David Smith's book on Reliability1 he suggests that these common ________________________________________________________________________ Page 21 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

modes can be considered by using a component in series with the redundant system as shown in figure 1 below.

Reliability is inextricably entwined with availability. If availability is thought of in terms of a repairable system being “up” and “down” then a number of concepts and terms can be simply defined.

Another metric for measuring reliability is the Failure Rate—defined as the inverse of either the hardware MTBF or system MTBF as displayed in the formula below. If the Failure Rate is high, the MTBF is small. Each MTBF definition has its own corresponding failure rate. For example, the Hardware Failure Rate is the inverse of the hardware MTBF, while the System Failure Rate is the inverse of the system MTBF. ________________________________________________________________________ Page 22 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Hardware MTBF is probably the most frequently cited MTBF metric even though it does not reflect the typical end-user view of a reliable service level because hardware component failures do not necessarily result in a system interruption.

For most electronic components, the MTBF and failure rate change during the life cycle of a component, however, since the variance is small, it is safe to assume the value is a constant. The component failure rate curve of a component’s life cycle is known as a Bathtub Curve, see FIGURE.

FIGURE: Component Failure Rate Bathtub Curve FIGURE is a plot of hardware component failures over time and it demonstrates a phenomenon known as infant mortality—where a component exhibits high failure rate during the early stages of its life. To help ensure component failures are detected early, manufacturers often subject components to a burn-in process (literally, exposure to high temperatures). Over time the failure rate gradually reduces until it approaches the constant rate — which is maintained during its useful life. Eventually, the component enters the wear-out stage ________________________________________________________________________ Page 23 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

of its life—in this stage, failures increase exponentially. It is important to note that new technologies can exhibit a similar failure rate during the early life and useful life time periods. Common MTBF Misconceptions MTBF is often confused with a component’s useful life, even though the two concepts are not related in any way. For example, a battery may have a useful life of four hours and have an MTBF of 100,000 hours. These figures indicate that in a population of 100,000 batteries there will be approximately one battery failure every hour during its four-hour life span. On the other hand, a 64-processor server platform may last 50,000 hours before it enters its wear-out period while its MTBF value may only be 2000 hours. Another common misconception is to assume the MTBF value is higher in a system with strong component redundancy. While component redundancy can increase a system MTBF, it does quite the opposite for the hardware MTBF—generally, the greater the number of components in a system, the lower the system’s hardware MTBF will be.

3.

The RAM model of the offered System

In this section, we describe the methodology that can be used to measure and study the availability of the offered systems. We begin by establishing a standard definition of availability and the metrics that can be used to report it, then consider how to construct benchmarks that produce those metrics, and finally describe how the results of those benchmarks can be reported and analysed. ________________________________________________________________________ Page 24 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The term `àvailability'' carries with it many possible connotations. Traditionally, availability has been defined as a binary metric that describes whether a system is `ùp'' or ``down'' at a single point of time. A traditional extension of this definition is to compute the percentage of time, on average, that a system is available (`ùp'') or not (``down'')--this is how availability is defined when a system is described as having 99.999% availability, for example. Because of the complexity of the integration of this installation, we take a different perspective on availability. First, we see availability as a spectrum, and not a binary metric. Systems can exist in a large number of degraded, but operational, states between ``down'' and `ùp.'' In fact, systems running in degraded states are probably more common than ``perfect'' systems, especially in our case the system consists of two different but essential to work perfectly concepts: the Hardware and the Software (operating system & applications). For the Software (Solaris plus drivers and any other kind of system software included the applications as well) our company, as that is emerged directly by the contract, is not allowed to interfere in a suspected case of failure. But at the same time, our company is responsible to provide support and to guarantee a system availability that is Hardware availability, no less than 99,95%. We propose that it has to be taken the assumption, because of the above particularity to demarcate the responsibilities of the availability of the system software from hardware, that: 1) The first diagnosis of a possible problem must be taken by your trained technical staff. 2) All calculations regarding the system availability that is responsible our company concern only the offered hardware and not the total system’s availability. ________________________________________________________________________ Page 25 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

3) There are field test and reports that the software (system S/W and application S/W) affects the 70% of the availability of a system. The total system availability metric must therefore capture possible degraded states between S/W and H/W failures, measuring not only whether the system is up or down, but also its efficacy and the quality of service that it is provided. Availability must not be defined at a single point in time or as a simple average over all time. It must instead be examined as a function of the system's quality of service over time. To motivate this, consider that from a user's perspective, there is a big difference between a system that refuses requests for two seconds out of every minute and one that is down for one whole day every month, even though the two systems have approximately the same average uptime. That means that the benchmark of the total availability must be able to capture possible difference between incompatibilities and/or mis-functionality between S/W and H/W. The network life cycle of the offered system refers to the cycle of planning, design, implementation, and operations. In general many applications have significant constraints that require careful management. Jitter, delay, throughput, and bandwidth requirements for current applications typically have many constraints and they influence the total network availability. The way the application was written may also create constraints. Therefore our responsibility is focused again only, on network’s hardware (extreme networks switches) availability and we do not take into consideration, calculations concerning S/W application constrains upon total network’s availability. For the rest of this paper our calculations regard only the offered hardware that our company has the full responsibility for an operational availability, assuming that the first diagnosis of a problem it is disclosed to us by your side, following the diagram below: ________________________________________________________________________ Page 26 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

IN CASE THE DIAGNOSIS REPORTS THAT THERE IS NO HARDWARE PROBLEM THEN NO HARDWARE DOWNTIME IS CONSIDERED

But a focus on reliability, while effective for hardware, isn’t enough when complex software is also involved as in our case. Software is inherently less reliable—every possible configuration of use and data cannot be tested prior to release, so software glitches are often revealed only when the software is actually put into use. As systems became more and more software-driven, fault prevention has become an unrealizable standard, now replaced by fault management. The notion of reliability has had to be reconsidered in system design.

________________________________________________________________________ Page 27 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The reliability strategy was, first build great hardware that doesn’t break down and then—to be completely covered—supply a backup system just in case. Redundancy (the foundation of all high availability strategies) was introduced to provide system reliability in an increasingly unreliable world. In availability management, hardware or software faults are seen as unavoidable and the system is designed to anticipate and work around faults before they become system failures. Thus, instead of counting on the hardware to avoid faults, availability—and especially the high availability— design relies heavily on management software to mask and manage faults that are expected to occur. Fault management takes practical precedence over designing for fault avoidance; the goal is to anticipate faults and execute fault recovery as quickly as possible. A Service Availability solution requires a system be highly available and provide continuity of service. An availability system consisting of a H/W availability system and from a S/W availability system, separated as to different responsibilities, must be working without ultimately jeopardizing the metrics of reliability or availability of the total system.

3.1

Reliability Block Diagram (RBD)

The presented reliability block diagram (RBD) is a drawing to model the complexity of the system. The goal of an RBD is to produce a series of images representing portions of a system that is to be analysed. Once the images are configured properly, and data for these images is provided, calculations can be performed in order to calculate the failure rate, MTBF, reliability, and availability of the system. As the configuration of the diagram changes, the calculation results also change. A reliability block diagram provides a simple way to compare various configurations in an attempt to find the best overall system design ________________________________________________________________________ Page 28 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

In general, reliability analysis starts by listing all the major components of the magnet system and making a block diagram showing how their functions are interrelated, following the flowchart in figure below:

Based on MIL-STD-721C Definitions of Terms For Reliability and Maintainability that gives two definitions for reliability: 1.

The duration or probability of failure-free performance under stated conditions

2.

The probability than an item can perform its intended function for a specified interval under stated conditions

we built our RAM study. In order to build the RBD we have to design and understand the data flowchart among the various subsystems of the installation:

________________________________________________________________________ Page 29 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

________________________________________________________________________ Page 30 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

With the path-tracing method, every path from a starting point to an ending point is considered. Since system success involves having at least one path available from one end of the RBD to the other, as long as at least one path from the beginning to the end of the path is available, then the system has not failed.

The system in Figure is a complex system that can be broken down into a group of series and parallel systems. This is primarily due to the fact that component H (extreme switch) ________________________________________________________________________ Page 31 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

has multiple paths leading away from it, whereas B (Aurora) and A (V880) have only one. Regarding E (HIC) and F (UCC), we have to note and assume that the system works in a perfect mode when and only when both E and F components are working. Otherwise, if one of the two is working then the system is out of order, if both fails (E and F) then the system is out again. Even if at the diagram theoretically the connection is in parallel, because of the above assumption we use them in the model as having a series connection. Component D (SunBlade): Reliability of k-out-of-n Independent and Identical Components The simplest case of components in a k-out-of-n configuration is when the components are independent and identical. In other words, all the components have the same failure distribution and whenever a failure occurs, the remaining components are not affected. In this case, the reliability of the system with such a configuration can be evaluated using the binomial distribution, or:

Where: •

n is the total number of units in parallel.

•

k is the minimum number of units required for system success.

•

R is the reliability of each unit.

In our case, we have to consider that regarding the active and the passive network, it is installed already by our company a redundant (parallel) network to maximize the availability, but it is not used in a functional way due to your demand. But in any way it’s ready to take over in a possible failure of the active node but not automatically due to the reason above.

3.2

Offered Component’s Availability List EQUIPMENT

Indicative nums---Æ

DownTime Quantity

MTBF

/ year

Availability

________________________________________________________________________ Page 32 of 65

Panos Sarafidis

Project: MASE Δ906Ν Implementation

Reliability - Availability - Maintainability STUDY

Including S/W Component A

SFV880

1

CPU's (900MHZ)(Non-redundant)

4

9015

23,38

99,74%

Memory(512MB) 16 GB *Redundant* 2nd FC/AL Backplane

1

RSC

1

DVD_Drive

1

*Mirrored* Internal Disk Drives: DiskDr(73GB) 6

6

PCI Cards: Non-Redundant (X3768A) 1

1

Non-Redundant (X1141A) 1

1

X7137A TFT Monitor

1

49999

10

99,98%

2

93350

30

99,97%

22679

9,24

99,96%

Component H

Extreme Networks’ Summit48si Switch 1PSU

Component D

SUN BLADE 150-2

2 1

CPU MODULE(650MHZ)

1

Memory(256MB)

4

*Non-Redandant* Internal Disk Drives: DiskDr(40GB)

1

PCI Cards: Non-Redundant (X1033A)

1

Non-Redundant (X3685A)

1

X7137A TFT Monitor

2

49999

10

99,98%

1

22814

0,89

100,00%

Component C

SUN BLADE 150 CPU MODULE(550MHZ)

1

Memory(256MB)

4

*Non-Redandant* Internal Disk Drives: DiskDr(40GB)

1

PCI Cards: Non-Redundant (X1033A)

1

Non-Redundant (X3685A)

1

X7137A TFT Monitor

2

49999

10

99,98%

Component B

AURORA LanMultiServer Models: 2016P

Component F

UCC

1

50000

30

99,94%

Component E

HIC

1

50000

30

99,94%

Component G

UPS / system

1

87600

20

99,98%

3.3

Spare Parts

________________________________________________________________________ Page 33 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The required spare parts planning for a system/machine is an integral part of the product support strategy. The number of required spare parts can be effectively estimated on the basis of the product reliability characteristics. The reliability characteristics of an existing machine/system are influenced not only by the operating time, but also by factors such as the environmental parameters (e.g. dust, humidity, temperature, moisture, etc.), which can degrade or improve the reliability. In the product life cycle, for determining the accurate spare parts needs and for minimizing the machine life cycle cost, consideration of these factors is useful Demand driven by failure of parts. When demand is driven by failure of parts it is usually appropriate to model the probability of failure of parts using a Weibull distribution. Given a population of parts in service, the Weibull distribution predicts an S shaped curve of cumulative failure of that population of parts. The Weibull model is a three-parameter distribution but one of these parameters, t0, is often assumed as zero. Generally, a two parameter version of the Weibull model is assumed, whose cumulative distribution as a function of time is: ⎛

F = 1− e

− ⎜⎜

t

⎜η ⎝

⎞β ⎟ ⎟⎟ ⎠

Where F is the probability that all parts will fail by time t. Parameters η and β are the position parameter and shape parameter respectively. Where t = mission time (1 day, 1 week, 1 month, 1 year, etc). η = characteristic age-tofailure rate, and β = Weibull shape factor where for components β 1 implies wear out failure modes. If you’re evaluating a system, then the beta values have no physical relationship to failure modes.

________________________________________________________________________ Page 34 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

How do you find η and β? Take age-to-failure data for a specific failure mode (this requires that you know the time origin, you have some system for measuring the passage of time, and the definition of failure must be clear as described in the New Weibull Handbook), input the data into WinSMITH Weibull software, and the software returns values for η and β. Please note that if you have suspended (censored) data, then you must signify to the software the data is suspended (simply put a minus sign in-front of the age and the software will tread the data as a suspension). The theoretical distribution is shown as a dashed line while simulated failures, found using a Weibull random number generator, are shown as circles. It is assumed that most failure-driven low demand parts are on the exponentially increasing section of the failure curve. Parts where a significant percentage of parts have failed already would have a higher demand rate and would be out of the scope of this study. The actual demand distribution for any period is found by calculating the probability of a failure over that period, ΔF, from (1), then using that probability to find the binomial probability distribution from the number of parts in the fleet:

⎛n⎞ p ( x ) = ⎜⎜ ⎟⎟ ΔF x (1 − ΔF ) n − x ⎝ x⎠ Where p(x) is the probability of experiencing a demand for x parts over the period and n is the number of parts in service in the system.

Demand driven by accidental breakage. When parts demand is driven strictly by accidental breakage then the failure rate does not depend on the age of the part. In this case the demand distribution can be modelled using the binomial distribution or the Poisson distribution. Given a constant probability p of breakage of a single part in a ________________________________________________________________________ Page 35 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

quarter, the binomial distribution, given in (2) above, is the probability that x parts will break out of a collection of n parts during a quarter, where x = 0, 1, 2, 3,… broken parts. Given a large collection of parts and a mean breakage rate per quarter from that collection, λ, the Poisson distribution gives the probability that x parts will break during a quarter, and can also be used to find distributions for periods covering multiple quarters:

e − λ λx p( x) = x! When the number of parts is large and p is small, the binomial distribution is approximately equal to the Poisson distribution with λ = np . This is typically the case for low demand parts. Note that there is no noticeable difference in the distributions. The Poisson distribution, if appropriate, is preferred over the binomial because there is no need to estimate the number of parts in service in the system. However in the calculations it is used the basic formula of: Q=(N x T) / MTBF Combined with assumptions on the above Eqns. Where: Q is the quantity of spare parts required N is the population of the items installed T is the time that the stock of spare items is required to last in hours MTBF is the Mean Time Between Failures in hours

4.

Maintainability

Maintainability is defined as the probability of performing a successful repair action within a given time. In other words, maintainability measures the ease and speed with ________________________________________________________________________ Page 36 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

which a system can be restored to operational status after a failure occurs. For example, if it is said that a particular component has a 90% maintainability in one hour, this means that there is a 90% probability that the component will be repaired within an hour. In maintainability, the random variable is time-to-repair, in the same manner as time-tofailure is the random variable in reliability. As an example, consider the maintainability equation for a system in which the repair times are distributed exponentially. Its maintainability M(t) is given by:

Where

= repair rate.

Note the similarity between this equation and the equation for the reliability of a system with exponentially distributed failure times. However, since the maintainability represents the probability of an event occurring (repairing the system) while the reliability represents the probability of an event not occurring (failure), the maintainability expression is the equivalent of the unreliability expression, (1 - R). Furthermore, the single model parameter

is now referred to as the repair rate, which is

analogous to the failure rate, , used in reliability for an exponential distribution. Similarly, the mean of the distribution can be obtained by:

This now becomes the mean time to repair (MTTR) instead of the mean time to failure (MTTF).

________________________________________________________________________ Page 37 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The same concept can be expanded to other distributions. In the case of the Weibull distribution, maintainability, M(t), is given by:

While the mean time to repair (MTTR) is given by:

And the Weibull repair rate is given by:

As a last example, if a lognormal distribution is chosen, then:

Where: mean of the natural logarithms of the times-to-repair. standard deviation of the natural logarithms of the times-to-repair. It should be clear by now that any distribution can be utilized, as well as related concepts and methods used in life data analysis. The only difference being that instead of times-tofailure we are using times-to-repair. What one chooses to include in the time-to-repair varies but can include: -

The time it takes to successfully diagnose the cause of the failure.

-

The time it takes to procure or deliver the parts necessary to perform the repair.

________________________________________________________________________ Page 38 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

-

The time it takes to gain access to the failed part or parts.

-

The time it takes to remove the failed components and replace them with

functioning ones. -

The time involved with bringing the system back to operating status.

-

The time it takes to verify that the system is functioning within specifications.

-

The time associated with "closing up" a system and returning it to normal

operation. In the interest of being fair and accurate, one should disclose (document) what was and was not included in determining the repair distribution Mean time between maintenance is the average interval between maintenance actions, taking into consideration both corrective maintenance and preventive maintenance. The formula for the mean time between maintenance is:

where: MTBMC =

mean time between corrective maintenance (it is an equivalent of MTBF) in hours.

MTBMP = mean time between preventive maintenance in hours. Now, let us express the frequencies of occurrences that are indicated in the previous Equation as: • failure rate:

• and preventive maintenance frequency: ________________________________________________________________________ Page 39 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Then Equation (4) can be expressed as:

Mean downtime represents the total system downtime attributable to maintenance actions and maintenance-related events. It takes into consideration the contributory effect of corrective maintenance, preventive maintenance and logistics delays, which associated with corrective maintenance. This methodology incorporates the intervals, mean tasks or event times, and frequencies of occurrence in these critical areas. The mean downtime formula is:

where: =

mean corrective maintenance time in hours (it is an equivalent of MTTR).

= mean preventive maintenance time in hours. = mean time of logistics delay in hours.

4.1

Preventive Maintenance

Preventive maintenance is a schedule of planned maintenance actions aimed at the prevention of breakdowns and failures. The primary goal of preventive maintenance is to prevent the failure of equipment before it actually occurs. It is designed to preserve and ________________________________________________________________________ Page 40 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

enhance equipment reliability by replacing worn components before they actually fail. Preventive maintenance activities include equipment checks, partial or complete overhauls at specified periods, oil changes, lubrication and so on. In addition, workers can record equipment deterioration so they know to replace or repair worn parts before they cause system failure. Recent technological advances in tools for inspection and diagnosis have enabled even more accurate and effective equipment maintenance. The ideal preventive maintenance program would prevent all equipment failure before it occurs. 4.1.1

Value of Preventive Maintenance

There are multiple misconceptions about preventive maintenance. One such misconception is that PM is unduly costly. This logic dictates that it would cost more for regularly scheduled downtime and maintenance than it would normally cost to operate equipment until repair is absolutely necessary. This may be true for some components; however, one should compare not only the costs but the long-term benefits and savings associated with preventive maintenance. Without preventive maintenance, for example, costs for lost production time from unscheduled equipment breakdown will be incurred. Also, preventive maintenance will result in savings due to an increase of effective system service life.

Long-term benefits of preventive maintenance include: -

Improved system reliability.

-

Decreased cost of replacement.

-

Decreased system downtime.

-

Better spares inventory management.

________________________________________________________________________ Page 41 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Long-term effects and cost comparisons usually favour preventive maintenance over performing maintenance actions only when the system fails. When Does Preventive Maintenance Make Sense Preventive maintenance is a logical choice if, and only if, the following two conditions are met: Condition #1: The component in question has an increasing failure rate. In other words, the failure rate of the component increases with time, thus implying wear-out. Preventive maintenance of a component that is assumed to have an exponential distribution (which implies a constant failure rate) does not make sense! Condition #2: The overall cost of the preventive maintenance action must be less than the overall cost of a corrective action. (Note: In the overall cost for a corrective action, one should include ancillary tangible and/or intangible costs, such as downtime costs, loss of production costs, lawsuits over the failure of a safety-critical item, loss of goodwill, etc.) If both of these conditions are met, then preventive maintenance makes sense. Additionally, based on the costs ratios, an optimum time for such action can be easily computed for a single component..

4.1.2

The Fallacy of "Constant Failure Rate" and "Preventive Replacement"

Even though we alluded to the fact in the last section of this on-line reference, Availability, it is important to make it explicitly clear that if a component has a constant failure rate (i.e. defined by an exponential distribution), then preventive maintenance of the component will have no effect on the component's failure occurrences. ________________________________________________________________________ Page 42 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

To illustrate this, consider a component with an MTTF = 100 hours, or = 0.01, and with preventive replacement every 50 hours. The reliability vs. time graph for this case is illustrated in Figure. In Figure, the component is replaced every 50 hours, thus the component's reliability is reset to one. At first glance, it may seem that the preventive maintenance action is actually maintaining the component at a higher reliability.

Reliability vs. time for a single component with an MTTF = 100 hours, or

= 0.01,

and with preventive replacement every 50 hours.

4.1.3

Determining Preventive Replacement Time

As mentioned earlier, if the component has an increasing failure rate, then a carefully designed preventive maintenance program is beneficial to system availability. Otherwise, the costs of preventive maintenance might actually outweigh the benefits. The objective of a good preventive maintenance program is to either minimize the overall costs (or downtime, etc.) or meet a reliability objective. In order to achieve this, an ________________________________________________________________________ Page 43 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

appropriate interval (time) for scheduled maintenance must be determined. One way to do that is to use the optimum age replacement model, as presented next. The model adheres to the conditions discussed previously, or: The component is exhibiting behaviour associated with a wear-out mode. That is, the failure rate of the component is increasing with time. The cost for planned replacements is significantly less than the cost for unplanned replacements.

Cost curve for preventive and corrective replacement.

________________________________________________________________________ Page 44 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The Figure shows the Cost Per Unit Time vs. Time plot. In this figure, it can be seen that the corrective replacement costs increase as the replacement interval increases. In other words, the less often you perform a PM action, the higher your corrective costs will be. Obviously, the longer we let a component operate, its failure rate increases to a point that it is more likely to fail, thus requiring more corrective actions. The opposite is true for the preventive replacement costs. The longer you wait to perform a PM, the less the costs; while if you do PM too often, the higher the costs. If we combine both costs, we can see that there is an optimum point that minimizes the costs. In other words, one must strike a balance between the risk (costs) associated with a failure while maximizing the time between PM actions. 4.1.4

Optimum Age Replacement Policy

To determine the optimum time for such a preventive maintenance action (replacement), we need to mathematically formulate a model that describes the associated costs and risks. In developing the model, it is assumed that if the unit fails before time t, a corrective action will occur and if it does not fail by time t, a preventive action will occur. In other words, the unit is replaced upon failure or after a time of operation, t, whichever occurs first. Thus, the optimum replacement time can be found by minimizing the cost per unit time, CPUT(t). CPUT(t) is given by:

Where: ________________________________________________________________________ Page 45 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

R(t) = reliability at time t. = cost of planned replacement. = cost of unplanned replacement. The optimum replacement time interval, t, is the time that minimizes CPUT(t) This can be found by solving for t such that:

Or by solving for a t that satisfies:

Interested readers can refer to Barlow and Hunter for more details on this model.

________________________________________________________________________ Page 46 of 65

Panos Sarafidis

Project: MASE Δ906Ν Implementation

Reliability - Availability - Maintainability STUDY

5.

Conclusion

Reliability engineers are very often called upon to make decisions as to whether to improve a certain component or components in order to achieve minimum required system reliability. (Note: This minimum required system reliability is for a specified time.) There are two approaches to improving the reliability of a system: fault avoidance and fault tolerance. Fault avoidance is achieved by using high-quality and high-reliability components and is usually less expensive than fault tolerance. Fault tolerance, on the other hand, is achieved by redundancy. Redundancy can result in increased design complexity and increased costs through additional weight, space, etc. Before deciding whether to improve the reliability of a system by fault tolerance or fault avoidance, a reliability assessment for each component in the system should be made. Once the reliability values for the components have been quantified, an analysis can be performed in order to determine if that system's reliability goal will be met. If it becomes apparent that the system's reliability will not be adequate to meet the desired goal at the specified mission duration, steps can be taken to determine the best way to improve the system's reliability so that it will reach the desired target. Consider a system with three components connected reliability-wise in series. The reliabilities for each component for a given time are: reliability goal,

= 70%,

= 80% and

= 90%. A

= 85%, is required for this system.

________________________________________________________________________ Page 47 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The current reliability of the system is:

Obviously, this is far short of the system's required reliability performance. It is apparent that the reliability of the system's constituent components will need to be increased in order for the system to meet its goal. First, we will try increasing the reliability of one component at a time to see whether the reliability goal can be achieved.

________________________________________________________________________ Page 48 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Change in system reliability if a three-unit series system due to increasing the reliability of just one component. Figure shows that even by raising the individual component reliability to a hypothetical value of 1 (100% reliability, which implies that the component will never fail), the overall system reliability goal will not be met by improving the reliability of just one component. The next logical step would be to try to increase the reliability of two components. The question now becomes: which two? One might also suggest increasing the reliability of all three components. A basis for making such decisions needs to be ________________________________________________________________________ Page 49 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

found in order to avoid the "trial and error" aspect of altering the system's components randomly in an attempt to achieve the system reliability goal. As we have seen, the reliability goal for the preceding example could not be achieved by increasing the reliability of just one component. There are cases, however, where increasing the reliability of one component results in achieving the system reliability goal. Consider, for example, a system with three components connected reliability-wise in parallel. The reliabilities for each component for a given time are: and

= 80%. A reliability goal,

= 60%,

= 70%

= 99%, is required for this system. The initial system

reliability is:

The current system reliability is inadequate to meet the goal. Once again, we can try to meet the system reliability goal by raising the reliability of just one of the three components in the system.

________________________________________________________________________ Page 50 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Meeting a reliability goal requirement by increasing a component's reliability. From Figure, it can be seen that the reliability goal can be reached by improving Component 1, Component 2 or Component 3. The reliability engineer is now faced with another dilemma: which component's reliability should be improved? This presents a new aspect to the problem of allocating the reliability of the system. Since we know that the system reliability goal can be achieved by increasing at least one unit, the question becomes one of how to do this most efficiently and cost effectively. We will need more information to make an informed decision as to how to go about improving the system's ________________________________________________________________________ Page 51 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

reliability. How much does each component need to be improved for the system to meet its goal? How feasible is it to improve the reliability of each component? Would it actually be more efficient to slightly raise the reliability of two or three components rather than radically improving only one? In order to answer these questions, we must introduce another variable into the problem: cost. Cost does not necessarily have to be in dollars. It could be described in terms of non-monetary resources, such as time. By associating cost values to the reliabilities of the system's components, we can find an optimum design that will provide the required reliability at a minimum cost. 5.1

Cost/Penalty Function

There is always a cost associated with changing a design due to change of vendors, use of higher-quality materials, retooling costs, administrative fees, etc. The cost as a function of the reliability for each component must be quantified before attempting to improve the reliability. Otherwise, the design changes may result in a system that is needlessly expensive or over designed. Developing the "cost of reliability" relationship will give the engineer an understanding of which components to improve and how to best concentrate the effort and allocate resources in doing so. The first step will be to obtain a relationship between the cost of improvement and reliability. The preferred approach would be to formulate the cost function from actual cost data. This can be done from past experience. If a reliability growth program is in place, the costs associated with each stage of improvement can also be quantified. Defining the different costs associated with different vendors or different component models is also useful in formulating a model of component cost as a function of reliability. However, there are many cases where no such information is available. For this reason, a general (default) behaviour model of the cost versus the component's reliability was developed for performing reliability optimisation in BlockSim. The objective of this ________________________________________________________________________ Page 52 of 65

Panos Sarafidis

Project: MASE Δ906Ν Implementation

Reliability - Availability - Maintainability STUDY

function is to model an overall cost behaviour for all types of components. Of course, it is impossible to formulate a model that will be precisely applicable to every situation; but the proposed relationship is general enough to cover most applications. In addition to the default model formulation, BlockSim does allow the definition of user-defined cost models. 5.2

Quantifying the Cost/Penalty Function

One needs to quantify a cost function for each component,

, in terms of the reliability,

, of each component, or:

This function should: Look at the current reliability of the component,

.

Look at the maximum possible reliability of the component,

.

Allow for different levels of difficulty (or cost) in increasing the reliability of each component. It can take into account: -

design issues.

-

supplier issues.

-

state of technology.

-

time-to-market issues, etc.

Thus, for the cost function to comply with these needs, the following conditions should be adhered to: -

The function should be constrained by the minimum and maximum reliabilities of

each component (i.e. reliability must be less than one and greater than the current reliability of the component or at least greater than zero). -

The function should not be linear, but rather quantify the fact that it is

incrementally harder to improve reliability. For example, it is considerably easier to

________________________________________________________________________ Page 53 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

increase the reliability from 90% to 91% than to increase it from 99.99% to 99.999%, even though the increase is larger in the first case. -

The function should be asymptotic to the maximum achievable reliability.

The following default cost function (also used in BlockSim) adheres to all of these conditions and acts like a penalty function for increasing a component's reliability. Furthermore, an exponential behaviour for the cost is assumed since it should get exponentially more difficult to increase the reliability. See Mettas.

Where: ( ) is the penalty (or cost) function as a function of component reliability. f is the feasibility (or cost index) of improving a component's reliability relative to the other components in the system. is the current reliability at the time at which the optimisation is to be performed. is the maximum achievable reliability at the time at which the optimisation is to be performed. Note that this penalty function is dimensionless. It essentially acts as a weighting factor that describes the difficulty in increasing the component reliability from its current value, relative to the other components. Examining the cost function given by Eqn., the following observations can be made: -

The cost increases as the allocated reliability departs from the minimum or current

value of reliability. It is assumed that the reliabilities for the components will not take values any lower than they already have. Depending on the optimization, a component's reliability may not need to be increased from its current value but it will not drop any lower. ________________________________________________________________________ Page 54 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

-

Project: MASE Δ906Ν Implementation

The cost increases as the allocated reliability approaches the maximum achievable

reliability. This is a reliability value that is approached asymptotically as the cost increases but is never actually reached. -

The cost is a function of the range of improvement, which is the difference

between the component's initial reliability and the corresponding maximum achievable reliability. The exponent in Eqn approaches infinity as the component's reliability approaches its maximum achievable value. This means that it is easier to increase the reliability of a component from a lower initial value. For example, it is easier to increase a component's reliability from 70% to 75% than increasing its reliability from 90% to 95%. The Feasibility Term, f The feasibility term in Eqn. is a constant (or an equation parameter) that represents the difficulty in increasing a component's reliability relative to the rest of the components in the system. Depending on the design complexity, technological limitations, etc., certain components can be very hard to improve. Clearly, the more difficult it is to improve the reliability of the component, the greater the cost. Figure below illustrates the behaviour of the function defined in Eqn. for different values of f. It can be seen that the lower the feasibility value, the more rapidly the cost function approaches infinity.

________________________________________________________________________ Page 55 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Behaviour of the cost function for different feasibility values. Several methods can be used to obtain a feasibility value. Weighting factors for allocating reliability have been proposed by many authors and can be used to quantify feasibility. These weights depend on certain factors of influence, such as the complexity of the component, the state of the art, the operational profile, the criticality, etc. Engineering judgment based on past experience, supplier quality, supplier availability and other factors can also be used in determining a feasibility value. Overall, the assignment of a feasibility value is going to be a subjective process. Of course, this problem is negated if the relationship between the cost and the reliability for each component is known because one can use regression methods to estimate the parameter value.

________________________________________________________________________ Page 56 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

5.3

Project: MASE Δ906Ν Implementation

Maximum Achievable Reliability

For the purposes of reliability optimisation, we also need to define a limiting reliability that a component will approach, but not reach. The costs near the maximum achievable reliability are very high and the actual value for the maximum reliability is usually dictated by technological or financial constraints. In deciding on a value to use for the maximum achievable reliability, the current state of the art of the component in question and other similar factors will have to be considered. In the end, a realistic estimation based on engineering judgment and experience will be necessary to assign a value to this input. Note that the time associated with this maximum achievable reliability is the same as that of the overall system reliability goal. Almost any component can achieve a very high reliability value, provided the mission time is short enough. For example, a component with an exponential distribution and a failure rate of one failure per hour has a reliability that drops below 1% for missions greater than five hours. However, it can achieve a reliability of 99.9% as long as the mission is no longer than four seconds. For the purposes of optimisation in BlockSim, the reliability values of the components are associated with the time for which the system reliability goal is specified. For example, if the problem is to achieve a system goal of 99% reliability at 1000 hours, the maximum achievable reliability values entered for the individual components would be the maximum reliability that each component could attain for a mission of 1000 hours. As the component reliability,

, approaches the maximum achievable reliability,

,

the cost function approaches infinity. The maximum achievable reliability acts as a scale parameter for the cost function. By decreasing between

and

, the cost function is compressed

, as shown in Figure.

________________________________________________________________________ Page 57 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Effect of the maximum achievable reliability on the cost function. 5.4

Cost Function

Once the cost functions for the individual components have been determined, it becomes necessary to develop an expression for the overall system cost. This takes the form of:

In other words, the cost of the system is simply the sum of the costs of its components. This is regardless of the form of the individual component cost functions. They can be of the general behaviour model in BlockSim or they can be user-defined.

________________________________________________________________________ Page 58 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Once the overall cost function for the system has been defined, the problem becomes one of minimizing the cost function while remaining within the constraints defined by the target system reliability and the reliability ranges for the components. The latter constraints in this case are defined by the minimum and maximum reliability values for the individual components.

6.

General RAM details of SUN V880

________________________________________________________________________ Page 59 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

________________________________________________________________________ Page 60 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

The Sun FireTM V880 Server’s reliability, availability and serviceability features include: ï Error correction and parity check for improved data integrity for memory and on internal data paths ï Internal error detection and diagnosis capability, including via an internal ’back door bus’ ï Hot pluggable disks, power supplies, fans and PCI slots ï Easily accessible LED status indicator, especially for PCI cards ï Front panel LED display ï Remote System Console (RSC) for monitoring and administrative capability ï Automatic System Recovery (ASR) for isolation of failed components and automatic reboot capability ________________________________________________________________________ Page 61 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

ï Environmental monitoring and fault protection ï RAID capability for disks

6.1

Modeling Sun Cluster Availability general details (proposed alternatively)

The starting point of modeling a Sun Cluster system is to build a top-level system availability behavioral model with RAScad. RAScad is a Sun internal reliability, availability, and serviceability architecture modeling and analysis tool for use in computer system design and development phases. It highlights the different variables that contribute to system availability, as well as the degree to which they affect it. FIGURE 1 shows the RAScad Markov model for a two-node Sun Cluster stack for any fault that causes one of the nodes to go down. This fault could be an operating system panic or any hardware fault causing the node to fail. The data service is a scalable service, meaning that each of the two nodes is hosting the service actively. A node going down results in reconfigurations of the clustering framework as well as of the service related components in this stack. In FIGURE 1 a vertex in the graph represents the system state, and an edge represents the failure rate of the transition between the source and sink states of that edge. A value of 1 marks a state if it represents the service being up or available, otherwise a value of 0 marks the state.

________________________________________________________________________ Page 62 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Under steady-state, the cluster is in state Both_Up where, as the name implies, both nodes are functional, servicing client requests. The cluster can transition out of this state if one of the nodes goes down; the rate of either of the two nodes dying is 2/ MTBF, where MTBF is the Mean Time Between Failures for a node. This event takes the cluster into the Recovery state, where it remains unavailable while the clustering and the service components reconfigure. The reconfiguration completes successfully with a probability p, and takes Recovery_Time to do so. A successful reconfiguration takes the cluster into an available state, One_Up. In state One_Up, the surviving node services client requests, but at a higher load level since clients that were being serviced by two nodes are now being serviced by just one. This causes the failure rate of the node to increase by a factor of a, which results in a corresponding decrease in the value of the MTBF of the node. The time taken to repair the failed node is MTTR_1, where MTTR stands for the Mean Time to Repair a node. After being repaired, it is booted back into the cluster. When it joins, the cluster goes into an unavailable state Node_Rejoin, while the joiner node gets

________________________________________________________________________ Page 63 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

refreshed with the existing cluster state. The average time taken for this is Node_Rejoin_Time, after which the cluster enters the original available state, Both_Up. The node in state One_Up can go down with a failure rate of (1+a)/MTBF, and the cluster would enter state Both_Down, where both nodes of the cluster have gone down. Following the second outgoing path from state Recovery, an unsuccessful reconfiguration takes the cluster to an unavailable state, Both_Down. Assuming a repair policy where both nodes are repaired at the same time, the average time to repair and boot back the two nodes is MTTR_2, after which the cluster goes back to the original, available state, Both_Up.

References ________________________________________________________________________ Page 64 of 65

Panos Sarafidis

Reliability - Availability - Maintainability STUDY

Project: MASE Δ906Ν Implementation

Langford, J. W. (1995). Logistics Principles and Applications. McGraw-Hill, New York, USA, p 565. Blanchard, B. S. (1981). Logistics Engineering and Management. Prentice-Hall, New York, USA, p 430. IEC 300-3-3: Dependability Management, Part 3: Application Guide - Section 3: Life cycle costing. Blischke, W. R. and Murthy, P.D.N. (1996). Product Warranty Handbook. Marcel Dekker, New York, USA, p 925. Vintr, Z. (1999). Optimisation of Reliability Requirements from the Manufacturer's Point of View, in Proceedings of the Annual Reliability & Maintainability Symposium 1999. IEEE, Washington, USA, pp 183-189. George, L. L., “MTBF Versus Age-Specific Reliability Prediction,” ASQ Reliability Review, Vol. 21, No. 2, pp 13−15, June 2001 George, L. L., “MTBF Prediction for Redundant Systems,” ASQ Reliability Review, Vol. 21, No. 4, Dec. 2001 Jones, Jeff and Joseph Hayes, “A Comparison of Electronic-Reliability Prediction Models,” IEEE Trans. on Reliability, Vol. 48, No. 2, pp. 127–134, June 1999 Klion, J., “A Redundancy Notebook,” RADC-TR-77-287, AD A050837, 1977 MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, U.S. Department of Defense, Washington, DC, Notice 1, July 1992 NASA, “Active Redundancy,” NASA Preferred Reliability Practices No. PD-ED-1216, http://www.hq.nasa.gov/office/codeq/relpract/n1216.pdf, NASA Headquarters, Washington, DC Probability and Statistics with Reliability Second Edition by K.S.Trivendi

________________________________________________________________________ Page 65 of 65

RAM-study

Short Description

Description

Comments

We need your help!