April 2, 2017 | Author: AlexanderPetrov | Category: N/A
Download Reliability Modeling: The RIAC Guide to Reliability Prediction, Assessment ...
Reliability Modeling The RIAC Guide to Reliability Prediction, Assessment and Estimation
L
LI [(Ta1,Tb1 ),...,(TaL ,TbL ) /θ] ∝ ∏[F(Tb i ) − F(Tai )] |θ) i=1
− Ea
λ = λ b e KT S n r
1 − CL = ∑ k =0
(λt ) k!
k
e
−λt
=e
−λt
r −1 ⎡ λt ) ( (λt )r ⎤ + ⎥ ⎢1 + λt + ⋅ ⋅ ⋅ ⋅ + (r −1)! ( r)! ⎥⎦ ⎢⎣
RIAC is a DoD Information Analysis Center sponsored by the Defense Technical Information Center. RIAC is operated by a team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn State University Applied Research Laboratory and the State University of New York Institute of Technology.
Ordering No.: RPAE
Reliability Modeling The RIAC Guide to Reliability Prediction, Assessment and Estimation Prepared by: Reliability Information Analysis Center 6000 Flanagan Rd. Suite 3 Utica, NY 13502-1348 Under Contract to: Defense Technical Information Center DTIC-AI 8725 John J. Kingman Rd. Suite 0944 Fort Belvoir, VA 22060
RIAC is a DoD Information Analysis Center sponsored by the Defense Technical Information Center. RIAC is operated by a team of Wyle Laboratories, Quanterion Solutions Inc., the University of Maryland, the Penn State University Applied Research Laboratory and the State University of New York Institute of Technology.
The information and data contained herein have been compiled from government and nongovernment technical reports and from material supplied by various manufacturers and are intended to be used for reference purposes. Neither the United States Government nor the Wyle Laboratories contract team warrant the accuracy of this information and data. The user is further cautioned that the data contained herein may not be used in lieu of other contractually cited references and specifications. Publication of this information is not an expression of the opinion of The United States Government or of the Wyle Laboratories contract team as to the quality or durability of any product mentioned herein and any use for advertising or promotional purposes of this information in conjunction with the name of The United States Government or the Wyle Laboratories contract team without written permission is expressly prohibited.
ISBN-10: 1-933904-17-8 ISBN-13: 978-1-933904-17-7 ISBN-10: 1-933904-18-6 ISBN-13: 978-1-933904-18-4
(Hardcopy) (Hardcopy)
(PDF Download) (PDF Download)
Form Approved OMB No. 0704-0188
REPORT DOCUMENTATION PAGE
Public reporting burden for this collection is estimated to average 1 hour per response including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports(0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a current or valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE 31 May 2010
2. REPORT TYPE
3. DATES COVERED (From - To) N/A
Technical
4. TITLE AND SUBTITLE
5a. CONTRACT NUMBER HC1047-05-D-4005 5b. GRANT NUMBER N/A
Reliability Modeling – The RIAC Guide to Reliability Prediction, Assessment and Estimation
5c. PROGRAM ELEMENT NUMBER N/A 5d. PROJECT NUMBER N/A 5e. TASK NUMBER N/A 5f. WORK UNIT NUMBER N/A 8. PERFORMING ORGANIZATION REPORT NUMBER
6. AUTHORS
William Denson
7. PERFORMING ORGANIZATIONS NAME(S) AND ADDRESS(ES) Reliability Information Analysis Center 100 Sherman Rd. Suite C101 Utica, NY 13502-1348
RPAE
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) Defense Technical Information Center DTIC-AI 8725 John J. Kingman Rd. STE 0944 Ft. Belvoir, VA 22060
10. SPONSORING/MONITOR’S ACRONYM(S)
Air Force Research Lab/RISE 525 Brooks Rd. Rome, NY 13440
DTIC-AI and AFRL/RISE 11. SPONSORING/MONITOR’S REPORT NUMBERS N/A
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release, distribution unlimited. 13. SUPPLEMENTARY NOTES Hardcopies available from Reliability Information Analysis Center, 100 Sherman Rd., Suite C101, Utica, NY 13502-1348. (Price: $85 US/$95 Non-US). PDF Download available from http://theRIAC.org (Price $70). 14. ABSTRACT The intent of this book is to provide guidance on modeling techniques that can be used to quantify the reliability of a product or system. In this context, reliability modeling is the process of constructing a mathematical model that is used to estimate the reliability characteristics of a product. There are many ways in which this can be accomplished, depending on the product or system and the type of information that is available, or practical to obtain, to the analyst. This book will review possible approaches, summarize their advantages and disadvantages, and provide guidance on selecting a methodology based on the specific goals and constraints of the analyst. While this book will not discuss the use of specific published methodologies, in cases where examples are provided, tools and methodologies with which the author has personal experience in their development are used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.
15. SUBJECT TERMS
Reliability Modeling NPRD
Reliability Prediction MIL-HDBK-217
Reliability Assessment 217Plus
16. SECURITY CLASSIFICATION OF:
17. LIMITATION OF ABSTRACT
UNCLASSIFIED
Reliability Estimation 18. NUMBER OF PAGES
19a. NAME OF RESPONSIBLE PERSON
David Nicholls a. REPORT
b. ABSTRACT
c. THIS PAGE
UNCLASSIFIED
UNCLASSIFIED
UNCLASSIFIED
UNLIMITED
410
19b. TELEPHONE NUMBER (include area code)
315.351.4202 Standard Form 298 (Rev. 8/98) Prescribed by ANSI Std. Z39.18
The Reliability Information Analysis Center (RIAC), formerly the Reliability Analysis Center (RAC), is a Department of Defense Information Analysis Center sponsored by the Defense Technical Information Center, managed by the Air Force Research Laboratory (formerly Rome Laboratory), and operated by a team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn State University Applied Research Laboratory and the State University of New York Institute of Technology. RIAC is chartered to collect, analyze and disseminate reliability, maintainability, quality, supportability and interoperability (RMQSI) information pertaining to systems and products, as well as the components used in them. The RIAC addresses both military and commercial perspectives. The data contained in the RIAC databases is collected on a continuous basis from a broad range of sources, including testing laboratories, device and equipment manufacturers, government laboratories and equipment users (government and industry). Automatic distribution lists, voluntary data submittals and field failure reporting systems supplement an intensive data solicitation program. Users of RIAC are encouraged to submit their RMQSI data to enhance these data collection efforts. RIAC publishes documents for its users in a variety of formats and subject areas. While most are intended to meet the needs of RMQSI practitioners, many are also targeted to managers and designers. RIAC also offers RMQSI consulting, training and responses to technical and bibliographic inquiries. REQUESTS FOR TECHNICAL ASSISTANCE AND INFORMATION ON AVAILABLE RIAC SERVICES AND PUBLICATIONS MAY BE DIRECTED TO: Reliability Information Analysis Center 100 Sherman Rd. Suite C101 Utica, NY 13502-1348 General Information:(877) 363-RIAC (877) 363-7422 Technical Inquiries: (315) 351-4200 Fax: (315) 351-4209 E-Mail:
[email protected] Internet: http://theRIAC.org
ALL OTHER RIAC REQUESTS SHOULD BE DIRECTED TO:
Air Force Research Laboratory AFRL – Systems and Information Interoperability Branch Attn: R. Hyle 525 Brooks Road Rome, NY 13441-4505 Telephone: DSN: Fax: E-Mail:
(315) 330-4857 587-4857 (315) 330-7647
[email protected]
Copyright © 2010 by Quanterion Solutions Incorporated. This handbook was developed by Quanterion Solutions Incorporated, in support of the prime contractor (Wyle Laboratories) in the operation of the Department of Defense Reliability Information Analysis Center (RIAC) under Contract HC1047-05-D-4005. The Government has a fully paid up perpetual license for free use of and access to this publication and its contents among all the DOD IACs in both hardcopy and electronic versions, without limitation on the number of users or servers. Subject to the rights of the Government, this document (hardcopy and electronic versions) and the content contained within it are protected by U.S. Copyright Law and may not be copied, automated, re-sold, or redistributed to multiple users without the express written permission. The copyrighted work may not be made available on a server for use by more than one person simultaneously without the express written permission. If automation of the technical content for other than personal use, or for multiple simultaneous user access to a copyrighted work is desired, please contact 877.363.RIAC (toll free) or 315.351.4202 for licensing information.
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents 1.
INTRODUCTION
1.1. 1.2. 1.3. 1.4. 1.5. 1.6.
Scope Book Organization Reliability Program Elements The History of Reliability Prediction Acronyms References
2.
Page 1 2 5 7 11 17 18
GENERAL ASSESSMENT APPROACH
Define System 2.1. 2.2. Identify the Purpose of the Model 2.3. Determine the Appropriate Level at Which to Perform the Modeling 2.3.1. Level vs. Data Needed 2.3.2. Using an FMEA as the basis for a reliability model 2.3.3. Model Form vs. Level 2.4. Assess Data Available 2.5. Determine and Execute Appropriate Approach 2.5.1. Empirical
2.5.1.1. Test 2.5.1.2. Field Data 2.5.2.
20 22 25 26 28 34 36 38 44
44 77 106
Physics
2.5.2.1. Stress/Strength Modeling 2.5.2.2. First Principals
106 111
2.6. Combine Data 2.6.1. Bayesian Inference 2.7. Develop System Model 2.7.1. Monte Carlo Analysis 2.8. References
3.
19
114 121 123 127 133
FUNDAMENTAL CONCEPTS
135
Reliability Theory Concepts 3.1. 3.2. Probability concepts 3.2.1. Covariance 3.2.2. Correlation Coefficient 3.2.3. Permutations and Combinations 3.2.4. Mutual Exclusivity
135 142 142 142 143 144 i
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents Page 3.2.5. Independent Events 3.2.6. Non‐independent (Dependent) Events 3.2.7. Non‐independent (Dependent) Events: Bayes Theorem 3.2.8. System Models 3.2.9. K‐out‐of‐N Configurations 3.3. Distributions 3.3.1. Exponential 3.3.2. Weibull 3.3.3. Lognormal 3.4. References
4.
DOEBASED APPROACHES TO RELIABILITY MODELING
4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8.
Determine the Feature to be Assessed Determine Factors Determine the Factor Levels Design the Tests Perform Tests and Measurements Analyze the Data Develop the Life Model References
5.
LIFE DATA MODELING
171 172 172 172 174 180 181 183 183
185
Selecting a Distribution 5.1. 5.2. Parameter Estimation Overview 5.2.1. Closed Form Parameter Approximations 5.2.2. Least Squares Regression 5.2.3. Parameter Estimation Using MLE
185 186 189 190 192
5.2.3.1. Brief Historical Remarks 5.2.3.2. Likelihood Function 5.2.3.3. Maximum Likelihood Estimator (MLE) 5.2.4.
144 145 146 146 151 153 159 160 166 169
193 193 195 198
Confidence Bounds and Uncertainty
5.2.4.1. Confidence Bounds with MLE 5.2.4.2. Confidence Bounds Approximations 5.3. Acceleration Models 5.3.1. Fundamental Acceleration Models
198 199 206 207
5.3.1.1. Examples
208 ii
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents Page 5.3.2. Combined Models 5.3.3. Cumulative Damage Model 5.4. MLE Equations 5.4.1. Likelihood Functions 5.5. References
6.
210 214 216 217 221
INTERPRETATION OF RELIABILITY ESTIMATES
Bathtub Curve 6.1. 6.2. Common Cause vs. Special Cause 6.3. Confidence Bounds 6.3.1. Traditional Techniques for Confidence Bounds 6.3.2. Uncertainty in Reliability Prediction Estimates 6.4. Failure Rate vs pdf 6.5. Practical Aspects of Reliability Assessments 6.6. Weibayes 6.7. Weibull Closure Property 6.8. Estimating Event‐Related Reliability 6.9. Combining Different Types of Assessments at Different Levels 6.10. Estimating the Number of Failures 6.11. Calculation of Equivalent Failure Rates 6.12. Failure Rate Units 6.13. Factors to be Considered When Developing Models 6.13.1. Causes of Electronic System Failure 6.13.2. Selection of Factors 6.13.3. Reliability Growth of Components 6.13.4. Relative vs. Absolute Humidity 6.14. Addressing Data with No Failures 6.15. Reliability of Components Used Outside of Their Rating 6.16. References
7.
EXAMPLES
223 223 225 238 238 240 243 245 245 246 247 248 250 251 252 253 253 255 257 259 259 261 262
263
MIL‐HDBK‐217 Model Development Methodology 7.1. 7.1.1. Identify Possible Variables 7.1.2. Develop Theoretical Model 7.1.3. Collect and QC Data 7.1.4. Correlation Coefficient Analysis iii
264 266 266 267 268
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents Page 7.1.5. Stepwise Multiple Regression Analysis 7.1.6. Goodness‐of‐Fit Analysis 7.1.7. Extreme Case Analysis 7.1.8. Model Validation 7.2. 217Plus Reliability Prediction Models 7.2.1. Background 7.2.2. System Reliability Prediction Model
270 271 272 272 273 273 274
7.2.2.1. 217Plus Background 7.2.2.2. Methodology Overview 7.2.2.3. System Reliability Model 7.2.2.4. Initial Failure Rate Estimate 7.2.2.5. Process Grading Factors 7.2.2.6. Basis Data for the Model 7.2.2.7. Uncertainty in Traditional Approach Estimates 7.2.2.8. System Failure Causes 7.2.2.9. Environmental Factor 7.2.2.10. Reliability Growth 7.2.2.11. Infant Mortality 7.2.2.12. Combining Predicted Failure Rate with Empirical Data 7.2.3.
Development of Component Reliability Models
7.2.3.1. Model Form 7.2.3.2. Acceleration Factors 7.2.3.3. Time Basis of Models 7.2.3.4. Failure Mode to Failure Cause Mapping 7.2.3.5. Derivation of Base Failure Rates 7.2.3.6. Combining the Predicted Failure Rate with Empirical Data 7.2.3.7. Estimating Confidence Levels 7.2.3.8. Using the 217Plus Model in a Top-Down Analysis 7.2.3.9. Capacitor Model Example 7.2.3.10. Default Values 7.2.4.
7.2.5.
292
292 294 294 295 296 296 298 298 299 301 303
Photonic Model Development Example
7.2.4.1. 7.2.4.2. 7.2.4.3. 7.2.4.4. 7.2.4.5.
274 277 278 279 280 281 281 282 287 291 292 292
Introduction Model development methodology and results Uncertainty Analysis Comments on Part Quality Levels Explanation of Failure Rate Units
303 306 322 325 325 326
System‐Level Model
7.2.5.1. Model Presentation
326 iv
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents 7.2.5.2. 217Plus Process Grading Criteria 7.2.5.3. Design Process Grade Factor Questions 7.2.5.4. Manufacturing Process Grade Factor Questions 7.2.5.5. Part Quality Process Grade Factor Questions 7.2.5.6. System Management Process Grade Factor Questions 7.2.5.7. Can Not Duplicate (CND) Process Grade Factor Questions 7.2.5.8. Induced Process Grade Factor Questions 7.2.5.9. Wearout Process Grade Factor Questions 7.2.5.10. Growth Process Grade Factor Questions 7.3. Life Modeling Example 7.3.1. Introduction 7.3.2. Approach 7.3.3. Reliability Test Plan 7.3.4. Results
350 350 350 350 352
7.3.4.1. Times to Failure Summary 7.3.4.2. Life Models
352 354
7.4. NPRD Description 7.4.1. Data Collection 7.4.2. Data Interpretation 7.4.3. Document Overview
7.4.3.1. 7.4.3.2. 7.4.3.3. 7.4.3.4. 7.4.3.5. 7.4.3.6. Prefix" 7.5.
8.
Page 328 330 336 340 342 346 347 348 349
357 358 361 366
"Part Summaries" Overview 366 "Part Details" Overview 373 Section 4 "Data Sources" Overview 374 Section 5 "Part Number/MIL Number" Index 374 Section 6 “National Stock Number Index with Federal Stock Class” 375 Section 7 "National Stock Number Index without Federal Stock Class 375 375
References
THE USE OF FMEA IN RELIABILITY MODELING
Introduction 8.1. 8.2. Definitions 8.3. FMEA Logistics 8.3.1. When initiated 8.3.2. FMEA Team 8.3.3. FMEA Facilitation
377 377 381 383 383 383 384
v
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents Page 8.3.4. Implementation 8.4. How to Perform an FMEA 8.5. Identify System Hierarchy 8.6. Function Analysis 8.7. IPOUND Analysis 8.8. Identify the Severity 8.9. Identify the Possible Effect(s) that Result from Occurrence of Each Failure Mode 8.10. Identify Potential Causes of Each Failure Mode 8.11. Identify Factors for Each Failure Cause 8.11.1. Accelerating Stress(es) or Potential Tests 8.11.2. Occurrence
8.11.2.1. Occurrence Rankings
398
8.11.3. Preventions 8.11.4. Detections 8.11.5. Detectability 8.12. Calculate the RPN 8.13. Determine Appropriate Corrective Action 8.14. Update the RPN 8.15. Using Quality Function Deployment to Feed the FMEA 8.16. References
9.
385 385 387 388 388 390 392 392 398 398 398
CONCLUDING REMARKS
401 401 401 404 405 408 408 410
411
vi
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures Page FIGURE 1.1‐1: PHASES OF A RELIABILITY PROGRAM ..................................................................................... 2 FIGURE 1.1‐2: RELATIVE COST OF FAILURES VS. PHASE ................................................................................ 3 FIGURE 1.1‐3: RELIABILITY PREDICTION, ASSESSMENT AND ESTIMATION.................................................... 4 FIGURE 1.1‐4: PERCENT OF COMPANIES USING RELIABILITY ENGINEERING TOOLS ..................................... 5 FIGURE 1.3‐1: EXAMPLE RELIABILITY PROGRAM APPROACH ........................................................................ 7 FIGURE 2.0‐1: GENERAL MODELING APPROACH ......................................................................................... 20 FIGURE 2.1‐1: FAULT TREE REPRESENTATION OF SYSTEM MODEL ............................................................. 21 FIGURE 2.1‐2: FAULT TREE REPRESENTATION TO THE FAILURE CAUSE LEVEL ............................................ 21 FIGURE 2.2‐1: BREAKDOWN OF POTENTIAL RELIABILITY MODELING PURPOSES ....................................... 23 FIGURE 2.3‐1: TYPICAL DATA REQUIREMENTS VS. LEVEL OF HIERARCHY ................................................... 27 FIGURE 2.3‐2: THE BASIC FMEA APPROACH ................................................................................................. 28 FIGURE 2.3‐3: HIERARCHICAL RELATIONSHIP BETWEEN CAUSE, MODE AND EFFECT ................................. 29 FIGURE 2.3‐4: APPROACH TO IDENTIFYING CAUSES .................................................................................... 29 FIGURE 2.3‐5: FAULT TREE OF PRODUCT OR SYSTEM ................................................................................. 32 FIGURE 2.3‐6: FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE AS THE LOWEST LEVEL ....................... 32 FIGURE 2.3‐7: FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE ABOVE THE LOWEST LEVEL ................ 33 FIGURE 2.3‐8: FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE TWO LEVELS ABOVE THE LOWEST LEVEL ................................................................................................................................................... 33 FIGURE 2.5‐1: BREAKDOWN OF RELIABILITY ASSESSMENT OPTIONS .......................................................... 38 FIGURE 2.5‐2: QUALIFICATION CONCEPTS AND TERMINOLOGY .................................................................. 46 FIGURE 2.5‐3: EVT, DVT AND PVT RELATIONSHIPS....................................................................................... 48 FIGURE 2.5‐4: ACCELERATION LEVELS ......................................................................................................... 51 FIGURE 2.5‐5: UNCERTAINTY IN EXTRAPOLATION ...................................................................................... 52 FIGURE 2.5‐6: ACCELERATION LEVELS ......................................................................................................... 53 FIGURE 2.5‐7: ACCELERATION ALTERNATIVES ............................................................................................. 53 FIGURE 2.5‐8: RELATIVE LIFETIME VS. STRESS ............................................................................................. 54 FIGURE 2.5‐9: RELIABILITY REQUIREMENT VS. SMALL POPULATION RELIABILITY INFERENCE ................... 60 FIGURE 2.5‐10: LIFE MODELING METHODOLOGY ....................................................................................... 62 FIGURE 2.5‐11: IDENTIFICATION OF TEST STRESSES BASED ON THE FMEA ................................................. 64 FIGURE 2.5‐12: USING THE DESTRUCT LIMIT TO DEFINE THE LIFE TEST MAX STRESS ................................ 66 FIGURE 2.5‐13: POSSIBLE STRESS PROFILES ................................................................................................ 67 FIGURE 2.5‐14: MEASUREMENT POINTS FOR AN INFANT MORTALITY FAILURE CAUSE .............................. 69 FIGURE 2.5‐15: MEASUREMENT POINTS FOR A WEAROUT FAILURE CAUSE ............................................... 69 FIGURE 2.5‐16: ACCELERATION WHEN THE DISTRIBUTIONS FOR AT LEAST TWO STRESSES ARE AVAILABLE ............................................................................................................................................................ 71 FIGURE 2.5‐17: ACCELERATION WHEN THE DISTRIBUTIONS FOR LOW STRESSES ARE NOT AVAILABLE ..... 71 FIGURE 2.5‐18: LIFE MODEL SEQUENCE ....................................................................................................... 72 FIGURE 2.5‐19 DEGRADATION MODELING APPROACH ................................................................................ 75 FIGURE 2.5‐20: DEGRADATION DATA EXAMPLE .......................................................................................... 76 FIGURE 2.5‐21: DEGRADATION DATA CONVERSION TO TIMES TO FAILURE ................................................ 77 FIGURE 2.5‐22: RELIABILITY ESTIMATES FROM FIELD DATA ........................................................................ 78 vii
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures Page FIGURE 2.5‐23: FMEA AS A TOLL FOR ASSESSING SIMILARITY ..................................................................... 81 FIGURE 2.5‐24: MIL‐HDBK‐217 PART COUNT EXAMPLE ............................................................................... 85 FIGURE 2.5‐25: MIL‐HDBK‐217 PART STRESS EXAMPLE ............................................................................... 86 FIGURE 2.5‐26: TELCORDIA SR‐332 (BELLCORE) ........................................................................................... 87 FIGURE 2.5‐27: RAC PRISM REPLACED BY RIAC 217PLUS ............................................................................. 88 FIGURE 2.5‐28: CNET/RDF 2000 ................................................................................................................... 89 FIGURE 2.5‐29: CNET/RDF 2000 MODEL EXAMPLE ...................................................................................... 90 FIGURE 2.5‐30: FIDES .................................................................................................................................... 91 FIGURE 2.5‐31: USES OF PROGRAM DATA ELEMENTS ................................................................................. 93 FIGURE 2.5‐32: PROGRAM DATABASE STRUCTURE ..................................................................................... 93 FIGURE 2.5‐33: DATABASE INFORMATION FLOW ........................................................................................ 95 FIGURE 2.5‐34: HIERARCHY OF MAINTENANCE ACTIONS ............................................................................ 97 FIGURE 2.5‐35: CALCULATION OF PART LIFE UNIT ..................................................................................... 100 FIGURE 2.5‐36: FAILURE TIMES BASED ON OPERATING TIME .................................................................... 101 FIGURE 2.5‐37: FAILURE TIMES BASED ON CALENDAR TIME ..................................................................... 102 FIGURE 2.5‐38: FAILURE RATE SIMULATION WITH WEIBULL BETA = 20 .................................................... 103 FIGURE 2.5‐39: FAILURE RATE SIMULATION WITH WEIBULL BETA = 5.0 ................................................... 103 FIGURE 2.5‐40: FAILURE RATE SIMULATION WITH WEIBULL BETA = 2.0 ................................................... 104 FIGURE 2.5‐41: FAILURE RATE SIMULATION WITH WEIBULL BETA = 1.0 ................................................... 104 FIGURE 2.5‐42: FAILURE RATE SIMULATION WITH WEIBULL BETA = 0.5 ................................................... 105 FIGURE 2.5‐44: STRESS/STRENGTH INTERFERENCE ................................................................................... 108 FIGURE 2.5‐45: STRESS/STRENGTH INTERFERENCE VS. TIME .................................................................... 109 FIGURE 2.6‐1: 217PLUS APPROACH TO FAILURE RATE ESTIMATION ......................................................... 114 FIGURE 2.6‐3. BAYESIAN INFERENCE OUTLINE .......................................................................................... 122 FIGURE 2.7‐1: COMBINING SEVEN FAILURE CAUSE DISTRIBUTIONS .......................................................... 125 FIGURE 2.7‐2: POSSIBLE FAULT TREE REPRESENTATION OF A SERIES RELIABILITY BLOCK DIAGRAM ........ 126 FIGURE 2.7‐3: PDF OF NORMAL DISTRIBUTION WITH MEAN OF 10 AND STANDARD DEVIATION OF 3. ... 128 FIGURE 2.7‐4: CUMULATIVE NORMAL DISTRIBUTION WITH MEAN OF 10 AND STANDARD DEVIATION OF 3 .......................................................................................................................................................... 128 FIGURE 2.7‐5: VALUE SELECTION FROM A DISTRIBUTION ......................................................................... 129 FIGURE 2.7‐6: VALUE SELECTION FROM A WEIBULL DISTRIBUTION .......................................................... 130 FIGURE 2.7‐7: RELIABILITY BLOCK DIAGRAM OF REDUNDANT EXAMPLE .................................................. 131 FIGURE 2.7‐8: SYSTEM MONTE CARLO EXAMPLE....................................................................................... 131 FIGURE 2.7‐9: MONTE CARLO SIMULATION OF EXAMPLE SYSTEM ........................................................... 132 FIGURE 3.1‐1: DISCRETE PROBABILITY DISTRIBUTION ............................................................................... 135 FIGURE 3.1‐2: CONTINUOUS PROBABILITY DISTRIBUTION ........................................................................ 136 FIGURE 3.2‐1: EXAMPLES OF CORRELATION COEFFICIENTS ....................................................................... 142 FIGURE 3.2‐2: VENN DIAGRAM OF MUTUALLY EXCLUSIVE EVENTS ........................................................... 144 FIGURE 3.2‐3: INDEPENDENT EVENTS ........................................................................................................ 145 FIGURE 3.2‐4: FAULT TREE OR GATE .......................................................................................................... 147 FIGURE 3.2‐5: RELIABILITY BLOCK DIAGRAM FOR AN OR GATE ................................................................. 147 FIGURE 3.2‐6: FAULT TREE AND GATE ........................................................................................................ 148 viii
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures Page FIGURE 3.2‐7: RELIABILITY BLOCK DIAGRAM FOR AN AND GATE .............................................................. 149 FIGURE 3.2‐8: FAULT TREE OF AN AND/OR COMBINATION ....................................................................... 150 FIGURE 3.2‐9: RBD OF AND/OR COMBINATION ......................................................................................... 150 FIGURE 3.3‐1: SHAPES OF FAILURE DENSITY AND RELIABILITY FUNCTIONS OF COMMONLY USED DISCRETE DISTRIBUTIONS (FROM MIL‐HDBK‐338B) ......................................................................................... 157 FIGURE 3.3‐2: SHAPES OF FAILURE DENSITY, RELIABILITY AND HAZARD RATE FUNCTIONS FOR COMMONLY USED CONTINUOUS DISTRIBUTIONS (FROM MIL‐HDBK‐338B) ........................................................ 158 FIGURE 3.3‐3: EXAMPLE PDF PLOTS FOR THE WEIBULL DISTRIBUTION .................................................... 164 FIGURE 3.3‐4: EXAMPLE HAZARD RATE PLOTS FOR THE WEIBULL DISTRIBUTION .................................... 164 FIGURE 3.3‐5: EXAMPLE PROBABILITY PLOTS FOR WEIBULL DISTRIBUTION ............................................. 165 FIGURE 3.3‐6: EXAMPLE PDF PLOTS FOR THE LOGNORMAL DISTRIBUTION .............................................. 167 FIGURE 3.3‐7: EXAMPLE HAZARD RATE PLOTS FOR THE LOGNORMAL DISTRIBUTION .............................. 168 FIGURE 3.3‐8: EXAMPLE PROBABILITY PLOTS FOR THE LOGNORMAL DISTRIBUTION ............................... 168 FIGURE 4.0‐1: THE DOE CONCEPT .............................................................................................................. 171 FIGURE 4.3‐1: POSSIBLE RESPONSE‐FACTOR LEVEL RELATIONSHIP ........................................................... 173 FIGURE 4.4‐1: DOE TERMINOLOGY ............................................................................................................ 174 FIGURE 4.4‐2: ONE‐FACTOR‐AT‐A‐TIME EXPERIMENTS ............................................................................. 176 FIGURE 4.4‐3: STANDARD DOE NOMENCLATURE ...................................................................................... 177 FIGURE 4.4‐4: POTENTIAL INTERACTIONS .................................................................................................. 178 FIGURE 4.6‐1: ANALYSIS OF MEANS ........................................................................................................... 182 FIGURE 4.6‐2: LINEARIZATION OF THE ARRHENIUS RELATIONSHIP ........................................................... 182 FIGURE 4.6‐3: OPTIMAL FACTOR SETTINGS................................................................................................ 183 FIGURE 5.4‐1: LIKELIHOOD CONTOUR EXAMPLE........................................................................................ 220 FIGURE 6.1‐1: BATHTUB CURVE ................................................................................................................. 223 FIGURE 6.2‐1: EXAMPLE OF NON‐MONOMODAL DISTRIBUTION .............................................................. 228 FIGURE 6.2‐2: MULTIMODAL DISTRIBUTION EXAMPLE 1 ........................................................................... 229 FIGURE 6.2‐3: MULTIMODAL DISTRIBUTION EXAMPLE 2 ........................................................................... 230 FIGURE 6.2‐4: MULTIMODAL DISTRIBUTION EXAMPLE 3 ........................................................................... 231 FIGURE 6.2‐5: MULTIMODAL DISTRIBUTION EXAMPLE 4 ........................................................................... 232 FIGURE 6.2‐6: MULTIMODAL DISTRIBUTION EXAMPLE 5 ........................................................................... 233 FIGURE 6.2‐7: MULTIMODAL DISTRIBUTION EXAMPLE OF POOLED DATA SET ......................................... 234 FIGURE 6.2‐8: AGE AT DEATH DATA ........................................................................................................... 235 FIGURE 6.2‐9: PDF OF MULTIMODE DISTRIBUTION OF AGES .................................................................... 236 FIGURE 6.2‐10: FAILURE RATE OF AGE DATA ............................................................................................. 236 FIGURE 6.2‐11: PROBABILITY PLOT OF AGE DATA ...................................................................................... 237 FIGURE 6.2‐12: SINGLE MODE WEIBULL FIT TO THE AGE DATA ................................................................. 238 FIGURE 6.3‐1: SOURCES OF ERROR IN EMPIRICAL MODELS ....................................................................... 241 FIGURE 6.3‐2: CONFIDENCE LEVEL THROUGH PREDICTION, ASSESSMENT AND ESTIMATION .................. 243 FIGURE 6.6‐1: WEIBAYES EXAMPLE ............................................................................................................ 246 FIGURE 6.13‐1: NOMINAL FAILURE CAUSE DISTRIBUTION OF ELECTRONIC SYSTEMS ............................... 254 ix
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures Page FIGURE 6.13‐2: IPO MODEL ........................................................................................................................ 256 FIGURE 6.13‐3: RELATIONSHIP BETWEEN ABSOLUTE AND RELATIVE HUMIDITY....................................... 259 FIGURE 6.14‐1: ESTIMATED UPPER BOUND FAILURE RATES VS OPERATING TIME AT 60 AND 90% CONFIDENCE ..................................................................................................................................... 260 FIGURE 7.1‐1: MIL‐HDBK‐217 MODEL DEVELOPMENT METHODOLOGY ................................................... 265 FIGURE 7.2‐1: FAILURE CAUSE DISTRIBUTION OF ELECTRONIC SYSTEMS .................................................. 275 FIGURE 7.2‐2: OPTICAL AMPLIFIER FAILURE CAUSE DISTRIBUTION ........................................................... 277 FIGURE 7.2‐3: ΠG VS. TIME AND GROWTH RATES ..................................................................................... 291 FIGURE 7.2‐4: MODEL DEVELOPMENT METHODOLOGY FLOWCHART ...................................................... 306 FIGURE 7.2‐5: DISTRIBUTION OF LOG10 PREDICTED/OBSERVED FAILURE RATE RATIO FOR ALL DATA .... 323 FIGURE 7.2‐6: DISTRIBUTION OF LOG10 PREDICTED/OBSERVED RATIO FOR FIELD DATA ONLY ............... 324 FIGURE 7.2‐7: DISTRIBUTIONS OF THE PREDICTED/OBSERVED FAILURE RATE RATIO FOR ALL DATA AND FOR FIELD DATA ONLY ...................................................................................................................... 324 FIGURE 7.3‐1: TIMES TO FAILURE DISTRIBUTIONS ..................................................................................... 354 FIGURE 7.3‐2: PROBABILITY OF FAILURE VS. TEMPERATURE AND RELATIVE HUMIDITY AT 50,000 HOURS .......................................................................................................................................................... 357 FIGURE 7.4‐1: APPARENT FAILURE RATE FOR REPLACEMENT UPON FAILURE........................................... 362 FIGURE 7.4‐3: EXAMPLE OF PART DETAIL ENTRIES ................................................................................... 374 FIGURE 8.1‐1: TWO BASIC TYPES OF FMEA ................................................................................................ 378 FIGURE 8.4‐1: FMEA PROCESS FLOW ......................................................................................................... 386 FIGURE 8.7‐1: FAILURE CAUSE‐MODE EFFECT RELATIONSHIP ................................................................... 390 FIGURE 8.10‐1: FAILURE CAUSE, MODE AND EFFECT HIERARCHY ............................................................. 393 FIGURE 8.10‐2: FAILURE CAUSES ................................................................................................................ 395 FIGURE 8.11‐1: OCCURRENCE DEFINITIONS ............................................................................................... 399 FIGURE 8.11‐2: OCCURRENCE GUIDELINES ................................................................................................ 400 FIGURE 8.11‐3: DETECTABILITY DEFINITIONS ............................................................................................. 402 FIGURE 8.11‐4: LIFE CYCLE VS DETECTABILITY DIMENSION ....................................................................... 403 FIGURE 8.13‐1: POTENTIAL CORRECTIVE ACTIONS .................................................................................... 407 FIGURE 8.15‐1: QFD‐TO‐FMEA LINKS ......................................................................................................... 408 FIGURE 8.15‐2: QFD‐FMEA ......................................................................................................................... 410
x
List of Tables: Reliability Modeling – The RIAC Guide
List of Tables Page TABLE 1.3‐1: RANGES OF POTENTIAL CUSTOMER REACTIONS...................................................................... 8 TABLE 2.2‐1: RELIABILITY ASSESSMENT PURPOSES ..................................................................................... 24 TABLE 2.2‐2: PROGRAM PHASE VS. RELIABILITY ASSESSMENT PURPOSE ................................................... 25 TABLE 2.3‐1: EXAMPLES OF INITIAL CONDITIONS, STRESSES AND MECHANISMS ...................................... 30 TABLE 2.3‐2: RELATIONSHIP BETWEEN CAUSE, MODE AND EFFECT. .......................................................... 31 TABLE 2.5‐1: SUMMARY OF RELIABILITY ASSESSMENT OPTIONS ............................................................... 39 TABLE 2.5‐1: SUMMARY OF ASSESSMENT OPTIONS (CONTINUED) ............................................................ 40 TABLE 2.5‐2: RELEVANCY OF APPROACH TO PREDICTION, ASSESSMENT AND ESTIMATION....................... 41 TABLE 2.5‐3: IDENTIFICATION OF APPROPRIATE APPROACHES BASED ON THE PURPOSE ......................... 43 TABLE 2.5‐4: RANKING THE ATTRIBUTES OF EMPIRICAL DATA ................................................................... 44 TABLE 2.5‐5: EVT, DVT AND PVT PURPOSE AND APPROACH ....................................................................... 47 TABLE 2.5‐6: RELIABILITY DEMONSTRATION EXAMPLE ............................................................................... 50 TABLE 2.5‐7: EXAMPLE OF A QUALIFICATION PLAN FOR AN ASSEMBLY ..................................................... 57 TABLE 2.5‐8: QUALIFICATION EXAMPLE FOR A LASER DIODE ..................................................................... 58 TABLE 2.5‐9: STRESS PROFILE OPTION ADVANTAGES AND DISADVANTAGES ............................................. 68 TABLE 2.5‐10: SIMILARITY ANALYSIS ............................................................................................................ 80 TABLE 2.5‐11: DIGITAL CIRCUIT BOARD FAILURE RATES (IN FAILURES PER MILLION PART HOURS) ........... 83 TABLE 2.5‐12: TEST CONDITIONS ............................................................................................................... 111 TABLE 2.5‐13: DATA TO ESTIMATE DIFFUSION RATE ................................................................................. 112 TABLE 2.5‐14: PREDICTED LIFETIMES VS. OBSERVED ................................................................................. 113 TABLE 3.1‐1: PROBABILITY DISTRIBUTION NOTATION & MATHEMATICAL REPRESENTATIONS ............... 141 TABLE 3.2‐1: COMBINATIONS EXAMPLE .................................................................................................... 143 TABLE 3.2‐2: COMBINATIONS OF AN OR CONFIGURATION ....................................................................... 147 TABLE 3.2‐3: COMBINATIONS OF AN AND CONFIGURATION ..................................................................... 149 TABLE 3.2‐4: EXAMPLE OF “K‐OUT‐OF‐N” PROBABILITY CALCULATIONS................................................... 151 TABLE 3.2‐5: EXAMPLE OF “2‐OUT‐OF‐3” REQUIRED FOR SUCCESS .......................................................... 152 TABLE 3.3‐1: PROBABILITY DISTRIBUTIONS APPLICABLE TO RELIABILITY ENGINEERING .......................... 154 TABLE 3.3‐2: EXPONENTIAL DISTRIBUTION PARAMETERS ........................................................................ 160 TABLE 3.3‐3: CONFUSING TERMINOLOGY OF THE WEIBULL DISTRIBUTION ............................................. 162 TABLE 3.3‐4: WEIBULL DISTRIBUTION PARAMETERS ................................................................................ 163 TABLE 4.3‐1: POSSIBLE CONCLUSIONS FOR A NON‐LINEAR RESPONSE‐FACTOR RELATIONSHIP ............... 173 TABLE 4.4‐1: FULL‐FACTORIAL EXAMPLE .................................................................................................... 175 TABLE 4.4‐2: FULL AND HALF FACTORIAL EXAMPLE FOR CORROSION ...................................................... 179 TABLE 5.2‐1: TERMINOLOGY USED IN PARAMETER ESTIMATION ............................................................. 187 TABLE 5.2‐2: TECHNIQUES FOR PARAMETER ESTIMATION ....................................................................... 188 TABLE 5.2‐3: PARAMETERS TYPICALLY ESTIMATED FROM STATISTICAL DISTRIBUTIONS ......................... 189 TABLE 5.2‐4: CONFIDENCE BOUNDS FOR THE POISSON DISTRIBUTION ................................................... 200 TABLE 5.2‐5: CONFIDENCE BOUNDS FOR THE BINOMIAL DISTRIBUTION ................................................. 201 TABLE 5.2‐6: CONFIDENCE BOUNDS FOR THE EXPONENTIAL DISTRIBUTION ........................................... 202 TABLE 5.2‐8: CONFIDENCE BOUNDS FOR THE NORMAL DISTRIBUTION ................................................... 203 TABLE 5.3‐10: CONFIDENCE BOUNDS FOR THE WEIBULL DISTRIBUTION ................................................. 205 xi
List of Tables: Reliability Modeling – The RIAC Guide
List of Tables Page TABLE 6.1‐1: CATEGORIES OF FAILURE EFFECTS ........................................................................................ 227 TABLE 6.2‐2: BIMODAL POPULATION EXAMPLE 1 ...................................................................................... 229 TABLE 6.2‐3: BIMODAL POPULATION EXAMPLE 2 ...................................................................................... 230 TABLE 6.1‐4: BIMODAL POPULATION EXAMPLE 3 ...................................................................................... 231 TABLE 6.1‐5: BIMODAL POPULATION EXAMPLE 4 ...................................................................................... 232 TABLE 6.1‐6: BIMODAL POPULATION EXAMPLE 5 ...................................................................................... 233 TABLE 6.1‐7: FOUR MODE WEIBULL DISTRIBUTION PARAMETERS ............................................................ 235 TABLE 6.3‐1: FAILURE RATE UNCERTAINTY LEVEL MULTIPLIERS ................................................................ 242 TABLE 6.9‐1: EXAMPLE OF COMBING DIFFERENT TYPES OF MODELS........................................................ 248 TABLE 6.13‐1: FACTORS TO BE CONSIDERED IN A RELIABILITY MODEL ..................................................... 256 TABLE 6.13‐2: FAILURE RATE DATA SUMMARY ......................................................................................... 258 TABLE 7.1‐1: DATA COLLECTED FOR MODEL DEVELOPMENT .................................................................... 269 TABLE 7.1‐2: DATA TRANSFORMS .............................................................................................................. 270 TABLE 7.1‐3: REGRESSION DATA INCLUDING CATEGORICAL VARIABLES ................................................... 271 TABLE 7.2‐1: UNCERTAINTY LEVEL MULTIPLIER ......................................................................................... 282 TABLE 7.2‐2: PERCENTAGE OF FAILURES ATTRIBUTABLE TO EACH FAILURE CAUSE .................................. 283 TABLE 7.2‐3: WEIBULL PARAMETERS FOR FAILURE CAUSE PERCENTAGES ................................................ 283 TABLE 7.2‐4: MULTIPLIERS AS A FUNCTION OF PROCESS GRADE ............................................................. 284 TABLE 7.2‐5: EXAMPLE OF FAILURE MODE‐TO‐FAILURE CAUSE CATEGORY MAPPING ............................. 295 TABLE 7.2‐6: CAPACITOR PARAMETERS ..................................................................................................... 301 TABLE 7.2‐7: DEFAULT ENVIRONMENTAL STRESS VALUES ........................................................................ 302 TABLE 7.2‐8: DEFAULT OPERATING PROFILE VALUES................................................................................. 303 TABLE 7.2‐9: FAILURE CAUSE SUMMARY FOR CONNECTORS .................................................................... 308 TABLE 7.2‐10: FAILURE MODE TO FAILURE CAUSE CATEGORY FOR CONNECTORS (SC AND FC) .............. 309 TABLE 7.2‐11: FAILURE CAUSE PERCENTAGES FOR CONNECTORS ............................................................. 311 TABLE 7.2‐12: DATA COLLECTED FOR CONNECTORS.................................................................................. 312 TABLE 7.2‐13: CATEGORIES OF ACCELERATION MODEL PARAMETERS ...................................................... 315 TABLE 7.2‐14: ACCELERATION MODEL PARAMETERS ................................................................................ 315 TABLE 7.2‐15: DEFAULT MODEL PARAMETERS .......................................................................................... 316 TABLE 7.2‐16: SUMMARY OF PI‐FACTOR CALCULATIONS .......................................................................... 317 TABLE 7.2‐17: APPLICABILITY OF TEST DATA .............................................................................................. 318 TABLE 7.2‐18: BASE FAILURE RATES (FAILURES PER MILLION CALENDAR HOURS) .................................... 319 TABLE 7.2‐19: PART QUALITY PROCESS GRADE FACTOR QUESTIONS FOR PHOTONIC DEVICE MODELS .. 320 TABLE 7.2‐20: SUMMARY OF UNCERTAINTY METRICS ............................................................................... 323 TABLE 7.2‐21: PARAMETERS FOR THE PROCESS GRADE FACTORS ............................................................. 327 TABLE 7.2‐22. INDEX OF PROCESS GRADE TYPE QUESTIONS .................................................................... 328 TABLE 7.2‐23: DESIGN PROCESS GRADE FACTOR QUESTIONS .................................................................. 330 TABLE 7.2‐24: MANUFACTURING PROCESS GRADE FACTOR QUESTIONS ................................................. 336 TABLE 7.2‐25: PART QUALITY PROCESS GRADE FACTOR QUESTIONS ....................................................... 340 TABLE 7.2‐26: SYSTEM MANAGEMENT PROCESS GRADE FACTOR QUESTIONS ........................................ 342 TABLE 7.2‐27: CAN NOT DUPLICATE (CND) PROCESS GRADE FACTOR QUESTIONS .................................. 346 TABLE 7.2‐28: INDUCED PROCESS GRADE FACTOR QUESTIONS ............................................................... 347 xii
List of Tables: Reliability Modeling – The RIAC Guide
List of Tables Page TABLE 7.2‐29: WEAROUT PROCESS GRADE FACTOR QUESTIONS ............................................................. 348 TABLE 7.2‐30: GROWTH PROCESS GRADE FACTOR QUESTIONS ............................................................... 349 TABLE 7.3‐1: PARAMETER LEVELS .............................................................................................................. 350 TABLE 7.3‐2: TEST PLAN SUMMARY ........................................................................................................... 351 TABLE 7.3‐3: LIFE TEST RESULTS ................................................................................................................. 352 TABLE 7.3‐4: TIMES TO FAILURE DISTRIBUTION PARAMETERS .................................................................. 353 TABLE 7.3‐5: ESTIMATED PARAMETER 80% 2‐SIDED CONFIDENCE BOUNDS ............................................ 356 TABLE 7.4‐1: DATA SUMMARIZATION PROCESS ........................................................................................ 359 TABLE 7.4‐2: TIME AT WHICH ASYMPTOTIC VALUE IS REACHED ............................................................... 363 TABLE 7.4‐3 α/MTTF RATIO AS A FUNCTION OF β ..................................................................................... 363 TABLE 7.4‐4: PERCENT FAILURE FOR WEIBULL DISTRIBUTION ................................................................... 364 TABLE 7.4‐5: FIELD DESCRIPTIONS ............................................................................................................. 367 TABLE 7.4‐6: APPLICATION ENVIRONMENTS DEFINED IN NPRD ............................................................... 368 TABLE 8.7‐1: FAILURE MODE RELATIONSHIP TO TAGUCHI LOSS FUNCTION ............................................. 389 TABLE 8.8‐1: DIMENSIONS OF FUNCTIONAL SEVERITY .............................................................................. 391 TABLE 8.8‐2: DIMENSIONS OF SEVERITY .................................................................................................... 392 TABLE 8.11‐1: CATEGORIES OF FAILURE EFFECTS ...................................................................................... 401 TABLE 8.11‐2: RECOMMENDED DETECTABILITY RATING CRITERIA ............................................................ 404
xiii
List of Tables: Reliability Modeling – The RIAC Guide
This page intentionally left blank
xiv
Chapter 1: Introduction
1.
Introduction
Few engineering techniques have caused as much controversy in the last several decades as the topic of reliability prediction. One of the primary reasons for this is the stochastic nature of reliability. Whereas many engineering disciplines are governed by deterministic processes, reliability is governed by a complex interaction of stochastic processes. As a result, the metrics of interest in other engineering disciplines are generally much more quantifiable by their very nature. While there is always a stochastic element in any engineering model, the topic of reliability quantification must address its extreme stochastic nature. Many highly respected reliability engineering texts treat the topic of reliability modeling thoroughly and in great detail. Included in these texts are detailed ways to model system reliability using techniques like Failure Modes and Effects Analysis (FMEA), Fault Tree Analysis (FTA), Markov models, fault tolerant design techniques, etc. The techniques that are addressed in detail in these texts often gloss over a fundamental requirement in order to effectively utilize these techniques, i.e., the ability to quantify the reliability of the constituent components and subsystems comprising the system. The intent of this book is to provide guidance on reliability modeling techniques that can be used to quantify the reliability of a product or system. In this context, reliability modeling is the process of constructing a mathematical model that is used to estimate the reliability characteristics of an item. There are many ways in which this can be accomplished, depending on the item and the type of information that is available to, or practical to obtain by, the analyst. This book will review possible approaches, summarize their advantages and disadvantages, and provide guidance on selecting a methodology based on specific goals and constraints. While this book will not discuss the use of specific published methodologies, in cases where examples are provided, tools and methodologies with which the author has personal experience in their development are used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus. The Reliability Information Analysis Center (RIAC) has prepared many documents in the past relating to many different reliability engineering techniques, such as FMEA, FTA, Worst Case Analysis (WCA), etc. However, one noteworthy omission from this list is reliability modeling. This, coupled with (1) the RIAC’s history of providing reliability modeling data and solutions, and (2) the need to objectively address some of the confusion and misconceptions related to this topic, formed the inspiration for this book.
Reliability Information Analysis Center 1
Chapter 1: Introduction
In years past, DoD contracts would require specific reliability prediction methodologies, usually MIL-HDBK-217, be used. This resulted in system developers having very little flexibility in applying different reliability prediction practices. Since the DoD has not, until very recently, supported updates to MIL-HDBK-217, companies were encouraged to use best practices in quantifying product reliability. The difficult question to be addressed is “what are the best practices that should be used?” This book attempts to provide guidance on selecting an appropriate methodology based on the specific conditions and constraints of the company and its products or systems. It is hoped that the author’s experience gained by attempting many different reliability assessment approaches, including physics and empirical approaches, can be used to the advantage of the reader in a practical way.
1.1. Scope The intent of a reliability program is to identify and mitigate failure modes/mechanisms, verify their removal through reliability testing, implement corrective actions for “discovered” failures, and maintain reliability levels after reliability has been designed in. These correspond to the designing-in reliability, reliability growth and ensuring on-going reliability goals, respectively, as illustrated in Figure 1.1-1.
Figure 1.1-1: Phases of a Reliability Program 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 2
Chapter 1: Introduction
The cost to an organization increases exponentially as a function of when failure causes are discovered, as illustrated in Figure 1.1-2. It is most efficient to discover failure modes and mechanisms as early as possible, when they can be effectively mitigated. If failure modes and mechanisms are discovered late in development or, worse, in the field, organizations can be faced with staggering costs associated with corrective actions.
Figure 1.1-2: Relative Cost of Failures vs. Phase The use of reliability engineering techniques early in the development cycle of a system is critical to achieving high reliability. An important part of these efforts is the modeling of reliability before the product or system is fielded. The term “Reliability Prediction” has had a relatively narrow connotation, primarily associated with “handbook” approaches. This document attempts to take a broader view of this topic by investigating the various approaches for quantifying reliability, and their effectiveness when used to achieve specific objectives. For this reason, the book is entitled “Reliability Modeling – the RIAC Guide to Reliability Prediction, Assessment and Estimation”. The definitions of these are: Prediction - something that is predicted, forecasted Assessment - to determine the importance, size, or value of Estimation - A tentative evaluation or rough calculation, as of worth, quantity, or size Reliability Information Analysis Center 3
Chapter 1: Introduction
Predictions are performed very early, before there is any empirical data on the item under analysis. Reliability assessments are made to determine the affects of certain factors on reliability and to identify failure causes. Reliability estimates are made based on empirical data. This book covers all three areas, as illustrated in Figure 1.1-3.
Figure 1.1-3: Reliability Prediction, Assessment and Estimation Figure 1.1-4 summarizes the results of a benchmarking study of best commercial reliability practices (Reference 9). In this study, reliability predictions were identified by more than 90% of the participants as being an appropriate reliability task during the product/system development life cycle. Approximately 70% of the survey respondents felt that reliability predictions were effective, supporting the proposition that, while generally perceived as beneficial, there are problems associated with their use. This information highlights the importance that organizations often place on assessing and predicting reliability.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 4
Chapter 1: Introduction
Figure 1.1-4: Percent of Companies Using Reliability Engineering Tools
1.2. Book Organization Chapter 1 of this book presents background information on reliability modeling. The next section of this chapter includes a description of a typical reliability program, the intent of which is to present the elements that should be considered when developing a program, and to highlight how reliability modeling fits into such a program. Also included is a section on the history of reliability prediction, to provide a historical perspective of its evolution. Chapter 2 covers the primary topic of this book, and includes information on the various ways in which a product can be modeled and guidance on selecting an approach. It presents a generic approach, and describes the elements of this approach. Chapter 3 presents fundamental concepts of reliability theory, probability and statistics. In many books, these topics are presented first. However, in this book, it is presented after Chapter 3 because it is not the primary topic. Rather, it is presented to provide the fundamental foundation for the concepts used in reliability modeling. It is also the foundation for Design of Experiments (DOE) and Life Modeling techniques, which are further detailed in Chapters 4 and 5.
Reliability Information Analysis Center 5
Chapter 1: Introduction
Approaches like using a “Multi-cell”-based designed experiment to generate data from which a life model is developed are presented in Chapter 2. Here, a generic approach to this topic is presented. Since the topic of life modeling is central to reliability modeling, important elements of it are presented in more detail in Chapters 4 and 5. One of the critical aspects of life modeling is reliability testing. Design of Experiments is a technique to maximize the usefulness of the data resulting from DOE tests, and is the topic of Chapter 4. Chapter 5 presents information relative to development of the mathematical models that form the basis of the reliability model, and includes information pertaining to parameter estimation. Chapter 6 presents a variety of topics pertaining to the interpretation of reliability models. This is provided to allow the reader to gain a better appreciation for what can, and cannot, be concluded from a model. Chapter 7 is a compilation of examples of reliability models. Presented here are the following examples: 1. 2. 3. 4.
A typical MIL-HDBK-217 model development process Information on the development of the RIAC’s 217Plus methodology A life modeling example A description of RIAC’s Nonelectronic Parts Reliability Data (NPRD), provided as an example of the use of field data in reliability modeling
These examples are provided to give the reader a better appreciation for the tools, techniques and limitations of various approaches to reliability modeling. A discussion of FMEA is presented in Chapter 8. Although FMEA is secondary to the primary intent of this book, it can form the basis for many elements of a reliability program, including reliability modeling. Therefore, Chapter 8 is intended to present FMEA concepts in this context, as well as provide practical information on performing FMEAs that this author has found to be useful.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 6
Chapter 1: Introduction
1.3. Reliability Program Elements In order to allow a perspective on how reliability modeling fits into a reliability program, this section presents a generic reliability program, with a description of its various elements. It is presented to highlight how reliability modeling fits into such a program. There are many possible approaches to “designing in” reliability. The specific approach used will depend on the needs of the specific organization. Figure 1.3-1 presents one possible approach, and includes the elements that should be included in all approaches. The premise of this approach is to identify the critical parts and material which warrant detailed attention. Since it is impractical to perform some reliability modeling approaches on all system parts, it is imperative to identify the critical parts which are the highest risk. Since one of the most effective ways to verify the robustness of parts or materials is from experience, an effective reliability program must leverage knowledge gained in the development and deployment of previous systems. It will be shown that reliability assessments impact many of the elements of this approach.
Figure 1.3-1: Example Reliability Program Approach Reliability Information Analysis Center 7
Chapter 1: Introduction
Elements of the reliability program are summarized as follows: 1. Design requirements: The first step in any product development process is the identification of requirements. These requirements include items pertaining to Performance, Reliability (failure rate, life), Maintainability, Diagnostics, and Use Environment and Operational stresses (i.e., mission profiles). Typically, the medium for communicating these requirements is the product specification. While the specification usually contains details regarding the require performance of the product or system, it is often lacking relative to quantifying the reliability attributes required. The following questions should be answered to determine these reliability requirements: • • • • •
What is the required failure rate of the item in its useful life? What is the service life required? What criteria will be used to determine when the requirements are not met? Whose responsibility will it be to take corrective action if these requirements are not met? What are the operating and environmental profiles expected in field deployed conditions?
A valuable tool to assist in understanding the requirements is Quality Function Deployment (QFD). The reliability that is considered acceptable will, of course, be specific to the industry, criticality of failure, etc. The specific value may be specified, or it may not be, depending on the industry and the maturity of the product. The range of potential customer reactions to various scenarios are summarized in Table 1.3-1. Table 1.3-1: Ranges of Potential Customer Reactions Outcome Best
Worst
Field reliability No failures Failures occur at an acceptable rate Recurring failures, but on a relatively small percent of items Recurring failures on a high percent of items An unexpected failure mechanism is discovered that will affect the entire population, or critical safety related failures
Likely Customer reaction Pleased Tolerant Annoyed Angry Legal action, loss of business
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 8
Chapter 1: Introduction
If the requirement is not specified, an estimate of the requirement must be made so that there is a goal that can be used in the development process. 2. Initial Design: After the product requirements are understood, the design team generally derives an initial, or preliminary, design for the product or system. Inputs to this initial design should be in the form of design rules and a Standard parts list. Design rules are the culmination of lessons learned from previous development activities, from both empirical field or test data, and from analysis. These design rules should be a living document which is continuously updated based on current information. Effective use of design rules also saves much effort since reliability attributes which have a reliability history or which have been previously studied do not need to be addressed in detail, thus saving resources to be applied to the study of critical parts. 3. Similarity analysis: Once an initial design is available, a similarity analysis can be performed to identify attributes which are similar to those for which a reliability history is available, and those for which it is not. A FMEA can be a valuable technique for this analysis, and will be discussed later. In this analysis, each reliability attribute identified in the FMEA is reviewed to determine if a reliability history exists or not. 4. Identify attributes that are similar: Similar attributes are those that have a reliability history 5. Assess robustness of attribute: If the part or attribute does have a history, previous test data or field experience data can be used to assess the robustness of the part or attribute. 6. Identify attributes that are not similar: Attributes that are not similar do not have a reliability history. 7. Perform design analysis: Although any attribute that is potentially different in the new design relative to the previous design must be analyzed, particular attention is given to the attributes that are not similar. Design techniques that are used for this purpose are FMEA, tolerance or worst case analysis, thermal analysis, stress analysis, and reliability predictions. 8. Implement corrective action: From the results of the design analysis, corrective action should be taken to improve the robustness of the design. 9. Identify critical parts/materials: Based on the results of the analysis, critical parts or materials are identified. Reliability Information Analysis Center 9
Chapter 1: Introduction
10. Model critical parts/materials: Once critical parts are identified, action must be taken to ensure that the parts or materials are robust enough to meet the reliability and durability requirements. More details of the approach used for this purpose will be presented later in the book. 11. Identify effective tests for non-similar attributes: Based on the identification of critical parts and the design analysis that was performed, specific tests that will assess the reliability and durability of the attribute can be determined. Part of the FMEA should include identification of stresses that will accelerate the attribute under analysis and therefore, this analysis is important for identifying the appropriate stress tests. 12. Develop a test plan and execute tests: Based on the design analysis performed and the identification of tests for non-similar attributes, a test plan can be determined. In the context of this approach, the goal of these tests is to assess the robustness of the product by subjecting the product to test stresses that are intended to accelerate the critical parts and non-similar attributes to failure. In addition to these tests, other test requirements should be incorporated into this test plan. These additional test requirements include any tests required by the customer, such as qualification or reliability demonstration tests. 13. Document the test results: Once the tests have been performed and the data analyzed, the results should be fully documented, since they subsequently will be used for a variety of purposes. 14. Monitor field reliability: Once the product is deployed, field reliability experience data should be carefully gathered, since it will be used for a variety of purposes. Elements of the data to be gathered include: 1. Product or system deployment history by serial number, including when deployed, when fielded 2. Failure information, including failure date, root failure cause, results of failure analysis 3. Product or system re-deployment information 15. Update reliability database: A database is required to manage the reliability data, and should include both test data and field data. This data can be used to generate a company-specific reliability prediction methodology.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 10
Chapter 1: Introduction
16. Update Design Rules: Data acquired from tests and field surveillance should be used to update the design rules. Field data is probably the most valuable type of data for this purpose since it represents the actual product or system in the intended use environment. The process of maintaining design rules and ensuring that they are used in new designs is the cornerstone of the means by which reliability is improved in a reliability growth process. Critical parts are those which may result in a significant risk to the project. This risk can be related to reliability, lifetime, availability, or maintainability. Some of the factors that constitute critical parts are: • • • • • •
New, unproven technology New, unproven manufacturing processes Performance limitations: stringent environmental conditions or non-robust design practices Reliability limitations: components/materials with life limitations Vendors with a past history of delivery, cost performance or reliability problems Old technology with availability problems
These critical parts or items warrant additional attention in assessing their reliability, as they generally will represent the greatest reliability risk.
1.4. The History of Reliability Prediction The term “reliability prediction” has historically be used to denote the process of applying mathematical models and data for the purposes of estimating field reliability of a product or system before empirical data is available on that product or system. This section will review some of the developments in the area of reliability prediction from the 1950’s to the present. While there are several techniques available to reliability practitioners to perform reliability predictions, the discussion inevitably centers around MIL-HDBK-217 due to its historical prominence as a reliability prediction tool. During World War II, electronic tubes were by far the most unreliable component used in DoD electronic systems. This observation led to various studies and ad hoc groups whose purpose was to identify ways that their reliability, and the reliability of the systems in which they operated, could be improved. One group in the early 1950’s concluded that: 1. There needs to be better reliability data collected from the field 2. Better components need to be developed Reliability Information Analysis Center 11
Chapter 1: Introduction
3. Quantitative reliability requirements need to be established 4. Reliability needs to be verified by test before full scale production 5. A permanent committee needs to be established to guide the reliability discipline Item 5, above, was implemented in the form of the Advisory Group on Reliability of Electronic Equipment (AGREE), whose charter was to identify actions that could be taken to provide more reliable electronic equipment. This time period was the advent of the reliability engineering discipline. It soon became clear that the emerging discipline was using several different methods to achieve its goal of higher reliability. One was the identification of root causes of field failure and determination of mitigating actions. Another was the specification of quantitative reliability requirements. The specification of requirements in turn led to the desire to have a means of estimating reliability before an equipment is built and tested so that the probability of achieving its reliability goal could be estimated. This, of course, was the beginning of reliability prediction. The 1950’s also saw much pioneering work in the reliability discipline, including; • • • • •
A variety of efforts to improve device reliability through data collection and design The establishment of reliability programs Symposiums devoted to quality and reliability engineering Statistical techniques development such as the Weibull distribution Military handbooks that provided guidance on the reliable application of electronic components
In addition to these accomplishments, the 50’s also included pioneering work in the area of quantitative reliability prediction. In 1956, RCA released TR-1100, “Reliability Stress Analysis for Electronic Equipment”, which presented mathematical models for the estimation of component failure rates. This report turned out to be the predecessor of MIL-HDBK-217. Several additional early works in the area of reliability prediction were produced in the early 1960’s, including D.R. Erles’ report (Reference 2) and the Erles and Edins paper (Reference 3). In 1962, the first version of MIL-HDBK-217 was published by the Navy. Once issued, MIL HDBK-217 quickly became the standard by which reliability predictions were performed, and other sources of failure rates gradually disappeared. Part of the reason for the demise of other sources was the fact that MIL-HDBK-217 was often a contractually cited document and defense contractors did not have the option of using other sources of data. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 12
Chapter 1: Introduction
These early sources of failure rates also often included design guidance on the reliable application on electronic components. However, subsequent versions of the documents, primarily MIL-HDBK-217, would delete the application information because it was treated in more detail elsewhere. By now, the reliability discipline was working under the tenet that reliability was a quantitative discipline that needed quantitative data sources to support its many statistically based techniques, such as allocations and redundancy modeling. However, another branch of the reliability discipline focused on the physical processes by which components were failing. The first symposium devoted to this topic was the “Physics of Failure In Electronics” Symposium sponsored by the Rome Air Development Center (RADC) and IIT Research Institute (IITRI) in 19621. This symposium later became known as the International Reliability Physics Symposium (IRPS). In this period of time, the two branches of reliability engineering seemed to be diverging, with the “systems” engineers devoted to the tasks of specifying, allocating, predicting and demonstrating reliability, while the physics-of-failure (PoF) engineers and scientists were devoting their efforts to identifying and modeling the physical causes of failure. Both branches were integral parts of the reliability discipline, and both were hosted at RADC (later to become Rome Laboratory). The physics-based information was necessary to develop part qualification, screening and application requirements, and the “systems” tasks of specifying, allocating, predicting and demonstrating reliability were necessary to insure that reliability requirements were met. The component research efforts of the 1950’s and 1960’s culminated with the implementation of the “ER” and “TX” families of specifications. This complicated the issue of predicting their reliability because there were now many different combinations of quality levels and environments that needed to be addressed in MIL-HDBK-217. In the early 1970’s, the responsibility for preparing MIL-HDBK-217 was transferred to RADC, who published revision B in 1974. However, other than the transition to RADC, the 1970’s maintained the status quo in the area of reliability prediction. MIL-HDBK217 was updated to reflect the technology at that time, but there were few other efforts that changed the manner in which predictions were performed. One exception, however, was that there was a shift in the complexity of the models being developed for MILHDBK-217. There were several efforts to develop new and innovative models for reliability prediction. The results of these efforts were extremely complex models that may have been technically sound, but were criticized by the user community as being too 1
IITRI was the original contractor of the Reliability Analysis Center (RAC). In 2005, the RAC contract was awarded as RIAC to the current team of Wyle Labs (prime), Quanterion Solutions Incorporated, the University of Maryland Center for Risk and Reliability, the Pennsylvania State Applied Research Laboratory (ARL), and the State University of New York Institute of Technology (SUNYIT)
Reliability Information Analysis Center 13
Chapter 1: Introduction
complex, too costly, and unrealistic given the low level of detailed design information available at the point in time when the models were needed. RCA, under contract to RADC, had developed PoF-based models which were rejected as unusable, since the detailed design and construction data for microcircuits were simply unavailable to typical model users. These models were never incorporated into MIL-HDBK-217. While MIL-HDBK-217 was updated again several times in the 1980’s, there were agencies that were developing reliability prediction models unique to their industries. As an example, the automotive industry, under the auspices of the Society of Automotive Engineers (SAE) Reliability Standards Committee, developed a series of models specific to automotive electronics. The SAE committee felt that there was no existing prediction methodologies that were applicable to the specific quality levels and environments of automotive applications. The Bellcore reliability prediction standard is another example of a specific industry developing methodologies for their unique conditions and equipment. It originally was developed by modifying MIL-HDBK-217 to better reflect the conditions of interest of the telecommunications industry. It has since taken on its own identity with models derived from telecommunications equipment and is now used widely within that industry. The 1980’s also saw explosive growth in integrated circuit technology. Very dense circuits were being fabricated using feature sizes as small as 0.5 microns. This presented unique challenges to reliability modelers. The VHSIC (Very High Speed Integrated Circuit) program was the government’s attempt to leverage from the technological advancements of the commercial industry and, at the same time, produce circuits capable of meeting the unique requirements of military applications. From the VHSIC program came the Qualified Manufacturers List (QML) - a qualification methodology that qualified an integrated circuit manufacturing line, unlike the traditional qualification of specific parts. The government realized that it needed a QML-like process if it were to leverage from the advancements in commercial technologies and, at the same time, have a timely and effective qualification scheme for military parts. A reliability prediction model was also developed for VHSIC devices in 1989 (Reference 9) in support of a MILHDBK-217 update. An interesting observation was made during that study that deviated from the premise on which most of the MIL-HDBK-217 models were based. The traditional approach to developing models was to collect as much field failure rate data as possible, statistically analyze it, and quantify model factors based on the results of the statistical analysis. For integrated circuits, one of the factors that was quantified was inevitably device complexity. This complexity was measured by the number of gates or transistors and was the primary factor on which the models were based. The correlation between failure rate and complexity was strong and could be quantified because the 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 14
Chapter 1: Introduction
failure rate of circuits was much higher than they are today and the defect rate was directly proportional to the complexity. As technology has advanced, the gate or transistor count became so high that it could no longer effectively be used as the measure of complexity in a reliability model. Furthermore, transistor or gate count data was often difficult or impossible to obtain. Therefore, the model developed for VHSIC microcircuits needed another measure of complexity on which to base the model. The best measures, and the ones most highly correlated to reliability are defect density and silicon area. It can be shown that the failure rate (for small cumulative percent failure) is directly proportional to the product of the area and defect density. However, another factor that is highly correlated to defect density and area is the yield of the die, or the percent of die that are functional upon manufacture. Ideally, a reliability model would use either yield or defect density/area as the primary factor(s) on which to base the model. The problem in using these factors in a model is that they are considered highly proprietary parameters from a market competition viewpoint and, therefore, are rarely released by the manufacturers. Therefore, the single most important driver of reliability cannot be obtained by the user of the device, which is unfortunate because the accuracy of the model suffers. The conflict between the usability of a model and its accuracy has always been a difficult tradeoff to address for model developers. Much of the literature in the 1990’s on the topic of reliability prediction has centered around the debate as to whether the reliability discipline should focus on PoF-based or empirically-based models (such as MIL-HDBK-217) for the quantification of reliability. In the author’s opinion, many of the primary criticisms of MIL-HDBK-217 stem from the fact that it was often used for purposes for which it was not intended. For example, it was often used as a means by which the reliability of a product was demonstrated. Since its use was contractually required, contractors would try to demonstrate compliance to the specified reliability requirements by “adjusting” factors in the model to make it appear that the reliability would meet requirements. Sometimes these adjustments had a technical basis, and sometimes they did not. Les Gubbins, one of the government’s first project managers for the handbook, once made the analogy that engaging in the use of these adjustment factors is like pushing the needle on your car’s speedometer up, and convincing yourself you’re going faster. This, of course, is not good engineering practice, but rather was done for nontechnical reasons. Another key development in the area of reliability predictions was related to the implications of acquisition reform. In 1994, Military Specifications and Standards Reform (MSSR) was initiated which decreed the adoption of performance-based specifications as a means of acquiring and modifying weapon systems. It also overhauled Reliability Information Analysis Center 15
Chapter 1: Introduction
the military standardization process which, in turn, led to a list of standardization documents that required priority action because they were identified as barriers to commercial processes, as well as major cost drivers in defense acquisitions. The list included only one handbook, MIL-HDBK-217. Over the years, critics of MIL-HDBK217 have complained about its utility as an effective method for assessing reliability. While the claim is made that it is inaccurate and costly, to date there is no viable replacement in the public domain. As the DoD Lead Standardization Activity for reliability and maintainability (R&M), Rome Laboratory (RL) was responsible for implementing the R&M segment of MSSR. Within this context, RL initiated a project to develop a new reliability assessment technique to supplement MIL-HDBK-217, and to overcome some of its perceived problems. Utilizing standardization reform funding, RL awarded a contract to the Reliability Analysis Center and Performance Technology, Inc. The objective of the work was to develop new and innovative reliability assessment methods that are flexible enough to suit the needs of system reliability analysts regardless of their preferred (or required) initial prediction methods. The intent was to use the final model to supplement or possibly replace MIL-HDBK-217. The premise of traditional methods, such as MILHDBK-217, is that the failure rate is primarily determined by components comprising the system. This was a good premise in the 1960’s and 1970’s when components exhibited higher failure rates and systems were less complex than they are today. Increased system complexity and component quality have resulted in a shift of system failure causes away from components to more “system level” factors including manufacturing, design, system requirements, interface, and software problems. Historically, these factors have not been explicitly addressed in prediction methods. The intent of this study was to develop a structure for an electronic system reliability assessment methodology. The term “system” was used because the methodology accounted for all predominant causes of system failure. The new model adopted a broader definition of reliability. An integral part of the methodology was the assessment of processes used in the design and manufacture of the system, including factors contributing to the following failure causes: parts, design, manufacturing, system management, induced, wearout, no defect found and software. The results of this study became the basis for the current RIAC 217Plus methodology. The 2000’s was a time in which there was progress on development of new standards, some of which will be summarized in this book. Also, the DoD has initiated efforts to resurrect MIL-HDBK-217 by updating it with models reflecting state-of-the-art technologies.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 16
Chapter 1: Introduction
1.5. Acronyms Acronyms and abbreviations that are used in his book are defined as follows: AL ALM ALT CA CDF CRR D DoD DPA DVT ED ELFR EPRD ESD EV EVT FMEA FMECA FRU GFL HALT HASS HAST HTB HTOL HTRB IOL IPL IWV KPSI LI MCMC MLE MS MTTF NPRD O PD PDF PVT RBD RPN RSH S SD TBD TC TR TST TTF VVF
Accelerated Life Accelerated Life Model Accelerated Life Testing Constant acceleration Cumulative Distribution Function Center for Risk and Reliability Detectability Department of Defense Destructive Physical Analysis Design Verification Test Electrical distributions Early life failure rate Electronic Parts Reliability Data Electrostatic discharge External visual Engineering Verification Test Failure Mode and Effect Analysis Failure Mode and Effect Criticality Analysis Field Replaceable Unit Gross/fine leak Highly Accelerated Life Test (simultaneous temperature cycling and vibration) Highly Accelerated Stress Screening Highly Accelerated Stress Testing High temperature bake High temperature operating life High temp. reverse bias Intermittent operational life Inverse Power Law Internal water vapor Pounds per square inch, in thousands Lead integrity Markov Chain Monte Carlo Maximum Likelihood Estimator Mechanical shock Mean Time to Failure Non-Electronic Parts Reliability Data Occurrence Physical dimensions Probability Density Function Process Verification Test Reliability Block Diagram Risk Priority Number Resistance to solder heat Severity Solderability To Be Defined Temperature cycling Thermal resistance Pre and post electrical test Time to Failure Vibration - variable freq.
Reliability Information Analysis Center 17
Chapter 1: Introduction
1.6. References 1. Coppola, A., Reliability Engineering of Electronic Equipment, A Historical Perspective,” IEEE Transactions on Reliability. Vol. R-33. No. 1, April 1984. 2. Erles, D.R., “Reliability Application and Analysis Guide,” The Martin Company, July 1961. 3. Erles D.R. and M.F. Edins, “Failure Rates,” AVCO Corp. April, 1962. 4. Knight, C.R., “Four Decades of Reliability Progress,” 1991 Proceedings Annual Reliability and Maintainability Symposium. 5. “Reliability Prediction Methodologies For Electronic Equipment,” AIR 5286, SAE G11 Committee, Electronic Reliability Prediction Committee, 31 Jan. 1998 6. “Reliable Application of Plastic Encapsulated Microcircuits,” Reliability Analysis Center Publication PEM2. 7. Morris, S.F. and J.F. Reilly (Rome Laboratory), “MIL-HDBK-217 - A Favorite Target.” 8. Denson, W. And P. Brusius, “VHSIC and VHSIC-Like Reliability Modeling,” RADCTR-89-177. 9. Reliability Analysis Center, “Benchmarking Commercial Reliability Practices”
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 18
Chapter 2: General Assessment Approach
2.
General Assessment Approach
Prior to developing a reliability model for a product or system, the analyst should consider the following questions: • • • • • •
• •
What is the goal of the model, and what decisions will be made based on it? What data is currently available on the product? Is field data available? If so, is it from the product or system operating in the same manner and environment as the one under analysis? Is test data available? If so, what types of tests (i.e., accelerated life tests, nonaccelerated life tests, qualification tests, etc.) Is data, either field or test, available on a predecessor (i.e., earlier version) of the product? Have models been developed for specific failure modes, mechanisms and/or causes of the product? o Life models? o Stress-strength models? o Models from first principals? Have critical failure causes of the product been identified? How much support can be expected from suppliers regarding identification and quantification of the failure causes of their product?
A suggested approach to modeling the reliability of a product is shown in Figure 2.0-1.
Reliability Information Analysis Center 19
Chapter 2: General Assessment Approach
Define system
Identify the purpose of the model
Determine the appropriate level at which to perform the assessment (System, assembly, part, failure cause) Assess data available
Determine appropriate approach and execute
Assess feasibility of performing reliability tests Combine data
Develop System Model
Figure 2.0-1: General Modeling Approach Each of the elements of this approach is discussed below.
2.1. Define System The first step in assessing the reliability of a product or system is to clearly define the scope of the assessment. A model is then generated that describes the breakdown of the product or system. This breakdown can be in accordance with a physical hardware hierarchy of the system, or a functional breakdown. Either way, the goal is to define the “items” for which a reliability estimate is required. If handbook reliability prediction methodologies such as 217Plus or MIL-HDBK-217 are used, the definition of the items to address in the prediction is generally accomplished with a hardware-based hierarchical breakdown, since those prediction methodologies are based on the physical components comprising the system. In other approaches, such as life modeling from accelerated test data, the product or system breakdown can be based on functionality or hardware, with the exception that the breakdown continues down to 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 20
Chapter 2: General Assessment Approach
the root failure mode cause or mechanism level. Tools for this “system model” include FMEA and FTA. Fault tree representation of a system breakdown in which the level at which reliability estimates are made are the components, represented by circles (basic events) is illustrated in Figure 2.1-1. This Figure represents a reliability prediction performed using MILHDBK-217 or 217Plus. System
Assembly 1
Assembly 2
Subassembly 1a
Comp. 1a1
Comp. 1a2
Subassembly 1b
Comp. 1a3
Comp. 1b1
Subassembly 2b
Comp. 1b2
Comp. 2b1
Comp. 2b2
Subassembly 2c
Comp. 2b3
Comp. 2c1
Comp. 2c2
Figure 2.1-1: Fault Tree Representation of System Model Fault tree representation of a system breakdown in which the level at which reliability estimates are made are the failure mechanisms of the components, represented by circles (basic events) is illustrated in Figure 2.1-2. This would be the representation of a reliability prediction performed using a physics approach in which the intent is to estimate the reliability of specific root-cause failure mechanisms. System
Assembly 1
Assembly 2
Subassembly 1a
Comp. 1a1
FM1
FM2
Subassembly 1b
Comp. 1a2
FM2
FM1
Comp. 1a3
FM1
FM2
Comp. 1b1
FM3
FM1
Subassembly 2b
Comp. 1b2
FM1
FM2
Comp. 2b1
FM1
FM2
Subassembly 2c
Comp. 2b2
FM3
FM1
FM2
Comp. 2b3
FM1
Comp. 2c1
FM1
Comp. 2c2
FM2
Figure 2.1-2: Fault Tree Representation to the Failure Cause Level Reliability Information Analysis Center 21
FM1
FM2
Chapter 2: General Assessment Approach
Approaches such as this, in which the reliability of each failure mechanism is estimated, are practical if: 1. The product or system under analysis has a manageable number of failure mechanisms that can be estimated 2. The approach can be practically applied for all failure mechanisms over the entire supply chain. In other words, each organization responsible for their component or assembly has the ability to estimate the reliability of all failure mechanisms within their component or assembly. This same representation is relevant to performing FMEAs. In this case, the lowest level events in the fault tree are the constituent failure modes of the component. If a failure mechanism modeling approach is to be used, it needs to be applied to all failure mechanisms in order for the assessment to quantify the reliability of the entire system.
2.2. Identify the Purpose of the Model Perhaps the single most important factor contributing to a successful reliability assessment is an unambiguous definition of the specific purpose to be accomplished in the assessment. Only by knowing the purpose of an assessment can an appropriate methodology be selected. If the purpose is not made clear, there is little chance that the assessment will be successful. In the author’s opinion, this unclear definition of purpose is the root cause of many of the controversies found in the reliability discipline over the last twenty years as to selecting and using the appropriate approach. All of the approaches described in this book have merit. All have their strengths and weaknesses. A successful assessment will leverage the strengths of specific methodologies toward the specific goals of the assessment. Toward this end, the intent of this section (and the following sections) is to provide guidance on the applicable approaches for specific assessment purposes. A breakdown of the possible purposes for developing a reliability model is shown in Figure 2.2-1.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 22
Chapter 2: General Assessment Approach Purpose of model
Risk assessment Reliability demo Observed failure
Anticipated failure
Determine if minimum robustness is achieved
Input to FMEA/FTA for ID of failure cause priority Compare competing designs
Determine if reliability rqmt is achieved PM schedules Warranty cost predictions
Model Reliability growth
Maintainability
Design aid
Determine feasibility of meeting rel. rqmt
Determine fault tolerance, redundancy
Determine impact of factors on reliability
Allocate maintenance personnel
Spares allocation
Determine screening rqmt
Determine testability requirements
Figure 2.2-1: Breakdown of Potential Reliability Modeling Purposes Each of these purposes is described in Table 2.2-1.
Reliability Information Analysis Center 23
Chapter 2: General Assessment Approach
Table 2.2-1: Reliability Assessment Purposes Purpose Anticipated failure Risk Assessment Observed failure
Input to FMEA/FTA for ID of failure cause priority Compare competing designs
Design Aid
Model reliability growth Determine feasibility of meeting reliability rqmt. Determine impact of factors on reliability (derating) Determine screening rqmt.
Reliability Demo
Determine if minimum robustness is achieved Determine if reliability rqmt. is achieved Warranty cost predictions Preventive Maintenance (PM) schedules
Maintainability Spares allocation Allocate maintenance personnel
Description Risk assessments are performed to quantify the reliability of critical- or safety-related failure modes before the product is fielded. This is often done to meet industry or customer requirements. Risk assessments are performed on fielded products that experience failures. Factors that usually need to be quantified are (a) determination of the root cause, (b) lifetime, (c) percent failure at a given time, (d) the percent of the population at risk (i.e. whether the root cause is special cause or common cause), (e) whether the defect is lot- or batch-related, (f) whether the defective portion can be contained, and (g) what the reliability will be as a function of the level of corrective actions (for example, 1 – if nothing is done; 2 - if a complete recall is done, and 3 - an approach in between). Techniques such as FMEA and FTA are used to assess and prioritize failure causes. Part of this prioritization includes the identification of the probability of occurrence, either qualitatively or quantitatively. For this purpose, reliability modeling is performed to quantify the relative reliabilities of several competing designs. This analysis is then used as one criterion from which the final design is chosen. In this case, reliability is only one of the factors to be accounted for in this comparison, and needs to be traded off against all of the other factors. A natural part of the development process is to grow the reliability to a point that it meets its reliability requirement. For this purpose, the reliability metric of choice is quantified as a function of time. This provides Program Management with the information to assess the reliability status of the project and to estimate the date at which the requirements will be met. In many cases, reliability requirements are levied upon suppliers and contractors. For this purpose, the reliability assessment is performed to determine if there is a reasonable probability of achieving the reliability requirements. If it is highly likely that requirements cannot be met, then management must make decisions regarding the future of the program. For this purpose, the effects of specific factors are assessed. For example, the effects of temperature may be assessed to determine how much cooling is required. This purpose relates to quantifying reliability as a function of possible screening options, so that it can be determined which screening options will result in the reliability requirements being met. This purpose is to provide quantitative data that proves, within acceptable confidence limits, that predefined robustness levels are achieved. These robustness levels usually correspond to a “qualification” requirement, and may not be highly correlated to field reliability. This purpose is to provide quantitative data that proves, within acceptable confidence limits, that the reliability requirements are met. For this purpose, the assessment is performed so that the costs associated with warranty repairs or replacements can be estimated. The assessment is performed so that effective preventive maintenance schedules can be derived. For repairable systems, since the replacement of failed items requires the availability of spare items, the question of how many spares to keep on hand inevitably arises. The reliability characteristics of the item is one piece of information required. Others are repair rates, a reliability block diagram, etc. For repairable systems, organizations need to determine the personnel required to keep up with maintenance demands. One input to this is the frequency of various types of failures
1-only for the specific failure causes modeled
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 24
Chapter 2: General Assessment Approach
Specific reliability modeling purposes are generally suited to specific program phases, as summarized in Table 2.2-2. Table 2.2-2: Program Phase vs. Reliability Assessment Purpose Stage Purpose Concept Risk Assessment
Design Aid
Reliability Demo
Maintainability
Anticipated failure Observed failure Input to FMEA/FTA for ID of failure cause priority Compare competing designs Model reliability growth Determine feasibility of meeting reliability rqmt. Determine impact of factors on reliability (derating) Determine screening rqmt. Determine if minimum robustness is achieved Determine if reliability req. is achieved Warranty cost predictions PM schedules Spares allocation Allocate maintenance personnel
x
Development
Early Production
Production
Deployment
x x
x
x
x
x x x
x
x x x x
x
x
x
x
x
x
x
x x
x x
x x
x x
x
x
x
x
2.3. Determine the Appropriate Level at Which to Perform the Modeling The first thing to determine is the hierarchical level at which the assessment will be performed. A generic hierarchy is shown below: Reliability Information Analysis Center 25
Chapter 2: General Assessment Approach
System Subsystem Assembly Component Failure Modes (Root) Failure Causes/Mechanisms (Root) 2.3.1. Level vs. Data Needed
Traditional handbook approaches for reliability predictions will generally be applied at the component level. In this case, a failure rate is estimated for each component, based on the factors accounted for in the specific model used. In some cases, this predicted failure rate will be apportioned amongst the component’s failure modes in a FMEA (if the MIL-STD-1629 method is used, in which the criticality is determined by the modal failure rate, i.e., the component failure rate multiplied by the failure mode percentage of occurrence). This approach can be used based on readily accessible data, such as that found in the handbooks. This approach also allows for the estimation of a failure rate associated with each “failure severity”. This is accomplished by adding the failure rates for the failure modes that result in a specific “severity” call of failure. If the level to be analyzed is failure causes, then additional detailed data and information is required. Therefore, the practicality of obtaining the required data must be a consideration when choosing an appropriate approach. The degree of difficulty of obtaining required data generally increases as you go lower in the hierarchy. This concept is illustrated in Figure 2.3-1.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 26
Chapter 2: General Assessment Approach
Data required for reliability assessment
System level System
Subsystem Parts lists Environmental conditions Part stresses
Assembly
Component
Failure mode distributions Yield Defect density Internal part stresses & distributions
Failure modes
Failure causes/mechanisms
Figure 2.3-1: Typical Data Requirements vs. Level of Hierarchy As shown in Figure 2.3-1, the data required for the assessment of specific failure causes can be factors like yield, defect density, internal part stresses and distributions. Because these are factors often difficult to obtain by outside organizations, the best approach is generally to have the manufacturer assess the reliability of the causes in the event that the selected approach requires this sort of data. The appropriate approaches for a reliability assessment will, therefore, generally depend on the location of a company’s product in the hierarchy of the product or system.
Reliability Information Analysis Center 27
Chapter 2: General Assessment Approach
2.3.2. Using an FMEA as the basis for a reliability model
A FMEA can be an effective tool in identifying specific root failure causes that need to be quantified in a reliability model. A generic FMEA approach is shown in Figure 2.3-2. System Hierarchy How functions can fail Functions Functions
Failure effects Failure modes
Identify failure causes
Occurrence Detectability
Risk Priority Number (RPN)
Improve design
Figure 2.3-2: The Basic FMEA Approach The hierarchical relationship between cause, mode and effect is shown in Figure 2.3-3. For example, a failure mode can have any number of potential effects, and also can have any number of potential causes.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 28
Chapter 2: General Assessment Approach Failure Effect #1
Failure Effect #2
Failure Effect #3
Failure Mode
Cause #2
Cause #1
Cause #1b
Cause #1a
Cause #1c
Cause #2a
Cause #2a1
Cause #4
Cause #3
Cause #2b
Cause #3a
Cause #3b1
Cause #2a2
Cause #3b
Cause #4b
Cause #4a
Cause #3b2
Figure 2.3-3: Hierarchical Relationship Between Cause, Mode and Effect If the reliability assessment is to be performed at the failure cause level, then all possible causes need to be identified. One of the FMEA objectives is to identify all conceivable failure causes. One way to accomplish this is to identify all combinations of initial conditions, stresses and mechanisms, as illustrated in Figure 2.3-4 and Table 2.3-1.
Initial conditions Defect Free
Mechanism
Stresses
Defects Operational
Intrinsic
Extrinsic
Environmental
Mechanical
Electrical
Figure 2.3-4: Approach to Identifying Causes Reliability Information Analysis Center 29
Chemical
Chapter 2: General Assessment Approach
Table 2.3-1: Examples of Initial Conditions, Stresses and Mechanisms Defect Free Intrinsic Initial Conditions
Defects Extrinsic Operational
Stresses Environmental
Electrical
Mechanism
Mechanical
Chemical
Voids Material property variation Geometry variation Contamination Ionic contamination Crystal defects Stress concentrations Organic contamination Nonconductive particles Conductive particles Contamination Ionic contamination Thermal Electrical Chemical Optical Chemical exposure Salt fog Mechanical shock UV exposure Drop Vibration Temperature-high &low Temperature cycling Humidity Pressure – low &high Radiation – EMI, cosmic Sand and dust Electromigration Dielectric breakdown Dendritic growth Tin whiskers Electro-thermo-migration Second breakdown Metal fatigue Stress corrosion cracking Melting Creep Warping Brinelling Fracture Fretting fatigue Pitting corrosion Spalling Crazing Abrasive wear Adhesive wear Surface fatigue Erosive wear Cavitation pitting Stress corrosion cracking Elastic deformation Material migration Cracking Plastic deformation Elastic deformation Brittle fracture Expansion Contraction Emod change Outgas Corrosion Chemical attack Fretting corrosion Oxidation Crystallization
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 30
Chapter 2: General Assessment Approach
One of the keys to a successful FMEA is to understand the relationship between cause, mode and effect. In general, there is a natural tiering effect that occurs in an FMEA as a function of the product or system level, as illustrated in Table 2.3-2. For example, at the most basic level, the part manufacturing process, the cause of failure may be a process step that is out of control. The ultimate effect of that cause becomes the failure mode at the part level, the failure effect of the part becomes the failure mode at the next level of assembly, and so forth. It is very important that the cause, mode and effect are not confounded in the analysis. Table 2.3-2: Relationship Between Cause, Mode and Effect. System
Assembly
Part
Part Manufacturing Process
Effect Mode
Effect
Cause
Mode
Effect
Cause
Mode
Effect
Cause
Mode Cause
More detail regarding an FMEA approach is provided in Chapter 8. Figures 2.3-5 through 2.3-8 illustrate, with fault trees, how the relationship between cause, mode and effect scale up or down the product or system hierarchy, depending on the hierarchical level at which the analysis is to take place. In this example, failure “cause” is considered to be at the lowest level at which a modeling effort will occur. If the cause corresponds to a fundamental mechanism of failure (i.e., the mechanism represents the fundamental physical failure of the item), then the term “cause” is considered synonymous with the term “mechanism”.
Reliability Information Analysis Center 31
Chapter 2: General Assessment Approach TOP
OR
AND
AND
AND
OR
VT
AND
AND
OR
AND
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Figure 2.3-5: Fault Tree of Product or System
TOP
OR
AND
AND
AND
OR
VT
Effect
AND
Mode
OR
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Cause
Event
Event
Event
Event
Event
Event
Event
Event
Figure 2.3-6: Fault Tree of Product or System with Cause as the Lowest Level
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 32
Chapter 2: General Assessment Approach TOP
OR
Effect
AND
OR
AND
VT
Mode
AND
Cause
OR
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Figure 2.3-7: Fault Tree of Product or System with Cause Above the Lowest Level
Effect
OR
Mode
AND
OR
AND
VT
Cause
AND
AND
OR
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Figure 2.3-8: Fault Tree of Product or System with Cause Two Levels Above the Lowest Level Therefore, if the FTA view of the product or system is to be consistent with the reliability assessment, then the lowest level in the tree must be the level at which reliability estimates are made. The section above describes the hierarchical level at which a reliability model will be developed, whether it be a failure cause, failure mode, a component or an assembly. Once this physical level is determined, there are several model forms possible to Reliability Information Analysis Center 33
Chapter 2: General Assessment Approach
construct a model to describe its reliability. This form will, of course, depend on the specific approach and data used to develop the reliability model. Some of these forms are described below. More detail on each of these is provided in subsequent sections. 2.3.3. Model Form vs. Level
The form of the model to be developed will depend on the level and the approach. For example, if empirical data is used directly without a model developed from it, assuming constant failure rate, the best estimate of the failure rate is simply:
λ=
Failures operating time
If a life model is developed from life tests performed at various stress levels, the result will be a time-to-failure (TTF) distribution (described by the Weibull, lognormal or other statistical distributions) that is a function of stress levels. If a Weibull distribution is used, the general model will be:
R(t ) = e
⎛t ⎞ −⎜ ⎟ ⎝α ⎠
β
If models are to be derived from the analysis of field data, there are several possible model forms. Traditional methods of reliability prediction model development have included the statistical analysis of empirical failure rate data. When using multiple linear regression techniques with highly variable data (which is often the case with empirical field failure rate data), a requirement of the model form is that it be multiplicative (i.e. the predicted failure rate is the product of a base failure rate and several factors that account for the stresses and component variables that influence reliability). An example of a multiplicative model is as follows:
λ p = λbπ eπ qπ s where: λp = λb = πe = πq = πs =
Predicted failure rate Base failure rate Environmrntal factor Quality factor Stress factor 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 34
Chapter 2: General Assessment Approach
However, a primary disadvantage of the multiplicative model form is that the predicted failure rate value can become unrealistically large or small under extreme value conditions (i.e., when all factors are at their lowest or highest values). This is an inherent limitation of multiplicative models, primarily due to the fact that individual failure mechanisms, or classes of failure mechanisms, are not explicitly accounted for. Another possible approach to model reliability is to segment the failure rate for each group of failure causes that are accelerated by stresses incurred during specific portions of a mission. Each of these failure rate terms are then accelerated by the appropriate stress or component characteristic. This is the model form used in the RIAC 217Plus methodology. This model form is as follows;
λ p = λ oπ o + λ eπ e + λ cπ c + λ i + λ sj π sj where: λp = predicted failure rate λo = failure rate from operational stresses πo = Product of failure rate multipliers for Operational Stresses λe = failure rate from environmental stresses πe= Product of failure rate multipliers for Environmental Stresses λc = failure rate from power or temperature cycling stresses πc = Product of failure rate multipliers for Cycling stresses λI = failure rate from induced stresses, including electrical overstress and ESD λsj = failure rate from solder joints πsj = Product of failure rate multipliers for solder joint stresses The concept of this approach is that the occurrence of each group of failure causes is mutually exclusive, and their failure rates can be modeled separately and summed. By modeling the failure rate in this manner, factors that account for the application and component-specific variables that affect reliability (π factors) can be applied to the appropriate additive failure rate term. Additional advantages to this approach are that they: o Address Operating-, Non-Operating- and Cycling-related Failure Rates in an additive model. These individual failure rates are weighted in accordance with the operational profile (duty cycle and cycling rate). The Pi factors modify only Reliability Information Analysis Center 35
Chapter 2: General Assessment Approach
the applicable failure rate term, thereby eliminating many of the extreme value problems that plague multiplicative models. o Are based on observed failure mode distributions, so that observed component root failure causes are empirically modeled o Can be tailored with test data (if available) by applying it in a Bayesian fashion to the appropriate failure rate term. As examples, temperature cycling data can be combined with the failure rate from power or temperature cycling stresses (λc), or high temperature operating life can be combined with the failure rate from operational stresses term (λo).
2.4. Assess Data Available A predominant factor that will dictate the options that an analyst has in modeling the reliability of a product is the availability of data. The analyst should consider the following questions when assessing the availability of test data: • • •
•
Is field data available on the specific product or system? Is data on a similar product or system available? If so, is it field data or test data? If data is available, is it: o Relevant? o Of sufficient quantity? o Of sufficient quality? If physics-based models are to be employed, is the required detailed data and information available, such as: o Defect rates o Material properties (e.g., functional characteristics) o Defect (flaw) distributions o Material variation quantification (e.g., purity, yields, dimensions) o Etc.
Perhaps the most important element of a reliability program is the reliability testing of the product. Reliability test data is, in turn, a critical element for assessing reliability. In this context, a reliability test consists of two primary elements: measurement and exposure. The measurement is the means of assessing the performance of the product or system relative to its requirements. It usually consists of quantifying parameters that are specifiable attributes. It may include both continuous variables (i.e. gain, power output, etc.) or attribute data (i.e. a binomial representation of whether a product possesses an attribute or not). Exposure is the application of a stress or stresses. These stresses may consist of operational stresses or environmental stresses. Operational stresses are defined as those stresses to which the product will be exposed by the act of operating the product. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 36
Chapter 2: General Assessment Approach
For example, a transistor is designed to have a voltage applied, and pass a given amount of current. As such, these are operational stresses. It will also be exposed to externally applied environmental stresses such as temperature, temperature cycling, vibration, etc. Reliability tests can be performed either by sequentially performing repeated cycles of a measurement, exposure, measurement, etc., or by continuously measuring performance parameters in-situ during exposure. It is usually desirable to perform in-situ measurement so that times to failure can be accurately determined. In practical cases, however, it is not always feasible due to the complexities of setting up such measurement capabilities. If repeated cycles of a measurement, exposure, and measurement are used, the measurement intervals should be frequent enough so that sufficient resolution in the times-to-failure data is available. Practical considerations for assessing the feasibility of testing products are: • • • •
Are samples available? If so, are they available in sufficient quantity? Are measurement systems available for continuous, in-situ, measurements during exposure? If not, repeated cycles of a measurement and exposure may be required. Are laboratory facilities available to perform the exposure? Are the measurement and exposure facilities available to support a multi-cell test at various stress levels (i.e., application of various combination of stresses)?
Additional considerations for testing products and systems are provided in Chapter 5.
Reliability Information Analysis Center 37
Chapter 2: General Assessment Approach
2.5. Determine and Execute Appropriate Approach This section discussed the various options that an analyst has to predict, assess, and estimate the reliability of a product. Figure 2.5-1 illustrates the breakdown of various approaches.
Figure 2.5-1: Breakdown of Reliability Assessment Options Table 2.5-1 describes the approach, it strengths and its weaknesses. This information is presented in the context of the intent of this book, which is to present options for quantifying the reliability of a product as it is used by customers in actual use conditions. Effective techniques also include using a combination of the approaches in this section. The manner in which these approaches can be combined will be addressed in Section 2.6.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 38
Chapter 2: General Assessment Approach
Table 2.5-1: Summary of Reliability Assessment Options Approach
Description
1 Highly Accelerated Life Test (HALT)
Exposure to severe levels of thermal cycling and vibration
Strengths Can quickly identify failure causes that are accelerated by thermal cycling and vibration
Only accelerates specific failure causes accelerated by the test
Accounts for the interaction of the two stresses
Large extrapolations to use conditions is required
Can be used as a screening basis
Can excite non-relevant failure modes (i.e., those that are not representative under field environmental conditions)
Reflects the actual reliability Test data can be collected and applied before the system is fielded
2 Qual
Exposure to industry standard “trade and commerce” tests
Weaknesses
Can demonstrate a degree of robustness to the specific qualification tests Reflects the actual reliability Test data can be collected and applied before the system is fielded
Cannot quantify special cause failure modes Correlation to field use conditions is difficult Can excite non-relevant failure modes (i.e., those that are not representative under field environmental conditions) Cannot quantify special cause failure modes
Can accurately model lifetime due to common cause mechanisms Can quantify acceleration factors
3 DOE multicell
Life tests under a variety of stress levels
Can estimate reliability at use conditions Reflects the actual reliability
4 Reliability demo
5 Reliability demo
6 Field data – same product
Demonstration of reliability via life tests at accelerated conditions
Demonstration of reliability via life tests at non-accelerated conditions
Use of field experience data on the product or system under analysis
Can be expensive to execute Difficult to quantify special cause failure modes due to large sample sizes sometimes required
Test data can be collected and applied before the system is fielded Can demonstrate required reliability in a statistically significant way Reflects the actual reliability
Correlation to field use conditions is difficult
Test data can be collected and applied before the system is fielded Can demonstrate required reliability in a statistically significant way Reflects the actual reliability
Large sample sizes usually required
Test data can be collected and applied before the system is fielded The most representative data Can quantify failure causes that exhibit low percent failures
Usually, the data is not available in time for use in product or system development Collecting field data is prone to errors
Reliability Information Analysis Center 39
Chapter 2: General Assessment Approach
Table 2.5-1: Summary of Assessment Options (continued) Approach
Description
Strengths Can be reasonably sensitive to various stresses Represents field use
7 Models
Models developed from field experience data on similar products
Can be a good indicator of field reliability performance Based on easily obtainable data Easy to use Can quantify failure causes that exhibit low percent failures Represents field use
8 Raw data (EPRD, NPRD)
The direct use of field experience data on similar products
Easy to use Can quantify failure causes that exhibit low percent failures
Weaknesses Difficult to keep updated Actual failures are impacted by factors not considered by the model Models become outdated by new technology Misapplication of models by the analyst No uncertainty estimates available Difficult to collect good quality field data Difficult to distinguish correlated variables(i.e. quality and environment) Extrapolations to specific use conditions required Not feasible to collect data representing all conceivable situations Difficult to account for material defects
9 Stress/Strength modeling
Calculation of failure probabilities based on the strength distribution and the stress distribution
Good approach for fundamental material behavior Can model fatigue behavior Models specific failure mechanisms Valuable for predicting end-of-life for known failure mechanisms
May require information that’s difficult to obtain Difficult to use for estimating field reliability Can be complex and costly to apply Difficult to use for modeling defect-driven failure mechanisms Not practical to use for the assessment of an entire system Can only be applied in rare cases In practice, difficult to derive fundamental equations
Scientifically robust
10 First principals
Calculation of failure probabilities based on a fundamental understanding of the physics of the failure cause
Good approach for fundamental material behavior Can model fatigue behavior Models specific failure mechanisms Valuable for predicting end-of life for known failure mechanisms t
Empirical data is usually required to validate the model, or to estimate model constants Difficult to account for material defects May require information that’s difficult to obtain Can be complex and costly to apply Difficult to use for modeling defect-driven failure mechanisms Not practical to use for the assessment of an entire system
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 40
Chapter 2: General Assessment Approach
Selecting a methodology The various approaches summarized here are suited to various program phases, corresponding to prediction, assessment and estimation. This is shown in Table 2.5-2 (note that the shaded area indicates where the approach can be applied). For example, MIL-HDBK-217 should only be used for prediction, meaning that its usefulness is limited for assessment and estimation. Conversely, 217Plus was designed to provide a framework for all three reliability modeling phases.
Accelerated Test NonAccelerated Same product
Empirical Field data
Physics
Similar product
Estimation
Approach
Assessment
Prediction
Table 2.5-2: Relevancy of Approach to Prediction, Assessment and Estimation
HALT Qualification DOE multicell Reliability demo Reliability demo
Models 217Plus MIL-HDBK-217 Bellcore Raw data (EPRD, NPRD)
Stress/Strength modeling First principals
The appropriate approach(es) to modeling reliability will depend on several factors, including: •
The severity of product failure. In this context, severity can mean that there are significant financial ramifications of failure, that there are safety-related risks, or that the system is not maintainable. For all of the reasons that high reliability may Reliability Information Analysis Center 41
Chapter 2: General Assessment Approach
• •
be required in the first place, are the same reasons that the reliability model must be acceptably accurate. Since reliability is a stochastic process, reliance on any one of the methodologies discussed in this book is susceptible to uncertainties. Sometimes these uncertainties can be very large. This is true for any of the methods. If, however, several methodologies can be employed, and their results are consistent with each other, then this adds much more credibility to the modeled reliability of the product. This is especially true if a physics approach is coupled with an empirical approach. The amount and level of detailed information available to the analyst. Often, this will dictate the available choices for the analysis. Complexity of the product. If the product or system is very complex, has many levels of indenture, and there is a complex supply chain involving many suppliers, then the available suitable choices for analysis at the top of the supply chain will be limited. For example, as discussed previously, it is very difficult to obtain the data required to utilize one of the physics approaches by organizations higher in the supply chain. If, however, the entire supply chain utilizes the PoF approach for the product or system, it can be a viable approach.
Table 2.5-3 provides general guidance on the identification of appropriate approaches based on the purpose of the assessment. If empirical data is to be used as a basis for one or more of the approaches, there are various factors that will influence the uncertainty in assessments made with this empirical data. These include the following data attributes: Relevancy – how close is the product or system architecture and complexity on which data is to be used to the item under analysis Quantity – this pertains to the statistical uncertainty of reliability estimates based on the quantity of data. For example, if the TTF distribution is exponential, this uncertainty is usually modeled with the Chi-squared distribution. Quality – this pertains to the accuracy inherent in the data itself. For test data, the accuracy is generally much better than with field data, since test data is usually much better controlled with known sample sizes, failure times, etc. Field data, on the other hand, is usually fraught with many problems and sources of uncertainty. This will be discussed in Section 5.2.1.2.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 42
Chapter 2: General Assessment Approach
Table 2.5-3: Identification of Appropriate Approaches Based on the Purpose Approach Empirical
Physics
Reliability demo
Maintainability
x
x
x x
Raw data
x x
x
x
x
x
x x
x x
x x
x
x
x
x
x
x
x
x
x
x
1
x
Models
Same product
NonAccelerated Reliability demo
Reliability demo
DOE multicell x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x x
x x
x
x
x
x
x
x
x
x
x
First principals
Design aid
Anticipated failure Observed failure Input to FMEA/FTA for ID of failure cause priority Compare competing designs Model reliability growth Determine fault tolerance, Determine redundancy feasibility of Determine meeting rel req. testability requirements Determine impact of factors on reliability (e.g. derating) Determine screening req. Determine if minimum robustness is achieved Determine if rel. req. is achieved Warranty cost predictions PM schedules Spares allocation Allocate Maintenance personnel
Stress/Strength modeling
Risk assessment
Qualification
HALT
Purpose
Similar product
Field data
Accelerated
Test
x x
x x
x
x
x
x
1
1
1
1
x
x
x
1 – for those failure causes addressed by the approach Reliability Information Analysis Center 43
x
Chapter 2: General Assessment Approach
Relevancy is a function of the type of data that is available, and the product or system on which that data is available. To further address the relevancy issue for assessments made with empirical data, consider the information in Table 2.5-4, which summarizes the various attributes of empirical data. This notion is valid, regardless of the level of assembly, ranging from root failure causes to the system level. Table 2.5-4: Ranking the Attributes of Empirical Data Type of data Field
Product or System on which data is available
Same
Similar
Same mfg/process Different mfg/process Same mfg/process Different mfg/process
Same environment Best
Different environment
Same stress
Test Different stress
Worst
There has been much information published in the literature comparing and contrasting empirical and physics-based models. However, they are not mutually exclusive methodologies. For example, empirical models generally utilize PoF principals in their derivation, and PoF models utilize empirical data in their derivation and parameter estimation. The majority of component field failures are a result of special causes. These causes may be an anomaly in the manufacturing process, an application anomaly, or a host of other assignable causes. They are rarely the result of a common cause failure mechanism, which can generally be modeled by life modeling techniques. Guidelines and examples are provided in the following sections for each of the approaches. 2.5.1. Empirical 2.5.1.1. Test
Testing product ors system reliability is performed for many reasons, including: 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 44
Chapter 2: General Assessment Approach
• • • • • • • • • • • •
Quantifying reliability (infant mortality, wear-out) Demonstrating reliability Growing reliability Lot acceptance Developing screens Performing screens Determining the limits of the technology Determining stress bounds for subsequent tests Determining predominant accelerating stresses Identifying “weak points” in the design Identifying failure causes Demonstrate compliance to industry standard qualification tests
An important consideration for all of the tests described above is the definition of “failure”, i.e., the "failure criteria" that will be used to determine if a product passes or fails. Industry guidelines, specifications, or an understanding of end-use application tolerances are often used to set pass/fail criteria. A common form of empirical testing is the performance of qualification tests. “Qualification” is usually defined as demonstrating that a product will meet performance requirements in its intended application, as used by customers, over the expected lifetime of the product. There are two primary elements to performance qualification: Validation and Verification (Reference 1), as follows: Validation – Confirmation by examination and provision of objective evidence that the particular requirements for specific intended use are fulfilled. Verification – Confirmation by examination and provision of objective evidence that specified requirements have been fulfilled. Therefore, for a product or system to be considered fit for use for a specific application, it must conform to the requirements of its specification over its intended life (verification) and the specification must adequately capture the requirements of the end user (validation). The various elements of qualification are illustrated in Figure 2.5-2.
Reliability Information Analysis Center 45
Chapter 2: General Assessment Approach
Qualification
Verification
Validation
Specification Compliance
Reliability Testing
EVT DV PVT
Root cause analysis and corrective action
Figure 2.5-2: Qualification Concepts and Terminology Verification ensures that the product or system meets the specified requirements both initially (specification compliance) and over its intended lifetime (reliability testing). Specification compliance ensures that, at the beginning of its lifetime, the item meets specified performance requirements and that the distribution of performance parameters over the population of items is within acceptable limits. Reliability testing ensures that the product is robust and that it meets the specified performance requirements over its intended lifetime. Reliability testing consists of several test phases, each of which has its own purposes and approaches. The testing sequence can be grouped into three categories: Engineering Verification Tests (EVT), Design Verification Tests (DVT), and Production Verification Tests (PVT). These are further explained below, along with their relationship to the establishment of a life model, the prediction of product reliability and the various elements of each test approach. This is provided specifically to highlight how reliability testing can be used in a reliability program.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 46
Chapter 2: General Assessment Approach
Engineering Verification Tests (EVT) are intended to identify and assess “high risk” critical items so that corrective action can be taken, if necessary. The intent is to uncover weaknesses or to identify product capability, not to pass a set of predefined tests, as is the case with traditional qualification testing. The purpose and approach of these tests are described in Table 2.5-5. These tests are also used to identify the maximum stress capability of a product, which is a prerequisite for developing a complete test plan (in DVT) to assess lifetime. Step stress tests are often used for this purpose, and the results can support establishment of an upper bound on subsequent test stresses. One of the primary purposes of DVT testing is to provide the data required to develop life models. Often, there are multiple accelerating stresses, in which case life tests must be conducted for various stress combinations. Design of Experiments (DOE) is used to develop an effective and cost-efficient test plan. DOE concepts, as they pertain to reliability testing, are covered in Chapter 4. PVT tests demonstrate that the robustness of production units is equivalent to that of the EVT/DVT samples. Whereas EVT and DVT demonstrate the intrinsic robustness, PVT demonstrates the “as-built” robustness. Table 2.5-5: EVT, DVT and PVT Purpose and Approach Test Program Element
EVT
DVT
PVT
Purpose Determine limits of the technology Determine stress bounds for subsequent tests Determine predominant accelerating stresses Identify “weak points” in the design Identify failure causes Quantify elements of the bathtub curve (infant mortality, wear-out) so that effective screens can be developed Provide data to assess product lifetime under various combinations of stresses Verify the robustness of production units are as good as EVT/DVT samples
Approach
Relative Sample Sizes Required
Test to failure Step stress (to determine limits) Test a broad range of stressors to determine the stresses that accelerate predominant failure causes
Low
Development of a life model that estimates time to failure as a function of pertinent accelerants Use DOE to design statistically valid life tests, and perform long term life tests using stresses that will be experienced in the intended application Test relatively small samples of parts using a broad range of stressors. These are traditional “qualification” tests
Reliability Information Analysis Center 47
High
Low
Chapter 2: General Assessment Approach
Some reliability practitioners choose to separate qualification tests from reliability tests. In this case, reliability tests are those that have a purpose similar to the DVT tests. The reason for separation is that the reliability tests are more of an engineering test that are not dictated by industry standards. As such, the results may or may not be shared with customers. Likewise, qualification tests are required and thus shared with customers to demonstrate compliance. Root cause analysis and corrective action A critical part of any reliability program is the ability to learn from failures and improve the product or system. Failure analysis is performed to ensure that the root cause is identified and understood, and corrective actions are implemented and verified. This is done throughout product development, including EVT or DVT and PVT. With a product that is comprised of a number of subassemblies, there is a time offset between EVT, DVT or PVT tests performed on components of the product or system and those tests performed on the end item. This is illustrated in Figure 2.5-3.
Time
Component (EVT)
(DVT)
Early Screening
Prequalification
Ongoing Reliability Test (ORT)
(PVT)
Full qualification
(EVT)
(DVT)
Early Screening
Prequalification
Mass Production
(PVT)
Full qualification
Ongoing Reliability Test (ORT) Mass Production
Assembly Figure 2.5-3: EVT, DVT and PVT Relationships
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 48
Chapter 2: General Assessment Approach 2.5.1.1.1.
Non-Accelerated
Nonaccelerated reliability tests are those in which samples are tested in a manner that recreates the use conditions the product will experience in its intended use environment as used by customers. These tests may be performed for several reasons: 1. To uncover any unexpected failure causes 2. To demonstrate that the product meets its reliability requirement Generally, if the purpose is #1, a more effective way of acheiving this is with accelerated testing, discussed in the next section of this book. If the purpose is #2, then concepts of reliability demostration can be used, as discussed in the next section. 2.5.1.1.2.
Reliability Demonstration
The fundamental concept of reliability demonstration is the following:
1 − CL = R This is essentially a hypothesis test in which the hypothesis is that the true product reliability is “R” or greater. For example, consider a case in which the reliability requirement is 0.95 at 5000 hours, and the desired confidence level is 0.80 (80%). In this case, the implied failure rate is 0.0000103 failures per hour. If the hypothesis is true and the test is run such that there less than a 20% probability of experiencing the observed number of failures (or fewer), then the analyst can be 80% certain that the reliability requirements have been met. Table 2.5-6 summarizes the probability as a function of the number of failures and cumulative operating time. The values in the cells are the Poisson probability that there will be “F” or fewer failures, under the hypothesis that the true failure rate is 0.0000103 (failures per hour). In this example, if the test can be run until 200,000 hours are accumulated, with no failures, then the test is passed and the hypothesis is verified. This is the first opportunity to pass the test, as this is the shortest time at which the Poisson probability falls below 0.20 (i.e., 0.13). In this example, 0.20 is the risk of concluding that the failure rate is less than 0.0000103 when it is not. The test is run until the number of failures and time combinations falls either above or below the shaded red area. If it falls above the red area, then the null hypothesis is confirmed (that the failure rate is greater than the required). If it falls below the red area, the hypothesis is confirmed. If the combination of hours and failures remains in the red area, the hypothesis cannot be confirmed or denied, and further testing is required. Reliability Information Analysis Center 49
Chapter 2: General Assessment Approach
The probability values are generally calculated from the binomial or Poisson distributions, depending on whether the probability is time-based (Poisson) or attributebased (binomial). Poisson is used in the case of constant failure rates. Table 2.5-6: Reliability Demonstration Example
250
300
350
400
450
500
550
600
800
200
750
150
700
100
650
50
Number of Failures
Cumulative operating time (in thousands of hours)
10
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.99
0.98
0.97
0.95
0.92
0.89
0.85
0.79
9
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.99
0.98
0.96
0.94
0.90
8
1.00
1.00
1.00
1.00
1.00
1.00
0.99
0.98
0.95
0.92
0.88
0.83
0.86
0.81
0.75
0.69
0.77
0.71
0.64
7
1.00
1.00
1.00
1.00
1.00
0.99
0.97
0.94
0.90
0.85
0.79
0.72
0.56
0.65
0.57
0.50
0.42
6
1.00
1.00
1.00
0.99
0.98
0.96
0.93
0.88
0.82
0.74
0.66
5
1.00
1.00
0.99
0.98
0.95
0.91
0.85
0.77
0.68
0.59
0.50
0.58
0.50
0.42
0.35
0.29
0.42
0.35
0.28
0.22
4
1.00
1.00
0.98
0.94
0.88
0.80
0.71
0.61
0.51
0.42
0.17
0.34
0.26
0.21
0.16
0.12
0.09
3
1.00
0.98
0.93
0.85
0.74
0.63
0.52
0.41
0.32
2
0.98
0.91
0.80
0.66
0.53
0.41
0.30
0.22
0.16
0.25
0.19
0.14
0.10
0.07
0.05
0.04
0.11
0.08
0.06
0.04
0.03
0.02
1
0.91
0.73
0.54
0.39
0.27
0.19
0.13
0.08
0.01
0.06
0.04
0.02
0.02
0.01
0.01
0.00
0.00
0
0.60
0.36
0.21
0.13
0.08
0.05
0.03
0.02
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.00
The Microsoft EXCEL® functions for these are: =1-BINOMDIST(x,y,z,TRUE), where BINOMDIST(number_s,trials,probability_s,cumulative) =POISSON(x,y,TRUE), where POISSON(x,mean,cumulative)
2.5.1.1.3.
Accelerated Testing
Accelerated testing is an enormous part of a reliability program. It is used for many purposes, including: • • • •
Identification of failure causes Qualification Life characterization Reliability demonstration
One of the critical aspects of accelerated testing is the degree to which acceleration takes place. Consider the situation depicted in Figure 2.5-4. The reliability requirement, in terms of lifetime in this example, will be specified at a specific stress condition. If tests are performed at the accelerated conditions of Test 1, there will be some extrapolation to 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 50
Chapter 2: General Assessment Approach
lifetimes at use conditions (if the purpose is to quantify life). If tests are performed at the accelerated conditions of Test 2, there will be additional extrapolation to lifetimes at use conditions. Life modeling is the means of performing this extrapolation, and will be covered in Section 2.5.1.1.2.3 and Chapter 5.
Figure 2.5-4: Acceleration Levels The larger the extrapolation distance, the larger the uncertainty in the reliability estimate at use conditions. This is illustrated in Figure 2.5-5.
Reliability Information Analysis Center 51
Chapter 2: General Assessment Approach
Figure 2.5-5: Uncertainty in Extrapolation The relevancy of failure causes must be considered when using accelerated test data to model product or system reliability in field deployed conditions. For example, if failures occur in an accelerated test, the questions to be addressed are: 1. Can the failure cause occur under field conditions? Or has it been induced by the test? 2. If the failure cause is relevant, can its reliability characteristics be scaled to field use conditions with an acceleration model? For example, consider several scenarios illustrated in Figure 2.5-6. Case 1 illustrates the situation in which the failure cause observed in accelerated testing is relevant, and its probability of occurrence can be extrapolated to use conditions with an acceleration model. Case 2 illustrates the situation in which the failure cause observed in accelerated testing is not relevant, and its probability of occurrence cannot be extrapolated to use conditions with an acceleration model. Case 2 is representative of a situation in which there is a “threshold” stress, above which the failure cause has been induced by the test. The higher the acceleration, the higher the risk is that Case 2 will occur. For this reason, for the purposes of quantifying reliability under field use conditions, highly accelerated tests (like HALT) must be used with caution. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 52
Chapter 2: General Assessment Approach
Figure 2.5-6: Acceleration levels Alternatives are also available that will cover more of the “life-stress” space, as shown in Figure 2.5-7. This approach is desirable because there is minimal extrapolation to field use conditions, and validity of the acceleration models over a broader stress range can be ascertained.
Figure 2.5-7: Acceleration Alternatives Reliability Information Analysis Center 53
Chapter 2: General Assessment Approach
Another factor to consider in accelerated testing, when used to quantify reliability at use conditions, is the relative probability of occurrence of various failure causes as a function of stress level. Each failure cause will have unique acceleration characteristics as a function of stress, depicted as the slope of the life-stress line. They will also have unique probabilities of occurrence, as depicted as the vertical position of the life-stress line. These factors together indicate that the relative probabilities of the causes require a model for each. This is illustrated in Figure 2.5-8. In this life-stress plot, the slope represents the dependency of life as a function of stress, and the position of the line represents the absolute life. As can be seen, the relative probabilities of the causes will depend on the stress level.
Figure 2.5-8: Relative Lifetime vs. Stress 2.5.1.1.4.
Highly Accelerated Life Test (HALT)
Highly Accelerated Life Test (HALT) is a popular technique in reliability testing. It is useful to achieve very large acceleration factors. HALT is a test methodology that simultaneously subjects an item to highly accelerated levels of thermal cycling and vibration. It can be a useful tool in identifying mechanical design weaknesses. It is a particularly valuable technique in the identification of the weakest area(s) of a new design in the shortest possible time. Therefore, is if often used as a tool to grow the product reliability through a test analyze and fix sequence.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 54
Chapter 2: General Assessment Approach
Such tests can: • •
• • • • •
Provide a means of sampling inspection for incoming component lots Be used for burn-in screening tests. This is called Highly Accelerated Stress Screening (HASS). For HASS, care must be taken to ensure that high levels of the accelerating stress will not damage or remove excessive life from units that are to be put into service. Be used for pilot tests to get information needed for planning a more extensive accelerated life test (ALT) at lower levels of the accelerating variable Be used to assess the relevance of specific failure modes Be used to obtain shorter test times to allow design engineering to remain focused on the product or system (resulting in a highly intensive and uninterrupted engineering effort) Serve as a cooperative workshop that involves both suppliers and the customer(s) Support collaboration between design and test engineers to address design weaknesses
HALT requires a different mindset than "conventional" accelerated testing. One is not trying to predict or demonstrate life, but rather to induce failures of the weakest links in the design, strengthen those links, and thereby greatly extend the life of the design. Root cause failure analyses are conducted and repairs and redesign are carried out, as feasible and cost-effective. Output results from HALT may include a Pareto chart showing the weak links in the design, and design guidance that can be used to create a more robust design. Testing a new design and comparing it against a proven previous generation design using the same accelerated test provides an efficient benchmarking test. Based on HALT results, a determination of "optimum" design characteristics can be made using statistical design of experiments (DOE). A generic HALT process starts with a temperature survey: 1. Start at room temperature 2. Step down temperature to -100°C in 20° increments, with each dwell time long enough to stabilize the product's internal temperature (the thermal rate of change between each temperature transition step should be ~100°C/minute) 3. Step up temperature from -100°C to +40°C at 100°C/min 4. Step up temperature from +40°C in 20°C increments to 100°C or the maximum temperature for the materials involved, with each dwell time long enough to Reliability Information Analysis Center 55
Chapter 2: General Assessment Approach
stabilize the product's internal temperature (the thermal rate of change between each temperature transition step should be ~100°C/minute) Next, a vibration survey is performed: 1. 2. 3. 4.
Begin vibration testing at room temperature Start six-axis random vibration at 5 Grms from 2Hz to 12kHz Step up the vibration level in 5 Grms increments, to a maximum of about 50 Grms Dwell for 10 minutes at each level
The vibration stress is provided by mechanically impacting the table with “hammers”. As such, the frequency spectrum is not truly random, but rather is “pseudorandom”. The purpose of the vibration survey is to detect weakness in the design as a function of the stresses created by the increased vibration levels. A combined environment HALT may also be performed: 1. Superimpose simultaneous temperature cycling from -100°C to +100°C at ~100°C/min of circulating air temperature. Dwell at each temperature only long enough to “semi-stabilize” the internal temperature of the part 2. During temperature dwells, subject the test unit to vibration at 5 Grms 3. During subsequent thermal cycles, step the vibration level up in 5 Grms increments In this example, the vibration is applied during temperature dwells, but if failure causes are possible that are accelerated by vibration stresses during temperature transitions, the stress profile can be modified to apply vibration continuously throughout the temperature cycle. This is a typical stress profile, and will be varied (and should be tailored) based on the limits of the product or system being tested. The purpose of the step-stress temperature test is to detect sensitivity of design functionality to temperature and temperature change rates. The purpose of the combined environment test should highlight weaknesses that result from the interaction effects of simultaneous exposure to temperature and vibration. Quantifying reliability is generally not the objective of HALT. The ability to improve the inherent reliability/robustness of the product or systems design is. However, in some cases it can be used as an indicator of field reliability performance. The fundamental 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 56
Chapter 2: General Assessment Approach
question to address is this: Does the HALT test excite failure causes that the item may experience in the field? The answer to this question will depend entirely on the characteristics of the item under test, and the stresses to which it will be exposed in field use. For example, if the product or system critical failure causes are accelerated by thermal cycling and random vibration, and the item will experience these stresses in the field, then HALT test results may be indicative of field reliability. Likewise, if the product or system critical failure causes are not accelerated by thermal cycling and vibration, and/or the product will operating in a benign environment, then the HALT results will provide very little information regarding field reliability. 2.5.1.1.5.
Qualification Testing
Qualification testing is a term used to describe a series of tests that a product or system must be exposed to, and pass, for it to be considered “qualified” by the industry or standards body governing the qualification requirements. Several examples of qualification requirements are provided in Tables 2.5-7 and 2.5-8, for an assembly, and for a laser diode component, respectively. Table 2.5-7: Example of a Qualification Plan for an Assembly Group/Test
SS/Failures
Group 1 Test Set Impact (packaged w/ mates removed for test)
Cat A: 30" Drop (nominal, based on weight, see table 4-7) 10-orientations as specified
3/0
Impact (not packaged)
4" Drop (nominal, based on weight, see table 4-9) 5-orientations as specified
3/0
Temperature Cycling
-40 °C to 85°C / 100Cycles
3/0
Vibration
10-55 Hz, 1.52mm (max=10G), 1min/cycle, 120 cycles, 3axis
3/0
Electro-Magnetic Interference
Compliance with MIL-STD
1/0
Electro-Static Discharge
Compliance with MIL-STD-883
Group 2 Test Set
Group 3 Test Set Damp Heat
75C/90%RH: 500 hrs qual, 1000 hrs info only
3/0
Toperating Max, Pnominal Full Qualification = 2000hrs Information = 5000hrs
3/0
Group 4 Test Set Endurance
Reliability Information Analysis Center 57
Chapter 2: General Assessment Approach
Table 2.5-8: Qualification Example for a Laser Diode Test Description High Temperature Aging at Ambient Condition
GR-468, Hermetic Laser Module (active) 70°C Q=2000 hrs I =5000 hrs
Low Temperature Aging at Ambient Condition
Min. storage temp. Q=2000 hrs
Damp Heat Aging
85°C/85% RH Q=1000 hrs
Thermal Cycling
-40 to 70°C Q=100 cycles I =500 cycles
Thermal Shock
ΔT=100°C 20 cycles
Vibration
20G, 20-2000 Hz, 4 min/cy, 4 cy/axis
Shock
500G, 0.5 ms, 5 times/axis
Electrostatic Discharge
MIL-STD-883, Method 3015
There are many qualification standards in existence, governed by standards bodies within specific industries. Some noteworthy standards organizations are IEC (International Electrochemical Commission), the U.S. Military (via MIL-specs), ISO, and Telcordia (for telecommunication components and equipment. There are several factors which will impact the usefulness of qualification data as an indicator or field reliability. These are: • • •
The degree to which the stress is accelerated, and the acceleration factor between the test and field environments The degree to which the stress accelerates critical failure causes that the product or system will experience in the field The sample sizes used, which impacts the statistical significance of the data
The first two bullets are treated in detail elsewhere in this book. The last bullet is discussed next.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 58
Chapter 2: General Assessment Approach
A common way in which sample size requirements are identified in standards is with a Lot Tolerance Percent Defective (LTPD) methodology. This concept is identical to the reliability demonstration idea presented previously. In this case, two parameters are specified: 1. The percent of allowable defects 2. The confidence level From before:
1 − CL = R In this case, the value of “R” is the reliability of the entire sample size. So, if the test plan is established to allow no failures (this will require the minimum sample size), the equation becomes:
1 − CL = R n where “n” is the sample size. For example, if the allowable percentage of defects is 20%, and the desired confidence level is 0.90 (i.e. 90%), then n = 11 is the minimum sample size required, as shown: 11
1 − 0.9 = 0.8
So, if the test is performed on 11 samples with no failures, then there is a 90% confidence that the true reliability is greater than 0.8 (i.e., the probability of failure is less than 0.2). Other plans are also available that allow a certain number of failures. These require larger sample sizes, and are determined with binomial statistics. Since the LTPD is generally less than the required reliability, qualification data is usually not sufficient, in and of itself, to demonstrate reliability requirements. It can, however, be valuable data when used in combination with other data sources. As an example, consider a case in which a reliability requirement is that a product or system must have less than 3% cumulative failures after 1000 hours of operation. This is shown as the star in Figure 2.5-9. Now, let’s say that the item is represented by a multimode Weibull distribution (notice the three distinct portions of the curve representing the bathtub curve), characterized by the probability line called “Case 1”in Figure 2.5-9. If 11 parts were tested, and zero failures occurred after 300 hours of Reliability Information Analysis Center 59
Chapter 2: General Assessment Approach
operation, the only statistical statement that can be made is that there is a 90% confidence that the true unreliability is less than 0.2 at 300 hours, shown as the solid star and arrow. Here, the data is not sufficient to determine if the actual distribution is “Case 1,” or that the reliability requirement is met. However, testing 11 samples may be sufficient to determine if we have a wearout mode occurring at a time less than 300 hours, as illustrated in “Case 2.” Probability - Weibull 99.000
90.000
Unreliability, F(t)
50.000
Case 2
Case 1
10.000
5.000
1.000
0.500
0.100 0.100
1.000
10.000
100.000
1000.000
10000.000
Time, (t)
Figure 2.5-9: Reliability Requirement vs. Small Population Reliability Inference If the goal of the test is to demonstrate the infant mortality percent fail value from the first of the distribution modes is for example, less than 1%, it can be seen that testing 11 samples will not come close to demonstrating this requirement. This example is shown to illustrate the fact that the demonstration of reliability due to wearout related failure causes can be done with relatively small populations, whereas low percent fail values typical of infant mortality cannot. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 60
Chapter 2: General Assessment Approach
In any case, however, the goal of a reliability program is to ensure that the actual probability line is to the right of the reliability requirement point. 2.5.1.1.6.
DOE-Based Multicell
The methodology of a DOE (Design of Experiment)-based multicell involves subjecting a sample of products to a combination of factors, or accelerants. These factors can be stresses or categorical variables. The intent of these tests is to generate the data that is required to develop a life model that is capable of predicting reliability under a variety of use conditions. Life modeling is usually performed for specific failure causes. A goal of a reliability program is to identify those causes that warrant the work required to develop a life model. Characteristics of these “critical failure causes” often include: • • • • • • • • •
Failures experienced in EVT tests New, unproven technology New, unproven manufacturing processes Items exposed to stringent/severe environmental conditions Items exposed to stringent/severe operating stresses Items designed or manufactured with non-robust practices Items with known life limitations Items from suppliers with a history of delivery, cost, performance or reliability problems Old technology with availability problems (obsolescence and/or diminishing manufacturing sources
After the identification of critical failure causes of a product or system that require life modeling, action must be taken to ensure that those items are sufficiently robust to meet product/system reliability and durability requirements. Life modeling is used for this purpose, and involves the characterization and quantification of specific failure causes, making it a critical element of a reliability program. A generic life modeling methodology is shown in Figure 2.5-1.
Reliability Information Analysis Center 61
Chapter 2: General Assessment Approach
Tools
Measurement: DOE
FMEA
Life Modeling
FTA
• • • •
Environment Stresses Duty Cycle Extreme Event Statistics
FTA
Characterize operating stresses Identify Factors
Reliability Tests
Develop Life Model
Predict Reliability under Use Conditions
Model of System Reliability
Actions Figure 2.5-10: Life Modeling Methodology Each of the elements in Figure 2.5-10 are further examined below. Additionally, the topics of Design of Experiments (DOE) and life modeling are treated in more detail in Chapters 4 and 5, due to their relatively complex nature and their importance to life modeling. A detailed example of a life model developed is also provided in Chapter 7. Identify Factors Factors are the independent variables that can influence the product reliability, and the response variable is the dependent variable. DOE is a common technique used to study the relationships amongst many types of factors. In the context of this book, the response variables specifically refer to the reliability metric of interest. Critical failure causes and the factors that potentially affect their probability of occurrence need to be identified. This can be done through testing, through analysis, or both. EVT testing that is performed as part of the overall product/system reliability program can be used for the identification of these factors, as previously described. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 62
Chapter 2: General Assessment Approach
FMEA is also a popular analytical technique for this and will be used in the upcoming example. Factors fall into one of several categories: • •
Stresses o Environmental o Operational Product/System Attributes o Design factors o Manufacturing processes
Each of these factors can be a continuous or a categorical variable: • •
A continuous variable is one that can assume any value within a given range Categorical variables are those that assume a discrete number of possibilities
Some factors can be modeled as either. For example, environmental stress can be modeled with continuous variables of the specific environmental stresses (i.e., temperature, vibration, humidity, etc.), or it can be modeled as a categorical variable. The latter case is the approach that has historically been used in MIL-HDBK-217, which uses environmental categories like “Ground, Benign,” Airborne, Inhabited”, etc. The 217Plus methodology treats them as continuous variables, but default values are provided for the categorical values of environment. There are several ways in which these factors can be identified. One method that has proven to be an efficient means of accomplishing this is to utilize the FMEA. This involves modifying the FMEA to include several additional columns that correspond to the above listed factors. At the analysts discretion, from one to four additional columns can be included. This will depend on the type of product or system under analysis and the level of rigor desired. In this approach, the FMEA team (or at least someone knowledgeable with the item design and process attributes) identifies the specific stresses or attributes that will affect the probability of occurrence of the specific failure cause that was identified in the FMEA. Since each failure cause will generally have an associated risk priority number (RPN), the cumulative RPN can be calculated for all failure causes affected by the specific stress or product/system attribute. For example, consider the case in which an FMEA was accomplished in this manner, and the results in Figure 2.5-11 were obtained. Here, only the environmental stresses are Reliability Information Analysis Center 63
Chapter 2: General Assessment Approach
shown, but the same methodology would apply to whichever additional factors are included in the FMEA. A more detailed discussions of the FMEA methodology is provided in Chapter 8.
Figure 2.5-11: Identification of Test Stresses Based on the FMEA In this case, the sum of the RPN values for all failure causes accelerated by mechanical shock is about 500. This cumulative RPN value is a relative number only, but can provide valuable insight into the most important stresses to be addressed in the reliability test plan. In this example, the test stresses shown pertain to all of the failure causes addressed in the FMEA. In performing life tests on specific failure causes, the information identified in the FMEA should be used to identify the test stresses to be considered in the DOE plan.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 64
Chapter 2: General Assessment Approach
Reliability Tests If critical item failure mechanisms are time dependent, then time-based life tests are required. Life tests are conducted by subjecting test samples to a defined stress level and measuring the times when failure occurs. The process is repeated for various combinations of factor levels. Considerations for the reliability tests are described below. Test Plan If there are multiple accelerating stresses, then life tests must be conducted at various combinations of stress magnitudes. A plan should be developed using an effective tool such as Design of Experiments. The plan should consider all aspects of testing so that the test program generates data in a cost effective way. It is easy to lapse into the mentality of testing “one factor at a time”, in which tests are conducted to assess specific factors, but this approach is generally not time- or cost-effective. Factors to consider in establishing an appropriate DOE include (1) the sample size per test cell, (2) stress levels, (3) the number of stress levels for each stress, (4) stress interactions, (5) stress durations, (6) failure criteria, and (7) measurement methodology (i.e., in-situ or periodic). The principals of DOE are treated in more detail in Chapter 4. Maximum Test Stress A prerequisite for developing a complete test plan to assess the lifetime of a product or system attribute is knowledge of the maximum stress magnitude that can be tolerated by the item prior to catastrophic failure. This knowledge supports establishment of an upper bound on subsequent test stresses that may be a part of step-stress testing. These tests are generally performed as part of the EVT tests. In many cases, it is desirable to establish the upper bound of the test stress for each specific stressor. An efficient way to determine this stress level, often called the “destruct limit”, is to perform a step stress test. Here, a sample of units is exposed to a stress level well below the suspected destruct limit. Then, the stress is increased until the product is overstressed. This step-stress test can include a linearly ramped stress, or a stepped-stress in which the samples are exposed to a constant stress for a given dwell time, after which the stress is increased, dwelled, and so on until failure. An example of the identification of these maximum stresses was mentioned previously in the HALT discussion. The destruct limit can be used as the upper limit of all subsequent life tests. Usually, the actual life tests will be performed at a maximum stress that is a certain percentage level Reliability Information Analysis Center 65
Chapter 2: General Assessment Approach
below the destruct limit. This percentage is dictated primarily by the sensitivity of the TTFs to the stress. For example, consider the two cases illustrated in Figure 2.5-12. Case 1 is a situation in which the lifetime, and subsequent reliability, is moderately sensitive to the stress level. Case 2 is a situation in which the lifetime has an extreme sensitivity to the stress level.
Figure 2.5-12: Using the Destruct Limit to Define the Life Test Max Stress For example, if a power law acceleration model is used, the life – stress relationship is:
Life =
A Sn
where “A” is a life constant and “S” is the stress. A typical value of “n” for Case 1 would be 1 to 3, whereas a typical value of “n” for Case 2 would be greater than 20. In case 1, the maximum stress for the life tests may be 10-20% below the destruct limit. For Case 2, however, the maximum stress should be only a few percentage points below the destruct limit. Otherwise, the risk is taken that the product or system will not fail within a reasonable time period, which is required for reliability model development. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 66
Chapter 2: General Assessment Approach
Stress Profile The two main types of stress profiles are steady-state and time varying. Steady state tests are those in which a sample set is exposed to constant stress levels, and the response (performance parameter(s)) is measured. Several examples are shown in Figure 2.5-13.
Figure 2.5-13: Possible Stress Profiles Any of the profiles in Figure 2.5-13 can be used to develop life models. If the timevarying stress profiles are used, a cumulative damage model is usually appropriate. In this case, the stress function is integrated to obtain the cumulative damage. This will be explained in more detail later. Some of the advantages and disadvantages of the two generic approaches are listed in Table 2.5-9.
Reliability Information Analysis Center 67
Chapter 2: General Assessment Approach
Table 2.5-9: Stress Profile Option Advantages and Disadvantages Approach Steady State Stress
Stepped (or Linear Ramped) Stress
Advantage Results can be easily interpreted
Disadvantage Longer test times required
Facilitates the de-convolution of time and stress effects more easily Short test times possible
Requires knowledge of destruct limits
A good approach when the time to failure characteristics as a function of stress are unknown
Can be difficult to model parameters Software required for modeling
Does not require knowledge of destruct limits
Optimum Measurement Intervals When testing is performed on products or systems whose performance cannot be monitored in-situ, the test needs to be run such that performance measurements are done at periodic intervals. These intervals need to be frequent enough to bracket the TTFs tightly enough such that life model parameters can be estimated accurately enough. The objective of the measurement intervals is to obtain as much resolution as possible in the regions of time that exhibit high failure rates. The measurement intervals should be an order of magnitude shorter than the failure times. There are several approaches to determining the appropriate measurement intervals: 1. Use constant intervals. While this approach may not be optimal, it can be appropriate in cases where the failure characteristics are completely unknown 2. If the rate of occurrence of failure (ROCOF) is expected to decrease over time, the measurement intervals can start out very frequent, and decrease in frequency as the failure rate decreases. This is shown in Figure 2.5-14.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 68
Chapter 2: General Assessment Approach
Failure Rate
Measurement Points
Figure 2.5-14: Measurement Points for an Infant Mortality Failure Cause
If the ROCOF is expected to increase over time, the measurement intervals can start out very infrequent, and increase in frequency as the failure rate decreases. This is shown in Figure 2.5-15.
Failure Rate
Measurement Points
Figure 2.5-15: Measurement Points for a Wearout Failure Cause This case is generally much more difficult to implement because the failure characteristics need to be known before the tests. Therefore, one of the first two approaches is usually desirable. Reliability Information Analysis Center 69
Chapter 2: General Assessment Approach
Sample Size Requirements The determination of adequate sample sizes will depend on several factors, the most important being whether the failure cause is special cause or common cause. If it is special cause, the sample size needed will depend entirely on the percent of the population affected by the failure cause. For example, if the failure cause manifests itself in 0.1% of the population, then at least 1000 items would be required in order to expect a single failure. Since multiple failures are required for true quantification, an order of magnitude more items, or about 10,000, would be required. The specific number can be calculated by using the principals of reliability demonstration, as explained elsewhere in this book. If the failure cause is a common cause mechanism, meaning that the entire population is at risk, then many fewer items would be required. In this case, test data on enough samples is required such that differences in reliability as a function of the factors (i.e., stresses, indicator variables) can be determined in a statistically significant manner. This will be a function of how much inherent variability there is in the population, and how sensitive the reliability is as a function of the factors under analysis. Essentially, if these variabilities are known, then statistical techniques, like the Fisher F-test, could be used. However, in practice, these variabilities are rarely known a priori. Therefore, sample sizes as large as possible are preferred. In practice, the sample sizes are usually dictated by programmatic constraints, in which case it is the reliability practitioner’s responsibility to lobby program managers for the required samples. Test Time The question as to how long tests should be run before stopping them inevitably needs to be addressed. This is especially true in cases where the stress levels are low and the resulting lifetimes are long. While it is usually difficult to determine an appropriate test duration before the test is run, a general rule of thumb is that tests should be run for durations sufficient to cause at least 50% of the items to fail. This facilitates quantification of the median life. Keep in mind that tests are used to characterize the statistical distribution at a specific stress level, and therefore enough failures need to be experienced to quantify the distribution. Consider the illustration in Figure 2.5-16. In this case, tests were performed at two stress levels, and the resulting TTF distributions were obtainable for each level. The acceleration in this case can be quantified, along with confidence bounds around the acceleration model parameters. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 70
Chapter 2: General Assessment Approach
Figure 2.5-16: Acceleration When the Distributions for at Least Two Stresses are Available Now, consider the case in which the lower stress samples are not tested until enough failures have occurred. This is shown in Figure 2.5-17. In this case, the distribution cannot be quantified. All that is possible is the estimation of the lower bound of life, via techniques like Weibayes analysis (shown as the star).
Figure 2.5-17: Acceleration When the Distributions for Low Stresses are Not Available Reliability Information Analysis Center 71
Chapter 2: General Assessment Approach
This 50% objective can sometimes be offset if enough data is available in at least two other, more stressful conditions, to compensate for the lack of data in the low stress condition. Develop Life Model After the life data is generated from implementing the DOE plan, a reliability model can be constructed. Factors that must be quantified include: • • •
Time-to-failure (TTF) distribution Acceleration factors for the primary stress variables Characterization of the impact of specific design attributes on reliability
A generic sequence of events for model development is shown in Figure 2.5-18. Collect data • TTFs • Acceleration variables • Stress(es) • Indicator Select TTF distribution
Select acceleration model(s) Estimate model parameters
Analyze goodness of fit and parameter significance Figure 2.5-18: Life Model Sequence
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 72
Chapter 2: General Assessment Approach
The TTF distribution can typically be modeled using the Weibull, exponential or lognormal distributions. For sample "subpopulations" that exhibit different reliability behavior than the main population, TTF distributions may manifest themselves as bimodal. It is important that bimodal distributions be characterized. If one of the two "modes" in the distribution appears to be the result of early failures from workmanship, materials or process defects, then this information should be used to develop an appropriate reliability screen. This topic is discussed in detail later in this book. Characterize Operating Stresses In order to estimate the field reliability of the product, in addition to the life model (which will predict the life characteristics as a function of the chosen factors), information regarding the stresses to which the product or system will be exposed in the field is also necessary. There are a variety of sources that can be used to estimate the stresses to which an item will be exposed. First, customers will usually specify nominal and worst case environmental requirements in the product or system specification. However, the data in specifications are often very generic and lack sufficient detail for reliability analysis. Another source of information is from direct measurement, either by directly measuring stresses in the item use environment, or by equipping the item with sensors and data logging features. Field maintenance personnel can also often provide qualitative information pertaining to stresses, especially when those stresses have resulted in failures. There is a wealth of information available in both commercial and military handbooks and standards. Many industries also have their own source material from the products or systems used in their industry. A summary of sources include: • • • • • •
Customer specifications Customer usage information Measurement of conditions: Stresses Duty cycle Extreme event statistics Reliability Information Analysis Center 73
Chapter 2: General Assessment Approach
• • • •
Using a sample of fielded products fitted with sensors and data-recording electronics Discussions with field maintenance personnel Handbooks and standards MIL-STD-210, “Climatic Information to Determine Design and Test Requirements for Military Systems and Equipment”
Predict Reliability Under Use Conditions Once life models have been developed for all pertinent failure causes, the specific combinations of design attributes and stresses that result in reliability requirements being met can be identified. These attributes/stresses define the item "safe operating region," which should then be added to the system/product design rules so that reliability requirements for future designs can be met without having to repeat the reliability modeling process for that item. Model of System Reliability Once life models have been developed for all pertinent failure causes, they need to be combined such that a reliability estimate of the entire product can be made. Section 2.7 describes this process and the appropriate tools in more detail. Degradation Modeling In many cases, the reliability response variable will not be a TTF, but rather it will be the behavior of a critical parameter as a function of time. In these cases, there are several choices: 1. Develop a model that predicts the parameter as a function of all factors that need to be quantified. 2. Derive a simple model (linear, logarithmic, exponential or power law) model that describes the parameter as a function of time, and then use this model to estimate a time to failure (i.e. the time the parameter is predicted to degrade to some predefined failure threshold. In many cases, Option 2 is a good choice. Option 1 is a good choice in the following cases: 1. When the failure mechanism can reach an asymptotic value of degradation. This condition is difficult to model using the conventional life modeling techniques 2. If the goal of the analysis is to feed other analytical techniques, like worst case analysis (WCA). 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 74
Chapter 2: General Assessment Approach
A general approach for degradation modeling is shown in Figure 2.5-19.
Regression Non linear model parameter estimation
Life modeling
Data of a performance parameter vs. time
Model of performance vs. time
Prediction of delta
Prediction of percent fail Life model
Prediction of life distribution
Figure 2.5-19 Degradation Modeling Approach This approach starts with data pertaining to the value of a critical parameter as a function of an independent parameter. This independent parameter is usually time, but can be other parameters, such as cycles. An example of such data is shown in Figure 2.5-20, in which five samples were put on test and the critical parameter was measured in situ.
Reliability Information Analysis Center 75
Chapter 2: General Assessment Approach
Figure 2.5-20: Degradation Data Example Next, models of performance vs. time are modeled. This can be accomplished by using some standard model forms like, linear, exponential, logarithmic, polynomial, or a more sophisticated non linear model form. The standard model forms can be quantified by applying a linear transform to the data and applying regression techniques. The models are easily performed in MS EXCEL with the trend line functions. Non linear model forms can be quantified using numerical methods. The “Solver” utility in MS EXCEL is, again, an example of this solution type. Once these degradation models are available, predictions can be made regarding the degradation value or the percent of the population failing in accordance to a predefined failure criterion (i.e. percent degradation). Or, another option is to convert the degradation data to failure times, as shown in Figure 2.5-21. The estimated TTFs are then used to generate a life model using the techniques covered elsewhere in this book.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 76
Chapter 2: General Assessment Approach
Figure 2.5-21: Degradation Data Conversion to Times-to-Failure Note that the resulting TTF distribution can sometimes be counterintuitive. For example, when dealing with what is believed to be a wearout-related phenomenon, the conversion of degradation into TTFs can reveal a TTF distribution that is not usually considered to be a wearout characteristic (for example a Weibull distribution with a shape parameter that is less than one). 2.5.1.1.7.
Reliability Demonstration
An accelerated reliability demonstration is conceptually the same as a non-accelerated test, but the level of acceleration needs to be quantified. For this, life modeling approaches are used. 2.5.1.2. Field Data
Reliability data obtained from the field experience of products or systems is an invaluable source of data. When using empirical field reliability data from a similar item as the basis of the reliability estimate, there are two fundamental approaches, as illustrated below.
Reliability Information Analysis Center 77
Chapter 2: General Assessment Approach
The first approach is to utilize the field data directly, and the second is to utilize the data, via an interim model developed from the data. This is shown in Figure 2.5-22.
Empirical Field Data
Reliability Estimate Model
Figure 2.5-22: Reliability Estimates from Field Data This data has been the primary source of data used to develop most of the empirical prediction methodologies such as MIL-HDBK-217, 217Plus, etc. Due to the author’s experience with these prediction methodologies, they will be used as examples in Chapter 7 to illustrate the concepts discussed in this section. 2.5.1.2.1.
Same Product
Field data on the exact item under analysis is the best information on which to estimate the reliability of the product or system. Unfortunately, it is usually available too late to do any good. Reliability predictions and estimates are required long before product or system deployment. This type of data is a lagging indicator of reliability, whereas the other techniques discussed in this book are leading indicators. In other words, we need leading indicators to estimate the reliability that will ultimately be observed with the field data. This data, however, which should always be collected on products, is valuable in the reliability assessment of future products. 2.5.1.2.2.
Similar product
When using data on a similar product or system to assess a new product or system, the degree of similarity needs to be accounted for to estimate the new item reliability based on the empirical data available on the similar product. There are several ways in which similarity can be assessed. The first approach is to utilize a reliability prediction technique. This technique can be any of those covered in this document. The technique’s ability to assess similarity is dependent on the ability of the specific methodology to: 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 78
Chapter 2: General Assessment Approach
1. Address the factors that drive the reliability for the two products under analysis 2. Be reasonably sensitive to these drivers. Regarding #2, for example, if a system is being developed that represents an evolutionary change to the system for which a reliability estimate is available, estimating the reliability of the new system based on the data from the old system requires that the prediction methodology be sensitive to the design differences between the old and new systems. If these differences consist of the addition of new components, an increase in the operating temperature, and the addition of software, then the methodology used to assess the “delta” in reliability between the new and old system must be capable of assessing these elements, and the reliability prediction approach must be reasonably sensitive to these factors. The methodology of 217Plus was designed to accommodate this type of situation, and is further detailed in Section 2.6. Additionally, it is not necessary that a single methodology be used to assess this “delta”. Different techniques can be used to assess each of the elements of the design, and the cumulative effect can be pooled together to form a complete system model. The techniques used to assess each of the design elements will generally fall into the categories described in this document. Another more qualitative technique is to simply list the general attributes of the design, as shown in Table 2.5-10. The relative expected reliability of each of these elements for the new and old designs are then listed. This is a qualitative method, but can be useful in some cases.
Reliability Information Analysis Center 79
Chapter 2: General Assessment Approach
Table 2.5-10: Similarity Analysis Reliability Ratio for the Design and Process elements of the Old and New Designs
Design Elements
Process elements
Size Weight General design Number of components of type “A” Number of components of type “B” Number of components of type “C” Number of optical components Thermal dissipation Number of connections Manufacturing site Equipment Screening Component attachment Screening tests QC tests
This approach needs to be developed for each product or system, since the reliability attributes will be unique to that particular item type. Another approach that can be used to assess similarity is to utilize the FMEA, if available. This is illustrated in Figure 2.5-23. Here, the FMEA is performed on both the new and the predecessor system. The failure causes identified represent a cumulative listing of all failure causes, whether they are applicable to either or both items. Then, the Occurrence rating is determined for each failure cause for both items. If a specific failure cause is not applicable to one of the items, then it gets a rating of zero. The sum of the Occurrence ratings are then calculated for each of the products or systems. The ratio of this sum is an indicator of the relative reliability levels of the two items, and is a good measure of the degree to which the items are similar.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 80
Chapter 2: General Assessment Approach
Recommended Actions
Detectability
Occurrence
Causes
Severity
Failure effects
Failure mode
function
Component
Applicable to:
This section represents the FMEA results for both the new and old systems, and lists a cumulative set of failure modes, causes, etc.
Old System
New System
In these two columns, it is identified whether the failure cause is applicable to the old, new or both systems. Sum of “O” Sum of “O” values of values of causes causes applicable to applicable to the old system the new system
Figure 2.5-23: FMEA as a Toll for Assessing Similarity 2.5.1.2.3.
Raw Empirical Field data - Similar Product or System
Raw field reliability data has been a very popular source of data on which to base reliability estimates. This “similar data” can be based on a specific company’s own field experience on previous products or systems, or it can be a pooled set of data based on a variety of companies and organizations. As an example of the latter, one of the RIAC’s most popular documents has been the “Nonelectronic Parts Reliability Data, (NPRD)” publication. NPRD is a compilation of observed field reliability data on a wide variety of components. A summary of NPRD is provided in Section 7.4, to provide the reader with a guide to the interpretation of this type of data. For the most part, methodologies such as EPRD (Electronic Parts Reliability Data), NPRD, MIL-HDBK-217, and 217Plus rely on field data from similar products or systems in order to make reliability estimates. The manner in which they do this differs, but they all share the same fundamental type of data as their basis.
Reliability Information Analysis Center 81
Chapter 2: General Assessment Approach
2.5.1.2.4.
Models
The use of models derived from empirical data to estimate the reliability of a product or system is just one option for estimating reliability. Empirical models can be developed and used by the analyst, or he/she can use empirical models developed by others. Models developed by others include the industry standards or methodologies that many reliability analysts are familiar with. This section of the book deals with such models that are derived from the analysis of empirical field data. Modeling is the means by which mathematical equations are developed for the purpose of estimating the reliability of a specific item used and applied in a specific manner. There are many ways in which models can be derived, and there is no single “correct” way to develop these models. There are many such models in existence. These models are generally easy to use, in that they are of a closed form and simply require the analyst to identify the appropriate values of the input variables. The developers of each of these models had their own perspective in terms of the user community to be served, the variables that were to be modeled, the data that was available, etc. It is not the intent of this book to review the specifics of these models, or to compare them in detail. It is the intent, however, to discuss the rationale and options for development of the models, and to provide some examples. The analyst must first decide what variables are to be modeled. Factors that should be considered as indicators of reliability include: • • • • • • • •
Environmental stresses Operational stresses Reliability growth Time dependency o Infant mortality o Wearout Engineering practices Technology o Feature sizes o Materials Defect rates Yields
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 82
Chapter 2: General Assessment Approach
Which ones are actually included depend on whether the data is available to support the quantification of a factor, if a valid theoretical basis exists for its inclusion, and whether the factor can be empirically shown to be an indicator of reliability. There are always many more potential factors influencing reliability that can realistically be included in a model. The analyst must choose which ones are considered to be the predominant reliability drivers, and include them in the model. The next step of model development is to theorize a model form. This is generally accomplished by attempting to establish a model consistent with the fundamental physics of reliability. Examples of the development of several empirically-based models are provided in Chapter 7. To compare various empirical methodologies, Table 2.5-11 contains the predicted failure rate of various empirical methodologies for a digital circuit board. The failure rates in this table were calculated for each combination of environment, temperature and stress. As can be seen from the data, there can be significant differences between the predicted failure rate values, depending on the method used. Differences are expected because each methodology is based on unique assumptions and data. The RIAC data in the last row of the table is based on observed component failure rates in a ground benign application. Table 2.5-11: Digital Circuit Board Failure Rates (in Failures per Million Part Hours) Environment Temperature Stress ALCATEL Bellcore Issue 4 Bellcore Issue 5 British Telecom HDR4 British Telecom HDR5 MIL-HDBK-217 E Notice 1 MIL-HDBK-217 F Notice 1 MIL-HDBK-217 F Notice 2 217Plus Version 2.0 RIAC data
Ground Benign 10 Deg. C 70 Deg. C 10% 50% 10% 50% 6.59 10.18 13.30 19.89 5.72 7.09 31.64 35.43 8.47 9.25 134.45 137.85 6.72 6.72 6.72 6.72 2.59 2.59 2.59 2.59 10.92 20.20 94.37 111.36 9.32 18.38 20.15 35.40 6.41 9.83 18.31 26.76 0.28 4.89 3.3
Ground Fixed 10 Deg. C 70 Deg. C 10% 50% 10% 50% 22.08 29.79 32.51 47.27 8.56 10.63 47.46 53.14 16.94 18.49 268.90 275.70 9.84 9.84 9.84 9.84 2.59 2.59 2.59 2.59 36.38 56.04 128.98 165.91 28.31 48.78 45.44 79.46 24.74 40.15 73.63 119.21 0.51 6.04
For electronic systems, generic handbook models such as MIL-HDBK-217 or Telcordia SR-332 can be separated into two basic approaches, Parts Count and Parts Stress. When the models for these handbooks were developed, researchers performed statistical analyses on collected test and field data to determine major influencing factors for the Reliability Information Analysis Center 83
Chapter 2: General Assessment Approach
class of components being considered. For example, for most all electronic components, the predicted failure rate is found to be a function of operating temperature and applied electrical stress. In general, the lower the operating temperature and applied electrical stress, the lower the predicted failure rate will be. Therefore, the parts stress method includes model factors for these specific stresses. However, if specific stress values cannot be determined, it is still possible to perform a prediction using the more general parts count methodology. For the parts count method, model stress levels have been set to typical default levels to allow a failure rate estimate simply by knowing the generic type of component (such as chip resistor) and its intended use environment (such as ground mobile). It should be noted that these reliability prediction handbook approaches are, by necessity, generic in nature. Actual test or field data from other similar items is always more desirable, given sufficient similarity, as was discussed previously. MIL-HDBK-217 MIL-HDBK-217, “Reliability Prediction of Electronic Equipment”, has historically been the most widely used of all of the empirically-based reliability prediction methodologies. The basic premise of the handbook is the use of historical piece part test and field failure rate data as the basis for predicting future system reliability. The handbook includes failure rate models for most electronic part types. The latest released version of MILHDBK-217 is “F, Notice 2”, dated 28 February 19952. The handbook was almost a casualty of the DoD Acquisition Reform initiative, but it survived primarily because of its widespread use, the dependency on it throughout the military-industrial complex, and the lack of a suitable replacement. Figure 2.5-24 presents a brief example of the MIL-HDBK-217 parts count method, where the product or system failure rate is the sum of the failure rates of the generic electrical and electromechanical components of which it is comprised3. Each piece-part failure rate is derived by assigning “typical” defaults to the generic component category stress models. The only factors considered in these parts count component models are (1) the generic base failure rate for that part type (represented by λg) that is based on an assumed application environment and default temperature, (2) a generic quality factor (πq) that is used to modify this part type base failure rate, and (3) the quantity of that part type used in the equipment. In the example shown here, the λg for a bipolar microcircuit comprised of between 1 and 100 gates in a 16-pin dual-in-line package used in a ground, fixed (i.e., GF) environment and operating at an assumed junction temperature of 60 degrees C is 2 As of the publication date of this book, a Draft version of MIL-HDBK-217G is in development and is expected to be released some time in 2010. 3 The current version of MIL-HDBK-217 does not predict the reliability of mechanical components or non-hardware reliability elements, such as software, human reliability, and processes. Field failures of mechanical components and non-hardware items should not be scored against MIL-HDBK-217 or any other electronics-based empirical methodologies.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 84
Chapter 2: General Assessment Approach
0.012 failures per million hours. The quality factor for a parts count prediction is also determined from a table (not shown in this example). The parts count prediction approach is intended for use early in the design phase of the equipment life cycle, prior to the start of detailed design, when there is little known about the specific characteristics of the parts being used, or how they will be applied (such as individual operating and environmental stresses).
Figure 2.5-24: MIL-HDBK-217 Part Count Example The parts stress approach of MIL-HDBK-217 is applied later in the design development phase of the equipment life cycle, when more details of the design are becoming available and specific part applications are being identified. The use of this approach requires detailed knowledge of the applied stresses and physical characteristics of the device, including ambient and/or operating junction temperatures, electrical stress levels (such as voltage, power, or current) vs. rated parameters, device complexity (such as gate counts or transistor counts for semiconductors), etc. An example of a MIL-HDBK-217 parts stress model is shown in Figure 2.5-25 for gate/logic arrays and microprocessors.
Reliability Information Analysis Center 85
Chapter 2: General Assessment Approach
Figure 2.5-25: MIL-HDBK-217 Part Stress Example The form of the model separately addresses failure rate contributions from the microcircuit die and the microcircuit package. The quantitative values for C1 (shown here) and C2 (not shown) are device- and package technology-dependent. The values for the temperature and environmental factors, πT and πE respectively, are also technology dependent and are seen to independently impact the die and package failure rate contributions. The quality factor, πQ, represents the amount of pre-condition screening or testing that the part might get, and the learning factor (πL) reflects the level of maturity associated with the manufacture of the device. Component maturity has been shown to be a predominant reliability driver of components, since the maturity is inversely proportional to the defect density, which in turn is proportional to the failure rate. . Telcordia SR-332 (Bellcore) The Telcordia SR-332, shown in Figure 2.5-26, was formerly known under the name of Bellcore. The models are similar in concept and purpose to MIL-HDBK-217, including the fact that they are empirically based.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 86
Chapter 2: General Assessment Approach
Figure 2.5-26: Telcordia SR-332 (Bellcore) The primary discriminator between the two approaches is that the Telcordia models have been tailored specifically for the telecommunications industry, meaning that the breadth of available environmental factors in the Telcordia models is much narrower than in MIL-HDBK-217. There are three basic reliability prediction methods in the Telcordia methodology: •
• •
Method I represents a parts count reliability prediction approach, and is applicable to new technology where no field data exists. Since it does reflect new technology, the model includes a “first year multiplier” factor to account for infant mortality failure rates. Method II incorporates the characteristics of Method I, but expands its scope to include the impact of lab test data. Method III is based on field tracking of failure rates.
It uses the Bayesian methodology to combine data from various sources, in a manner similar to the RAC PRISM/RIAC 217Plus methodology.
Reliability Information Analysis Center 87
Chapter 2: General Assessment Approach
PRISM/217Plus The original RAC PRISM®4 system reliability assessment tool was developed and released by the RAC in January 2000 as a potential replacement for MIL-HDBK-217. With the subsequent transition to RIAC in June 2005, the RIAC 217Plus methodology replaced the RAC PRISM tool and added additional component models. As a result, the 217Plus methodology currently addresses all the major component types found in MILHDBK-217. Figure 2.5-27 symbolizes the replacement of the RAC PRISM tool by 217Plus.
Figure 2.5-27: RAC PRISM Replaced by RIAC 217Plus With no DoD sponsorship and funding, the models contained within MIL-HDBK-217F, Notice 2, and the data upon which they were based, were becoming increasingly outdated, thereby subject to increasing criticism. The part failure rate models incorporated within PRISM, and ultimately within 217Plus, are based on a much larger and more recent dataset, reflecting the improvements made in semiconductor device and packaging technologies, resulting in more “accurate” part failure rate predictions. The PRISM/217Plus software tool incorporates many of the ideas contained in the “New System Reliability Assessment Study” performed by the RIAC for what was then Rome Laboratory (Reference 7). These ideas include the ability to update an analytical reliability prediction using in-house test data or field experience through Bayesian techniques, and the ability to factor in system-level reliability impacts resulting from the robustness (or lack, thereof) of the system development process. This methodology is discussed in more detail in Chapter 7.
4
PRISM is a registered trademark of Alion Science and Technology.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 88
Chapter 2: General Assessment Approach
CNET/RDF 2000 The CNET/RDF 2000 Reliability Prediction Standard, shown in Figure 2.5-28, covers most of the same component categories as MIL-HDBK-217.
Figure 2.5-28: CNET/RDF 2000 An example of an integrated circuit model is shown in Figure 2.5-29. This model has many similarities to the PRISM/217Plus models, in that it addresses the year of manufacture, dormancy failure rates, thermal cycling characteristics and electrical overstress failure rates. The form of the integrated circuits model is shown here, and bears a resemblance to the format of the MIL-HDBK-217F, Notice 2 microcircuit failure rate model, in that it partitions the predicted device failure rate into die- and packagerelated contributions. As can be seen from Figure 2.5-29, there is quite a bit of information that the analyst must have access to in order to use the model, but this is typical for virtually all parts stress reliability prediction models.
Reliability Information Analysis Center 89
Chapter 2: General Assessment Approach
Figure 2.5-29: CNET/RDF 2000 Model Example FIDES The FIDES methodology, illustrated in Figure 2.5-30, was created by a French consortium of reliability experts from various companies. It has similarities to the CNET, RAC PRISM and RIAC 217Plus methods.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 90
Chapter 2: General Assessment Approach
Figure 2.5-30: FIDES Other Methodologies Other methodologies include: UTE C 80-810, RELIABILITY DATA HANDBOOK: RDF 2000 – A universal model for reliability prediction of electronic components, PCBs and equipment. IEC 62380 TR Ed.1 (2003), Reliability Data Handbook - A universal model for reliability prediction of electronics components, PCBs and equipment. IEC 1709, Electronic components – Reliability – Reference conditions for failure rates and constraints influence models for conversion. The VITA51 committee is also an organization that is addressing reliability prediction. Its approach has been to adapt and modify existing methodologies, like MIL-HDBK-217, by tailoring various factors of the MIL-HDBK-217F, Notice 2 models so that there is a closer correlation between predicted values and field experience. Reliability prediction models such as those summarized above can easily become outdated as technology advances. As previously mentioned, maintaining the currency and accuracy of a reliability prediction model can be a prohibitively costly and laborReliability Information Analysis Center 91
Chapter 2: General Assessment Approach
intensive effort. Failure to invest in this activity, however, will doom a reliability prediction methodology to eventual irrelevancy and obsolescence. 2.5.1.2.5.
Collecting Field Data
Since field data is critical to the reliability assessment process, it is explored in this section. The nuances of collecting and interpreting it are discussed. Some of the issues encountered in collecting field data are discussed in the NPRD discussion included in Chapter 7. The intent of this section is to present guidelines on how to approach field data collection. Good data collection is the key to an effective process for utilizing data obtained from a reliability tracking system. This information includes: • • • •
Failure statistics (i.e., TTF, MTBF) Application information (i.e. stress, environment, etc.) Failure modes Failure causes
The intent of this section is to outline a reliability data collection and analysis system that can provide the data required. Although a reliability tracking system outlined herein has similarities to a FRACAS program, there are distinct differences. While a FRACAS program is intended identify the causes of failures so that corrective action can take place, the program outlined herein is intended to be more comprehensive in that it assists its user in more than the implementation of corrective actions, as it also provides the data required to quantify reliability, in accordance with the methodologies outlined in this book. This concept is illustrated in Figure 2.5-31.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 92
Chapter 2: General Assessment Approach
Reliability
TTF Analysis
Vendor Selection
Failure Verification
Warranty Claims
MTBF Analysis
Root Cause Identification
RCM Implementation
Implement Design Improvements
Figure 2.5-31: Uses of Program Data Elements A data system consists of several basic elements: a database, software analysis tools, and an interface to the data system users. The database is the core of the system that captures the raw maintenance data that is necessary to perform the required data analysis. A typical structure of a database is provided in Figure 2.5-32.
System Information
Parts Breakdown Maintenance Data Root Failure Cause/Analysis Data
Figure 2.5-32: Program Database Structure The blocks in the above figure correspond to records in a relational database structure. The data elements associated with each record are defined below. The System Reliability Information Analysis Center 93
Chapter 2: General Assessment Approach
Information record consists of population statistics and needs to be updated whenever the product or system status changes. Such a change occurs when new or modified items are fielded. The parts breakdown data element consists of a hierarchical description of the system. This description is necessary to avoid confusion as to which FRUs (Field Replaceable Units) belong to which assemblies and the number of FRUs in the assembly, as well as in the entire system. The maintenance data element consists of a record of the maintenance action taken to maintain or repair the system. It also consists of a description of the anomaly, the failure mode, and the failure mechanism of the failed unit as determined by the maintenance technician. One record corresponds to a single maintenance action, and there can be any number of them for each FRU in the system (i.e., a FRU in the system can be replaced any number of times over the life cycle of the system). The root failure cause/analysis data element consists of information on the results of the detailed failure analysis that may be performed on the failed unit. It is a separate record because not all maintenance actions will result in the failure analysis of a removed unit. There are two primary interfaces required of the system. The first is the maintenance technician interface. This interface is the means by which maintenance data is entered into the database. Ideally, this interface would consist of computers located within the maintenance facility for direct data entry. The second interface is the one utilized by individuals that need the results of the data analysis. The flow of the interface to the system from the perspective of the system user is given in Figure 2.5-33.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 94
Chapter 2: General Assessment Approach
System maintenance is required and maintenance commences Maintenance technician identifies the part requiring maintenance and enters the part data into the database
System user enters part breakdown and maintains system usage status Central Database User runs appropriate analysis and obtains necessary reliability metric(s)
Technician performs required maintenance
Maintenance technician enters maintenance data into the database Figure 2.5-33: Database Information Flow Important elements of the data system that should be considered for inclusion are summarized below: • • • • •
System information Number of systems fielded Dates of fielding for each system j Location of operation (optional) System Numbers (unique identifier for each system)
Critical elements of a data collection system are discussed below. Parts Breakdown A description of every level of assembly must be available, down to the lowest level of repair. For the purposes of this example, this assembly will be called a FRU (Field Reliability Information Analysis Center 95
Chapter 2: General Assessment Approach
Replaceable Unit). This product or system description is critical to the unique identification of parts so that the data that is reported at various levels is not confounded. It is also critical if maintenance actions are not consistently performed at the same level. At the lowest level of indenture, the following FRU information is required. • • • • • • •
Part number Serial number Part identification code (unique descriptor of part in hierarchical breakdown of system; sometimes referred to as a Reference Designator) Number of parts in the product or system Applicable Life Unit (i.e. hours, miles, cycles, operations, etc.) Identification as to if there is an individual elapsed time meter (or miles, cycles, operations) on the specific part or whether system life units must be used Manufacturer name
Maintenance Information A critical element to an effective reliability data collection and analysis system is the accurate quantification of the failure cause. Not all perceived failures are real failures and, therefore, it is important to identify whether part removals are indeed true failures. Figure 2.5-34 illustrates the hierarchy of maintenance actions.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 96
Chapter 2: General Assessment Approach
Maintenance Action
Unscheduled
Scheduled
Perform Routine Maintenance
Remove/Replace
Correct Diagnosis
Necessary Repair
Failure Analysis Performed to Identify Root Cause
Real Failure
Incorrect Diagnosis
Faulty Unit Gets Put Back into Field
False Alarm
Correct Diagnosis
Cannot Duplicate
Incorrect Diagnosis
Unnecessary Repair
Failure Analysis Not Performed
Figure 2.5-34: Hierarchy of Maintenance Actions The following is a list of required data elements in the capture of maintenance information: •
Job number (unique identifier) Reliability Information Analysis Center 97
Chapter 2: General Assessment Approach
• • • • • • • •
• •
• • • • • • •
Calendar date and time that system is taken out of operation Calendar date of maintenance action System serial or configuration control number Number of total life units (i.e. hours, miles, cycles, operations) on the FRU at the start of the maintenance action (if life unit meter is on FRU) Number of total life units (i.e. hours, miles, cycles, operations) on the product or system at the start of the maintenance action (if life unit meter is not on part) Number of total life units (FRU or product/system, depending on which of the above two items are applicable) on the part at the start of the maintenance action. This is a calculated field generated by the database software. Initial description of the anomaly Initiating event (only one is chosen): o Failure of system to perform (unscheduled maintenance) o Condition monitoring-based event o Scheduled maintenance When discovered Action taken (only one is chosen): o Remove/replace o Maintain o Remove, re-test OK, and replace FRU on which action is taken (description and serial/configuration control number) Maintenance technician (name) Man-hours required for maintenance action Calendar date and time that the system is put back into service Cause of failure identified by the maintenance technician Failure mode description Failure mechanism description. There could be a standardized listing of the possible failure mechanisms from which the technician could scan and identify the appropriate mechanism.
Failure Analysis Information The failure analysis record is used when there is a detailed failure analysis performed on a removed FRU. The data contained in this record generically consists of the following: • • •
Summary of the analysis performed Results of the analysis Failure cause (should be the root failure cause, not a failure symptom cause) 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 98
Chapter 2: General Assessment Approach
Analysis From the data collected and captured in the database, several fundamental reliability parameters, including those listed below, can be calculated. • • • • • •
•
• • • • • • • • •
Operating hours (or life unit) of each FRU Cumulative operating hours of the population Cumulative system calendar hours of the population Cumulative FRU calendar hours of the population Individual calendar times for each product or system For scheduled removals: o Number of scheduled removals o Total number of man-hours associated with scheduled removals o Individual operating times for scheduled removals o Individual calendar times for scheduled removals o Number of man-hours for each scheduled removal For unscheduled removals: o Number of unscheduled removals o Total number of man-hours associated with unscheduled removals o Individual operating times for unscheduled removals o Individual calendar times for unscheduled removals o Number of man-hours for each unscheduled removal Number of total removals Total number of man-hours Individual number of man hours Individual operating times of all removals Individual calendar times of all removals Number of removals for each failure cause Individual operating times of removals for each failure cause Individual calendar times of removals for each failure cause Total time that each individual product or system is unavailable
For many of these parameters, it is necessary to calculate the number of life units to which each part has been exposed. This is done by calculating the number of life units on the part since the last time that the part was replaced. This calculation procedure is illustrated in Figure 2.5-35.
Reliability Information Analysis Center 99
Chapter 2: General Assessment Approach
Is there a life unit meter on the part? Yes
No
Use part life unit meter
Use system life unit meter
Record life unit meter reading (i.e., Part hours/miles/cycles)
Has the part been previously removed? (i.e., is there a maintenance record for that part in the database?)
Yes
Subtract the system life unit from that of the last maintenance record from the current life unit
No
Record the system life unit
Figure 2.5-35: Calculation of Part Life Unit Outputs A list of typical output parameters are listed below: • • •
Mean Operating Hours Between Scheduled Removals Mean Calendar Hours Between Scheduled Removals Mean Operating Hours Between Unscheduled Removals 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 100
Chapter 2: General Assessment Approach
• • • • • • • • • •
Mean Calendar Hours Between Unscheduled Removals Mean Man Hours per Maintenance Action (MMH/MA) Distribution of maintenance man hours per maintenance action Weibull parameters of individual operating times for unscheduled maintenance actions Weibull parameters for failures of a specific cause Pareto ranking of part failure rates (or of any of the above listed parameters) Failure cause distribution Pareto ranking of failure causes Mean system availability for each system Distribution of system availability
Drenick’s Theorem An important aspect of interpreting field reliability data is distinguishing between calendar time and operating time. Consider a situation in which five items are fielded at the same time, as illustrated in Figure 2.5-36. They will each have a failure time (or other appropriate life unit) that is described by the TTF distribution as a function of operating time.
1 2 3 4 5
Operating Time
Failure Times Figure 2.5-36: Failure Times Based on Operating Time
Reliability Information Analysis Center 101
Chapter 2: General Assessment Approach
Now, consider the same five items that were placed in the field at different calendar times, as illustrated in Figure 2.5-37. They will have the same failure times relative to their operating time, but the apparent failure times relative to calendar time will be quite different.
1 2 3 4 5
Calendar Time Failure Times Figure 2.5-37: Failure Times Based on Calendar Time Furthermore, if the product or system is repairable (in which case the failed items are replaced upon failure with a new item), an interesting effect occurs in which the apparent failure rate will reach an asymptotic value that appears to represent a constant failure rate. This occurs as the “time zero” values become randomized as items fail and are replaced with new items. To illustrate the relationship between the beta value (Weibull shape) and the instantaneous failure rate as a function of calendar time when parts are replaced upon failure, a simulation was performed. In this simulation example, the failure rate of 1100 items as a function of calendar time was calculated. Figures 2.5-38 through 2.5-42 illustrate the results. These figures correspond to Weibulldistributed TTFs with shape parameters of 20, 5, 2, 1 and 0.5, respectively. The time axis is calendar time, normalized to a time unit of one characteristic life.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 102
Chapter 2: General Assessment Approach
Figure 2.5-38: Failure Rate Simulation with Weibull Beta = 20
Figure 2.5-39: Failure Rate Simulation with Weibull Beta = 5.0 Reliability Information Analysis Center 103
Chapter 2: General Assessment Approach
Figure 2.5-40: Failure Rate Simulation with Weibull Beta = 2.0
Figure 2.5-41: Failure Rate Simulation with Weibull Beta = 1.0 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 104
Chapter 2: General Assessment Approach
Figure 2.5-42: Failure Rate Simulation with Weibull Beta = 0.5 Consider the case where the Weibull beta = 20 (Figure 2.5-38). When the populations start operating at the same time at t = 0, the failures occur at a rate described by the Weibull distribution with a beta value of 20. The peak of the failure rate occurs at approximately the characteristic life value of time. As units fail and are replaced, the “time zeros” start to become randomized. As enough time passes, the “times zeros” will eventually become completely randomized. At this point, the asymptotic value of failure rate is reached, which is the reciprocal of the characteristic life (in this case, 100). Figure 2.5-39, depicting the simulation results for a beta value of 5.0, indicates a similar effect. The asymptotic failure rate, however, is reached sooner. This happens because the variance in failure time is greater for a beta of 5.0 relative to a beta of 20, which, in turn, means that the population “time zeros” become randomized sooner. The plot illustrating a beta of 2.0 (Figure 2.5-40) is similar, with a corresponding asymptotic value reached sooner. The plot corresponding to a beta of 1.0 (Figure 2.5-41) indicates that the random failure rate occurs at t=0 which intuitively make since it has, by definition, a randomly occurring failure rate. However, when the beta is less than 1.0 (Figure 2.5-42), the asymptotic failure rate value is zero. This occurs because, when enough time has passed, the failed items have been replaced with items that have a higher probability of living longer. The lower the beta value, the shorter the time period required to achieve a zero failure rate. Reliability Information Analysis Center 105
Chapter 2: General Assessment Approach
Because this is an important factor in interpreting field reliability data, a methodology was derived for the NPRD data to estimate the characteristic life based on field data with varying “time zero” values. This methodology is discussed in Chapter 7, Section 4. 2.5.2. Physics
The generic approaches covered here in using a physics approach are stress strength interference models and models from first principals. Each is described below. 2.5.2.1. Stress/Strength Modeling
Stress/strength interference theory is a technique used to quantify the probability that the strength of an item is less than the stress to which it is subjected. For example, if the distribution of the strength of an item can be quantified, and the distribution of the stress it is under can be quantified, the area of intersection of the two stresses represents the probability that the strength is less than the stress. This technique is general in nature and applies equally to any situation that the two distributions can be quantified, as long as the X-axis represents the same variable for both distributions. The variable can be electrical, such as voltage or current, or it can be mechanical strength, for example, in units of KPSI. The goal of any design for robustness effort is to minimize the variance of both distributions, and maximize the separation of the distribution means. In this manner, the probability of distribution intersection, or failure, is minimized. An example of this approach is illustrated in Figure 2.5-43.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 106
Chapter 2: General Assessment Approach
Material Properties
Design Dimension Dimensions
FEA
Extrinsic Stresses
Modulus
Strength Data
Stress
CTE
Fatigue Data
Strength
Probability of Failure vs Time Figure 2.5-43: Stress Strength Methodology In this example, a mechanical item has certain physical properties, for example its modulus and its coefficient of thermal expansion (CTE). These material properties are used in addition to the design variable (i.e. dimensions, extrinsic stresses) to estimate the stresses to which the item is exposed. This stress can be modeled in several ways. One is the use of handbooks that contain closed-form equations that estimate the stress to which a material is exposed as a function of dimensions, force, deflections, etc. This is usually only viable for simple structures. For more complex mechanical structures, finite element models and analysis (FEA) may be required to simulate stresses. For the strength portion of the model, two factors need to be considered: • •
The inherent strength distribution of the material The strength properties as a function of time
Reliability Information Analysis Center 107
Chapter 2: General Assessment Approach
An example of strength as a function of time is the fatigue properties of the material. The fatigue properties pertain to the strength degradation over time. At time = 0, the probability of failure is the intersection of the stress and the strength distributions, as illustrated in Figure 2.5-44.
Figure 2.5-44: Stress/Strength Interference The calculation for Normally-Distributed Stress and Strength Distributions is: Z= where: Z=
μx = μy = σx = σy =
ux − u y
σ x2 + σ y2
Standard Normal variant (i.e., the number of standard deviations from the normal standardized distribution). The value for “Z” can be obtained from: 1. Tables of the Standard Normal distribution 2. MS EXCEL formula = Normdist(Z) the mean of the strength the mean of the stress the standard deviation of the strength the standard deviation of the stress 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 108
Chapter 2: General Assessment Approach
In many real situations, distributions other than the Normal are used, requiring alternate methods of calculating the interference probability. Readily available software tools can be used for this purpose (Reference 3). As stated previously, in addition to the probability of failure at t=0, it is also critically important to understand how this interference between stress and strength behaves as a function of time. Items will sometimes age (due to mechanisms such as fatigue), which essentially means that the strength distribution changes such that its mean is lowered. Assuming that the stress to which the item is exposed remains constant, the result is that there is more interference, and the failure probability increases with time. To properly account for this aging phenomenon, the characteristics of this strength distribution and the interference must be quantified as a function of time. This concept is illustrated in Figure 2.5-45.
Figure 2.5-45: Stress/Strength Interference vs. Time
Reliability Information Analysis Center 109
Chapter 2: General Assessment Approach
An example of a model that has been successfully used for brittle materials is the following: ⎛ ⎜ ⎛V P = 1 − exp⎜ − ⎜⎜ ⎜ ⎝ V0 ⎝
⎞⎛ σ ⎟⎟⎜⎜ ⎠⎝ S 0
⎞ ⎟⎟ ⎠
m
⎛t ⎜⎜ ⎝ t0
m ⎞ ⎞n ⎟ ⎟⎟ ⎟ ⎠ ⎟ ⎠
where: P= probability of failure m = Weibull slope of the initial strength S0 = characteristic strength n= fatigue constant V and V0 are volume parameters to account for the effects of size (i.e., they account for the effect that the more volume or surface area that there is, the more likely it is to have a strength limiting flaw) σ= stress Now, if a screen is applied to the material to eliminate defects having strength values below the applied screen stress threshold (Sth), the probability of failure becomes: ⎛ ⎜ ⎜ ⎛V P = 1 − exp⎜ − ⎜⎜ ⎜ ⎝ V0 ⎜ ⎜ ⎝
1 m ⎞ ⎞ ⎛ ⎟ ⎜ ⎛ t0 ⎞ n ⎟ m σ S − ⎜ ⎟ ⎟ th ⎞⎜ t ⎠ ⎟ ⎛ t ⎞n ⎟ ⎝ ⎟ ⎜ ⎟ ⎟⎟⎜ S0 ⎟ ⎜⎝ t 0 ⎟⎠ ⎟ ⎠⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎠ ⎝ ⎠
This is only one example of a stress strength model. Many others can be found in the literature. Models such as these can be invaluable in understating the sensitivity of reliability as a function of the factors accounted for in the model. However, as is the case with any physics-based model, it is important to validate the model based on empirical evidence. This is critical because there is ample opportunity to introduce large errors in the analysis, based on extreme sensitivity to assumptions, sample variability, etc. Additionally, while the approach may be grounded in physics, the model parameters usually need empirical data for their quantification.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 110
Chapter 2: General Assessment Approach 2.5.2.2. First Principals
The premise of First Principals is that the fundamental physics that govern a failure mechanism can be characterized, and that the reliability of the mechanism can be accurately predicted from these equations. This is best illustrated with an example from References 4 and 5. In this example, the reliability of a Fused Biconic Splitter was modeled. This is a passive optical component used to split optical signals in fiber optic telecommunication systems. The observed failure mode was a degradation of the coupling ratio over time. The original test plan included Accelerated Aging Tests on Fused Splitters for 3 conditions, as shown in Table 2.5-12. Table 2.5-12: Test Conditions Test Conditions
Temperature (°C)
85°C / 85% RH 85°C / 16% RH 45°C / 85% RH
X X
Relative Humidity (RH) X X
Absolute Humidity (AH) X X
The “X” values in the cells indicate which test conditions have a constant value for the stress indicated in each column. The values were chosen to assess whether relative humidity or absolute humidity was the predominant mechanism of the failure mode. In this case, two of the three conditions have equivalent relative humidity and two of three have equivalent absolute humidity. The results of the accelerated tests did not agree with a previously hypothesized failure mechanism that proposed epoxy creep as the coupling ratio drift mechanism. Therefore, in an effort to obtain a model that was consistent with empirical evidence, the fundamental physics were investigated. This process is described below: From optical component physics, it can be shown that the coupling between two fibers is:
c=
3πλ 1 2 32n2 a (1 + 1 / V )2
where:
V ≡ ak (n22 − n32 )1 / 2 Reliability Information Analysis Center 111
Chapter 2: General Assessment Approach
Additionally, the diffusion of water vapor into silica can be represented as: C (r , t ) = C 0⎜⎜1 − ∑ BnJ 0( jnr / b) exp{− jn 2 [ DH 2O(T )t / b 2 ]}⎟⎟ ⎛ ⎜
∞
⎞ ⎟
⎜ ⎝
n =1
⎟ ⎠
where:
Bn ≡ 2 /[ jnJ 1( jn)] The hypothesis of the physical mechanism is that water diffuses into the outer surface of the fused region very slowly and slightly decreases the index of refraction of this outer surface. This increases the coupling coefficient, thereby increasing the coupling ratio. As time goes by, more and more water diffuses in, and the coupling ratio increases until the device goes out of spec. The amount of water in the silica is simply the number of water molecules hitting the surface of the silica per unit time (directly proportional to the absolute humidity) and the diffusion rate at that temperature. Therefore, if the time to failure at a specific condition is known, the time to failure at a new condition is the known TTF multiplied by a ratio of the absolute humidity level times the ratio of the diffusion rates. The data obtained in the tests were used to estimate the diffusion rate and the temperature dependence of this diffusion rate, as shown in Table 2.5-13. Table 2.5-13: Data to Estimate Diffusion Rate
85
ABS HUM grams H2O per m3 297.1
DIFFUSION CONSTANT cm2/sec 6.63x10-18
45
85
55.4
High Temp/Medium Humidity Chamber
85
16
Underground
25
Footway Box
15
TEMP °C.
RH %
High Temp/High Humidity Chamber
85
Med Temp/High Humidity Chamber
SERVICE CONDITION
RATIO
MTBF (Years)
1
0.579
2.60x10-19
137
79
56.0
6.63x10-18
5
3
85
19.6
3.73x10-20
2691
1559
93
11.9
1.73x10-20
9552
5535
The predictions from the model were then obtained. The predicted and observed lifetimes are shown in Table 2.5-14. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 112
Chapter 2: General Assessment Approach
Table 2.5-14: Predicted Lifetimes vs. Observed ENVIRONMENTAL CONDITION
High Temp/High Humidity Chamber High Temp/Medium Humidity Chamber
TEMP °C.
RH %
MTBF in Hours (Predicted)
MTBF in Hours (Measured)
% Difference
85
85
5072
5072
0
85
16
26,909
27,500
2
As can be seen above, the model is extremely accurate in predicting the failure mechanism behavior. Models developed from first principals like the one shown in this example can be very accurate and, thus, beneficial to a reliability program. However, several pieces of information were required in order to make this approach a viable alternative: •
•
Detailed component information, including: o Index of refraction of the core and cladding on the fiber used in the component o Fiber dimensions, and model constants in the above equations The ability to generate a closed form equation that describes: o Water diffusion rates into silica o Optical coupling ratio as a function of component design parameters
While it would be desirable to model the reliability of every conceivable failure mechanism in this manner, practical constraints of most reliability practitioners make this difficult to apply to complex systems. The primary reason for this is that information like that summarized above is not practical to obtain in many cases. Additionally, with complex systems, there can be thousands of possible failure causes which would need to be modeled in order to obtain a system reliability estimate. The primary difference between this approach and the DOE-based life modeling approach previously described is the manner in which the model form is determined. In the DOE approach, the model forms are assumed and are based on standard forms like the power law or the Arrhenius law. In the physics approach, the model forms are determined from first principals of physics. In both cases, however, certain model parameters are generally estimated from empirical data. Reliability Information Analysis Center 113
Chapter 2: General Assessment Approach
2.6. Combine Data Once the data for each item has been analyzed and reliability estimates have been made using any of the methods descried previously, the information needs to be combined to form the best estimate of product or system reliability. The methodology of 217Plus was developed for this specific purpose and can be used as a framework from which to perform this combination. Figure 2.6-1 summarizes the 217Plus methodology for estimating the failure rate of a product or system. In this example, only constant failure rates are addressed. If specific items are described by non constant failure rates, the mathematics become more difficult, but the basic approach remains the same.
Figure 2.6-1: 217Plus Approach to Failure Rate Estimation The specific approach that can be used depends on several factors, including: • •
Whether information exists on a predecessor product or system The amount of empirical reliability data available on that product or system 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 114
Chapter 2: General Assessment Approach
•
Whether the analyst chooses to evaluate and assess the processes used in the development of the product or system
The types of data that may be available can be any of the types summarized previously in this section of the book. If the product or system under analysis is an evolution of a predecessor item, the field experience of the predecessor product can be leveraged and modified to account for the differences between the new product and the predecessor product. A predecessor is defined as a product or system that is based on similar technology and uses design/manufacturing processes similar to the new item under development for which a reliability prediction is desired. In this case, the new product or system is an evolution of its predecessor. In this analysis, a prediction is performed on both the predecessor item and the new item under development. These two predictions form the basis of a ratio that is used to modify the observed failure rate of the predecessor, and account for the degree of similarity between the new and predecessor products pr systems. The result of the predecessor analysis is expressed as λ1, as presented in Figure 2.6-1. If enough empirical data (field, test, or both) is available on the new product or system under development, it can be combined with the reliability prediction on the new item to form the best failure rate estimate possible. A Bayesian approach is used for this combination, which merges the reliability prediction with the available data. As the quantity of empirical data increases, the failure rate using the Bayesian combination will be increasingly dominated by the empirical data. The result of the Bayesian combination is defined as λ2, as presented in Figure 2.6-1. The minimum amount of analysis required to obtain a predicted failure rate for a product or system is the summation of the component estimated failure rates. The component failure rates are determined from the component models, along with other data that may be available to the analyst. The result of this component-based prediction is λIA,new. This value can be further modified by incorporating the optional data, resulting in λpredicted,new, as shown in Figure 2.6-1. All methods of analysis require that a prediction be performed on the new product or system under development in accordance with the component prediction methodology. Predictions based solely on the component analysis should be used only when there is no field or test reliability history for the new item and no suitable predecessor item with a field reliability history. In this case, the reliability model is purely predictive in nature. After a product or system has been fielded, and there has been a significant amount of operating time, the best data on which to base a failure rate estimate is field observed data, or a combination of prediction and observed Reliability Information Analysis Center 115
Chapter 2: General Assessment Approach
failure data. In this case, the reliability model yields an estimate of reliability, because the reliability is estimated from empirical data. Each element of the 217Plus methodology is further described in the following sections.
λIA,predecessor
λIA,predecessor is the initial reliability assessment of the predecessor product or system. It is the sum of the predicted component failure rates, and uses any of the methods described in this book.
λobserved, predecessor
λobserved, predecessor is the observed failure rate of the predecessor product or system. It is
the point estimate of the failure rate, which is equal to the number of observed failures divided by the cumulative number of operating hours5.
Optional data Optional data is used to enhance the predicted failure rate by adding more detailed data pertaining to environmental stresses, operating profile factors, and process grades (the concept of process grades is explained in detail in Chapter 7). The 217Plus models contains default values for the environmental stresses and operational profile, but in the event that actual values of these parameters are known, either through analysis or measurements, they should be used. The application of the process grades is also optional, in that the user has the option of evaluating specific processes used in the design, development, manufacturing and sustainment of a product or system. If process grades are not used in a 217Plus analysis, default values are provided for each process (failure cause), so that the user can evaluate any or all of the processes.
λpredicted, predecessor
λpredicted, predecessor is the predicted failure rate of the predecessor product or system after combining the initial assessment with any optional data, if appropriate.
λIA,new
λIA,new is the initial reliability assessment of the new product or system. This is the sum of the predicted component failure rates, and uses the 217Plus component failure rate models or other methods (such as data from NPRD or other data sources). A reliability 5 Note that “operating hours” can be replaced by any other life unit, such as calendar hours, miles, cycles, etc. The 217Plus methodology predicts failure rates in terms of calendar hours. The important point is that all life units used in the assessment must be consistent.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 116
Chapter 2: General Assessment Approach
prediction performed in accordance with this method is the minimum level of analysis that will result in a predicted reliability value. Applying any optional data can further enhance this value.
λpredicted, new
λpredicted, new is the predicted failure rate of the new system after combining the initial
reliability assessment with any optional data, if used. If optional data is not used, then
λpredicted, new is equal to λIA,new
λ1
λ1 is the failure rate estimate of the new system after the predicted failure rate of the new system is combined with the information on the predecessor product (predicted and observed data). The equation that translates the failure rate from the old product or system to the new one is:
λ1 = λ predicted , new ×
λobserved , predecessor λ predicted , predecessor
The values for λpredicted,new and λpredicted,predecessor are obtained using the component reliability prediction procedures. The ratio of λobserved,predecessor /λpredicted,predecessor inherently accounts for the differences in the predicted and observed failure rates of the predecessor system, i.e., it inherently accounts for the differences in the products or systems analyzed in the component reliability prediction methodology. This methodology can be used when the new product or system is an evolutionary extension of predecessor designs. If similar processes are used to design and manufacture a new item, and the same reliability prediction processes and data are used, then there is every reason to believe that the observed/predicted ratio of the new system will be similar to that observed on the predecessor system. This methodology implicitly assumes that there is enough operating time and failures on which to base a value of λobserved,predecessor. For this purpose, the observance of failures is critical to derive a point estimate of the failure rate (i.e., failures divided by hours). A single-sided confidence level estimate of the failure rate should not be used.
ai
ai is the number of failures for the ith set of data on the new product or system. Reliability Information Analysis Center 117
Chapter 2: General Assessment Approach
bi
bi is the cumulative number of operating hours for the ith set of data on the new product or system.
AFi
AFi is the acceleration factor (AF) between the conditions of the test or field data on the new product or system and the conditions under which the predicted failure rate is desired. If the data is from a field application in the same environment for which the prediction is being performed, then the AF value will be 1.0. If the data is from accelerated test data or from field data in a different environment, then the AF value needs to be determined. If the applied stresses are higher than the anticipated field use environment of the new system, AF will have a value greater than 1.0. The AF can be determined by performing a reliability prediction at both the test and use conditions. The AF can only be determined in this manner, however, if the reliability prediction model is capable of discerning the effects of the accelerating stress(es) of the test. As an example, consider a life test in which the product was exposed to a temperature higher than what it would be exposed to in field-deployed conditions. In this case, the AF can be calculated as follows: AF =
λT 1 λT 2
where:
λT1 = the predicted failure rate at the test conditions obtained by performing a reliability prediction of the system at temperature 1 λT2 = the predicted failure rate at the use conditions obtained by performing a prediction at temperature 2
b i’
bi’ is the effective cumulative number of hours of the test or field data used. If the tests were performed at accelerated conditions, the equivalent number of hours needs to be converted to the conditions of interest, as follows:
bi ' = bi × AFi
ao
ao is the effective number of failures associated with the predicted failure rate. If this value is unknown, then use a default value of 0.5. In the event that predicted and 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 118
Chapter 2: General Assessment Approach
observed data is available on enough predecessor products or systems, this value can be tailored. See the next section for the appropriate tailoring methodology.
λ2
λ2 is the best estimate of the new system failure rate after using all available data and information. As much empirical data as possible should be used in the reliability assessment. This is done by mathematically combining λ1 with empirical data. Bayesian techniques are used for this purpose. The technique accounts for the quantity of data by weighting large amounts of data more heavily than small quantities. λ1 forms the “prior” distribution, comprised of a0 and ao/λ1. If empirical data (i.e., test or field data) is available on the system under analysis, it is combined with λ1 using the following equation: n
λ2 =
a 0 + ∑ ai a0
λ1
i =1 n
+ ∑ bi ' i =1
where λ2 is the best estimate of the failure rate, and ao is the “equivalent” number of failures of the prior distribution corresponding to the reliability prediction. For these calculations, 0.5 should be used unless a tailored value can be derived. An example of this tailoring is provided in the next section. ao/λ1 is the equivalent number of hours associated with λ1. a1 through an are the number of failures experienced in each source of empirical data. There may be “n” different sources of data available (for example, each of the n sources corresponds to individual tests or field data from the total population of products or systems). b1’ through bn’ are the equivalent number of cumulative operating hours experienced for each individual data source. These values must be converted to equivalent hours by accounting for any accelerating effects between the use conditions.
Reliability Information Analysis Center 119
Chapter 2: General Assessment Approach
Tailoring the Bayesian Constant, ao This section discusses tailoring of the ao value used in the Bayesian equations. The value of ao is proportional to the degree of weighting given to the predicted value (λ1). The value of the constant, a0, is chosen such that the uncertainty in the failure rate estimate, as calculated with the chi square distribution, equates to the observed uncertainty. The default value of 0.5 to be used in the equation is based on the observed/predicted ratio derived from a wide variety of systems, applications, industries, etc. As such, there are many “noise factors” contributing to the variability in this ratio. However, if the user of the 217Plus model has enough data on which to derive a tailored value of a0, it should be derived and used. While the default value of 0.5 represents the large degrees of uncertainty inherent when a diverse data set is used, a specific 217Plus user will generally be analyzing products with a much more narrow focus, in terms of product type, environment, operating profile, etc. As such, with enough data, the value of a0 can be increased. As an example of calculating a value for a specific application, consider the an example for a product used in a telecommunications system.
To estimate the value of ao that should be used, a distribution of the following metric is calculated for all products for which both predicted and observed data is available:
λobserved, predecessor λ predicted, predecessor The lognormal distribution will generally fit this metric well, but others (for example, Weibull) can also be used. The cumulative value of this distribution is then plotted. Next, failure rate multipliers (as calculated by a chi square distribution) are calculated and plotted. This chi-square distribution should be calculated and plotted for various numbers of failures, to ensure that the distribution of observed/predicted failure rate ratios falls between the chi-square values. In most cases, one, two and three failures should be sufficient. Next, the plots are compared to determine which chi-square distribution most closely matches the observed uncertainty values. The number of failures associated with that distribution then becomes the value of a0. Figure 2.6-2 illustrates an example for which this analysis was performed.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 120
Chapter 2: General Assessment Approach
Figure 2.6-2: Comparison of Observed Uncertainty with the Uncertainty Calculated With the Chi-square Distribution As can be seen from Figure 2.6-2, the observed uncertainty does not precisely match the Chi-square calculated uncertainty for any of the one, two or three failures used in this analysis. This is likely due to the fact that the population of products on which this analysis is based is not homogeneous, as assumed by the chi-square calculation. However, the confidence levels of interest are generally in the range 60 to 90 percent. In this range, the chi-square calculated uncertainty with 2 failures most closely approximates the observed uncertainty. Therefore, in this example, an a0 value of 2 was used. This value is also consistent with the Telcordia GR-332 reliability prediction methodology (Reference 6). The uncertainties represented by the distribution of observed/predicted failure rates are typical of what can be expected when historical data on predecessor products or systems are collected and analyzed to improve the reliability prediction process. Using this example, one can be 80% certain that the actual failure rate for a product or system will be less than 2.2 times the predicted value. 2.6.1. Bayesian Inference
Figure 2.6-3 depicts the outline of the Bayesian inference approach. The available information about the model parameter vector, θ, in the form of prior distribution, f0(θ), Reliability Information Analysis Center 121
Chapter 2: General Assessment Approach
is transformed to a new state of knowledge, represented by posterior distribution, f(θ). The likelihood function represents data in the Bayesian framework, and determines how much the data may influence the prior knowledge.
Model for Failure Data Failure Data Prior f0 (θ)
Likelihood L( Failure Data | θ )
Posterior
Bayesian Inference
f (θ) = L( θ | Failure Data)
Figure 2.6-3. Bayesian Inference Outline The mathematical description of the Bayesian transformation is defined by the equation below. The normalization factor appearing in the denominator is inevitable when dealing with conditional probability calculations.
f (θ ) = f (θ DATA) =
f 0 (θ ) × L(DATA θ )
∫θ f (θ ) × L(DATA θ ) dθ 0
where,
θ = f(θ) = f0(θ) =
the vector of model parameters, (θ1, θ2, …, θn) the posterior joint distribution of parameters the prior joint distribution of parameters L(data|θ) = the likelihood of data given the model parameters
In practice, the features of this distribution include the updated marginal and conditional distribution of each parameter given the provided information. The marginal distribution of a single parameter is defined by the next equation. The marginal distribution is estimated by integrating the posterior joint distribution, f(θ), over the range of other parameters, as shown. The other important outcome of the posterior joint distribution is 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 122
Chapter 2: General Assessment Approach
the conditional distribution of each parameter, when other elements of vector θ are given. The conditional distribution is constructed by substituting the known parameters in the joint distribution, f(θ). Here again, the function needs to be scaled by a normalization factor, as demonstrated in the equation below, in order to be consistent with the basic characteristics of the distribution functions.
f j (θ j ) = ∫
θ− j
f (θ1 ,θ 2 ,...,θ j ,...,θ n )dθ − j
(
)
g j θ j | θˆ− j =
∫θ
f (θ j , θˆ− j ) f (θ , θˆ )dθ j
−j
j
j
where:
θ-j =
) θi =
(θ1, … , θj-1, θj+1, … , θn) the given value for θi
The integrals necessary for Bayesian computation usually require analytic or numerical approximations. While the computations for non-constant failure rate distributions can get quite involved, they are relatively straightforward for the exponential distribution. The method explained in the previous section details this situation.
2.7. Develop System Model There are several options that the analyst has for merging the reliability models of all of the failure causes, components, etc. In decreasing order of rigor, they are: 1. Perform a Monte Carlo analysis, where the TTF distribution of each element is preserved, and the operating time and number of failures is modeled 2. If all of the failure causes are independent, the options are: a. Calculate the reliability of each cause at a specific time of interest, and then calculate the reliability as: n
R(t ) = ∏ Ri (t ) i =1
where there are n items, and Ri is the reliability of each item b. Convert the reliability estimate of each element to a constant failure rate, and calculate the reliability as: Reliability Information Analysis Center 123
Chapter 2: General Assessment Approach n
λ = ∑ λi i =1
if the following conditions are satisfied: 1. The analysis is performed only to the component level, without modeling the specific failure causes 2. A constant failure rate distribution is used 3. All components are required for the product or system to meet its requirements (i.e., failure probability values are independent) Then, the product reliability is simply the product of the reliabilities of the individual components, or likewise the failure rate is the sum of the failure rates of the individual constituent components. This has been the traditional approach when using the “handbook” types of methodologies. If all of the above listed conditions are not present, then more sophisticated techniques are required. For example, consider the situation in which Condition 1 and 2 are not satisfied, but Condition 3 is. In this example, let’s say that there are seven failure causes for which the life modeling has resulted in an estimate of the TTF distribution under field use conditions. These distributions can be any arbitrary shape, dependent entirely on the characteristics of the specific failure causes. This situation is depicted in Figure 2.7-1, where the reliability block diagram is shown as a series configuration. Each failure cause is represented by Events 1 through 7, each of which has its own probability density function.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 124
Chapter 2: General Assessment Approach
Event 1
Event 2
0.001
0.002
8.000E-4
0.001
6.000E-4
Event 5
Event 7
Probability Density Function 0.001
9.000E-4
Probability Density Function
9.000E-4
2.000E-4
Event 6
Probability Density Function
Probability Density Function
Probability Density Function
Probability Density Function
Probability Density Function 0.002
Event 4
Event 3
0.003
8.000E-4
7.200E-4 1.600E-4
7.200E-4
1.200E-4
5.400E-4
0.002
6.000E-4
5.400E-4
f(t)
f(t)
f(t)
f(t)
f(t)
f(t)
f(t)
0.002
4.000E-4
3.600E-4
8.000E-4
4.000E-4
4.000E-4
2.000E-4
8.000E-5
3.600E-4
4.000E-5
1.800E-4
0.001
2.000E-4
1.800E-4
0.000 0.000
1000.000
2000.000
3000.000
Time, (t)
4000.000
5000.000
0.000 0.000
1000.000
2000.000
3000.000
Time, (t)
4000.000
5000.000
0.000 0.000
1000.000
2000.000
3000.000
Time, (t)
4000.000
5000.000
0.000 0.000
6.000E-4
0.000 0.000
600.000
1200.000
1800.000
2400.000
3000.000
0.000 0.000 1000.000
2000.000
3000.000
4000.000
Time, (t)
Time, (t)
5000.000
0.000 0.000
600.000
1200.000
1800.000
2400.000
3000.000
1000.000
2000.000
3000.000
4000.000
5000.000
Time, (t)
Time, (t)
Time to first failure defines system Time to Failure (TTF) Probability Density Function 0.002
0.002
f(t)
0.001
8.000E-4
4.000E-4
0.000 0.000
1000.000
2000.000
3000.000
4000.000
5000.000
Time, (t)
Figure 2.7-1: Combining Seven Failure Cause Distributions For repairable systems, in which repairs are made as failures occur, the system reliability would be simulated over a given time period, such as the mission duration or the warranty period. In this case, failure times are simulated from time = 0 to the specified time period. In this simulation, multiple systems are simulated, for which the failure times of the constituent components are also simulated. As failures occur, new replacement components are installed which have a new component time zero (the system operating time will not be zero, but will be the cumulative operating time). This continues until the duration is exceeded for each of the simulated systems. The resulting failure times for the system can then be analyzed, and the distribution parameters defined. The resultant distribution will generally not be a mono-modal distribution; rather, it will be a distribution of an arbitrary shape that is usually represented by a multi-modal distribution. It is noteworthy that the above model is valid for any situation in which all items are critical, i.e., the failure of any one item results in product or system failure. For example, the Fault Tree for this situation may look like Figure 2.7-2. In this case, all gates are OR gates, which means that all failures of items represented by Events 1 through 7 constitute critical failures. This is shown to illustrate the fact that the analysis does not necessarily Reliability Information Analysis Center 125
Chapter 2: General Assessment Approach
need to be performed at the same level of hierarchy. The most important thing is that all of the critical failure causes are accounted for.
TOP
OR
Event 3
OR
OR Event 1
OR
Event 2
Event 4
Event 5
Event 6
Event 7
Figure 2.7-2: Possible Fault Tree Representation of a Series Reliability Block Diagram For non repairable systems, in which the first failure causes system failure and all items represented by each failure cause are required for the product or system to function, this becomes a competing risk situation in which the first failure cause to occur will define the item’s TTF distribution. The Type 1 extreme value distribution, also known as the Gumbel distribution, is sometimes used to model this situation when components have the same reliability distribution. This competing risk situation, modeled with times to first failure (TTFF), will not yield the same results as taking either the product of the reliability values or the sum of the failure rates (in the case of constant failure rates) because, in the latter cases, there is a probability that multiple failures will occur in the time period analyzed, which is not the case for the competing risk situation. For all but the simplest of situations, closed-form solutions cannot be obtained. These require solutions with numerical simulation, like Monte Carlo analysis, which is described in the next section. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 126
Chapter 2: General Assessment Approach
2.7.1. Monte Carlo Analysis
Monte Carlo analysis is a powerful analytical technique that allows for the estimation of parameters or factors in cases where closed-form statistical derivations are not possible. This occurs in many reliability engineering analyses, making it an invaluable tool. Monte Carlo analysis can be used for several purposes: 1. To determine the time to first failure, as in the previous example 2. To determine the probability of failure from a stress/strength interference model For #2, there are handbooks available which provide estimates of interference probability based on the individual stress and strength distributions. Or, a statistical simulation can be performed to estimate the degree of interference via numerical techniques. This is generally a more efficient and effective way of performing the simulation, given software tools that are readily available. The basic principal behind Monte Carlo analysis, as applied to stress/strength interference analysis is shown here: 1. First, the stress and strength distributions are determined 2. A randomly selected value from each distribution is obtained 3. The randomly selected values are compared, and if the selection from the strength distribution is less than the selection from the stress distribution, a failure is considered to have occurred. If it is not, then success is considered to have occurred. 4. This process is repeated many times, and the number of trials and the number of failures are counted. The number of trials needs to be large enough to result in a good estimate of the failure probability. The failure probability is equal to the total number of failures divided by the total number of trials. F =
N strength < stress N
where: F= N=
the failure probability the total number of trials
More detail regarding the process is described on the next page. Reliability Information Analysis Center 127
Chapter 2: General Assessment Approach
The first step is to randomly select a value from each of the stress and strength distributions. As an example, consider a normally distributed strength with a mean of 10 and standard deviation of 3, the pdf of which is shown in Figure 2.7-3.
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 2.7-3: pdf of Normal Distribution with Mean of 10 and Standard Deviation of 3. Next, the cumulative function of this distribution is calculated, as shown in Figure 2.7-4.
1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 2.7-4: Cumulative Normal Distribution with Mean of 10 and Standard Deviation of 3 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 128
Chapter 2: General Assessment Approach
Next, the randomly selected value from the distribution is obtained by: • •
Selecting a random number between 0 and 1. This number is displayed on the yaxis. Then, the value on the x-axis corresponding to this y-value is determined a shown in Figure 2.7-5.
Figure 2.7-5: Value Selection From a Distribution Distributions typically used in stress/strength analysis include the Normal distribution and the Weibull distribution. The Normal cumulative distribution does not have a closedform solution, and requires the solution of an integral for its computation. However, software programs have simplified this calculation. For example, the MS EXCEL function for this calculation is. NORMINV(rand(),mean,standard deviation) where: Rand() returns a random number between 0 and 1 The mean and standard deviation are the values from the sampled distribution
Reliability Information Analysis Center 129
Chapter 2: General Assessment Approach
The Weibull distribution is simpler to use than the Normal distribution since an integral of the pdf is not required to derive the CDF. The closed-form pdf of the Weibull distribution is:
β⎛t⎞ f (t ) = ⎜ ⎟ α ⎝α ⎠
β −1
e
⎛ t ⎞ −⎜ ⎟ ⎝α ⎠
β
The reliability function (1 - cumulative function (CDF)) is:
R(t ) = e
⎛t ⎞ −⎜ ⎟ ⎝α ⎠
β
The Weibull distribution is one of the most widely used distributions in reliability engineering due to its versatility. It also has the advantage of having a closed-form solution for its cumulative function. To select a random value from this distribution, a random number between 0 and 1 is selected, this value is substituted for R(t) and the corresponding TTF is determined from the equation. In this example, time(t) is shown as the independent variable, but the specific parameter could be any parameter whose distribution is used in a Monte Carlo analysis. The inverse cumulative function is shown in Figure 2.7-6, along with the selection of the random value.
Figure 2.7-6: Value Selection From a Weibull Distribution 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 130
Chapter 2: General Assessment Approach
Now, let’s consider another application of Monte Carlo simulation. In this example, a simple relationship between items for a repairable system is shown in Figure 2.7-7. Here, the items can be failure causes, components or assemblies, in accordance with the level to which the analysis is performed.
Figure 2.7-7: Reliability Block Diagram of Redundant Example For this example, A and (B or C) need to be operational for the system to function TTFi and TTRi are the times to failure (TTF) and times to repair (TTR), taken from the governing distributions of each The behavior for each item, along with the resultant system behavior, is shown in Figure 2.7-8.
Figure 2.7-8: System Monte Carlo Example For example, item A operates until it fails at TTFA1. At that point in time, it takes TTRA1 to repair it. Items B and C fail and get repaired at rates determined by the simulated Reliability Information Analysis Center 131
Chapter 2: General Assessment Approach
times for each, and governed by the specific distribution of each. The resultant system availability (Asystem) is shown on the bottom. A simulation was performed on this hypothetical system using a software tool, the results of which are shown in Figure 2.7-9. In this case, the following metrics were calculated from the Monte Carlo analysis: • • • • • • • • •
Ao: MTBDE: MDT: MTBM: MRT: % green time: % yellow time:
Availability (% of total time the system is available) Mean Time Between Downing Events Mean Down Time Mean Time Between Maintenance Mean Repair Time Percent of time that all units are operational Percent of time that at least one unit is not operational, but the system still operates % red time: Percent of time that at least one critical item is not operational Number of failures: The number of simulated failures per run
Figure 2.7-9: Monte Carlo Simulation of Example System
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 132
Chapter 2: General Assessment Approach
Simulations of product reliability, as described above, are generally the best way to combine life estimates of constituent parts in a system. If a system is comprised of redundant elements, closed-form equations are available that calculate the effective failure rate of the redundant elements. However, care must be taken when using these equations. For example, the manner in which they are generally derived is to calculate the failure characteristics as time approaches infinity. Only in this manner are closedform solutions possible. The results are “effective” failure rate estimates that often underestimate the benefits of redundancy. This is especially true when mission times are relatively short. As a result, calculating reliability based on the failure probability examples described above is generally a more sound approach. Additionally, the availability of software tools has made it much easier to perform these calculations.
2.8. References 1. “Production Part Approval Process (PPAP)”, Third Edition, Daimler-Chrysler, Ford , General Motors, 1999) 2. Modarres, M., “Accelerated Testing”, ENRI 641, Univ. of Maryland, May 2005 3. Weibull++, Reliasoft Corp. 4. Colm V. Cryan, James R. Curley, Frederick J. Gillham, David R. Maack, Bruce Porter, and David W. Stowe, “Long Term Splitting Ratio Drifts in Singlemode Fused Fiber Optic Splitters”, NFOEC 95 5. David R. Maack, David W. Stowe and Frederick J. Gillham, “Confirmation of a Water Diffusion Model For Splitter Coupling Ratio Drift Using Long Term Reliability Data”, NFOEC 96 6. Telcordia GR-332, “Reliability Prediction Methodology” 7. Denson, W.K. and S. Keene, “A New System Reliability Assessment Methodology – Final Report”, Available from the Reliability Information Analysis Center, 1998
Reliability Information Analysis Center 133
Chapter 2: General Assessment Approach
This page intentionally left blank
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 134
Chapter 3: Fundamental Concepts
3.
Fundamental Concepts
The intent of this book is not to cover the basics of probability or reliability theory. The understanding of some of these fundamental concepts, however, is critical to the interpretation of reliability estimates. The definition of reliability is a probability, the value of which is estimated by the techniques covered in this book. Therefore, the basics of reliability terminology, and the basis for various theoretical concepts are covered in this section.
3.1. Reliability Theory Concepts There are two basic types of variables: Discrete and Continuous. A discrete variable is one that is limited to integer values (i.e., 0, 1, 2, 3,…). The probability distribution describing this type of variable is called a discrete distribution. For example, the distribution of the number of defects remaining in software programs after 6 months of development would be a discrete distribution, since a partial defect cannot exist. Figure 3.1-1 illustrates a discrete probability distribution.
p(x5) Probability - p(xi)
p(x4)
p(x6)
p(x3)
p(x7) p(x8)
p(x2) p(x1) x1
p(x9) x2
x3
x4
x5
x6
x7
x8
x9
Number of Remaining Defects (x)
Figure 3.1-1: Discrete Probability Distribution The probability that a random variable “x” takes on a specific value “xi” is expressed as:
P{x = xi } = p(xi ) Reliability Information Analysis Center 135
Chapter 3: Fundamental Concepts
A continuous variable is one that is measured on a continuous scale, and its probability distribution is defined as a continuous distribution. For example, the distribution of the TTF would be a continuous distribution, since an infinite number of positive time values can be represented in the distribution. Figure 3.1-2 illustrates a continuous distribution.
Figure 3.1-2: Continuous Probability Distribution The probability that a random variable “x” lies between the interval from “a” to “b” is expressed as: b
P{a ≤ x ≤ b} = ∫ f ( x)dx a
A probability distribution is characterized by a probability density function (pdf), f(t). The pdf is essentially a histogram of the random variable, often the TTF. For a discrete random variable, the pdf at a given value of the random variable is the probability that the realization of the random variable will take on that value. For a continuous random variable, the area under the pdf for a given interval is the probability that a realization of the random variable will fall within that interval (Figure 3.1-2). The probability density functions are non-negative for all values and the sum of the probabilities over all values for discrete random variables, or the total area under the pdf for continuous random variables, always equals 1.0. The cumulative distribution function F(t) is defined as the probability in a random trial that the random variable is not greater than t: 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 136
Chapter 3: Fundamental Concepts
F (t ) =
t
∫ f (t )dt
−∞
If the random variable is discrete, the integral is replaced by a summation. The Cumulative Distribution Function (CDF) is the probability that the value of a corresponding random variable will not be exceeded. Cumulative distribution functions are non-negative and non-decreasing. Given a random variable that cannot be negative, the value of the CDF at the origin is zero. The upper limit of a CDF is always 1.0, as illustrated in Figure 3.1-3. The CDF is the integral of the pdf, and is illustrated in Figure 3.1-3 for discrete and continuous distributions, respectively.
Figure 3.1-3: The Cumulative Distribution Function (CDF) The reliability function, R(t), is the probability of a device surviving (not failing) prior to time “t”, and is given by:
Reliability Information Analysis Center 137
Chapter 3: Fundamental Concepts ∞
R(t ) = 1 − F (t ) = ∫ f (t )dt t
Note that for the reliability, the integral of the pdf is from “t” to infinity for the probability of success, as opposed to minus infinity to “t” as in the case of the failure probability. The sum of the probability of success and the probability of failure needs to be 1.0, consistent with the definition of a pdf. By differentiating the above equation:
− dR(t ) = f (t ) dt The probability of failure in a given time interval between t1 and t2 can be expressed by the reliability function: ∞
∫ t1
∞
f (t )dt − ∫ f (t )dt = R (t1 ) − R (t 2 ) t2
The rate at which failures occur in the interval t1 to t2, the failure rate “λ(t)”, is defined as the ratio of the probability that a failure occurs within the interval, given that it has not occurred prior to t1 (the start of the interval), divided by total the interval length. Thus:
λ (t ) =
R (t1 ) − R (t 2 ) R (t ) − R (t + Δt ) = (t 2 − t1 )R (t1 ) (Δt )R (t )
where t = t1 and t2 = t + Δt. The hazard rate, h(t), or instantaneous failure rate, is defined as the limit of the failure rate as the interval length approaches zero, or:
⎡ R(t ) − R(t + Δt ) ⎤ 1 ⎡ − dR(t ) ⎤ h(t ) = lim(Δt → 0)⎢ = ⎥ ⎢ ⎥ ⎣ (Δt )R(t ) ⎦ R(t ) ⎣ dt ⎦ Since it was already shown that:
− dR(t ) = f (t ) dt 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 138
Chapter 3: Fundamental Concepts
Then, h (t ) =
f (t ) R (t )
In an attempt at providing an interpretation of the hazard rate function, consider the following: • • •
The hazard rate, h(t), is the rate at which failures occur, providing the item has not failed before the time h(t) is evaluated f(t) is the normalized percentage of the population failing in a given time interval (Δt), such that the population size times value of f(t) is equal to the number of failures in the interval of time. The denominator, R(t), is the probability of survival at t, which is equivalent to the percentage of the population surviving at time t.
Multiplying R(t) by the population size yields the total number of units surviving until “t”. This is the binomial probability, or expected value of the number of survivors at “t”. Since this population will have accrued an operating time of “RN*Δt”, the denominator is equivalent to the cumulative operating time on the population in the time interval. Therefore, h(t ) =
f (t ) f (t )× N # failures in Δt Failures = = = = Failure rate R(t ) R(t )× N # units surviving Δt item hours
Integrating both sides of the h(t) function results in:
h(t ) =
1 ⎡ − dR(t ) ⎤ R(t ) ⎢⎣ dt ⎥⎦
Resulting in:
R(t ) = e
⎡ t ⎤ ⎢ − ∫ h (t )dt ⎥ ⎣ 0 ⎦
This is the general expression for the reliability function. If h(t) can be considered a constant failure rate (λ), which is often the case, the equation becomes: Reliability Information Analysis Center 139
Chapter 3: Fundamental Concepts
R(t ) = e − λt The mean time to failure (MTTF) is the expected value of the time to failure, and is: ∞
MTTF = ∫ R(t )dt 0
If the reliability function can be easily integrated, this is a convenient way to calculate the mean time to failure (MTTF). If not, then numerical techniques can be used. If all parts in a population are operated until failure, the mean life is: n
θ=
∑t i =1
i
n
where: ti = n=
the time to failure of the ith item in the population total number of items in the population
The mean time between failure (MTBF) is:
MTBF =
T (t ) r
where: T(t) = total operating time r= number of failures Failure rate and MTBF are applicable only to the situation in which the failure rate is constant, i.e., the exponential TTF distribution. Per the definitions above, it can be seen that the failure rate and MTBF are reciprocals of each other:
λ=
1 MTBF
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 140
Chapter 3: Fundamental Concepts
The failure rate is the number of failures divided by the cumulative operating time of the entire population (failure/part hours), whereas the MTBF is the cumulative operating time of the entire population divided by the number of failures (part hours per failure). Table 3.1-1 provides an overview of the basic notation and mathematical representations that are common among the various types of probability distributions. Table 3.1-1: Probability Distribution Notation & Mathematical Representations Notation
X
x
Pr( X ∈ S ) f (x)
Definition Random Variable Realization of a Random Variable Probability That the Random Variable “ X ” is in the Set “ S ” Probability Density Function (PDF)
F (x)
Cumulative Distribution Function (CDF)
h(x)
Hazard Rate
R (x )
Mathematical Representation
⎧ ∑ f ( x), Discrete Distribution ⎪⎪ x∈S Pr( X ∈ S ) = ⎨ ⎪ ∫ f ( x ) dx , Continuous Distribution ⎪⎩ S ⎧ x ⎪ ∑ f ( w ), Discrete Distribution ⎪⎪ w =0 F (x) = ⎨x ⎪ ⎪ ∫ f ( w ) dw , Cumulative Distribution ⎩⎪ 0
h( x) =
f ( x) 1 − F(x)
Reliability
=
f ( x) R( x)
=
1 dF( x ) R ( x ) dx x
∞
R ( x ) = 1 − F ( x ) = ∫ f ( t ) dt = e
− ∫ h ( t ) dt 0
x
E[u( X )]
Expected Value
μ
Mean
σ
Standard Deviation
⎧ ⎪ ∑ u(w) f(w), Discrete Distribution ⎪⎪ w =0 E[ u ( X )] = ⎨ ∞ ⎪ ⎪ ∫ u ( w ) f ( w ) dw , Continuous Distribution ⎪⎩ 0 ∞
μ = E (X ) σ = E[( X − μ ) 2 ]
Note: These definitions are based on the assumption that all realizations of a random variable must be non-negative.
Reliability Information Analysis Center 141
Chapter 3: Fundamental Concepts
3.2. Probability concepts This section discusses some of the basic probability concepts that are important in reliability modeling. 3.2.1. Covariance
Covariance is a measure of the extent to which one variable is related to another, and is expressed as: Cov( X , Y ) =
n
∑
(x − x )(y − y ) i
i =1
i
n −1
3.2.2. Correlation Coefficient
The correlation coefficient is a defined as the standardized Covariance: r=
Cov( X , Y )
σ Xσ Y
Examples of various correlation coefficients are shown in Figure 3.2-1.
Figure 3.2-1: Examples of Correlation Coefficients 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 142
Chapter 3: Fundamental Concepts
3.2.3. Permutations and Combinations
A permutation is defined the number of ways of ordering “n” items taken “x” at a time, and is mathematically expressed as: Pr = n
n! (n − x )!
A combination is defined as the number of distinct combinations of “n” items taken “x” at a time, when ordering is not relevant, and is mathematically expressed as: Pr = n
n! x!(n − x )!
As an example of permutations and combinations, define n=4 and x=2. The number of combinations is: Pr = n
n! 4! = =6 x!(n − x )! 2!(4 − 2)!
Consider these combinations, as illustrated in Table 3.2-1. Here, there are 4 items (n=4), each of which can have two possible values (blank or “x”). Table 3.2-1: Combinations Example n 1
2
x
x
x
3
4
x
x
x x
x
x
x x
x
Reliability Information Analysis Center 143
Chapter 3: Fundamental Concepts
The corresponding number of permutations is: Pr = n
n! 4! = = 12 (n − x )! (4 − 2)!
Each set of 2 can be reversed, thus the number of permutations is double the number of combinations for n=4. 3.2.4. Mutual Exclusivity
Items are mutually exclusive when the occurrence of one event precludes the other. In other words, if one event occurs, the other cannot. This is the only case in which probabilities can be added. Mutual exclusivity is defined as: P(a or b ) = P(a ) + P(b )
where: P(a or b) = P(a) = P(b) =
probability of either event a or event b occurring probability of event a occurring probability of event b occurring
Mutually exclusive sets are those with no common members, shown in the Venn diagram in Figure 3.2-2.
Figure 3.2-2: Venn Diagram of Mutually Exclusive Events The expression “A∩B” signifies the “Empty” or “Null” set. 3.2.5. Independent Events
An independent event is one in which the probability of one event has no effect on the other, and is expressed as follows: P(a and b ) = P(a )P(b )
where: 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 144
Chapter 3: Fundamental Concepts
P(a and b) = P(a) = P(b) =
probability of both event a and event b occurring probability of event a occurring probability of event b occurring
The probability of either event a or b occurring is: P(a or b ) = P(a ) + P(b ) − P(a )P(b )
This is illustrated in Figure 3.2-3.
Figure 3.2-3: Independent Events 3.2.6. Non-independent (Dependent) Events
Non-independent (or dependent) events indicate that the probability of one event is dependent on the other, as shown: P(a and b ) = P(a )P(b a )
or P(a and b ) = P(b )P(a b )
where: P(a and b) = P(a) = P(b) = P(b|a) = P(a|b) =
probability of both event a and event b occurring probability of event a occurring probability of event b occurring probability of event b occurring, given that event a has occurred probability of event a occurring, given that event b has occurred Reliability Information Analysis Center 145
Chapter 3: Fundamental Concepts
3.2.7. Non-independent (Dependent) Events: Bayes Theorem
For non-independent (dependent) events, one event may have several different outcomes, each affecting the other event differently. This situation is mathematically described as: P (a1 b ) =
P(b a1 )∗ P(a1 )
∑ P(b a )∗ P(a ) i
i
where: P(b|a1) = Σ P(b|ai)*P(ai) =
probability of event b occurring, given that event a1 has occurred the total probability of event b occurring
The event set a is mutually exclusive. Therefore their probabilities can be added. 3.2.8. System Models
For independent failure causes, the reliability of a system is the product of the reliability values for the constituent failure causes, as shown: R = R1 R2 R3 .........Rn
If the failure rate is constant, the probability of survival for a specific cause is: R = e − λt
The system reliability is: e − λtotal t = e − λ1t e − λ2t e − λ3t .........e − λnt
Taking the natural log of both sides yields:
λtotal = λ1 + λ2 + λ3 + ..........λn The above equations are relevant to a series configuration of items, each with a constant failure rate. The fault tree representation of this configuration is shown in Figure 3.2-4. Here, the system reliability is represented by a logical OR gate, since the failure of A or B or C will cause system failure.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 146
Chapter 3: Fundamental Concepts
OR
A
B
C
Figure 3.2-4: Fault Tree OR Gate
The corresponding reliability block diagram representation for this scenario is shown in Figure 3.2-5. A
B
C
Figure 3.2-5: Reliability Block Diagram for an OR Gate
All possible outcomes for this example are shown in Table 3.2-2. Table 3.2-2: Combinations of an OR Configuration A
B
C
Output of OR Gate
Fail
Fail
Fail
Fail
Fail
Fail
Pass
Fail
Fail
Pass
Fail
Fail
Fail
Pass
Pass
Fail
Pass
Fail
Fail
Fail
Pass
Fail
Pass
Fail
Pass
Pass
Fail
Fail
Pass
Pass
Pass
Pass
Reliability Information Analysis Center 147
Chapter 3: Fundamental Concepts
Note that each of these eight possible outcomes in the table are mutually exclusive, in that there is only one possible way in which each of the eight can occur. As an example, if events A, B and C have the following reliability values: RA = 0.95 RB = 0.92 RC = 0.99 The reliability of the series configuration (i.e., the probability of exactly zero failures) of the three items is: R = RA RB RC R = .95 × .92 × .99 = .87
Now, suppose that several items must fail in order for the system to fail. This scenario is represented by an AND gate in a fault tree representation, as is shown in Figure 3.2-6.
AND
A
B
C
Figure 3.2-6: Fault Tree AND Gate
The corresponding Reliability Block Diagram (RBD) representation is shown in Figure 3.2-7. Note the parallel nature of this configuration.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 148
Chapter 3: Fundamental Concepts
B
C Starting Block
Ending Block A
Figure 3.2-7: Reliability Block Diagram for an AND Gate
All possible outcomes for this example are shown in Table 3.2-3. Table 3.2-3: Combinations of an AND Configuration A
B
C
Output of AND Gate
Fail
Fail
Fail
Fail
Fail
Fail
Pass
Pass
Fail
Pass
Fail
Pass
Fail
Pass
Pass
Pass
Pass
Fail
Fail
Pass
Pass
Fail
Pass
Pass
Pass
Pass
Fail
Pass
Pass
Pass
Pass
Pass
The reliability of this parallel configuration of three items is: RA = 0.95 RB = 0.92 RC = 0.99 R = 1 − (1 − RA )(1 − RB )(1 − RC )
Reliability Information Analysis Center 149
Chapter 3: Fundamental Concepts
R = 1 − (1 − 0.95)(1 − 0.92)(1 − 0.99) = 0.99996
As an example of a slightly more complex situation, consider the fault tree representation of a system in Figure 3.2-8.
TOP
AND
Event 3
OR
AND Event 1
OR
Event 2
Event 4
Event 5
Event 6
Event 7
Figure 3.2-8: Fault Tree of an AND/OR Combination
The RBD is shown in Figure 3.2-9.
Event 1 Extra Starting Block
Event 4 Event 3
Event 2
Event 6 Event 5
Figure 3.2-9: RBD of AND/OR combination 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 150
Event 7
Chapter 3: Fundamental Concepts
Combining the series and parallel events yields the following reliability expression for this configuration R = (1 − (1 − R1 )(1 − R2 ) )R3 (1 − (1 − R4 )(1 − R5 ) )R6 R7 3.2.9. K-out-of-N Configurations
A system consisting of “n” components or subsystems, of which only “k” need to be functioning for system success, is called a “k-out-of-n” configuration. For such a system, the integer value of “k” is always less than the integer value of “n”. Define the following as: R= Q= R+Q=
reliability of one unit for a specified time period unreliability of one unit for a specified time period 1
As an example, let us assume that there are three units operating in parallel, two of which are required for the system to perform adequately. If R=0.9 and Q=0.1, then the probabilities associated with each possible combination of outcomes is summarized in Table 3.2-4. Table 3.2-4: Example of “k-out-of-n” Probability Calculations
Probability
Prob of pass or fail of A
Prob of pass or fail of B
Prob of pass or fail of C
Fail
QAQBQC
0.1
0.1
0.1
0.1*0.1*0.1
0.001
Fail
Pass
QAQBRC
0.1
0.1
0.9
0.1*0.1*0.9
0.009
Fail
Pass
Fail
QARBQC
0.1
0.9
0.1
0.1*0.9*0.1
0.009
4
Fail
Pass
Pass
QARBRC
0.1
0.9
0.9
0.1*0.9*0.9
0.081
5
Pass
Fail
Fail
RAQBQC
0.9
0.1
0.1
0.9*0.1*0.1
0.009
6
Pass
Fail
Pass
RAQBRC
0.9
0.1
0.9
0.9*0.1*0.9
0.081
7
Pass
Pass
Fail
RARBQC
0.9
0.9
0.1
0.9*0.9*0.1
0.081
8
Pass
Pass
Pass
RARBRC
0.9
0.9
0.9
0.9*0.9*0.9
0.729
Outcome
A
B
C
1
Fail
Fail
2
Fail
3
Reliability Information Analysis Center 151
Total System Probability
Chapter 3: Fundamental Concepts
In this example, the probability of each combination of possible outcomes (in this case, eight) is calculated. Note that the sum of the probabilities for all possible outcomes is 1.0, since each of the eight possibilities is mutually exclusive and their probabilities can, therefore, be added. This approach of calculating the probability of every possible outcome is always valid, regardless of whether the reliability values of each of the elements are the same or not. For example, if two of the three units are required for the system to perform adequately, the system will “pass” if there are either no failures or if there is one failure, as shown below. This is summarized in Table 3.2-5. Table 3.2-5: Example of “2-out-of-3” Required for Success Outcome
A
B
C
Probability
Total Probability
System Pass or Fail
1
Fail
Fail
Fail
QAQBQC
0.001
Fail
2
Fail
Fail
Pass
QAQBRC
0.009
Fail
3
Fail
Pass
Fail
QARBQC
0.009
Fail
4
Fail
Pass
Pass
QARBRC
0.081
Pass
5
Pass
Fail
Fail
RAQBQC
0.009
Fail
6
Pass
Fail
Pass
RAQBRC
0.081
Pass
7
Pass
Pass
Fail
RARBQC
0.081
Pass
8
Pass
Pass
Pass
RARBRC
0.729
Pass
It can be seen that the system will pass with outcomes 4, 6, 7 and 8. Outcomes 4, 6 and 7 correspond to exactly one failure (i.e., there are three ways in which one failure can occur), and outcome 8 corresponds to exactly zero failures (there is only one way in which this can occur). If the probability of failure of all of the units is the same and they are independent, then the binomial or Poisson distributions can be used: • •
If the metric used in the reliability analysis is the probability of failure, use the binomial distribution If the metric is a failure rate, use the Poisson distribution
Since this example pertains to items with defined probabilities, the binomial distribution applies. As defined previously:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 152
Chapter 3: Fundamental Concepts r r n! ⎛n⎞ F (x; r ) = ∑ ⎜ ⎟ p x q n− x = ∑ p x q n− x x =0 ⎝ x ⎠ x =0 (n − x!)x!
where: n= x= r=
total number of items (3) good items (2 or 3) failed items (0 or 1)
The probability of exactly no failures (i.e., the first term in the above summation) is: F (3,0 ) =
n! 3! x n− x 3 3−3 p q = .9 q = 1 * 0.729 = 0.729 (3 − 3)!3! (n − x )! x!
The probability of exactly one failure (i.e. the second term in the above summation) is: F (2,1) =
n! 3! x n− x 2 3− 2 p q = .9 .1 = 3 * 0.81 * 0.1 = 0.243 (3 − 2 )!2! (n − x )! x!
Therefore, the cumulative binomial expression for 0 or 1 failures (r = 0 or 1) is:
F ( x; r ) =
r
n!
∑ (n − x)! x!p q
x n− x
= 0.729 + 0.243 = 0.972
x =0
Because the first term in the binomial probability expression is the number of combinations of a specific number of failures (or survivals) occurring, the number of combinations (as calculated by the first term) essentially adds the probabilities associated with the mutually exclusive events.
3.3. Distributions Reliability distributions are at the heart of a reliability model. They represent the fundamental relationship between the reliability metric of interest (probability of failure, failure rate, etc.) and the independent variable (TTF, cycles to failure, etc.). This independent variable is called the “life unit”. Table 3.3-1 summarizes probability distributions often used in reliability modeling, along with a description of their primary uses. Reliability Information Analysis Center 153
Chapter 3: Fundamental Concepts
Table 3.3-1: Probability Distributions Applicable to Reliability Engineering Probability Distribution Binomial
Type
Primary Uses
Discrete
Used to find the probability of “x” events occurring in a total of “n” trials, e.g., the number of failures in a sequence of a specified number of equal-length time intervals
Poisson
Discrete
Used to model the probability of a specified number of events occurring in a specified time interval
Exponential
Continuous
Used to describe the distribution of the time to failure when the failure rate is constant
Gamma
Continuous
Used to determine the distribution of the time by which a specified number of failures will occur when the failure rate is constant
Normal
Continuous
Used to describe the statistical mean of a sample taken from any population with a finite mean and variance. Often used to model parameter distributions. Rarely used for time to failure distributions.
Standard Normal
Continuous
The Standard Normal distribution (Z) is derived from the Normal for ease of analysis and interpretation (mean = 0; standard deviation = 1).
Lognormal
Continuous
Used to model many wear out failure causes
Weibull
Continuous
Used to describe the distribution of failures representing constant (i.e., exponential), increasing, or decreasing failure rates, depending on the value of the slope parameter (β). Increasing popularity due to its versatility. Applicable only when no repair is performed following failure.
Student t
Continuous
Used to test for statistical significance of the difference between the means of two samples
F Distribution
Continuous
Used to test for statistical significance of differences between the variances of two samples
Chi-Square
Continuous
A special case of the Gamma distribution, used to estimate confidence intervals around reliability test data, and to test to see whether measured data reflects a constant failure rate.
The following section discusses several of the distributions used in reliability assessment. While the intent of this book is not to cover the statistical aspects of distributions, some 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 154
Chapter 3: Fundamental Concepts
fundamental concepts are critical to the understanding of the basis for certain techniques pertaining to reliability assessment, namely confidence level calculations and demonstrating reliability levels. In particular, the binomial and Poisson distribution are critical for these purposes. The binomial distribution is used when there are only two outcomes, such as success or failure, and the probability remains the same for all trials. The probability density function (pdf) of the binomial distribution is:
⎛n⎞ f (x) = ⎜ ⎟ p x q (n− x ) ⎝ x⎠ where:
n! ⎛n⎞ ⎜ ⎟= ⎝ x ⎠ (n − x!)x! and q = 1 – p. The function “f(x)” is the probability of obtaining exactly “x” good items and “(n-x)” bad items in a sample of “n” items, where “p” is the probability of obtaining a good item (success) and “q” (or 1-p) is the probability of obtaining a bad item (failure). The CDF, i.e., the probability of obtaining “r” or fewer successes in “n” trials, is given by: r r n! ⎛n⎞ F (x; r ) = ∑ ⎜ ⎟ p x q n− x = ∑ p x q n− x x =0 ⎝ x ⎠ x =0 (n − x!)x!
The Poisson distribution is an extension of the binomial distribution when “n” is infinite. In fact, it is used to approximate the binomial distribution when n ≥ 20 and p ≤ 0.05. If events are Poisson-distributed, they occur at a constant average rate and the number of events occurring in any given time interval is independent of the number of events occurring in any other time interval. Since the TTF distribution for this situation is the exponential (i.e., constant failure rate), the Poisson distribution will predict the number of failures for specific values of time and failure rates. The number of failures in a given time would be given by:
Reliability Information Analysis Center 155
Chapter 3: Fundamental Concepts
f (x ) =
a x e −a x!
where “x” is the actual number of failures and “a” is the expected number of failures. Since the expected number of failures (i.e., the expected value) for the exponential distribution is “λt”, the Poisson expression becomes:
( λt )x e −λt f (x ) = x!
where: λ= t= x=
failure rate length of time being considered number of failures
The reliability function, R(t), or the probability of zero failures in time “t” is given by:
( λt )0 e − λt R(t ) = 0!
= e −λt
This is the reliability for the exponential distribution. There are many cases where the probability of experiencing a given number of failures (r) or fewer is required. Examples are reliability demonstration, test planning, etc. For these cases, the CDF is used: r
(λt )x e −λt
x =0
x!
R(x) = ∑
A summary of the distributions most commonly used in reliability engineering are presented in Figures 3.3-1 and 3.3-2, for discrete and continuous distributions, respectively.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 156
Chapter 3: Fundamental Concepts
Figure 3.3-1: SHAPES OF FAILURE DENSITY AND RELIABILITY FUNCTIONS OF COMMONLY USED DISCRETE DISTRIBUTIONS (from MIL-HDBK-338B) Reliability Information Analysis Center 157
Chapter 3: Fundamental Concepts
Figure 3.3-2: SHAPES OF FAILURE DENSITY, RELIABILITY AND HAZARD RATE FUNCTIONS FOR COMMONLY USED CONTINUOUS DISTRIBUTIONS (from MILHDBK-338B) 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 158
Chapter 3: Fundamental Concepts
Continuous distributions are used when analyzing time to failure data, since times to failure is a continuous variable. The most common distributions used in reliability modeling to describe times to failure characteristics are the exponential, Weibull and lognormal distributions. These are described in more detail n the following sections. 3.3.1. Exponential
The exponential distribution is most commonly applied in reliability to describe the times to failure for repairable items. For non-repairable items, the Weibull distribution is popular due to its flexibility. In general, the exponential distribution has numerous applications in statistics, especially in reliability and queuing theory. The exponential distribution describes products whose failure rates are the same (constant) at each point in time (i.e., the “flat” portion of the reliability bathtub curve, where failures occur randomly, by “chance”). This is also called a Poisson process. This means that if an item has survived for "t" hours, the chance of it failing during the next hour is the same as if it had just been placed in service. It is sometimes referred to as the distribution with no memory. It is an appropriate distribution for complex systems that are comprised of different electronic and electromechanical component types, the individual failure rates of which may not follow an exponential distribution. Since the exponential distribution is relatively easy to fit to data, it can be misapplied to data sets that would be better described using a more complex distribution. Table 3.3-2 lists the parameters for the exponential distribution: the probability density function (pdf), the cumulative distribution function (CDF), the mean, the variance, and the standard deviation. Another useful parameter of continuous distributions is the 100pth percentile of a population, i.e., the age by which a portion of the population has failed. The 50% point is the median life. The mean of the exponential distribution is equal to the 63rd percentile. Thus, if an item with a 1000 hour MTBF had to operate continuously for 1000 hours, there would only be a 0.37 probability of success. As an example, consider a software system with a failure rate (λ) of 0.0025 failures per processor hour. Its corresponding mean time between failure (MTBF) is calculated as:
MTBF = θ =
1
λ
=
1 = 400 processor hours 0 .0025
Reliability Information Analysis Center 159
Chapter 3: Fundamental Concepts
Table 3.3-2: Exponential Distribution Parameters Parameters
Mathematical Expression (based on failure rate)
Probability Density Function
Cumulative Distribution Function Failure rate Mean
f (t) = λ e −λ t ,
F( t ) = 1 − e − λ t , t > 0
λ μ=
Variance
100 pth Percentile Reliability Function
σ=
t
f (t ) =
1 −θ e , t>0 θ
F (t ) = 1− e
−
t θ ,
t>0
1 θ
μ=θ
1 λ
σ2 =
Standard Deviation
t>0
Mathematical Expression (based on MTBF)
1 λ
2
σ=θ
1 λ
yP =−
σ2 = θ2
1 ln(1 − P ) λ
R (t ) = e−λ t
y P = −θ ln(1 − P)
R (t) = e
−
t θ
The reliability function (i.e., the probability, or population fraction that survives beyond age “t”) at 100 and 1000 processor hours is: R ( t ) = e − ( 0.0025 )(100 ) = 0.7788 = 77.88% R ( t ) = e −( 0.0025 )(1000 ) = 0.0821 = 8.21%
Which can be seen to be R(t) = 1 – F(t). 3.3.2. Weibull
The Weibull distribution is important in reliability modeling since it represents a general distribution which can model a wide range of life characteristics. It can accommodate increasing, decreasing and constant failure rates. Weibull analysis assumes that there has been no repair of failed items and is often used to model single failure causes. The basic features of the Weibull are: •
The shape parameter, β, which describes the shape of the pdf 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 160
Chapter 3: Fundamental Concepts
• •
• •
The scale (or characteristic life) parameter, α, is the value at which 63rd percentile of the distribution occurs The location parameter, γ (or gamma), is only used in the three parameter version of the Weibull distribution, and is the value that represents the failure free period for the item. If an item does not have a period where the probably of failure is zero, then γ= 0 and the Weibull distribution becomes a two parameter distribution. This third parameter is used when there are threshold effects. Determination of β, η, and γ can easily be estimated using Weibull probability paper or by using available Weibull software programs A multi-mode version of the Weibull distribution can be used to determine the points on the bathtub curve where the failure rate is changing from decreasing, to constant, to increasing
There are two general versions of the Weibull distribution, the first being the twoparameter Weibull and the second being the three-parameter Weibull. The twoparameter Weibull uses a shape parameter that reflects the tendency of the failure rate (increasing, decreasing, or constant) and a scale parameter that reflects the characteristic life of items being measured ( ≅ 63.2% of the population will have failed). The threeparameter Weibull adds a location parameter used to represent the minimum life of the population (e.g., a failure mode that does not immediately cause system failure at time zero, such as a software algorithm whose degrading calculation accuracy does not cause system failure until four calls to the algorithm have been made). Note that in most cases, the location parameter is set to zero (failures assumed to start at time zero) and the Weibull distribution reverts to the two-dimensional case. The three parameter Weibull distribution is also commonly used to characterize strength distributions (i.e., when using a stress/strength model), where the γ-value represents a screen value, or proof test, in which case this value of stress is applied to the item as a screen. It is also used to model failure causes that are not initiated until a time equal to the gamma value has passed. As with the gamma distribution, the definition of Weibull parameters is inconsistent throughout the literature. Table 3.3-3 illustrates how some sources define these parameters.
Reliability Information Analysis Center 161
Chapter 3: Fundamental Concepts
Table 3.3-3: Confusing Terminology of the Weibull Distribution Reference Montgomery, D.C., “Introduction to Statistical Quality Control – 2nd Edition”, John Wiley & Sons, 1991 Musa, J.D.; Iannino, A.; and Okumoto, K.; “Software Reliability: Measurement, Prediction, Application”, McGraw-Hill, May 1987 Nelson, W., “Applied Life Data Analysis”, John Wiley & Sons, 1982 MIL-HDBK-338, Section 5.3.6 This book
Weibull Form
Random Variable
Shape Parameter
Scale Parameter
Location Parameter
3-P
X
β
δ
γ
2-P
T
α
β
2-P
Y
β
α
3-P
T
β
η
2-P
X
β
α
γ
For much life data, the Weibull distribution is more suitable than the exponential, normal and extreme value distributions, so it should be the distribution of first resort. The characteristics of various shape parameters are summarized below: • • • • •
For shape parameter < 1.0, the Weibull pdf takes the form of the gamma distribution (see Section 3.7.1.4) with a decreasing failure rate (i.e., infant mortality) For shape parameter = 1.0, the failure rate is constant so that the Weibull pdf takes the form of the simple exponential distribution with failure rate parameter “λ” (the flat part of the reliability bathtub) For shape parameter = 2.0, the Weibull pdf takes the form of the lognormal or Rayleigh distribution, with a failure rate that is linearly increasing with time (i.e., wearout). This is often used to model software reliability. For 3 < shape parameter < 4, the Weibull pdf approximately takes the form of the Normal distribution For shape parameter > 10, the Weibull distribution is close to the shape of the smallest extreme value distribution
The basic parameters of the 2-parameter Weibull distribution are presented in Table 3.34. To have the mathematical expressions reflect a 3-parameter Weibull, replace all values of “x” with “(x-x0)”, where x0 represents the γ value as described above. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 162
Chapter 3: Fundamental Concepts
Table 3.3-4: Weibull Distribution Parameters Parameter
Probability Density Function
Mathematical Expression ⎡ ⎛ x ⎞β ⎤ ⎢ ⎥ β −1 ⎢ − ⎜⎜ α ⎟⎟ ⎥ ⎝ ⎠ ⎦ β ⎛ x ⎞ ⎣ f ( x ) = ⎜⎜ ⎟⎟ e ,
α⎝α⎠
Cumulative Distribution Function
⎡ ⎛ x ⎞β ⎤ ⎢ −⎜⎜ ⎟⎟ ⎥ ⎢ ⎝α⎠ ⎥ ⎦ F(x) = 1− e ⎣
Shape parameter
β
Scale parameter
α
Failure Rate
λ ( x) =
β⎛x⎞ ⎜ ⎟ α ⎜⎝ α ⎟⎠
x>0
β−1
Mean
⎛ 1⎞ μ = α Γ ⎜⎜ 1 + ⎟⎟ β ⎝ ⎠
Variance
2⎤ ⎡ ⎛ ⎛ 2⎞ 1⎞ σ 2 = α 2 ⎢ Γ ⎜⎜ 1 + ⎟⎟ − Γ ⎜⎜ 1 + ⎟⎟ ⎥ ⎢ ⎝ β⎠ β⎠ ⎥ ⎝ ⎣ ⎦
Standard deviation
2⎤ ⎡ ⎛ ⎛ 2⎞ 1⎞ σ = α ⎢ Γ ⎜⎜ 1 + ⎟⎟ − Γ ⎜⎜ 1 + ⎟⎟ ⎥ ⎢ ⎝ β⎠ β⎠ ⎥ ⎝ ⎣ ⎦
100 Pth Percentile
y P = α[− ln(1 − P )]1 β
Reliability
⎡ ⎛ x ⎞β ⎤ ⎢ −⎜⎜ ⎟⎟ ⎥ ⎢ ⎝α⎠ ⎥ ⎦ R (x) = e ⎣
0.5
Figure 3.3-3 provides a graphical example of the Weibull distribution pdf with a characteristic life of 1000 hours for a variety of shape parameters (β). Figures 3.3-4 and 3.3-5 illustrate the hazard rate and probability plot, respectively, for the same values of the shape parameter.
Reliability Information Analysis Center 163
Chapter 3: Fundamental Concepts
Figure 3.3-3: Example pdf Plots for the Weibull Distribution
Figure 3.3-4: Example Hazard Rate Plots for the Weibull Distribution 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 164
Chapter 3: Fundamental Concepts
Figure 3.3-5: Example Probability Plots for Weibull Distribution
For an example, consider that very early in the system integration phase of a large software development effort, there have been numerous failures due to software that have caused the system to crash (the predominant system failure cause). Plotting the failure times of this specific failure mode (other failure modes are ignored for now) on Weibull probability paper resulted in a shape parameter value of 0.77 and a scale parameter value of approximately 32 hours. Based on these parameters, the calculated reliability and failure rate of the software at 10 system hours is expected to be: λ (10 ) =
0 . 77 32
⎛ 10 ⎞ ⎜⎜ ⎟⎟ ⎝ 32 ⎠
R (10 ) = e
0 .77 −1
= 0 . 0314 failures per hour ⎡ ⎛ 10 ⎢ −⎜⎜ ⎢ ⎝ 32 ⎣
⎞ ⎟⎟ ⎠
0.77
⎤ ⎥ ⎥ ⎦ = 0.6647
Reliability Information Analysis Center 165
Chapter 3: Fundamental Concepts
3.3.3. Lognormal
The lognormal distribution is the distribution of a random variable whose natural logarithm is distributed normally; in other words, it is the normal distribution with “ln t” as the independent variable. The probability density function is
f (t ) =
1
σt 2π
e
1 ⎛ ln (t )− μ ⎞ − ⎜ ⎟ 2⎝ σ ⎠
2
The mean is:
e And the standard deviation is:
(e
⎛σ2 ⎞ ⎟ ⎟ ⎝ 2 ⎠
μ + ⎜⎜
2 μ + 2σ 2
−e
)
1 2 μ +σ 2 2
where μ and σ are the mean and standard deviation (SD), respectively, of ln (t). The lognormal distribution is used in the reliability analysis of semiconductors and the fatigue life of certain types of mechanical components. This distribution is also commonly used in maintainability analysis. The CDF for the lognormal distribution is:
⎡ 1 ⎛ ln (t ) − μ ⎞ 2 ⎤ F (t ) = ∫ exp ⎢− ⎜ ⎟ ⎥ dt ⎠ ⎥⎦ ⎢⎣ 2 ⎝ σ 0 tσ 2π t
1
This can be related to the Standard Normal variant Z by:
⎛ ⎛ ln(t ) − μ ⎞ ⎞ F (t ) = P⎜⎜ Z ≤ ⎜ ⎟ ⎟⎟ σ ⎝ ⎠⎠ ⎝ The reliability function is 1-F(t) or:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 166
Chapter 3: Fundamental Concepts
⎛ ⎛ ln(t ) − μ ⎞ ⎞ R(t ) = P⎜⎜ Z > ⎜ ⎟ ⎟⎟ ⎝ σ ⎠⎠ ⎝ The hazard function, h(t), is given as follows
⎛ ln(t ) − μ ⎞ ⎟ f (t ) σ ⎠ ⎝ = h(t ) = R(t ) tσR(t )
φ⎜
where φ is the standard normal probability function, and μ and σ are the mean and standard deviation of the natural logarithm of the random variable, t. Figures 3.3-6 through 3.3-8 illustrate the lognormal distribution for a mean value of 1000 and standard deviations of 0.1, 1 and 3. Shown are the pdf, the hazard rate, and the cumulative unreliability function, F(t), respectively.
Figure 3.3-6: Example pdf Plots for the Lognormal Distribution
Reliability Information Analysis Center 167
Chapter 3: Fundamental Concepts
Figure 3.3-7: Example Hazard Rate Plots for the Lognormal Distribution
Figure 3.3-8: Example Probability Plots for the Lognormal Distribution 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 168
Chapter 3: Fundamental Concepts
3.4. References 1. Lyu, M.R. (Editor), “Handbook of Software Reliability Engineering”, McGrawHill, April 1996, ISBN 0070394008 2. Musa, J.D., “Software Reliability Engineering: More Reliable Software, Faster Development and Testing”, McGraw-Hill, July 1998, ISBN 0079132715 3. Nelson, W., “Applied Life Data Analysis”, John Wiley & Sons, 1982, ISBN 0471094587 4. Musa, J.D.; Iannino, A.; and Okumoto, K.; “Software Reliability: Measurement, Prediction, Application”, McGraw-Hill, May 1987, ISBN 007044093X 5. Montgomery, D.C., “Introduction to Statistical Quality Control – 2nd Edition”, John Wiley & Sons, 1991, ISBN 047151988X 6. Shooman, M., "Probabilistic Reliability, An Engineering Approach," McGrawHill, 1968. 7. Abernethy, Dr. R.B., "The New Weibull Handbook," Gulf Publishing Co., 1994.
Reliability Information Analysis Center 169
Chapter 3: Fundamental Concepts
This page intentionally left blank
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 170
Chapter 4: DOE-Based Approaches to Reliability Modeling
4.
DOE-Based Approaches to Reliability Modeling
The use of Design of Experiment (DOE) principals is critical to reliability modeling, particularly as it pertains to designing reliability tests from which life models will be derived. As such, it is treated as a separate topic in this book. The tenets of DOE is that one or more of a products’ or systems’ responses is observed as a function of pertinent factors that may affect that response, as illustrated in Figure 4.0-1.
Figure 4.0-1: The DOE Concept At the heart of this technique is the product/system or process under analysis. This is the feature for which we want to quantify the behavior. The independent variables are called the factors. These represent the inputs to the product/system or process and are the things that can potentially change how the product behaves. The output of the DOE activity is the response, and is a measure of how good the product/system or process behaves. The levels for each factor are varied, tests are performed, and the resulting response is measured. The resultant data is analyzed to quantify the item or process response as a function of the factor levels. The generic steps in applying DOE to generate life models are: 1. Determine the product/system or process feature to be assessed Reliability Information Analysis Center 171
Chapter 4: DOE-Based Approaches to Reliability Modeling
2. 3. 4. 5. 6. 7.
Determine the factors Determine the factor levels Design the tests Perform tests and measurements Analyze the data Develop the life model
Each of these steps is described below.
4.1. Determine the Feature to be Assessed The product/system or process feature to be assessed can be any characteristic of the entity that is important to the end user or the producer. It can be related to the performance of the entity, or it can be related to its reliability or durability. In the context of this book, the primary features of interest are reliability and durability. The basic premise of the DOE approach dictates that the feature to be assessed must be quantifiable.
4.2. Determine Factors A factor is any variable that can potentially influence the feature being analyzed. It can be a design attribute, manufacturing attribute, process attribute, environmental stress, operational stress, or any other influencing factor. The output of this determination is a list of factors that will be varied in the DOE tests to be performed. A variety of tools can be used to assist in determining the factors that are to be included in the experiments. Some of these tools are: • • • • •
Quality Function Deployment (QFD) Brainstorming sessions Ishikawa diagram Design FMEA Process FMEA
The FMEA is treated in more detail in Chapter 8.
4.3. Determine the Factor Levels After the factors are identified, the next step is to determine the levels of each factor that will be used in the subsequent tests. The simplest and most common approach is the use of two levels, one at the high end of the operating space (defined below) and the other at the low end. However, there are risks associated with using only two levels. The main 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 172
Chapter 4: DOE-Based Approaches to Reliability Modeling
drawback is that it cannot detect non-linearity in the relationship between the factor and the response. For example, consider the relationship in Figure 4.3-1.
Figure 4.3-1: Possible Response-Factor Level Relationship In this example, the levels “a” and “d” represent the operating space of the product. The conclusions will be very different, depending on the levels chosen within this operating space. For example, if levels “a” and “b” are chosen, the conclusion will be that there is a strong positive relationship; if levels “b” and “d” are chosen, the conclusion will be that the factor has no effect on the response; and if “a” and “d” are chosen, which is a typical approach, the conclusion will be that there is a moderate relationship. These results are summarized in Table 4.3-1. Table 4.3-1: Possible Conclusions for a Non-Linear Response-Factor Relationship Levels
Conclusion
a-b
High positive relationship
c-d
High negative relationship
b-d
No relationship
a-d
Moderate positive relationship
The number of levels for each factor should be chosen, in part, based on knowledge of the physics of the manner in which the factor affects the response. Otherwise, there can be large uncertainty in using the resulting model to interpolate or extrapolate the response behavior as a function of the factor. For example, if the response under analysis is Reliability Information Analysis Center 173
Chapter 4: DOE-Based Approaches to Reliability Modeling
corrosion, and the relationship between the factor, temperature, and the corrosion rate is expected to be governed by the Arrhenius relationship over the entire operating space, then a two-level temperature test may be appropriate. If, however, it is hypothesized that there is a temperature threshold within the operating space, then more than two levels may be required.
4.4. Design the Tests The next step is to design the experiment itself. The tests must be designed to determine the specific factor level combinations to be tested, and the order in which they will be tested. There are many things that will influence the design of the experiment, including sample availability, the cost of running the tests, the time allotted for the tests, and test equipment availability. As an example of a simple experimental design, consider Figure 4.4-1.
Figure 4.4-1: DOE Terminology In this example, there are three factors to be assessed, A, B and C, represented by the three right-hand columns. Each factor has two levels, a “+” indicating the high level and a “–“ indicating the low level. This experiment has four runs, each one representing a treatment. A treatment refers to the combination of levels used in the tests. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 174
Chapter 4: DOE-Based Approaches to Reliability Modeling
Repetition and replication are techniques used to increase the number of runs. The advantage of increasing the number of runs is that obtaining multiple responses with exactly the same factor levels is valuable in quantifying the amount of variability and error in the measurements obtained. Repetition is the practice of repeating the same run sequentially. Replication is the practice of repeating a set of runs sequentially. Both practices will result in multiple responses for a given set of factor levels, but the advantage of replication over repetition is that it is better able to quantify measurement error in the event when there is a gradually changing parameter in the test or measurement system. The full-factorial approach will be used as an example for illustrating the concepts of data analysis, followed by a discussion of other approaches. A full-factorial design, an example of which is shown in Table 4.4-1, is the most comprehensive experimental design. It includes runs which represent all possible combinations of factor levels. The primary drawback to the full-factorial approach is that it requires many runs. In some cases, this may be practical, but in many cases, the cost and time required to carry out the experiments are prohibitive. Table 4.4-1: Full-Factorial Example Run
A
B
C
R (response)
1
+
+
+
R1
2
+
+
-
R2
3
+
-
+
R3
4
+
-
-
R4
5
-
+
+
R5
6
-
+
-
R6
7
-
-
+
R7
8
-
-
-
R8
The number of required runs is calculated as yx, where “y” is number of levels per factor (2, for this example), and “x” is number of factors (3). In Table 4.4-1, then, the number of runs is 23=8. Reliability Information Analysis Center 175
Chapter 4: DOE-Based Approaches to Reliability Modeling
There are many alternatives to the full factorial approach. “One-Factor-at-a-Time” experiments, illustrated in Figure 4.4-2, refer to experiments in which each run varies the level of one factor. In this manner, the effects of each factor can be assessed by comparing the response between the two successive runs in which the factor was varied. This is generally a brute force way to perform experiments, and is usually very inefficient.
Figure 4.4-2: One-Factor-at-a-Time Experiments Fractional Factorial Orthogonal Array Experiments can be used when it is impractical to perform a full factorial experiment. Characteristics of orthogonal experiments are as follows: • • • •
They use a fraction of the number of full-factorial combinations The treatments are chosen to provide enough information to analyze the effects of a factor using analysis of means “Orthogonal” means that the combination of factors are balanced such that the weight of all factors are equal “Orthogonal” also means that the effects of the factors can be assessed independently of the others
A full-factorial array can be scaled such that the resultant array has the characteristics of orthogonality. These are referred to as fractional factorial arrays, since only a fraction of the full-factorial runs are required, yet are still orthogonal. The naming convention for these arrays is determined from:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 176
Chapter 4: DOE-Based Approaches to Reliability Modeling
La ( y x ) where: a= y= x=
the number of experimental runs the number of levels the number of factors
In the previous examples, “y” and “x” were the number of factors and the number of runs, respectively. In the standard DOE nomenclature, however, “La” refers to the number of runs. For example, a seven-factor, two-level experiment for which there will be eight runs is shown in Figure 4.4-3.
Figure 4.4-3: Standard DOE Nomenclature Another critical element that must be considered when defining reliability tests is the potential interactions between factors. Everything discussed thus far in this section has assumed that the effects of each of the factors are independent of each other. In practice, there are often interactions between factors that must be accounted for. Graphical representations of potential interactions are shown in Figure 4.4-4. Referring to the Figure, if the responses for the two levels of the “B-factor” plotted against the two levels of the “A-factor” are parallel, then this is an indication that there is no interaction between the two factors. This is shown on the top left. In other words, the relative magnitudes of the B-response are independent of the level of “A”. If however, when the plots of the same factors result in the plot on the top right, then this is an indication that there is a strong interaction between factors A and B. In this example, the levels of “A” change the entire relationship between the B-levels and the response. The plot on the bottom indicates that there is a mild interaction between the two factors.
Reliability Information Analysis Center 177
Chapter 4: DOE-Based Approaches to Reliability Modeling
Figure 4.4-4: Potential Interactions If the potential interactions are not accounted for in the reliability test plan, the risk is that the effects of the factors cannot be deconvolved (separated) from the interactions between the factors. There are many DOE test plans and tools that assist in identifying the capability of various plans to identify main effects and interactions. A detailed treatment of DOE principals is beyond the scope of this book, as this has been done extensively in the literature, but it is important to understand the impact of some of the principals as they pertain to reliability testing. Resolution is a term that describes the degree to which the main effects of factors are aliased, or confounded, with the interactions amongst factors. In general, the resolution number of a design is one more than the smallest order interaction with which some main effects are aliased. For example, if some main effects are confounded with some 2-level interactions, the resolution number of the DOE is 3. Since full-factorial designs test the response of every possible combination of factors, there is no confounding and, therefore, they have infinite resolution. As stated previously, since the implementation of a fullfactorial test is often not practical, weaker tests are often necessary. The key is to select the aliasing structure of the test such that the actual critical interactions can be deconvolved from the main effects. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 178
Chapter 4: DOE-Based Approaches to Reliability Modeling
To illustrate this, consider an example of a corrosion failure mechanism that is accelerated by temperature, humidity and the level of ionic contamination. A full factorial, 2-level per factor, plan would be as shown in Table 4.4-2. The “-1” and “1” designation represent the low and high levels of the factors, respectively. For this fullfactorial, 2-level plan, eight runs are sufficient to test all possible combinations. Table 4.4-2: Full and Half Factorial Example for Corrosion
FullFactorial
HalfFactorial (Resolution = 3)
Temperature (T) 1 -1 1 -1 -1 1 -1 1 1 1 -1 -1
Main effects Humidity Ionic (H) contamination (I) -1 1 1 -1 -1 -1 -1 1 1 1 1 1 -1 -1 1 -1 1 1 -1 -1 -1 1 1 -1
Interactions T*H -1 -1 -1 1 -1 1 1 1 1 -1 1 -1
T*I 1 1 -1 -1 -1 1 1 -1 1 -1 -1 1
H*I -1 -1 1 -1 1 1 1 -1 1 1 -1 -1
Another possible plan would be a half factorial, also shown in Table 4.4-2. Notice that, for the half-factorial design, the temperature-humidity (T*H) interaction (i.e., the product of the two) is the same as for ionic contamination (I). Also, the T*I interaction is the same as H, and the H*I interaction is the same as T. Therefore, this Resolution 3 plan is incapable of deconvolving the main effects of T, H or I with the interactions of the other two. From physics, we know that both humidity and ionic contamination are required for corrosion. Therefore, the fact that H*I is the same as T (i.e., they are confounded) is unacceptable, since we would not be able to determine if the lifetime is governed by temperature, or the combination of humidity and ionic contamination. Therefore, we need a better DOE test plan. The full-factorial plan would be the best, if it could be executed, since none of this confounding exists. For the full-factorial plan, notice that none of the interaction terms are the same as the main effects.
Reliability Information Analysis Center 179
Chapter 4: DOE-Based Approaches to Reliability Modeling
If we were to actually model this failure cause based on the tests defined in these plans, the general form of the reliability model may be based on the two parameter Weibull distribution, which is:
R=e
⎛t ⎞ −⎜ ⎟ ⎝α ⎠
β
where: R= α= β=
the reliability, or probability of survival, at time “t” the characteristic life (i.e., time to 63% failure) the Weibull shape parameter
The characteristic life is then developed as a function of the applicable variables. The model in this case is: α0
α1
α = e e T H α I α HI α 2
4
3
where: α 0 through α 4 = parameter coefficients estimated in the life modeling process T= the temperature in degrees K (degrees C+273) H= the relative humidity I= the ionic contamination HI = the product of humidity and ionic contamination All model parameters, α 0 through α 4, could be adequately quantified with the fullfactorial design, but not with the half-factorial. There are many other potential test plans that would be adequate, providing that the required model variables can be quantified and are not confounded with one another (Reference 1).
4.5. Perform Tests and Measurements The next step in the process is to perform the tests. The test for each run is performed, and the response is measured. All variables that are not factors being addressed in the 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 180
Chapter 4: DOE-Based Approaches to Reliability Modeling
experiment must be kept as constant as possible. Make sure that all results are fully documented. This also must include any anomalies or potential sources of error that may have occurred. The order of the runs must be kept intact, per the experimental plan. If repetition is used, the same run or treatment is repeated sequentially. If replication is used, then the set of runs to be repeated have been identified in the experimental design. For in-situ measurements, careful time stamping of the data is required. Life models to be developed from the collected data often represent parameter degradation data and not actual TTF data. As a result, a model of degradation rate as a function of time may be used as the response to predict failure times. All test samples should be carefully stored, as root-cause failure analysis may be required at some future time.
4.6. Analyze the Data The data that is generated from the tests is then analyzed to identity the impact that each factor has on the response, and the interactions between each factor. The simplest way to analyze the data and the effects of each factor is to perform an analysis of means. This can be done only if the experimental design is orthogonal. In this case, the average value of the response is calculated for each level of each factor. From the previous example, if the effects of A are to be determined, then the average of the responses when A is “+” and when A is “-” are calculated. Likewise, the mean of each level of each factor is calculated in the same manner, as shown below:
The means can be pictorially represented, as shown in Figure 4.6-1. This is a convenient way to illustrate the sensitivity of the response to each factor. Data analysis techniques more sophisticated than the analysis of means shown here are also often used, and there are many good software tools available to aid in this analysis. However, if a balanced, orthogonal design is used, analysis of means can be very straightforward and effective.
Reliability Information Analysis Center 181
Chapter 4: DOE-Based Approaches to Reliability Modeling
Figure 4.6-1: Analysis of Means In the event that it is known that the response does not behave linearly with the factor level, the response can sometimes be linearized by making the appropriate data transformation. For example, if the response under analysis is corrosion that is governed by the Arrhenius relationship over the entire operating space, then the response, life in this case, would be exponential with temperature. However, if the transformation shown in Figure 4.6-2 is applied, the response will be linear. This is especially useful when a goal of the analysis is to determine the activation energy.
Figure 4.6-2: Linearization of the Arrhenius Relationship 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 182
Chapter 4: DOE-Based Approaches to Reliability Modeling
After the data has been analyzed, the optimal combination of factor levels can be determined. The goal of this approach is to determine the factor levels that will result in minimal variability of the product response and maximum probability of the product meeting its requirements. This is the payoff in this approach, since it results in a more robust design. In this example, if the desirable response is high, then a high value of A and B with a low value of C provides the best response, as shown in Figure 4.6-3.
Figure 4.6-3: Optimal Factor Settings
4.7. Develop the Life Model The reliability data is then analyzed and the life model is developed. This process is discussed in detail in Chapter 5
4.8. References 1. William Y. Fowlkes and Clyde M. Creveling, “Engineering Methods For Robust Product Design: Using Taguchi Methods In Technology And Product Development,”
Reliability Information Analysis Center 183
Chapter 4: DOE-Based Approaches to Reliability Modeling
This page intentionally left blank
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 184
Chapter 5: Life Data Modeling
5.
Life Data Modeling
This section addresses the topic of life modeling, after the life data has been generated. Life data modeling is treated as separate topic in this book since its principals pertain to many of the types of data previously discussed. The purpose of modeling the reliability of critical components or failure causes was previously described, and includes a variety of objectives. Life modeling is simply a means of constructing a mathematical model that predicts, assesses, or estimates the reliability of a product or system. A methodology was previously presented for developing life models from tests performed at multiple combinations of stress (DOE Multicell). From this data, a reliability model can be constructed. If all samples are tested to failure, or have been tested in exactly the same manner, then traditional statistical analysis techniques (like regression, F-tests, T-tests, AVOVA, etc.) will generally suffice for reliability modeling purposes. However, most real world cases include censored data, unbalanced datasets, uncertain failure times, etc. It is these cases where life modeling techniques are most effectively used. Life modeling requires simultaneous characterization of: 1. TTF distributions 2. Acceleration factors (which provide a relative value of the reliability parameter as a function of the stress level) Each of these two major elements is discussed in the following sections and presents more detailed information regarding development of the models after the life data has been obtained.
5.1. Selecting a Distribution While there is no specific distribution type that should be used in specific situations, there are some rules of thumb that are helpful when selecting an appropriate distribution. If the failure mechanism of interest is a manifestation of a positive feedback situation, then the lognormal distribution is often applicable. These positive feedback situations are recursive cases in which a flaw starts, the presence of the flaw results in an increased stress level, the flaw propagates resulting in further increased stress, and so on, until catastrophic failure occurs.
Reliability Information Analysis Center 185
Chapter 5: Life Data Modeling
If the failure process is governed by a distribution of defects present in the product or system at time zero, then the Weibull distribution is usually appropriate. In cases where the failure mechanism is random in nature, the exponential is applicable.
5.2. Parameter Estimation Overview Life modeling, using statistical concepts, involves drawing inferences from observations of random variables, such as observed failure times. Typical inferences consist of point and interval estimates of distribution parameters and decisions in statistical hypothesis testing. Parameter estimation provides a means for the effective use of data to aid in life modeling and the estimation of constants appearing in those models. The constants that appear in distribution functions (e.g., “p” in the binomial distribution; “λ” in the Poisson distribution; “μ” and “σ” in the Normal distribution; “λ” or “θ” in the exponential distribution; and “α” and “β” in the Weibull distribution) are called parameters. The true value of the parameters from a given distribution may not be known or measurable, so it becomes more practical to obtain approximate or estimated values of these parameters from a sample of data. In the larger context, parameter estimation is typically applied to one of the following scenarios. Point estimation is frequently used in reliability analysis to quantify parameters like the failure rate in the exponential distribution. Formally, a statistic, Y, is a function of random variables that does not depend on any unknown parameter:
Y = u( X 1 , K , X n ) Let “θ” denote the parameter to be estimated. Consider functions w(Y) of the statistic, which might serve as point estimates of the parameter. Since w(Y) is a random variable, it has a probability distribution. Statisticians have defined certain properties for assessing the quality of estimators. These properties are defined in terms of this probability distribution. A loss function, L[θ,w(Y)], assigns a number to the deviation between a parameter and an estimator. A typical loss function is the square of the difference, and is the value used in least squares regression: 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 186
Chapter 5: Life Data Modeling
L[θ , w(Y )] = [θ − w(Y )] 2
The risk function is the expected value of the loss function:
R(θ , w) = E{L[θ , w(Y )]} An unbiased estimator that minimizes the risk function for the above loss function is referred to as a minimum variance unbiased estimator. An estimator that minimizes this risk function uniformly in θ is called a minimum mean squared estimator. Table 5.2-1 summarizes the terms most commonly used in parameter estimation. Table 5.2-1: Terminology Used In Parameter Estimation Term Confidence Level Consistent Estimator Estimator Interval Estimator Likelihood Loss Function Maximum Likelihood Estimate Minimum Mean Squared Estimate Minimum Variance Unbiased Estimator Risk Function Sample Size Unbiased Estimator
Definition The theoretical percentage (or probability) of an interval estimate containing the parameter, and in which the endpoints of the interval are constructed from sample data The estimate converges to the true value of the parameter as the sample size increases to infinity A function of a statistic used to estimate a parameter in a probability model Estimates of the endpoints of an interval around a parameter The probability weight for given values of parameters at observed data points A function that provides a measure of the distance between a parameter value and its estimator An estimate that maximizes the probability that given parameter values will occur at observed data points An estimator that uniformly minimizes the expected value of the square of the difference between a parameter and an estimator Of all unbiased estimators, none has a smaller variance. Sometimes called a “best” estimator The mathematical expectation of the loss function The number of random variables from which a statistic is calculated An estimator with a mathematical expectation equal to the parameter being estimated
Table 5.2-2 includes a brief discussion of common parameter estimation techniques.
Reliability Information Analysis Center 187
Chapter 5: Life Data Modeling
Table 5.2-2: Techniques for Parameter Estimation Technique
Discussion
•
Process
Maximum Likelihood Estimation (MLE)
In all practical cases, MLE’s converge stochastically to the population value. If a MLE exists uniquely and a sufficient statistic for the parameter exists, the MLE is a function of the sufficient statistic. Sometimes the MLE is impossible to find in closed form, and numerical methods must be used (typical of time-domain software reliability models). MLE’s are the best estimators for large sample sizes.
• Express the joint probability density function of the random variables of interest as a function of the unknown parameters (i.e., the likelihood function) • Where appropriate, take the natural logarithm of the likelihood function • Differentiate the likelihood (or log likelihood) function with respect to each parameter • Set all derivatives equal to zero and solve for the parameters as functions of realizations of the random variables • Check second-order conditions
Least Squares
Least square estimators may be better when small or medium sample sizes are involved, since they may have smaller bias, or approach normality faster. Least squares estimation minimizes the variance around the estimated parameter. The technique is familiar to those comfortable with linear regression modeling.
• Express the sum of the squared distance between actual and predicted values as a function of parameter estimates • Determine the parameter estimators that minimize the sum of this squared distance (typically using differential calculus)
Method of Moments
This technique works by equating statistical sample moments calculated from a data set to actual population moments. Population moments are determined by the parameters to be estimated. As many moments are equated as there are parameters to be estimated. In most cases of practical interest, these can be found in closed form., but their theoretical justification is not as rigorous as for other parameter estimation methods.
• Determine the distribution whose parameters are to be estimated (suppose there are “n” parameters to be estimated) • Find the first “n” moments of the distribution, either around zero, or around the mean for moments higher than the first • Equate these moments to sample moments • Solve for the parameters as a function of the realizations of the random variables in the sample.
Bayesian
Provides an efficient method for incorporating various subjective and objective data sources into parameter estimation. It is a much less practical method than MLE, as the analysis is much more complex and the computation is much more complicated. The validity of the approach is dependent on validity of the model and prior distributions.
• Assign a non-informative or subjective distribution to the parameters of the model (the “priors”). The priors express the uncertainties in the parameter values. • Combine actual data with the “priors” to obtain new parameter distributions (the “posteriors”). The posteriors provide estimates and Bayesian confidence limits for the parameters, producing more precise estimates.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 188
Chapter 5: Life Data Modeling
5.2.1. Closed Form Parameter Approximations
Simple equations that approximate parameters have been developed and are summarized in Table 5.2-3, which provides an overview of the parameter estimates for commonly used distributions. Table 5.2-3: Parameters Typically Estimated from Statistical Distributions Distribution
True Parameter
Poisson
Occurrence Rate, λ
Binomial
Proportion, p
Estimated Parameter Sample Occurrence Rate: λˆ = n / t n = number of observed failures t = period (time, length, volume) over which failures are observed ˆ Sample Proportion: p = x / n x = number of “successful” trials n = number of statistically independent sample units n
∑xi
Exponential
Mean, θ
θˆ = x = i =1 n Sample Mean: xi = individual times to failure for each of the observations of sample size “n” n = number of statistically independent sample observations n
x=
Mean,
x
∑x i i =1
n Sample Mean: xi = individual times to failure for each of the observations of sample size “n” n = number of statistically independent sample observations
Normal
n
s2 =
Variance, s2
∑ (x i
− x )2
i =1
n −1 Sample Variance: 2 s = sample variance (standard deviation, s, equals (s2)0.5) xi = individual measurements for each of the observations of sample size “n” n = number of statistically independent sample observations
Reliability Information Analysis Center 189
Chapter 5: Life Data Modeling
Table 5.2-3: Parameters Typically Estimated from Statistical Distributions (continued) Distribution
True Parameter
Estimated Parameter The estimate of the Weibull shape parameter is: 1.283 βˆ = s where,
⎛ n 2 ⎜ ∑ (x i − x ) s = ⎜ i =1 ⎜ n −1 ⎜ ⎝
Shape Parameter, β
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
0.5
n
Weibull
x=
∑xi i =1
n s = sample standard deviation xi = individual times to failure for each observation of sample size “n” n = number of statistically independent sample observations The estimate of the Weibull scale parameter is: αˆ = exp ( x + ( 0.5772 )( 0.7797 ) s )
Scale Parameter, α
s= sample standard deviation xi = individual measurements for each observation of sample size “n” n = number of statistically independent sample observations
The parameter estimates shown in Table 5.2-3 are rather simplistic and easy to use, and often provide adequate estimates. There are more rigorous techniques available that do a better, more accurate job of estimating parameters, but their complexity requires the use of software tools. The most popular techniques used in reliability modeling are least squares regression and maximum likelihood. These are described in the next sections. 5.2.2. Least Squares Regression
Least squares regression is often used to estimate model parameters in cases when a function can be linearized. The following steps are required for this approach: 1. Select the distribution type 2. Linearize the distribution 3. Determine the plotting positions of each data point 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 190
Chapter 5: Life Data Modeling
4. Determine the parameters using a least squares technique For example, if a two parameter Weibull distribution is used (Step 1), the linear transform is performed as follows (Step 2):
R=e
⎛t ⎞ −⎜ ⎟ ⎝α ⎠
β
Taking the natural log (base e) of both sides, twice, yields:
ln(− ln(R)) = β ln(t ) − ln(α ) This is now a linear model with ln(t) being the independent variable, β being the slope, and ln(α) being the intercept. Step 3 calculates the plotting position (i.e., the estimated percent fail of the population at the failure time of each) for each data point. A common way to accomplish this is by using Bernard’s formula:
F=
i − 0.3 N + 0.4
where: i= N=
the cumulative number of failures the total sample size
For example, if there are ten items, the value of F after the second failure is:
F=
i − 0.3 2 − 0.3 = = 0.163 N + 0.4 10 + 0.4
The value of F is calculated for each failure. These pairs of x-y points are the values to which a linear model will be fit. The values of the slope and intercept are then: Reliability Information Analysis Center 191
Chapter 5: Life Data Modeling
β=
∑ (x − x )(y − y ) ∑ (x − x ) 2
ln(α ) = y − β x In this case, y = ln(α), and x = time (t) 5.2.3. Parameter Estimation Using MLE
This section addresses the use of Maximum Likelihood Estimation (MLE) techniques for estimating TTF distribution parameters, such as the parameter “λ” of the exponential pdf, or “μ” and “σ” of the Normal and lognormal pdf. The objective is to find a point estimate, as well as a confidence interval, for the parameters of these distributions based on the data available from test or field observation. Quantification of confidence intervals is very important in the estimation process because there is almost always a limited amount of data (e.g., on TTFs), and, thus, we cannot state our point estimation with certainty. Therefore, the confidence interval is a statement about the range within which the actual (“true”) value of the parameter resides. This interval is greatly influenced by the amount of data available. Of course, other factors such as diversity and accuracy of the data sources and adequacy of the selected model can also influence the state of our uncertainty regarding the estimated parameters. When discussing goodnessof-fit tests, we are trying to address the uncertainty due to the choice of the probability model form by using the concept of levels of significance. However, uncertainty due to diversity and accuracy of the data sources is a more difficult issue to deal with. Times-to-failure data are seldom complete. A complete sample is one in which all items observed have failed during a given observation period, and all the failure times are known. When “n” items are placed on test or observed in the field, whether with replacement or not, it is sometimes necessary (due to the long life of certain components) to terminate the test and perform the reliability analysis based on the observed data up to the time of termination. There are two basic types of possible life observation termination. The first type is time terminated (which results in Type I right censored data), and the second is failure terminated (resulting in Type II right-censored data). In the time-terminated life observation, “n” units are monitored and the observation is terminated after a predetermined time has elapsed. The number of items that failed during the observation time, and the corresponding TTF of each component, are recorded. In the failureterminated life observations, “n” units are monitored and the observation is terminated 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 192
Chapter 5: Life Data Modeling
when a predetermined number of component failures have occurred. The time to failure of each failed item, including the time that the last failure occurred, are recorded. The MLE method is one of the most widely used methods for estimating reliability model parameters. I n the first part of this section, a brief historical review of the MLE method is presented. The likelihood function concept for different types of failure data, as well as the mathematical approach to solve likelihood equations, is presented next. The last part of this section reviews the basic equations of the MLE approach for specific case studies, including exponential, Weibull and lognormal distribution likelihood functions. 5.2.3.1. Brief Historical Remarks
The use of regression techniques has many shortcomings when it comes to reliability modeling. In particular, it is weak when it comes to analyzing interval or censored data. The Maximum Likelihood Estimation method was originally introduced by Fisher (Reference 5). Fisher used the conditional probability of occurrence for each failure event as a measure for his mathematical curve fitting. He argued that, using a subjective assumption about the TTF model, one can characterize the probability of each failure event, conditioned to the model parameter. He then derived the posterior probability of failure events in a Bayesian framework using a uniform distribution as a prior for the model parameters. He later calculated the best estimate for model parameters by maximizing the posterior. Note that, in a Bayesian framework, a uniform distribution cancels out from the equation since it is a constant. The normalizing factor in the denominator is also a constant, which has no impact when one is interested in the extremes of the function. Therefore, this method was eventually called the maximum likelihood estimator, because it is basically the likelihood function that is maximized in this process. 5.2.3.2. Likelihood Function
Fisher (Reference 3) based his maximum likelihood measure on an implied Bayesian uniform prior for the parameters, and he names the method as leading to “the most probable set of values” for the parameters (Reference 9). He suggested that the ratio of the likelihood function and its maximum may be used to find confidence intervals for the model parameters, and derived it for the case of Normal sampling curves. Let “[f (t) × dt]” be the chance of a failure observation falling within the range “dt”. Fisher introduces the method of maximum likelihood by claiming that the factor “dt” is independent of the theoretical curve, and the probability is proportional to “f(t)”.
Reliability Information Analysis Center 193
Chapter 5: Life Data Modeling
Therefore, the likelihood of “N” independent TTF observations will be proportional to the product of the probability distribution function at the TTFs, as shown below: N
LF (t1 , t 2 ,..., t N / θ ) ∝ ∏ f (ti | θ ) i =1 M
LR (T1 , T2 ,..., TM / θ ) ∝ ∏ R (Ti | θ ) i =1 K
LL (T1 , T2 ,..., TK / θ ) ∝ ∏ F (Ti | θ ) i =1
L
LI [(Ta1 , Tb1 ),..., (TaL , TbL ) / θ ] ∝ ∏ [ F (Tb i ) − F (Tai )] | θ ) i =1
where: θ= N= M= K, L = Ti = ti = f (t) = R (t) = Tai = Tbi = LF = LR = LL = LI =
the vector of model parameters, (θ1, θ2, …, θn) the number of complete failure observations the number of right censored observations the number of left and interval data observations censored observations complete failure observations probability density function reliability function the lower bound of time interval the upper bound of time interval the likelihood function for complete failure data the likelihood function for right censored data the likelihood function for left censored data the likelihood function for interval data
Using the notion of the conditional probability density function, f(t|θ), helps to integrate many different types of failure data into the likelihood function. For example, the likelihood of the right-censored observations will be the reliability function, because this is the probability that the component remains reliable up to the censored time. Therefore, the likelihood of “M” independent right-censored observations will be the product of the reliability functions as illustrated in the second equation above. For left-censored times (that is, the time before which a failure has occurred) the likelihood is also the definition of probability of failure at that time. In the case of many left-censored times, the total likelihood will be the multiplication of the likelihood values of individual components 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 194
Chapter 5: Life Data Modeling
using the independency assumption, as shown in the third equation on the previous page. For interval times, the likelihood is the probability of having one failure in that interval (which is basically the integral of the probability density function between the upper and lower abounds of interval). This is simply the difference between the cumulative distribution function when it is evaluated at the upper and lower bounds, respectively, as shown in the final equation from the previous page. Assuming the independency of failure or censored time events, these likelihood functions can be multiplied with the likelihood of the complete failure data in order to build the likelihood function for the entire population. 5.2.3.3. Maximum Likelihood Estimator (MLE)
The likelihood function is used differently in the Bayesian and MLE frameworks. In the Bayesian method, the prior knowledge that is available for the model parameters is updated using this function as the conditional likelihood of data. In the MLE approach, the most probable set of values of the parameter vector, θ, s estimated by maximizing this likelihood as a standalone function. The practical way to find the modes of the likelihood function is derivation. A multivariable function has its maximum value at a point in which the first-order partial derivative of the function with respect to each variable becomes zero, as shown below:
⎧ ∂Λ ⎪ ∂θ = 0 ⎪ 1 ⎪ ∂Λ =0 ⎪ Λ = ln( L ) ⇒ ⎨ ∂θ 2 ⇒ θˆ = (θˆ1 ,θˆ 2,..., θˆn ) ⎪... ⎪ ⎪ ∂Λ ⎪ ∂θ = 0 ⎩ n where: Λ= θˆ =
the log likelihood function the best estimate parameter vector
Note that the likelihood function, as explained before, is based on a multiplication format. This makes the derivation process very complex. The likelihood is always positive, so Reliability Information Analysis Center 195
Chapter 5: Life Data Modeling
one may take the natural logarithm of this function to convert these multiplication operators to summation. This will significantly reduce the mathematical derivation complexity, while still providing the same best estimates for the mode of the likelihood function. Constructing the likelihood function, L, as explained before, one may set up the equations that need to be solved for the modes of this function. In the following sections, three examples of the likelihood function, representing the exponential, Weibull and lognormal distributions, are presented for further clarification. 5.2.3.3.1.
Exponential Distribution
If failures are expected to randomly occur at a constant rate in time, the TTF distribution follows an exponential distribution. The exponential distribution assumes a constant hazard rate for the item. This constant hazard rate is the only parameter of the exponential distribution. The likelihood of complete failure and right-censored data, as explained in previous sections, can be represented based on the probability density and the cumulative distribution functions of the exponential distribution. The equation below shows the log-likelihood function in case of “F” complete (i.e., failed) and “S” rightcensored (i.e., survived or suspended) observations. F
(
L = ∑ N i ln λe i =1
− λt i
) − ∑ N λT S
j =1
j
j
The only variable in this equation is λ. In the MLE method, the best estimate of λ is evaluated by maximizing the likelihood (or log-likelihood) function. The next equation shows the criteria to estimate the best estimate of λ. The uncertainty of the calculation can be illustrated as confidence bounds over λ, which is calculated using the corresponding local Fisher information matrix. This step will be explained in detail later. F ∂L ⎞ S ⎛1 = ∑ N i ⎜ − ti ⎟ − ∑ N jT j = 0 ∂λ i =1 ⎝ λ ⎠ j =1
5.2.3.3.2.
Weibull Distribution
The Weibull distribution can be used for non-repairable hardware units exhibiting increasing, decreasing, or constant hazard rate functions. Similar to the lognormal distribution, it is a two-parameter distribution and its estimation, even in the case of 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 196
Chapter 5: Life Data Modeling
complete (uncensored) data, is not a trivial problem. It can be easily shown that, in the situation where all “r” units out of “n” observed units fail, the log-likelihood estimates of the Weibull distribution are represented by the equation below: β −1 ⎛ t i ⎞ ⎛ − ⎜⎜ ⎟⎟ ⎜ β ⎛ ti ⎞ L = ∑ N i ln ⎜ ⎜ ⎟ e ⎝ α ⎠ ⎜α ⎝α ⎠ i =1 ⎝ F
β
β ⎞ S ⎛ Tj ⎞ ⎟ ⎟⎟ − ∑ N j ⎜⎜ α ⎟⎟ j =1 ⎝ ⎠ ⎠
The best estimate of the parameters “α” and “β” are made using the first derivative of the log-likelihood function, as shown in the next equations below. The best estimates will be the unique answer for the set of two equations and two unknowns, as shown: β
β F ⎛T ⎞ ⎛T ⎞ ∂L 1 F ⎛ ti ⎞ F ⎛ ti ⎞ ⎛ ti ⎞ S = ∑ Ni + ∑ Ni ln⎜ ⎟ − ∑ Ni ⎜ ⎟ ln⎜ ⎟ − ∑ N j ⎜⎜ j ⎟⎟ ln⎜⎜ j ⎟⎟ = 0 ∂β β i =1 ⎝ α ⎠ i =1 ⎝ α ⎠ ⎝ α ⎠ j =1 ⎝ α ⎠ ⎝ α ⎠ i =1
∂L − β = ∂α α
β Ni + ∑ α i =1 F
β S ⎛Tj ⎛ ti ⎞ ⎜ N N + ⎜ ⎟ ∑ ∑ i j⎜ α i =1 j =1 ⎝ ⎠ ⎝α F
β
⎞ ⎟ =0 ⎟ ⎠
Note that, despite the complexity of the mathematical representations of the likelihood and log-likelihood functions and their derivatives, the basic concept is fairly simple. In advanced numerical approaches using computers, the entire mathematical derivation is done through numerical simulations using predefined tool boxes and library functions. 5.2.3.3.3.
Lognormal Distribution
In the case of estimating the parameters of the lognormal distribution, the only difference is in the construction of the likelihood function for which the pdf and CDF of the distribution are used for complete and suspended failure data, respectively. The equation below shows the log-likelihood function for a combination of complete failure and suspended (right-censored) data. F ⎛ ⎛ 1 ⎛ ln(ti ) − μ ⎞ ⎞ S ⎛ ln(T j ) − μ ⎞ ⎞ ⎟⎟ ⎟ L = ∑ N i ln⎜⎜ φ ⎜ ⎟ ⎟⎟ + ∑ N j ln⎜⎜1 − Φ⎜⎜ ⎟ σ σ ⎠ ⎠ j =1 i =1 ⎝ ⎠⎠ ⎝ σti ⎝ ⎝
Having the log-likelihood function of failure data, the MLE approach can be executed using the first derivative approach, as explained in previous sections. The first derivative Reliability Information Analysis Center 197
Chapter 5: Life Data Modeling
of the log-likelihood function with respect to the mean and standard deviation is illustrated in the following two equations:
⎛ ln(T j ) − μ ⎞ ⎟⎟ σ ∂L 1 F 1 S ⎝ ⎠ =0 = 2 ∑ Ni (ln(ti ) − μ ) + ∑ N j ∂μ σ i =1 ⎛ ln(T j ) − μ ⎞ σ j =1 ⎟⎟ 1 − Φ⎜⎜ σ ⎝ ⎠
φ ⎜⎜
F ⎛ (ln(t i ) − μ ) ∂L = ∑ Ni ⎜ ∂σ i =1 ⎜⎝ σ3
2
⎛ ln(T j ) − μ ⎞ ⎛ ln(T j ) − μ ⎞ ⎜ ⎟φ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ σ σ 1 ⎞⎟ 1 S ⎝ ⎠ ⎝ ⎠ =0 − − ∑Nj ⎟ σ ⎠ σ j =1 ⎛ ln(T j ) − μ ⎞ ⎟ 1 − Φ⎜⎜ ⎟ σ ⎠ ⎝
where:
1
φ(x) = Φ(x ) =
2π 1 2π
x
∫
1 − (x)2 e 2 1 − (t )2 e 2 dt
−∞
The capital Φ in the above equation is basically the cumulative Normal distribution, which is defined as the integral of the small φ (i.e., normal pdf). The derivative of the CDF always becomes the pdf, since the derivative operator cancels out the integration. 5.2.4. Confidence Bounds and Uncertainty
Since point estimates are constructed from data that exhibits random variation, these estimates will not be exactly equal to the unknown population parameters. Confidence bounds provide a convention for making statements about the random variation in the estimates of parameters. 5.2.4.1. Confidence Bounds with MLE
The variance and covariance of the parameters, calculated using MLE equations, can be found using the local Fisher information matrix. Fisher assumed a Normal distribution for the parameters when deriving these equations. Using the following local information matrix, one can relate the likelihood function to the variance and covariance of the model 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 198
Chapter 5: Life Data Modeling
parameters. The next equation represents these uncertainties for a general case for which there are “n” parameters in the likelihood function.
⎡ Var (θˆ1 ) ... Cov (θˆ1 , θˆ2 ) Cov (θˆ1 , θˆn ) ⎤ ⎥ ⎢ ˆ ˆ ... Var (θˆ2 ) Cov (θˆ2 , θˆn )⎥ ⎢Cov (θ 2 , θ 1 ) −1 . . . . ⎥ = [F ] ⎢ ⎥ ⎢ . . . . ⎥ ⎢ ... ... Cov (θˆn , θˆn −1 ) Var (θˆn ) ⎥⎦ ⎢⎣ (17) where: Var = Cov = Λ= F=
variance of the parameter of interest covariance of the two parameters the log likelihood function the local Fisher information matrix as defined below:
⎡ ∂ 2Λ ⎢ − 2 ⎢ ∂θ2 1 ⎢ ∂ Λ ⎢− ∂θ ∂θ 2 1 F=⎢ . ⎢ ⎢ . ⎢ ⎢ ... ⎢⎣
∂2Λ − ∂θ1∂θ 2 ∂2Λ − 2 ∂θ 2 . . ...
... ... . . ∂2Λ − ∂θ n ∂θ n −1
∂2Λ ⎤ − ⎥ ∂θ1∂θ n ⎥ ∂2Λ ⎥ − ∂θ 2 ∂θ n ⎥ ⎥ . ⎥ ⎥ . ⎥ ∂2Λ ⎥ − 2 ∂θ n ⎥⎦
Having the variance and the best estimate of each parameter, one may estimate the uncertainty bounds for any given confidence bounds. Note that the important underlying assumption here is independency, as well as the Normal distribution for all parameters. 5.2.4.2. Confidence Bounds Approximations
Tables 5.2-4 through 5.2-11 present a summary of equations for calculating the confidence bounds around the parameters for various distributions. Reliability Information Analysis Center 199
Chapter 5: Life Data Modeling
Table 5.2-4: Confidence Bounds for the Poisson Distribution Parameter One-Sided Confidence Interval Two-Sided Confidence Interval Given: The estimate for a the true occurrence rate, λ, is the sample occurrence rate: λˆ = n / t where, n= number of observed failures t= period (time, length, volume) over which failures are observed Poisson Limits (approximate only): Exact confidence levels cannot be conveniently obtained for discrete distributions
True Occurrence Rate, λ
λ L = 0.5 χ 2 [1 − γ ; 2 n ] / t
λ L = 0.5 χ 2 [(1 − γ ) 2 ; 2 n ] / t
Normal Approximation When “n” is large (say, >10) λ ≅ λˆ − z ( λˆ / t ) 0.5
λ L ≅ λˆ − z ( 1+ γ )
2
( λˆ / t ) 0.5
λ U ≅ λˆ + z γ ( λˆ / t ) 0.5
λ U ≅ λˆ + z ( 1+ γ )
2
( λˆ / t ) 0.5
λ U = 0 .5 χ 2 [ γ ; ( 2 n + 2 ) ] / t
γ
L
λ U = 0.5 χ 2 [(1 + γ ) 2 ; ( 2 n + 2 ) ] / t
Given: Given the observed rate of occurrence above, the prediction for the future rate of occurrence is: yˆ = λˆ s = (n / t ) s where, n, t = as defined above s= period (time, length, volume) over which future observation is predicted Poisson Limits (approximate only) Closest integer solutions for yL and yU from the following equations ( n + 1) F [γ ; ( 2 n + 2 ); 2 y U ] s t s t = F [γ ; ( 2 y L + 2 ); 2 n ] ( y L + 1) n yU
Future Occurrence Rate, y
=
( n + 1) F [(1 + γ ) 2 ; ( 2 n + 2 ); 2 y U ] s t s t = F [(1 + γ ) 2 ; ( 2 y L + 2 ); 2 n ] ( y L + 1) n yU
=
Normal Approximation When “n” and “y” are large (e.g., each is > 10)
( ) (λˆ s ( t + s ) t )
y L ≅ yˆ − z γ λˆ s ( t + s ) t
0.5
y U ≅ yˆ + z γ
0.5
(λˆ s ( t + s ) t ) (λˆ s ( t + s ) t )
0.5
y L ≅ yˆ − z ( 1+ γ )
2
y U ≅ yˆ + z ( 1+ γ )
2
0.5
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 200
Chapter 5: Life Data Modeling
Table 5.2-5: Confidence Bounds for the Binomial Distribution Parameter One-Sided Confidence Interval Two-Sided Confidence Interval Given: The estimate of the true population proportion, p, is the sample proportion: pˆ = x / n
where, x= number of “successful” trials n= number of statistically independent sample units Binomial Limits (approximate only): Exact confidence levels cannot be conveniently obtained for discrete distributions 1 1 + ( n − x + 1)(1 x ) F [γ ; ( 2 n − 2 x + 2 ); 2 x ] 1 pU = 1 + ( n − x )(1 (( x + 1) F [γ ; ( 2 x + 2 ); 2 n − 2 x ] pL =
True Proportion, p
pL =
1 1 + ( n − x + 1)(1 x ) F [(1 + γ ) 2 ; ( 2 n − 2 x + 2 ); 2 x ]
pU =
1 1 + ( n − x )(1 (( x + 1) F [(1 + γ ) 2 ; ( 2 x + 2 ); 2 n − 2 x ]
Normal Approximation When “x” and “n-x” are large (e.g., each is > 10) p L ≅ pˆ − z γ ( pˆ (1 − pˆ ) / n ) 0.5
p L ≅ pˆ − z ( 1+ γ ) 2 ( pˆ (1 − pˆ ) / n ) 0.5
p U ≅ pˆ + z γ ( pˆ (1 − pˆ ) / n ) 0.5
p U ≅ pˆ + z ( 1+ γ ) 2 ( pˆ (1 − pˆ ) / n ) 0.5
Poisson Approximation When “n” is large and “x” is small (e.g., when “x” < n/10) p L ≅ 0.5 χ 2 [(1 − γ ); 2 x ] n
p U ≅ 0 .5 χ
2
[γ ; 2 x + 2 ]
p L ≅ 0.5 χ 2 [(1 − γ ) 2 ; 2 x ] n
p U ≅ 0.5 χ 2 [(1 + γ ) 2 ; 2 x + 2 ] n
n
Given: Given the observed probability above, the prediction for the number of “y” future category units is: yˆ = mpˆ = m ( x / n )
where, x, n = as defined above m= future sample size Normal Approximation When “x”, “n-x”, “y” and “m-y” are all large (say, > 10)
[ ] 0.5 y U ≅ yˆ + z γ [m pˆ (1 − pˆ )( m + n ) n ] 0.5 y L ≅ yˆ − z γ m pˆ (1 − pˆ )( m + n ) n
Prediction of Future Probability of “Success”, y
[ ] ˆ (1 − pˆ )( m + n ) n ]0.5 2 [m p
y L ≅ yˆ − z (1+ γ ) 2 m pˆ (1 − pˆ )( m + n ) n 0 .5 y U ≅ yˆ + z ( 1+ γ )
Poisson Approximation When “n” is large and “x” is small (e.g., when “x” < n/10) Closest integer solutions for yL and yU from the following equations ( x + 1) F [γ ; 2 x + 2; 2 y U ] n m n = F [γ ; ( 2 y L + 2 ); 2 x ] ( y L + 1) x yU m
=
( x + 1) F [(1 + γ ) 2 ; ( 2 x + 2 ); 2 y U ] n m n = F [(1 + γ ) 2 ; ( 2 y L + 2 ); 2 x ] ( y L + 1) x yU m
=
Reliability Information Analysis Center 201
Chapter 5: Life Data Modeling
Table 5.2-6: Confidence Bounds for the Exponential Distribution Parameter One-Sided Confidence Interval Two-Sided Confidence Interval Given: The estimate of the true population mean, θ, is the sample mean: n
θˆ = x =
∑xi i =1
n
where, xi = n=
individual times to failure for each of the observations of sample size “n” number of statistically independent sample observations Exponential Limits (exact) for Failure Truncated Tests θL = θU =
2 nx χ
2
θL =
[γ ; 2 n ] 2 nx
θU =
χ 2 [(1 − γ ); 2 n ]
2 nx
χ
2
[(1 + γ )
2 ;2n ]
2 nx
χ 2 [(1 − γ ) 2 ; 2 n ]
Exponential Limits (exact) for Time Truncated Tests True value of the mean, θ
2nx χ 2 [γ;2(n + 1)] 2nx θU = 2 χ [(1 − γ);2(n + 1)]
θL =
θL =
θU =
2n x
χ [(1 + γ) 2 ;2(n + 1) ] 2n x 2
χ [(1 − γ) 2;2(n + 1) ] 2
Normal Approximation for Failure Truncated Tests When “n” is large (say, > 15) θL ≅
(
x
exp z γ
(
θ U ≅ x * exp z γ
n
)
θL ≅ n
)
(
x
exp z ( 1+ γ )
(
n
2
θ U ≅ x * exp z ( 1+ γ )
)
2
n
)
Given: The estimate of the true population failure rate, λ, is the sample failure rate: 1 1 λˆ = = n ˆθ ∑xi i =1
n
where, θhat= sample mean xi = individual times to failure for each of the observations of sample size “n” n= number of statistically independent sample observations Exponential Limits (exact) for Failure Truncated Tests True value of the failure rate, λ
λL = λU =
χ 2 [(1 − γ ); 2 n ] 1 = 2 nx θU χ 2 [γ ; 2 n ] 1 = 2 nx θL
λL = λU =
χ 2 [(1 − γ ) 2; 2 n ] 1 = θU 2 nx
χ 2 [(1 + γ ) 2; 2 n ] 1 = θL 2 nx
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 202
Chapter 5: Life Data Modeling
Table 5.2-7: Confidence Bounds for the Exponential Distribution (continued) Parameter One-Sided Confidence Interval Two-Sided Confidence Interval Given: The usual estimate of the 100 pth percentile, yp, is calculated as: y p = − x * ln(1 − p ) where,
probability at the 100 pth percentile
p= True value of the 100-pth percentile, yp
y p ,L = − θ L * ln(1 − p ) = y p , U = − θ U * ln(1 − p ) =
−2 nx * ln(1 − p )
y p ,L = − θ L * ln(1 − p ) =
χ 2 [γ ; 2 n ] − 2 nx * ln(1 − p )
y p , U = − θ U * ln(1 − p ) =
χ 2 [(1 − γ ); 2 n ]
−2 n x * ln(1 − p )
χ 2 [(1 + γ ) 2 ; 2 n ] − 2 n x * ln(1 − p )
χ 2 [(1 − γ ) 2 ; 2 n ]
Given: The usual estimate of the reliability, R(t), at any age, t, is: R * (t) = e −(t
x)
where, R= t=
reliability as a function of time, distance, etc. period at which reliability is assessed (time, distance, etc.)
( { = exp (− t * {χ
R L ( t ) = e − ( t θ L ) = exp − t * χ 2 [γ ; 2 n ] 2 nx
True value of reliability at end of period, R(t)
R U (t ) = e
−( t / θ U )
2
[(1 − γ );2n ]
})
2 nx
})
( { ( {
R L ( t ) = e − ( t θ L ) = exp − t * χ 2 [(1 + γ ) 2; 2 n ] 2 nx
Table 5.2-8: Confidence Bounds for the Normal Distribution Parameter One-Sided Confidence Interval Two-Sided Confidence Interval Given: The estimate of the true population mean, μ, is the sample mean: n
x=
∑x i i =1
n
where, xi = n=
True value of the mean, μ
individual times to failure for each of the observations of sample size “n” number of statistically independent sample observations Normal Limits (exact) Also serve as approximate intervals for the mean of a distribution that is not normal ⎞ ⎛ ⎟ μ L = x − t [γ ; n − 1] * ⎜⎜ s n ⎟⎠ ⎝ ⎞ ⎛ ⎟ μ U = x + t [γ ; n − 1] * ⎜⎜ s n ⎟⎠ ⎝
}) })
R U ( t ) = e −( t / θ U ) = exp − t * χ 2 [(1 − γ ) 2; 2 n ] 2 nx
⎛ ⎞ ⎟ μ L = x − t [(1 − γ ) 2; n − 1] * ⎜⎜ s ⎟ n ⎝ ⎠ ⎛ s ⎞ ⎟ μ U = x + t [(1 − γ ) 2; n − 1] * ⎜⎜ n ⎟⎠ ⎝
Reliability Information Analysis Center 203
Chapter 5: Life Data Modeling
Table 5.2-9: Confidence Bounds for the Normal Distribution (continued) Parameter One-Sided Confidence Interval Two-Sided Confidence Interval Given: The estimate of the true population variance, σ2, is the sample variance: n
s2 =
where,
s2= xi = n=
True value of the variance, σ2
∑ (x i
− x )2
i =1
n −1
sample variance (standard deviation, s, equals (s2)0.5) individual measurements for each of the observations of sample size “n” number of statistically independent sample observations Normal Limits (exact) ⎫⎪ ⎧⎪ n −1 σL = s* ⎨ ⎬ 2 ⎪⎩ χ [γ ; n − 1] ⎪⎭
0.5
⎧⎪ ⎫⎪ n −1 σU = s*⎨ ⎬ 2 ⎪⎩ χ [(1 − γ ); n − 1] ⎪⎭
0.5
⎧⎪ ⎫⎪ n −1 σL = s * ⎨ ⎬ ⎪⎩ χ 2 [(1 + γ ) 2 ; n − 1] ⎪⎭
0.5
⎧⎪ ⎫⎪ n −1 σU = s*⎨ ⎬ 2 ⎪⎩ χ [(1 − γ ) 2 ; n − 1] ⎪⎭
0.5
Given: The estimate of the reliability at any age “t”, R(t), is: R * (t ) = 1 − Φ( z )
where, R= reliability as a function of time, distance, etc. t= period at which reliability is assessed (time, distance, etc.) Φ(z) = estimate of the fraction of a population failing by age “t” True value of reliability at end of period, R(t)
R L ( t ) = 1 − FU ( t ) = 1 − Φ ( z U )
R L ( t ) = 1 − FU ( t ) = 1 − Φ ( z U ) where ,
where , (x − x) z= s 2 zγ ⎛ ⎜ 1 + z (n / 2) zU ≅ z + ⎜ n −1 n ⎝
z=
⎞ ⎟ ⎟ ⎠
0 .5
zU ≅ z +
2 z (1+ γ ) 2 ⎛ ⎜1 + z (n / 2) ⎜ n −1 n ⎝
⎞ ⎟ ⎟ ⎠
0 .5
R U ( t ) = 1 − FL ( t ) = 1 − Φ ( z L )
R U ( t ) = 1 − FL ( t ) = 1 − Φ ( z L )
where ,
where , (x − x) z= s 2 zγ ⎛ ⎜1 + z (n / 2) zL ≅ z − ⎜ n −1 n ⎝
(x − x) s
z= ⎞ ⎟ ⎟ ⎠
0. 5
(x − x) s
2 z (1+ γ ) 2 ⎛ ⎜1 + z (n / 2) zL ≅ z − ⎜ n −1 n ⎝
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 204
⎞ ⎟ ⎟ ⎠
0 .5
Chapter 5: Life Data Modeling
Table 5.3-10: Confidence Bounds for the Weibull Distribution Parameter One-Sided Confidence Interval Two-Sided Confidence Interval Given: The estimate of the Weibull shape parameter, β, is given as: 1.283 βˆ = s where, ⎛ n 2 ⎜ ∑ (x i − x ) = 1 i s=⎜ ⎜ n −1 ⎜ ⎝
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
0.5
n
x=
∑xi i =1
n
where, s= sample standard deviation xi = individual times to failure for each observation of sample size “n” n= number of statistically independent sample observations Weibull Limits (approximate) Limits are crude unless “n” is quite large (say, “n” > 100) True value of the Weibull shape parameter, β
βL ≅
βU ≅
1 ⎛ 1.049 z γ 0.7797s * exp⎜ ⎜ n ⎝
βL ≅
⎞ ⎟ ⎟ ⎠
0.7797s ⎛ 1.049 z γ exp⎜ ⎜ n ⎝
βU ≅
⎞ ⎟ ⎟ ⎠
1 ⎛ 1.049z (1+ γ ) 2 ⎞ ⎟ 0.7797s * exp⎜ ⎜ ⎟ n ⎝ ⎠ 0.7797s ⎛ 1.049z (1+ γ ) 2 ⎞ ⎟ exp⎜ ⎜ ⎟ n ⎝ ⎠
Given: The estimate of the Weibull scale parameter, α, is: αˆ = exp ( x + ( 0.5772 )( 0.7797 ) s ) where, s= sample standard deviation xi = individual measurements for each observation of sample size “n” n= number of statistically independent sample observations Weibull Limits (approximate) Limits are crude unless “n” is quite large (say, “n” > 100) True value of ⎛ ⎛ (1.081)( 0.7797 ) s ⎞⎟ (1.081)( 0.7797 ) s ⎞⎟ the Weibull α L ≅ exp ⎜ ( x + 0.45 s ) − z ( 1+ γ ) 2 α L ≅ exp ⎜ ( x + 0.45s ) − z γ ⎜ ⎟ ⎜ ⎟ n n ⎝ ⎠ ⎝ ⎠ scale ⎛ parameter, α ⎛ (1.081)( 0.7797 ) s ⎞⎟ (1.081)( 0.7797 ) s ⎞ ⎜ α U ≅ exp ⎜ ( x + 0.45s ) + z α ⎜ ⎝
n
⎟ ⎟ ⎠
α U ≅ exp ( x + 0.45 s ) + z ( 1+ γ ) ⎜ ⎝
Reliability Information Analysis Center 205
2
n
⎟ ⎠
Chapter 5: Life Data Modeling
Table 5.2-11: Confidence Bounds for the Weibull Distribution (continued) Parameter One-Sided Confidence Interval Given: The estimate of the reliability at any age “t”, R(t), is:
R * (t ) = e
Two-Sided Confidence Interval
β ⎛ t ⎞ −⎜⎜ ⎟⎟ ⎝α⎠
where,
True value of reliability at end of period, R(t)
R= reliability as a function of time, distance, etc. t= period at which reliability is assessed (time, distance, etc.) α= Weibull scale parameter β= Weibull shape parameter Limits are crude unless “n” is quite large (say, “n” > 100) One-sided approximate Weibull limits: 0.5 ⎞⎤ ⎡ ⎛ 2 ⎡ ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎛ t − ( x + 0.45s) ⎞ ⎤ ⎟⎥ ⎢ + − 1 . 168 ( 1 . 1 ) ( 0 . 1913 ) ⎢ ⎥ ⎜ ⎟ ⎜ ⎟ ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢ ⎝ 0.7797s ⎠ ⎝ 0.7797s ⎠ ⎥ ⎟⎟⎥ RL (t ) = exp⎢− exp⎜ ⎜ ⎟ + zγ ⎢ ⎢ ⎥ ⎟⎥ n ⎜ ⎝ 0.7797s ⎠ ⎢ ⎢ ⎥ ⎟⎥ ⎜ ⎢ ⎣ ⎦ ⎠⎥⎦ ⎝ ⎣ 0.5 ⎞⎤ ⎡ ⎛ 2 ⎡ ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎛ t − ( x + 0.45s) ⎞ ⎤ ⎟⎥ ⎢ + − 1 . 168 ( 1 . 1 ) ( 0 . 1913 ) ⎢ ⎜ ⎟ ⎜ ⎟⎥ ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢ ⎝ 0.7797s ⎠ ⎝ 0.7797s ⎠ ⎥ ⎟⎟⎥ RU (t ) = exp⎢− exp⎜ ⎜ ⎟ − zγ ⎢ ⎢ ⎥ ⎟⎥ n ⎜ ⎝ 0.7797s ⎠ ⎢ ⎢ ⎥ ⎟⎥ ⎜ ⎢ ⎣ ⎦ ⎠⎥⎦ ⎝ ⎣
Two-sided approximate Weibull limits: 0.5 ⎞ ⎤ ⎡ 2 ⎛ ⎡ ⎜ ⎛ t − ( x + 0.45s ) ⎞ ⎤ ⎟⎥ ⎛ t − ( x + 0.45s ) ⎞ ⎢ 1 . 168 ( 1 . 1 ) ( 0 . 1913 ) − + ⎢ ⎥ ⎟ ⎟ ⎜ ⎜ ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢ ⎝ 0.7797 s ⎠ ⎥ ⎟⎟⎥ ⎝ 0.7797 s ⎠ RL (t ) = exp ⎢− exp⎜ ⎜ ⎟ + z(1−γ ) 2 ⎢ ⎢ ⎥ ⎟⎥ 0 . 7797 s n ⎠ ⎝ ⎜ ⎢ ⎢ ⎥ ⎟⎥ ⎜ ⎢ ⎣ ⎦ ⎠⎥⎦ ⎝ ⎣ 0.5 ⎞⎤ ⎡ 2 ⎛ ⎡ ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎛ t − ( x + 0.45s) ⎞ ⎤ ⎟⎥ ⎢ 1 . 168 ( 1 . 1 ) ( 0 . 1913 ) + − ⎥ ⎢ ⎜ ⎟ ⎜ ⎟ ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢ ⎝ 0.7797s ⎠ ⎝ 0.7797s ⎠ ⎥ ⎟⎟⎥ RU (t ) = exp⎢− exp⎜ ⎜ ⎟ − z(1−γ ) 2 ⎢ ⎥ ⎟⎥ ⎢ n ⎜ ⎝ 0.7797s ⎠ ⎢ ⎥ ⎟⎥ ⎢ ⎜ ⎢ ⎦ ⎠⎥⎦ ⎣ ⎝ ⎣
5.3. Acceleration Models Acceleration models are needed to determine how the TTF distribution behaves as a function of the accelerant. The accelerant can be a stress (such as temperature, voltage, pressure, etc.), or it can be an “indicator” variable (such as a product feature or design attribute). These are also sometimes called “categorical” variables. One of the most common ways to quantify acceleration factors is to perform tests at various stress levels (and, in the case of indicator variables, for various product features or design attributes). 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 206
Chapter 5: Life Data Modeling
Accelerated testing is often used for this purpose, in which case tests are performed at stress levels higher than the item will experience in use, to speed up failure processes. Acceleration models consist of two generic types: Physical Acceleration Models: For well-understood failure mechanisms, one may have a model based on physical/chemical theory that describes the failurecausing process over the range of the data and provides extrapolation to use conditions. Empirical Acceleration Models: Empirical acceleration models are used when there is little understanding of the chemical or physical processes leading to failure, and a model can be empirically determined to describe the observed data. 5.3.1. Fundamental Acceleration Models
In practice, the acceleration models used are a combination of physical and empirical, in that theory may be used to determine the appropriate form of the acceleration model, but the specific model constants are almost always determined empirically. There are four basic forms of accelerated life models. Combinations of these are also possible: The linear model is:
y = ax + b
The exponential model is:
y = be ax The power law model is:
y = bx a The Logarithmic model is:
y = a ln(x) + b
In all of these equations, “y” is the dependent variable, usually either lifetime (as measured by characteristic life or mean life, depending on the TTF distribution used), or failure rate. Since the failure rate is the reciprocal of the mean life (in the case of the Reliability Information Analysis Center 207
Chapter 5: Life Data Modeling
exponential distribution), the constant “a” will generally be positive in one case and negative in the other. The most commonly used reliability models are the power law and exponential models. Several points regarding acceleration models are: • • •
There is no “correct” acceleration factor to use for a specific application Several different acceleration factor model forms are often equally applicable The “best” acceleration factor model is often the one that best fits the empirical data
5.3.1.1. Examples
Several commonly used acceleration models are summarized in this section. Arrhenius The Arrhenius relationship is a widely used model describing the effect that temperature has on the rate of a simple chemical reaction:
L ∝ Ae
⎡ Ea ⎤ ⎢ KT ⎥ ⎣ ⎦
where: L= A= Ea = T=
the lifetime a life constant the activation energy in eV the absolute temperature in degrees Kelvin
It can be seen that this is the exponential model, with the reciprocal of temperature used as the stress. The Arrhenius model is the most widely used for evaluating the effect of temperature on reliability. It is applicable to situations in which the failure mechanism is a function of the steady state temperature, such as corrosion, diffusion, etc. Notable observations about the Arrhenius acceleration model are that: • •
It was derived centuries ago to model chemical reaction rates Over the last few decades, it has been applied to electronics reliability modeling, since it often empirically fit the data reasonable well
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 208
Chapter 5: Life Data Modeling
• •
In the formative years of the electronics industry, many failure mechanisms were related to corrosion and contamination, which are inherently chemical reaction rates for which the Arrhenius factor applies reasonably well It has since been applied to many other failure mechanisms, with an assumed applicability
Eyring The Eyring model is: ⎡
B⎤
1 − ⎢ A− ⎥ L∝ e ⎣ T⎦ T Coffin-Manson A form of fatigue life strain models is the Coffin-Manson “life vs. plastic strain”, which is often used for solder joint reliability modeling:
⎛ ΔT AF = ⎜⎜ S ⎝ ΔTU
⎞ ⎟⎟ ⎠
β
where: AF = ΔTU = ΔTS = β=
acceleration factor product temperature in service use, °K product temperature in stress conditions, °K constant for a specific failure mechanism
The number of cycles to failure is expressed as: ⎛ 1 N f = A⎜ ⎜ Δe ⎝ p
⎞ ⎟ ⎟ ⎠
β
where: Nf = A= Δep = β=
number of cycles to failure a material constant plastic strain range a material constant Reliability Information Analysis Center 209
Chapter 5: Life Data Modeling
Since ΔT ∝ Δep, a simplified acceleration factor for temperature cycling fatigue testing is:
⎛ ΔT N AF = use = ⎜⎜ test N test ⎝ ΔTuse
⎞ ⎟⎟ ⎠
β
The Coffin-Manson model is also sometimes used to model the acceleration due to vibration stresses. Random vibration input and response curves are typically plotted on log-log paper, with the power spectral density (PSD) expressed in squared acceleration units per hertz (G2/Hz), plotted along the vertical axis, and the frequency (Hz) plotted along the horizontal axis. G2 Δf → 0 Δ f
P = lim
In the above equation, “G” is the root mean square (RMS) of the acceleration, expressed in gravity units, and “Δf” is the bandwidth of the frequency range expressed in hertz. Since “G” is the agent of failure that causes fatigue, the following inverse power model applies:
L(G ) ∝
1 1 ⇒ Life = β G KG β
The acceleration factor for vibration based on Grms for similar product responses is represented by:
⎛G N AF = use = ⎜⎜ test N test ⎝ G use
⎞ ⎟⎟ ⎠
β
5.3.2. Combined Models
Acceleration models with more than one accelerating variable might be suggested when it is known that two or more potential accelerating variables contribute to degradation and failure. Several examples follow.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 210
Chapter 5: Life Data Modeling
Temperature and Non-Thermal Stress When temperature and a second non-thermal stress (e.g., voltage) are the accelerated stresses of a test, then the Arrhenius and the Inverse Power Law models can be combined to yield the Temperature-Non-Thermal (T-NT) model (Reference 10):
L(U ,V ) =
C n
U e
−
B V
where: U= non-thermal stress (i.e., voltage, vibration, etc.) V= temperature (in °K) B, C, n = parameters to be determined The T-NT relationship can be linearized and plotted on a Life vs. Stress plot by taking the natural logarithm of both sides:
ln[L(U ,V )] = ln(C ) − n ln(U ) +
B V
Here, the log of the life is equal to a linear relationship, where the intercept is ln(C), the slope of ln(U) is “n” and the slope of 1/V is “B”. The acceleration factor for the T-NT relationship is given by: B
AF =
LUse LAccelerated
=
C Vu e U un B
C VA e U An
n
⎛ 1
where: LUse = LAccelerated = Vu = VA = Uu =
1 ⎞
⎛ U ⎞ B ⎜⎜ V −V ⎟⎟ = ⎜⎜ A ⎟⎟ e ⎝ u A ⎠ ⎝ Uu ⎠
the life at use stress level the life at the accelerated stress level the use temperature level the accelerated temperature level the use non-thermal level Reliability Information Analysis Center 211
Chapter 5: Life Data Modeling
UA =
the accelerated non-thermal level
Temperature-Humidity Models A variation of the Eyring relationship is the Temperature-Humidity (TH) relationship. This combination model is expressed as:
L(V , U ) = Ae
⎛φ b ⎞ ⎜ + ⎟ ⎝V U ⎠
where, “φ” and “b” are parameters to be determined (the parameter “b” is also known as the activation energy for humidity), “A” is a constant, “U” is the relative humidity (decimal or percentage), and “V” is the temperature (in absolute units, °K). Note that the relative humidity can be expressed in either a decimal format or as a percentage, as long as it is consistent throughout the analysis. The relationship is linearized by taking the natural logarithm of both sides of the equation:
ln[L(V , U )] = ln( A) +
φ V
+
b U
The acceleration factor for the TH relationship is:
AF =
LUse LAccelerated
=
Ae Ae
⎛ φ b ⎜ + ⎜V U u ⎝ u
⎞ ⎟ ⎟ ⎠
⎛ φ b ⎞ ⎜⎜ ⎟⎟ + ⎝ VA U A ⎠
=e
⎛ 1 1 ⎞ ⎛ 1 1 ⎞ ⎟ − ⎟⎟ + b ⎜⎜ − ⎟ ⎝ Vu V A ⎠ ⎝ U u U A ⎠
φ ⎜⎜
where: LUse = LAccelerated = Vu = VA = Uu = UA =
the life at use stress level the life at the accelerated stress level the use temperature level the accelerated temperature level the use humidity level the accelerated humidity level
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 212
Chapter 5: Life Data Modeling
Peck Model The Peck model (Reference 6) is: −n
L ∝ ( RH ) e
⎡ Ea ⎤ ⎢ KT ⎥ ⎣ ⎦
where: RH = T= n= Ea = K=
Relative Humidity temperature constant activation energy Boltzman’s constant = 8.617 x 10-5 eV/°K
Note that this is a multiplicative model consisting of a power law for humidity and the Arrhenius model for temperature. The British Telecom Model The British Telecom model, also used in the Telcordia standards (Reference 7) is:
L∝e
⎡ Ea ⎤ 2 ⎢ KT ⎥ + n ( RH ) ⎣ ⎦
This model includes the effects of both temperature and relative humidity. Harris Model Wearout data published by the Harris Corporation shows a good fit to Peck’s model (Reference 8) in representing aluminum corrosion. This model is:
AF = e
⎡ Ea ⎢ ⎣⎢ k
⎛ 1 1 ⎞ ⎤ ⎛ RH S ⎜⎜ ⎟⎟ ⎥ ⎜⎜ − T T ⎥ ⎝ RH U S ⎠⎦ ⎝ U
⎞ ⎟⎟ ⎠
a
⎛ VS ⎜⎜ ⎝ VU
⎞ ⎟⎟ ⎠
b
where: AF = Ea = k= TU =
acceleration factor activation energy Boltzman’s constant = 8.617 x 10-5 eV/°K product temperature in service use, °K Reliability Information Analysis Center 213
Chapter 5: Life Data Modeling
TS = RHU = RHS = VU = VS = a= b=
product temperature in stress conditions, °K relative humidity in service use relative humidity in stress conditions voltage in service use voltage in stress conditions 2.66 based on Peck 1.4 (from Harris data)
Fatigue and S-N curves With metals and alloys, the fatigue process starts with dislocations, or crystallographic irregularities, that ultimately result in crack formation. It is a probabilistic phenomenon with a significant variation in lifetime. The S-N curves quantify the relationship between stress and the number of stress cycles to failure. It is essentially a life-stress relationship for metals. The curves are generally obtained by testing samples of the metal or alloy, and have been published in various handbooks.
In estimating fatigue life for materials, the model is used as the analytical representation of the so-called “S-N” curves, where “S” is stress amplitude and “N” is life (in cycles to failure), such that N = kS-b, where “b” and “k” are material parameters either estimated from test data or published in handbooks. Miner’s Rule Miners rule states that the amount of damage sustained by a metal is proportional to the number of cycles it experiences, as follows: k
ni
∑N i =1
=C
i
There are “k” stress levels (one for each contribution “n” cycles), “N” is the total number of cycles at a constant stress reversal, and “C” is usually assumed to be 1.0. It essentially estimates the percentage of life used by each stress reversal at each specific magnitude. 5.3.3. Cumulative Damage Model
Many situations arise in which there is cumulative damage inflicted on an item when subjected to a stress. For those situations where a Weibull distribution is appropriate, the reliability function is expressed as:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 214
Chapter 5: Life Data Modeling
R (t ) = e
⎛t ⎞ −⎜ ⎟ ⎝α ⎠
β
where: R(t) = reliability – the probability of survival at time t β= Weibull shape parameter (in time space) α= Characteristic life as a function of the stressor If it is assumed that the acceleration can be described by a power law, then: ⎛a⎞ α =⎜ ⎟ ⎝S⎠
n
where: S= a= n=
stressor life constant fatigue exponent in time space
Combining the two equations yields:
R (t ) = e
⎛ ⎜ ⎜ t −⎜ n ⎜⎛a⎞ ⎜ ⎟ ⎜ S ⎝⎝ ⎠
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
β
The modeling process estimates β, “a” and “n”. The premise of the cumulative damage model is that the amount of life used per cycle is proportional to the stressor raised to the “n” power:
⎛S ⎞ te = t1 ⎜⎜ 1 ⎟⎟ ⎝ S0 ⎠
n
Reliability Information Analysis Center 215
Chapter 5: Life Data Modeling
where: te = S0 =
equivalent time at stressor 1 relative to S0 normalization stressor
This cumulative damage model is particularly useful when the stresses are time varying, since an equivalent amount of damage can be estimated per unit time, regardless of the behavior of stress as a function of time. This model is also consistent with fatigue, which is essentially a cumulative damage scenario.
5.4. MLE Equations The previous sections summarized information relative to the selection of a specific distribution and acceleration factors. Once these have been determined for a particular situation, and tests have been performed at various levels of acceleration, the next step is to quantify model parameters. Previously, in the discussion on distributions and parameter estimation with MLE, only the distribution parameters were considered, not the acceleration model parameters. A life model needs parameter estimates for both the distribution parameters and the acceleration model parameters. In this section, the likelihood equations for various combinations of distributions and acceleration models are presented. The form of a life model is the distribution equation, with the mean or characteristic life (depending on the distribution) replaced with the acceleration model. For example, if a Weibull distribution is used, the reliability function is:
R (t ) = e
⎛t ⎞ −⎜ ⎟ ⎝α ⎠
β
where: R(t) = reliability – the probability of survival at time t β= Weibull shape parameter (in time space) α= Characteristic life as a function of stressor: And, if the acceleration model is the power law:
⎛a⎞ α =⎜ ⎟ ⎝S⎠
n
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 216
Chapter 5: Life Data Modeling
where: S= a= n=
stressor life constant fatigue exponent in time space
Then, combining the two equations yields:
R(t ) = e
⎛ ⎜ ⎜ t −⎜ n ⎜⎛ a⎞ ⎜⎜S ⎟ ⎝⎝ ⎠
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
β
The modeling process estimates β, “a” and “n”. Once these parameters are estimated, the life distribution for any stress level can be obtained. 5.4.1. Likelihood Functions
The likelihood functions for the six combinations of distribution (exponential, Weibull, lognormal) and acceleration model (Arrhenius, Inverse Power Law) are provided below. Exponential-Arrhenius Reaction Rate Model: B
B
M B t − T − L = ∑ N i (− − ln(C ) − i e Vi ) − ∑ N i Ri e Vi Vi C C i =1 i =1 N
−B ⎞ −B ⎞ ⎛ ⎛ ⎡ − ⎛⎜ Tbie−VBi ⎞⎟ ⎡ ⎜ T e Vi ⎟ ⎤ ⎜ T e Vi ⎟ ⎤ − ⎜ ai C ⎟ − ⎜ Li C ⎟ ⎟ ⎜ C K L ⎢ ⎜ ⎢ ⎟⎥ ⎜ ⎟ ⎟⎥ ⎜ ⎠ ⎠ ⎠ −e ⎝ + ∑ N i Ln ⎢1 − e ⎝ ⎥ ⎥ + ∑ N i Ln ⎢e ⎝ i =1 ⎥ ⎢ ⎥ i =1 ⎢ ⎦⎥ ⎣⎢ ⎦⎥ ⎣⎢
Exponential-Inverse Power Law (IPL): M
N
L = ∑ N i (ln K + n ln(S i ) − KS t ) − ∑ N i KS inTRi n i i
i =1
K
+ ∑ N i Ln(1 − e i =1
− KS inTLi
L
i =1
) + ∑ N i Ln(e − KSi Tbi − e − KSi Tai ) n
i =1
Reliability Information Analysis Center 217
n
Chapter 5: Life Data Modeling
Weibull-Arrhenius: ⎛ ⎜ N ⎜ β L = ∑ N i ln⎜ B i =1 ⎜ Ce Vi ⎜ ⎝
⎛ ⎜ ti ⎜ B ⎜ Vi ⎝ Ce
⎛ ⎞ ⎛ ⎜ T ⎟ ⎜ − ⎜ LiB ⎟ K ⎜⎜ ⎟ ⎜ Vi ⎟ + ∑ N i ln⎜1 − e ⎝ Ce ⎠ i =1 ⎜ ⎜ ⎝
β
⎞ ⎟ ⎟ ⎟ ⎠
β −1
e
⎛ ⎜ t −⎜ i B ⎜⎜ Vi ⎝ Ce
⎞ ⎟ ⎟ ⎟⎟ ⎠
β
⎞ ⎟ ⎛ ⎟ M ⎜ TRi N − ⎟ ∑ i⎜ B ⎟ i =1 ⎜⎝ Ce Vi ⎟ ⎠
⎛ ⎞ ⎞ ⎛ ⎛⎜ T ⎞⎟ ⎜ T ⎟ ⎟ ⎜ −⎜ aiB ⎟ − ⎜ biB ⎟ ⎜⎜ ⎟⎟ ⎟ L ⎜ ⎜⎜⎝ Ce Vi ⎟⎟⎠ ⎝ Ce Vi ⎠ N ln e e + − ⎟ ∑ i ⎜ ⎟ i =1 ⎜ ⎟ ⎜ ⎠ ⎝ β
⎞ ⎟ ⎟ ⎟ ⎠ β
β
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
Weibull-IPL: N
(
L = ∑ N i ln⎛⎜ βKS in KS in t i ⎝ i =1
)
e(
β −1 − KS nt i i
)β ⎞⎟ − N (KS nT )β ∑ i i Ri M
⎠
i =1
n n n + ∑ N i ln⎛⎜1 − e −(KSi TLi ) ⎞⎟ + ∑ N i ln⎛⎜ e −(KSi Tai ) − e −(KSi Tbi ) ⎞⎟ ⎝ ⎠ i =1 ⎝ ⎠ i =1
K
β
L
β
β
Lognormal-Arrhenius: ⎛ ⎛ B ⎞⎞ B ⎞⎞ ⎛ ⎛ ⎜ ⎜ ⎜ ln(TRi ) − ln(C ) − ⎟ ⎟ ⎜ ln(t i ) − ln(C ) − ⎟ ⎟ M Vi ⎟ ⎟ Vi ⎟ ⎟ ⎜ ⎜ 1 L = ∑ N i ln⎜ φ ⎜ + ∑ N i ln⎜1 − Φ⎜ ⎟ ⎜ ⎟⎟ ⎜ ⎟ t σ σ σ i =1 i =1 ⎜⎜ ⎜⎜ i ⎜ ⎜ ⎟ ⎟⎟ ⎟ ⎟⎟ ⎝ ⎠⎠ ⎝ ⎠⎠ ⎝ ⎝ ⎛ ⎛ ⎛ ⎛ B ⎞⎞ B⎞ B ⎛ ⎜ ⎜ ln(TLi ) − ln(C ) − ⎟ ⎟ L ⎜ ⎜ ln(Tbi ) − ln(C ) − ⎟ ⎜ ln(Tai ) − ln(C ) − K Vi ⎟ ⎟ Vi ⎟ Vi ⎜ ⎜ + ∑ N i ln⎜ Φ⎜ + ∑ N i ln⎜ Φ⎜ − Φ⎜ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ σ σ σ i =1 i =1 ⎜⎜ ⎜ ⎜⎜ ⎜ ⎟ ⎟⎟ ⎟ ⎜ ⎠⎠ ⎠ ⎝ ⎝ ⎝ ⎝ ⎝ N
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 218
⎞⎞ ⎟⎟ ⎟⎟ ⎟⎟ ⎟ ⎟⎟ ⎠⎠
Chapter 5: Life Data Modeling
Lognormal-IPL: N ⎛ 1 ⎛ ln(t i ) + ln( K ) + n ln(S i ) ⎞ ⎞ φ⎜ L = ∑ N i ln⎜⎜ ⎟ ⎟⎟ σ ⎠⎠ i =1 ⎝ σt i ⎝ M ⎛ ⎛ ⎛ ln(TLi ) + ln( K ) + n ln(S i ) ⎞ ⎞ ⎛ ln(TRi ) + ln( K ) + n ln(S i ) ⎞ ⎞ K + ∑ N i ln⎜⎜1 − Φ⎜ ⎟ ⎟⎟ + ∑ N i ln⎜⎜ Φ⎜ ⎟ ⎟⎟ σ σ ⎝ ⎠ ⎠ i =1 ⎠⎠ i =1 ⎝ ⎝ ⎝ L ⎛ ⎛ ln(Tbi ) + ln( K ) + n ln(S i ) ⎞ ⎛ ln(Tai ) + ln( K ) + n ln(S i ) ⎞ ⎞ + ∑ N i ln⎜⎜ Φ⎜ ⎟ ⎟⎟ ⎟ − Φ⎜ σ σ ⎝ ⎠⎠ ⎠ i =1 ⎝ ⎝
Solutions for the parameters’ stress-life log-likelihood functions can be obtained by setting their first order partial derivatives equal to zero and applying iterative methods. The second order partial differential equations for each of the six combinations are presented below. The advantage to using the second order partials is the potential for dual use in the Fisher information matrix and the required iterative methods for attaining the necessary parameter solutions. Exponential Arrhenius: ∂2L ∂2L ∂2L , , ∂B 2 ∂ C 2 ∂B ∂ C
Exponential IPL: ∂2L ∂2L ∂2L , , ∂K 2 ∂n 2 ∂K ∂n
Weibull Arrhenius:
∂2L ∂2L ∂2L ∂2L ∂2L ∂2L , , , , , ∂β 2 ∂B 2 ∂C 2 ∂β∂B ∂β∂C ∂B∂C Weibull IPL:
∂2L ∂2L ∂2L ∂2L ∂2L ∂2L , , , , , ∂β 2 ∂K 2 ∂n 2 ∂β∂K ∂β∂n ∂K∂n Reliability Information Analysis Center 219
Chapter 5: Life Data Modeling
Lognormal Arrhenius: ∂2L ∂2L ∂2L ∂2L ∂2L ∂2L , , , , , ∂ B 2 ∂C 2 ∂ σ 2 ∂ B ∂C ∂B ∂σ ∂ C ∂σ
Lognormal IPL: ∂2L ∂2L ∂2L ∂2L ∂2L ∂2L , , , , , ∂ K 2 ∂n 2 ∂ σ 2 ∂ K ∂ n ∂K ∂ σ ∂ n ∂ σ
The likelihood function will yield a value for all possible combinations of parameter values. A useful tool in data analysis is a plot of the likelihood value. As an example, Figure 5.4-1 illustrates a contour plot of the likelihood value for an exponential-IPL model.
Figure 5.4-1: Likelihood Contour Example 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 220
Chapter 5: Life Data Modeling
In this example, the plot lines represent values of equal likelihood as a function of the two parameters of interest (i.e., the Weibull slope and the exponent in the power law acceleration model). The center position represents the combination of beta and “n” at which the maximum value of likelihood occurs. The height of the likelihood value increases as the center of the contour lines is approached. The spread in the contour lines of equal likelihood are proportional to the uncertainty in the parameter estimates, and in fact are one way to estimate confidence bounds on the model parameters. Also, the dispersion of the likelihood values on the “n” axis can be thought of as the spread of the TTFs in the stress dimension, and the dispersion of the likelihood values on the “beta” axis can be thought of as the spread of the TTFs in the time dimension.
5.5. References 1. Lyu, M.R. (Editor), “Handbook of Software Reliability Engineering”, McGraw-Hill, April 1996, ISBN 0070394008 2. Musa, J.D.; Iannino, A.; and Okumoto, K.; “Software Reliability: Measurement, Prediction, Application”, McGraw-Hill, May 1987, ISBN 007044093X 3. Musa, J.D., “Software Reliability Engineering: More Reliable Software, Faster Development and Testing”, McGraw-Hill, July 1998, ISBN 0079132715 4. Nelson, W., “Applied Life Data Analysis”, John Wiley & Sons, 1982, ISBN0471094587 5. Fisher, R. A., 1912, “On an Absolute Criterion for Fitting Frequency Curves”, Messenger of Mathematics, Vol. 41, pp. 155-160. [Reprinted in Statistical Science, Vol. 12, (1997) pp. 39-41.] 6. Peck, S., IRPS tutorial, 1990 7. Telcordia GR1221 8. Peck and Hallberg, “Quality and Reliability Engineering International”, 1991 9. Hald, A., 1999, “On the Maximum Likelihood in Relation to Inverse Probability and Least Squares.” Statistical Science, Vol. 14, No. 2, pp. 214-222. 10. Accelerated Life Testing Analysis (ALTA), Reliasoft Corp.
Reliability Information Analysis Center 221
Chapter 5: Life Data Modeling
This page intentionally left blank
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 222
Chapter 6: Interpretation of Reliability Estimates
6.
Interpretation of Reliability Estimates
This chapter presents topics related to the interpretation of various aspects of reliability models. It is hoped that this information will provide the reader with information that allows for a better intuitive understanding of reliability predictions, assessments and estimations.
6.1. Bathtub Curve The “bathtub curve” is a general reliability model of failure rate as a function of time that, for hardware, has three distinct periods. It is often misunderstood and misinterpreted. It should be thought of as a concept rather than an actual failure rate function. A generic bathtub curve is shown in Figure 6.1-1.
Figure 6.1-1: Bathtub Curve The three regions are: Infant Mortality. In this first portion of the bathtub curve, the failure rate is relatively high because a portion of the population may contain parts with defects. These parts Reliability Information Analysis Center 223
Chapter 6: Interpretation of Reliability Estimates
generally fail earlier than those in the main population. The shape of the failure rate curve is decreasing, with its rate of decrease dependent on the maturity of the design and manufacturing processes, as well as the applied stresses. Useful Life. The second portion of the bathtub curve is known as the “useful life” and is characterized by a relatively constant failure rate caused by randomly occurring failures. It should be noted that the failure rate is only related to the height of the curve, not to the length of the curve, which is a representation of product or system life. If items are exhibiting randomly occurring failures, then they fail according to the exponential distribution, in accordance with a Poisson process. Since the exponential distribution exhibits a constant hazard rate, we can simply add the failure rates for all items making up an item to estimate the overall failure rate of that item during its useful life. Wearout. The last part of the curve is the wearout portion. This is where items start to deteriorate to such a degree that they are approaching, or have reached, the end of their useful life. This is often relevant to mechanical parts, but can also apply to any failure cause that exhibits wearout behavior.
It is important to understand the difference between the MTBF of an item and the useful life of that same item. Items that experience wearout failure modes/mechanisms will have some period of useful life before they fail as a result of wearout. This useful life is not the same as the item MTBF. During useful life, an item may also experience randomly occurring “freak” failures caused by weak components or faulty workmanship, especially if the item is subjected to high stress conditions. The occurrence of these random failures during an item’s useful life results in higher failure rates, or lower MTBF, for that item. Mechanical items are usually most prone to wearout and, therefore, we are usually most concerned with the useful life, or MTTF, associated with these items. Electronic items usually become obsolete long before any significant wearout takes place6. Therefore, the infant mortality and constant failure rate portions of the bathtub curve are of the most interest for these items. The bathtub curve conceptually offers a good view of the three primary types of failure categories. It is essentially a composite failure rate curve comprised of three generic types of failure causes. In practice, however, the well defined curve of Figure 6.1-1 is rare. The actual curve for a product or system will depend on many factors. A specific 6
It should be noted, however, that with the progressively decreasing feature sizes of current state-of-the-art microelectronic devices, the issues associated with wearout and useful life are becoming of greater concern.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 224
Chapter 6: Interpretation of Reliability Estimates
failure cause will generally exhibit characteristics of only one segment of the bathtub curve, but when the characteristics of all of the other failure causes for that product or system are considered, and a composite model is generated, the curve will have a shape that deviates from the classic bathtub curve, even though it will often contain elements of each of the three portions. Usually, the composite curve will be dominated by the characteristics of those failure causes that dominate the overall reliability of the item. It is also important to note that defects do not always manifest themselves as infant mortality failures. They can appear to be infant mortality, random or wearout, depending on the specific characteristics of the failure mechanism and factors, such as defect severity distributions.
6.2. Common Cause vs. Special Cause The fact that a failure rate can be predicted for a given part under a specific set of conditions does not imply that a failure rate is an inherent quality of the part. The probability of failure is a complex interaction between the inherent defect density, defect severity, and stresses incurred in operation. Failure rates predicted using empirical models are, therefore, typical failure rates only and represent typical defect rates, design characteristics and use conditions. The accuracy of these prediction models is dependent on: • • •
The model developers’ ability to identify the variables (component- or userelated) that most heavily influence reliability The level of detailed data to which the model user has access The quantity and quality of the data on which the models are based
The accuracy of a reliability model is a strong function of the manner in which defects are accounted for. Therefore, there is a trade-off between the usability of the model and the level of detailed data that it requires. This highlights the fact that the purpose of a reliability prediction must be clearly understood before a methodology is chosen. Practical considerations for choosing an approach will inevitably include the types and level of detail of information available to the analyst. Given the practical time and cost constraints that most reliability practitioners face, it is usually important that the chosen reliability prediction methodology be based on data and information accessible to them. Model developers have long known that many of the factors which had a major influence on the reliability of the end product were not included in traditional methods like MILHDBK-217, but under the “constraints” of handbook users, these factors could not be Reliability Information Analysis Center 225
Chapter 6: Interpretation of Reliability Estimates
included in the models. For example, it was known that manufacturing processes had a major impact on end item reliability, but those are the factors which corporations hold most proprietary. As an example of this, a physics-of-failure-like model was developed several years ago for small-scale CMOS technology. This model required many input variables, such as metallization cross-sectional area, silicon area, oxide field strength, oxide defect density, metallization defect density etc. While the model has the potential to be much more accurate than the other MIL-HDBK-217 models, it is essentially unusable by anyone other than the component manufacturers who have access to such information. The model is useful, however, for these manufacturers to improve the reliability of their component designs. The two primary purposes for performing a quantitative reliability assessment of systems are (1) to assess the capability of the parts and design to operate reliably in a given application (robustness), and (2) to estimate the number of field failures or the probability of mission success. The first does not require statistically-based data or models, but rather sound part and materials selection/qualification and robust design techniques. It is for this purpose that physics approaches have merit. The second, however, requires empirical data and models derived from that data. This is due to the fact that field component failures are predominantly caused by component and manufacturing defects which can only be quantified through the statistical analysis of empirical data. This can be seen by observing the TTF characteristics of components and systems, which are almost always decreasing, indicating the predominance of defect-driven failure mechanisms. The “handbook” models described in this book provide the data to quantify average failure rates which are a function of those defects. It has been shown that system reliability failure causes are not driven by deterministic processes, but rather by stochastic processes that must be treated as such in a successful model. There is a similarity between reliability prediction and chaotic processes. This likeness stems from the fact that the reliability of a complex system is entirely dependent upon initial conditions (e.g., manufacturing variation) and use variables (i.e., field application). Both the initial conditions and the use application variables are often unknowable to any degree of certainty. For example, the likelihood of a specific system containing a defect is often unknown, depending on the defect type, because the propensity for defects is a function of many variables and deterministically modeling them all is virtually impossible. However, the reliability can be predicted within bounds by using empirically based stochastic models. A critical factor that must be considered when choosing a reliability assessment method is whether the failure mechanism under analysis is a special cause or a common cause 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 226
Chapter 6: Interpretation of Reliability Estimates
mechanism. In other words, a special cause mechanism means that there is an assignable cause to the failure and that only a subpopulation of the item is susceptible to this failure mechanism. Common cause mechanisms are those affecting the entire population. Table 6.1-1 summarizes the characteristics of various categories of failure causes, and identifies whether they are typically common cause or special cause. The categories of failure types encompass the ways a failure cause can manifest itself. These are also categories that can be used in a FMEA. Table 6.1-1: Categories of Failure Effects Failure cause type
Always (Common Cause) Sometimes (Special Cause)
Category of Failure Type Screen Infant Fallout/Out-ofMortality the-Box Failure
Design Not Capable
Process Not Capable
x
x
x
x
x
x
Random Failure
Wearout
x x
x
x
If it is erroneously assumed that special cause mechanisms will affect the entire population, gross errors in the reliability estimates of the population will result. This error results from the assumption of a mono-modal TTF distribution when, in fact, the actual distribution is multimodal. If the distribution is truly mono-modal, only the parameters applicable to a single mode distribution need to be estimated. However, if there are really several sub-populations within the entire population, the parameters of each of the distributions needs to be estimated, along with the percentage of the entire population represented by each distribution. This is especially critical when dealing with defects. In this case, it is critical to understand the percentage of the population that is at risk of failure. To illustrate this, consider the probability plot in Figure 6.2-1. As can be seen in this plot, there is an apparent “knee” in the plot at about 400 hours, an indication of several subpopulations. If a mono-modal distribution is assumed (i.e., the straight line), errors in the cumulative percent fail at a given time will occur. Likewise, if a multimodal distribution is assumed, a much more accurate representation of the situation results (the line through the data points).
Reliability Information Analysis Center 227
Chapter 6: Interpretation of Reliability Estimates ReliaSoft Weibull++ 7 - www.ReliaSoft.com
Probability - Weibull
99.000
Probability-Weibull Data 1 Weibull-Mixed MLE SRM MED FM F=98/S=139 Data Points Susp Points Probability Line
90.000
Unreliability, F(t)
50.000
10.000
5.000
1.000
0.500
0.100 10.000
100.000
1000.000
10000.000
100000.000
Bill Denson Corning 1/15/2008 5:24:32 PM 1000000.000
Time, (t) β[1]=1.3341, η[1]=307.1460, Ρ[1]=0.0646; β[2]=0.7505, η[2]=2.1367Ε+4, Ρ[2]=0.4240; β[3]=4.2735, η[3]=1.1624Ε+5, Ρ[3]=0.5114
Figure 6.2-1: Example of a Non-Mono-Modal Distribution The quantification of subpopulations usually requires data on many more samples relative to the mono-modal situations. If accelerated tests are used to model life, the risk in assuming mono-modality must be considered. For this reason, techniques like stress/strength and first principals are often difficult to use to quantify multimodality. Examples of multimodal distributions The plots presented in Figures 6.2-2 through 6.2-6 illustrate the characteristics of several different types of multimodal distributions. Before each plot, the information on each of the two distributions comprising the multimodal distribution is presented in a table (Tables 6.2-2 through 6.2-6, respectively). Included in this description are the beta value, the eta value (characteristic life) and the portion of the population represented by the distribution. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 228
Chapter 6: Interpretation of Reliability Estimates
Table 6.2-2: Bimodal Population Example 1 Population Beta Eta Portion
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. com
1 0.60 61.1 0.42
2 0.59 918.7 0.58
Probability - Weibull
9 9. 9 00
9 0. 0 00
Unreliability, F(t)
5 0. 0 00
1 0. 0 00
5. 0 00
1. 0 00
0. 5 00
0. 1 00 0 . 1 00
1. 00 0
10 . 00 0
1 00 . 00 0
1 000 . 0 00
T ime, (t) F olio 1\1-. 5: β[ 1 ] =0 .6 0 1 0 , η[ 1 ] =6 1 .0 8 6 0 , Ρ[1 ]= 0 .4 2 1 9 ; β[ 2 ]= 0 .5 9 3 6 , η[2 ]= 9 1 8 .6 9 3 5 , Ρ[ 2 ] = 0 .5 7 8 1
Figure 6.2-2: Multimodal Distribution Example 1
Reliability Information Analysis Center 229
10 00 0. 00 0
Chapter 6: Interpretation of Reliability Estimates
Table 6.2-3: Bimodal Population Example 2 Population Beta Eta Portion
Re lia Sof t W e ibull+ + 7 - w w w . Re lia Sof t. co m
1 0.86 341.4 0.63
2 1.4 863.25 0.37
Probability - Weibull
99 . 9 0 0
90 . 0 0 0
Unreliability, F(t)
50 . 0 0 0
10 . 0 0 0
5.000
1.000
0.500
0.100 0. 1 00
1 . 00 0
1 0. 00 0
10 0 . 0 00
1 0 00 . 00 0
T ime, (t) F o lio 1 \1 -1 : β[1 ]= 0 .8 6 3 3 , η[1 ]= 3 4 1 .4 6 5 6 , Ρ[1 ]= 0 .6 3 0 3 ; β [2 ]= 1 .4 0 6 2 , η[2 ]= 8 6 3 .2 7 6 7 , Ρ[2 ]=0 .3 6 9 7
Figure 6.2-3: Multimodal Distribution Example 2
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 230
1 0 00 0. 0 00
Chapter 6: Interpretation of Reliability Estimates
Table 6.1-4: Bimodal Population Example 3 Population Beta Eta Portion
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m
1 1.81 98.44 0.19
2 1.23 679.4 0.81
Probability - Weibull
9 9. 9 00
9 0. 0 00
Unreliability, F(t)
5 0. 0 00
1 0. 0 00
5. 0 00
1. 0 00
0. 5 00
0. 1 00 0.100
1. 00 0
1 0. 00 0
1 0 0. 00 0
1 00 0. 0 00
T ime, (t) F o lio 1\5-1 : β [1 ]= 1 .8 1 4 4 , η[1 ]= 9 8 .4 4 2 8 , Ρ[1 ]= 0 .1 8 8 1 ; β[ 2 ]= 1 .2 3 8 5 , η[2 ]=6 7 9 .4 4 6 9 , Ρ[ 2 ] =0 .8 1 1 9
Figure 6.2-4: Multimodal Distribution Example 3
Reliability Information Analysis Center 231
10 0 00. 0 00
Chapter 6: Interpretation of Reliability Estimates
Table 6.1-5: Bimodal Population Example 4 Population Beta Eta Portion
Re lia Sof t W e ibull+ + 7 - w w w . Re lia Sof t. co m
1 1.18 206.2 0.19
2 4.69 497.6 0.81
Probability - Weibull
99 . 9 0 0
90 . 0 0 0
Unreliability, F(t)
50 . 0 0 0
10 . 0 0 0
5.000
1.000
0.500
0.100 0. 1 00
1 . 00 0
1 0. 00 0
10 0 . 0 00
1 0 00 . 00 0
T ime, (t) F o lio 1 \. 5-5: β [1 ]=1 .1 8 0 8 , η[1 ]=2 0 6 .1 9 6 8 , Ρ[1 ]= 0 .1 9 4 0 ; β [2 ] =4 .6 9 4 3 , η[2 ]= 4 9 7 .6 3 5 9 , Ρ[2 ]=0 .8 0 6 0
Figure 6.2-5: Multimodal Distribution Example 4
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 232
1 0 00 0. 0 00
Chapter 6: Interpretation of Reliability Estimates
Table 6.1-6: Bimodal Population Example 5 Population Beta Eta Portion
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m
1 5.71 44.7 0.10
2 4.29 483.7 0.90
Probability - Weibull
99.900
90.000
Unreliability, F(t)
50.000
10.000
5 . 0 00
1 . 0 00
0 . 5 00
0 . 1 00 0. 10 0
1. 00 0
10 . 0 0 0
10 0 . 0 00
1 00 0 . 0 0 0
1 0 00 0. 0 00
T ime, (t) F o lio 1\5 -5 : β [ 1 ]=5 .7 1 6 3 , η[1 ]=4 4 .7 8 9 1 , Ρ[1 ]= 0 .0 9 9 9 ; β[ 2 ]= 4 .2 9 3 2 , η[2 ]= 4 8 3 .7 0 4 2 , Ρ[2 ] =0 .9 0 0 1
Figure 6.2-6: Multimodal Distribution Example 5 A distribution was then obtained by pooling all of the individual distributions described previously. This is shown in Figure 6.2-7. The effect of pooling the various distributions from many failure causes has the effect of randomizing the apparent failure characteristics of the resultant pooled population. This is one of the reasons that a Reliability Information Analysis Center 233
Chapter 6: Interpretation of Reliability Estimates
constant failure rate distribution (i.e., exponential) is usually a reasonably good representation of a complex system’s failure rate characteristics. Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m
Probability - Weibull
99.900
Proba bility-W e ibull a ll da ta W e ibull-Mixe d NLRR SRM MED F M F = 4 98 /S= 0 Da ta Points Pro ba bility Line
90.000
Unreliability, F(t)
50.000
10.000
5 . 0 00
1 . 0 00
0 . 5 00
0 . 1 00 0. 01 0
0 . 1 00
1.000
1 0 . 0 00
1 00 . 00 0
10 0 0. 00 0
1 0 00 0. 0 00
Bill De nson Co rning 3 /14 /2 0 10 9 :43 :2 8 PM 10 0 000 . 0 00
T ime, (t) β [1 ]=0 .7 2 0 8 , η[1 ]=4 3 2 .4 6 1 4 , Ρ[1 ]= 0 .6 6 9 4 ; β[ 2 ]= 4 .4 2 3 2 , η[2 ]= 4 9 1 .0 1 2 0 , Ρ[2 ]=0 .3 3 0 6
Figure 6.2-7: Multimodal Distribution Example of Pooled Data Set To illustrate the reliability theory concepts discussed above, consider an example in which the lifetimes of people are analyzed. The data on which this analysis was based is from http://www.mortality.org (Reference 1) and considers the lifetimes of individuals that died in 2006. The raw data is contained in Figure 6.2-8, which presents the number of deaths occurring at each age. This is the discrete version of the pdf.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 234
Chapter 6: Interpretation of Reliability Estimates 4000
3500
Number of Deaths
3000
2500
2000
1500
1000
500
0 0
20
40
60
80
100
120
Age
Figure 6.2-8: Age at Death Data From this graphic, it can be seen that there are several distinct distributions present. First, is the infant mortality period, which is represented by Mode 1. The second mode, Mode 2, represents deaths in the late teens and early twenties. Then, the third and fourth modes represent deaths from old age. Next, a multimode Weibull distribution was fit to the data, using Reliasoft’s Weibull++ software tool, which allows fitting failure data to multimode distributions. The results are summarized in Table 6.1-7. Table 6.1-7: Four Mode Weibull Distribution Parameters Parameter
Mode 1
Mode 2
Mode 3
Mode 4
Beta
0.184
4.25
4.74
9.61
Eta
0.1030
24.81
67.84
87.67
Portion
0.0090
0.012
0.194
0.784
The composite pdf is shown in Figure 6.2-9. Reliability Information Analysis Center 235
Chapter 6: Interpretation of Reliability Estimates Probability Density Function 0.040
0.032
f(t)
0.024
0.016
0.008
0.000 0.100
22.100
44.100
66.100
88.100
110.100
Time, (t)
Figure 6.2-9: pdf of multimode distribution of ages The failure rate is shown in Figure 6.2-10. Failure Rate vs Time Plot 0.040
0.032
Failure Rate, f(t)/R(t)
0.024
0.016
0.008
0.000 0.100
22.100
44.100
66.100
88.100
110.100
Time, (t)
Figure 6.2-10: Failure Rate of Age Data
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 236
Chapter 6: Interpretation of Reliability Estimates
The Weibull probability plot is shown in Figure 6.2-11. Note that, in this graph, the plot is shown using Weibull scales, i.e. the log of time on the x-axis and double log of unreliability on the y-axis. If this plot was close to a straight line, it would indicate that the distribution could be described adequately with a mono-modal Weibull distribution. Clearly, this is not the case. Probability - Weibull 99.990
90.000
Unreliability, F(t)
50.000
10.000
5.000
1.000
0.500
0.100 0.100
1.000
10.000
110.000
Time, (t)
Figure 6.2-11: Probability Plot of Age Data Figure 6.2-12 illustrates a single mode Weibull fit (straight line) to the data. As can be seen, if the single mode fit is used to estimate probability of death at a specific age, significant errors would result. For example, it would imply that about 20% of the population would live to 110 years. And, it would imply that there is less than .001% probability of death in the first year. This example illustrates the fact that, if there is a “sub-population” of samples with different reliability behavior than the main population, then the TTF distributions may manifest themselves as bimodal or multimodal. It is important that these multimodal distributions be characterized. If one of the two “modes” in the distribution appears as early failures resulting from defects, this information is required to develop an appropriate reliability screen.
Reliability Information Analysis Center 237
Chapter 6: Interpretation of Reliability Estimates Probability - Weibull 99.990
90.000
50.000
10.000
Unreliability, F(t)
5.000
1.000 0.500
0.100 0.050
0.010 0.005
0.001 1.000
10.000
110.000
Time, (t)
Figure 6.2-12: Single Mode Weibull Fit to the Age Data
6.3. Confidence Bounds The topic of confidence bounds has always been important in reliability engineering due to the fact that estimating the uncertainty associated with a reliability estimate is important when making decisions based on that estimate. In this case, the risk associated with being wrong must be assessed. It is a topic that has received a tremendous amount of attention by reliability practitioners and academicians alike. 6.3.1. Traditional Techniques for Confidence Bounds
The traditional manner in which confidence levels are calculated around failure rates is the use of the chi-square distribution, as follows:
λ=
χ 2 (1 − CL ,2 r + 2 ) 2t
where the numerator is a value taken from a chi-square table, and “t” is the number of device hours. A question sometimes arises as to how the confidence bounds calculated in this manner compare to those calculated with the use of the Poisson distribution. From the binomial and Poisson distributions, Farachi (Reference 2) has shown that: 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 238
Chapter 6: Interpretation of Reliability Estimates
n! (1 − q )n−k q k k =0 k!(n − k )! r
1 − CL = ∑
Using the Poisson approximation of the binomial:
n! (1 − q )n − k q k ≈ (nq ) e − nq k !(n − k )! k! k
Combining the above two equations yields: k r −1 ⎡ ( ( (nq)r ⎤ nq) −nq nq) −nq 1 − CL = ∑ e = e ⎢1 + nq + ⋅ ⋅ ⋅ ⋅ + + (r − 1)! (r )! ⎥⎦ k! k =0 ⎣ r
Since:
nq = λ t Then:
⎡ ( ( λt )k −λt λt )r −1 (λt )r ⎤ −λt 1 − CL = ∑ e = e ⎢1 + λt + ⋅ ⋅ ⋅ ⋅ + + ( ) (r )! ⎥⎦ r 1 ! − k =0 k! ⎣ r
The chi-square value is the exact solution to the above equation. The chi-square values are for “λt”, not “λ” alone. Therefore, for a given confidence level and number of failures, the chi-square tables provide the value for “λt”. Therefore, the chi-square values are entirely consistent with the Binomial and Poisson distributions. It is important to note that the confidence bounds based on the chi-square distribution summarized above pertain to the uncertainty from statistical considerations alone. They do not account for variations in failure rate due to other noise factors, such as: • • •
Uncertainty in the number of hours or failures Whether the failure causes are truly relevant Time dependencies of the failure rate Reliability Information Analysis Center 239
Chapter 6: Interpretation of Reliability Estimates
Additional information on confidence bounds is included in the section on life modeling. 6.3.2. Uncertainty in Reliability Prediction Estimates
One of the limitations of reliability predictions that are based on handbook models is that they can only provide “point estimates” of failure rates. These failure rates are based on whatever data was available to make up the model, and the model development approach. There are no statistical confidence limits or intervals that can be associated with handbook model data. Traditional methods are not applicable because there are many more factors contributing to the uncertainty than the statistical-only considerations of traditional techniques. For example, consider the following summary of the model development and use approach, along with the potential sources of error, as shown in Figure 6.3-1. The sources of error are highlighted in the gray boxes. From this, it can be seen that there are many sources of noise. The model output results reflect the cumulative effects of the uncertainties in all of the noise sources shown. Although a theoretical basis for the calculation of the confidence bounds around reliability predictions is extremely difficult to derive, it is possible to empirically observe the degree of uncertainty. Reliability predictions performed using empirical models developed from field data result in a failure rate estimate with relatively wide confidence bounds. Table 6.3-1 presents the multipliers of the failure rate point estimate as a function of confidence level. This data was obtained by analyzing data on systems for which both predicted and observed data was available. For example, using traditional approaches, one could be 90% certain that the true failure rate was less than 7.57 times the predicted value.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 240
Chapter 6: Interpretation of Reliability Estimates
Raw Input Data Item Information: Manufacturer Manufacturing Date Quality Defect rate Data: Operating Hours Time to Failure Number of Failures Failure Relevancy Degradation vs Catastrophic
Unmodeled Noise Factors Modeled Factors: Environmental: Temperature Humidity Delta T Radiation Contaminants Operational Profile: Duty Cycle Cycling Rate Operating Stress Electrical Stress Mechanical Extreme Events
User
Item Information: Manufacturing Date Quality Defect Rate Environmental Stresses: Temperature Humidity Delta T Radiation Contaminants Operational Profile: Duty Cycle Cycling Rate Operating Stress Electrical Stress Mechanical Extreme Events
Model Development Model Censored Data; Biased Estimators; Assumptions Made in Modeling
Model Output
Figure 6.3-1: Sources of Error in Empirical Models Reliability Information Analysis Center 241
Chapter 6: Interpretation of Reliability Estimates
Table 6.3-1: Failure Rate Uncertainty Level Multipliers Percentile 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Multiplier 0.13 0.26 0.44 0.67 1.00 1.49 2.29 3.78 7.57
An interesting effect occurs when combining the distributions that describe the uncertainties of the individual components comprising a system. The uncertainties are wider at the piece-part level than at the system level. If one were to take the distributions of failure rate from the regression analysis used to derive the component model (i.e., standard error estimate), and statistically combine them with a Monte Carlo summation, the resultant distribution describing the system prediction uncertainty will have a variance much smaller than that of the individual components comprising the system. The reason for this is the effect of the Central Limit Theorem which quantifies the variance of summed distributions. For example, the variance around the component failure rate estimate is higher than the variance suggested by the above table. However, the variance in the above table is observed to be much larger than that theoretically derived by summing the component failure rate distributions. This implies that there are system-level effects that contribute to the uncertainty that are not accounted for in the component-based estimate. Bayesian techniques, such as those used in the 217Plus system reliability assessment methodology, allow the refinement of analytical predictions over time to reflect the experienced reliability of an item as it progresses through in-house testing, initial field deployment and subsequent use by the customer. In-house testing can be comprised of accelerated tests at the component or equipment level, reliability growth tests, and reliability screens or accelerated screening techniques. We will not discuss Bayesian methods in detail here. The primary benefit of using Bayesian techniques can be implied from Figure 6.3-2, however. As more and more test and experience data is factored into the initial analytical reliability prediction, the 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 242
Chapter 6: Interpretation of Reliability Estimates
statistical confidence levels represented by the outside (red) lines on the graph continue to converge on the “True MTBF” of the subject item. Using Bayesian techniques, as time approaches infinity the predicted inherent MTBF and the true MTBF of the device, product or system population become one and the same. This, of course, assumes that MTBF is the appropriate metric, but the same situation conceptually applies to other metrics such as failure rate and reliability (R).
Prediction Assessment Estimation
Paper Analysis
MTBF
In-House Testing
Field Data Upper Confidence “True MTBF” Lower Confidence Level
TIME
Figure 6.3-2: Confidence Level Through Prediction, Assessment and Estimation
6.4. Failure Rate vs pdf The biggest distinction to be made when assessing reliability is whether the time period of interest for the item under analysis is in the “meat” of the TTF distribution, or whether it is in the extreme left tail of the distribution. For example, consider a system that has a five year design life. If an item has a mean life of three years, clearly precautions would Reliability Information Analysis Center 243
Chapter 6: Interpretation of Reliability Estimates
be required, such as preventive maintenance. The reliability of these types of items is usually easier to predict because they can be tested to failure in relatively short times, and small sample sizes will usually suffice. On the other hand, consider a component that has a failure rate of 2 FITs7, typical for many modern electronic components. In the five year design life, assuming continuous operation, the reliability would be:
R=e
(− λ t )
= 0 .999912
Or, a probability of failure of 0.000088. Therefore, if there were 10,000 of these components operating in a system, the expected number of failures in the five year period would be less than one. Predicting the reliability behavior of a failure cause based on the extreme left tail of the TTF distribution of the main population is dangerous, since the accuracy of the distribution breaks down in its extreme tails. As an example, consider a state-of-the-art integrated circuit. One failure mechanism is electromigration of the metal lines. Manufacturers will typically perform life tests of the metal line structures to assess their lifetime. These tests are done in a manner similar to the practices detailed in this book. They are accelerated tests performed under a variety of temperature and current density conditions. Failure times are collected and models are developed to predict lifetimes under deployment conditions. A goal of a good manufacturer is to design the metal lines such that the probability of failure is acceptably low when the part is used under specified conditions. While, as stated, the models developed can be used to estimate the reliability under deployment conditions, rarely will the prediction be reasonably close to the observed failure data. The reasons for this are: • • •
The distribution is usually not mono-modal Manufacturing variability is difficult to account for in the model Extreme events, such as defects in the metal lines, will only manifest themselves after very large sample sizes are tested or fielded
A multimode distribution can be used to model this situation, the first mode being applicable to the defects, and the second being applicable to the main population.
7
Two FITs is defined as 2.0 failures per billion hours. This corresponds to 0.002 failures per million hours.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 244
Chapter 6: Interpretation of Reliability Estimates
However, in many cases, it is only the first mode that will impact the field reliability within the useful life of the component. Some researchers have attempted to use extreme value statistics for such cases, but they also have limited usefulness because the data on low failure rate items, like electronic components, is generally not consistent with these distributions. As a result, low failure rate items are usually modeled with a constant failure rate (exponential distribution), or a Weibull distribution. The Weibull is usually used in this case to model the effects of infant mortality.
6.5. Practical Aspects of Reliability Assessments There are very often serious constraints put on practicing reliability engineers. Due to limitations of time, cost, test resources, availability of data, limitation of modeling capabilities, and lack of understanding of failure physics, analysts often need to do the best they can with what they have to work with. This is usually compounded in smalland medium-sized companies, which often lack the resources needed to execute many of the analysis techniques described in this book. Companies engaged in highly competitive industries face extreme time pressures, which is in stark contrast to the tenets of good reliability engineering practices. The goal of the reliability engineer should be to select an optimal approach that achieves the desired purpose of the analysis, while conforming to the practical constrains to which he or she is subjected.
6.6. Weibayes There are many cases in reliability modeling in which there are few or no failures. For these, a Weibayes technique can be used. This approach is practical when there are few or no failures and a reasonable shape parameter can be estimated. This approach essentially fixes a plotting position using: 1. One failure assumed at the end of the test duration 2. A line drawn through the median rank point with an assumed beta The result of this analysis is a lower single-sided bound of the life distribution. As an example, consider the following case: 1. 50 samples are tested for 1000 hours, with no failures 2. Data from other testing indicates a beta of 3 is appropriate Reliability Information Analysis Center 245
Chapter 6: Interpretation of Reliability Estimates
3. The median rank at 1000 hours is 1.39%. A line is drawn through this point with a beta slope of 3. This is shown in Figure 6.6-1.
Figure 6.6-1: Weibayes Example
6.7. Weibull Closure Property In cases where it is desired to estimate the time-to first-failure (TTFF) of a product or system comprised of multiple items, the Weibull closure property can be used. Here, the characteristic life of the Weibull distribution of time to first failure is:
⎡ n 1 α s = ⎢∑ β ⎢⎣ i =1 α i
⎤ ⎥ ⎥⎦
−
1
β
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 246
Chapter 6: Interpretation of Reliability Estimates
where αi and β represent the Weibull distribution parameters for individual items. This is applicable when β is the same, but αi can be different for each item.
6.8. Estimating Event-Related Reliability Many cases arise in estimating reliability where a failure cause of a device under analysis is event-related. For example, if a hand-held device is susceptible to failure when it is dropped, the failure rate (or hazard rate, if a time-varying failure rate distribution is used) is a function of the: • • • •
rate at which drops occur distribution of the drop height relationship between drop height and G-shock level probability of failure as a function of G-level
The failure rate is expressed as:
λ (t ) = λd (t )hd
G Gth h
where: λ(t) = λd(t) = hd = G/h = Gth =
the failure rate of the device due to shock-related failure causes the rate at which the drops occur the drop height distribution the relationship between the G-level and the drop height the failure threshold distribution
Since hd and Gth are random variables described by distributions, λ(t) can generally be estimated with a Monte Carlo analysis, as described earlier in this book. In this case, the conditional probability of failure if the device is dropped is:
hd
G Gth h
This is essentially a stress/strength interference model.
Reliability Information Analysis Center 247
Chapter 6: Interpretation of Reliability Estimates
6.9. Combining Different Types of Assessments at Different Levels Practicing reliability engineers are usually faced with the challenge of making reliability estimates of a product or system based on imperfect, noisy data and information. The engineer must utilize the data that is available, and additional data that is feasible to obtain, and combine this information to estimate the product or system reliability. The 217Plus methodology summarized previously in this book presents one possible approach for using the initial estimated reliability based on the predictions made from empirical models, and combining it with empirical data on the same product or system. This combination is done using Bayesian principals. This is a general approach that can be extended to include the combination of estimates from different methods that are made at different levels. For example, consider the case summarized in Table 6.9-1. It may be possible to characterize specific failure causes with one of the physics-based techniques summarized herein, but it also may be unlikely that all failure causes can be modeled in this manner. Table 6.9-1: Example of Combing Different Types of Models Item Assembly Component A Failure Cause 1 Failure Cause 2 Failure Cause 3 Component B Failure Cause 1 Failure Cause 2 Failure Cause 3
Available Reliability Estimate Life Test Data Life Test Data Physics Model Field Data on Similar Item Physics Model Field Data Life Test Data Field Physics Model
In this example, the objective is to estimate the reliability of the assembly, which is comprised of two components. Component A has physics-based models available for two of the three primary failure causes. An estimate of the failure rate of component A is:
λA −preliminary = λ1 + λ2 + λ3 where λ1, λ2 and λ3 are the failure rates obtained from the model or data available on each failure cause. Of course, these values should represent the failure rate under the use 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 248
Chapter 6: Interpretation of Reliability Estimates
conditions for which the assessment is to be made. In this example, λ is used, which indicates a constant failure rate. However, if the failure rates are time-dependent, the corresponding time-dependent failure rates or hazard rates can be used. Also, the methodology to be illustrated in this example is similar to the data combination methodology described in the 217Plus section, the main difference being that this example deals with the situation in which there are different types of data at different hierarchical levels of the product or system, whereas the 217Plus methodology deals with different types of data within the same configuration item. Now, since Component A has life test data available from tests performed on the component, λA-preliminary is the failure rate estimate before accounting for the life test data on the entire component. This life data will account for any failure causes not included in the three failure causes considered, and it will also provide additional data on the three failure causes considered. A better estimate of reliability can be obtained by combining λA-preliminary with the life test data, using Bayesian techniques. This technique accounts for the quantity of data by weighting large amounts of data more heavily than small amounts. λA-preliminary forms the “prior” distribution, comprised of a0 and ao/λA-preliminary . The empirical data (i.e., test data in this case) is combined with λA-preliminary using the following equation:
a0 +
λA =
n
∑a
i
i =1
a0
λA−preliminary
+
n
∑b ' i
i =1
λA is the best estimate of the Component A failure rate, while ao is the “equivalent” number of failures of the prior distribution corresponding to λA-preliminary. For these calculations, 0.5 should be used unless a tailored value can be derived. An example of this tailoring is provided in the Section 2.6 of this book. The equivalent number of hours associated with λA-preliminary is represented by ao/λA-preliminary. The number of failures experienced in each source of empirical data is a1 through an. There may be “n” different sources of data available (for example, each of the “n” sources corresponds to individual tests or field data from the population of products). The equivalent number of cumulative operating hours experienced for each individual data source is b1’ through bn’. These values must be converted to equivalent hours by accounting for any accelerating effects between the use conditions. The same methodology is applied to Component B, and λB is obtained. Reliability Information Analysis Center 249
Chapter 6: Interpretation of Reliability Estimates
The same methodology is, in turn, applied at the parent level assembly, in which case, the preliminary estimate is:
λAssembly= preliminary = λA + λB and the parent assembly failure rate becomes:
a0 +
λA =
n
∑a
i
i =1
a0
λAssembly- preliminary
+
n
∑b ' i
i =1
where ao is the “equivalent” number of failures of the prior distribution corresponding to λAssembly-preliminary, and the values for ai and bi correspond to the Assembly life test data.
6.10. Estimating the Number of Failures There are many cases in which the desired outcome of a reliability analysis is the expected number of failures. This is appropriate, for example, when calculating spares requirements or warranty returns. The techniques described in this book are useful for estimating either failure rates or probability of failure. If the outcome of the analysis is a failure rate, then the expected number of failures is:
N f = λt where: Nf = λ= t=
the number of expected failures the failure rate the cumulative operating time
This can be seen by reviewing the units in this relationship:
N = λt =
⎛ operating time ⎞ Failures ⎜⎜ ×# parts ⎟⎟ = Failures operating time ⎝ part ⎠ # parts × part
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 250
Chapter 6: Interpretation of Reliability Estimates
This equation is usually used for repairable systems. If the output of the analysis is a life model that describes the distribution of TTFs for a specific set of conditions, the number of failures is:
N f = N [F (t 2 ) − F (t1 )] where: Nf i= the number of expected failures N = the total number of parts in the population F(t1) = the cumulative probability function at time t1 F(t2) = the cumulative probability function at time t2 t1 and t2 are the times between which the failure probability is to be evaluated In this case, since “F” is a (unitless) probability value, the total population is scaled by the probability of failure in the time interval of interest. This is identical to the expected value of the binomial distribution of the number of failures.
6.11. Calculation of Equivalent Failure Rates In many cases, it is advantageous to calculate an “equivalent” failure rate from the results of a reliability model that yield a non-constant failure rate as its output. For example, if a reliability model estimates that a certain percent fail will occur at a given time (based on the non-constant failure rate model), the equivalent constant failure rate can be calculated as follows: The reliability function for a constant failure rate is:
R = e − λt The equivalent failure rate can be obtained by solving the above equation for the failure rate:
λ=
− ln(R ) t
Reliability Information Analysis Center 251
Chapter 6: Interpretation of Reliability Estimates
The resulting failure rate value is equal to a failure rate that will result in the same cumulative percent fail as predicted by the non-constant model at the specific time that the reliability is calculated. If a different time is chosen, a different value will be obtained. This technique can be used when the reliability of some parts of a system is calculated with non-constant failure rate models and others are calculated with a constant failure rate. It can also be used when modeling “one-shot” devices, which will simply have a probability of failure instead of a failure rate.
6.12. Failure Rate Units The output of a reliability model can include a host of potential metrics, including: • • • • • • •
Mean life Median life MTBF Failure rate Time to X% fail B10 life Distribution parameters: o Weibull characteristic life and shape parameter o Lognormal mean and standard deviation
If a constant failure rate distribution is used, there are various units of failure rate possible. Some of these are: • • • •
Failures per hour Failures per million hours Failures per billion hours Percent failure per thousand hours
“Failures per hour” is the fundamental unit. All of these failure rate units can be translated to each other with a constant multiplication factor. For example, “Failures per million hours” times 1000 equals “Failures per billion hours” and “Percent failure per thousand hours” is equivalent to “Failures per ten thousand hours”. In the above cases, the “life unit” shown is in hours (i.e., time), but it does not necessarily need to be. Other possible life units are cycles, miles, missions, operations, etc. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 252
Chapter 6: Interpretation of Reliability Estimates
Additionally, if the life unit in the above listed metrics is time (hours), it can refer to the number of operating hours, calendar hours, flight hours, etc. Reliability prediction methods like MIL-HDBK-217 use operating hours as the life unit, whereas 217Plus uses calendar hours as the life unit. Calculation of the operating failure rate using MILHDBK-217 makes the implicit assumption that the failure rate during non-operating periods is zero, unless the non-operating failure rate is otherwise accounted for. However, in all cases the life unit refers to the cumulative value of the population. For example, if the failure rate unit of “Failures per million hours” is used, the million hours refers to the cumulative time of the entire population, i.e. the sum of each component’s number of hours.
6.13. Factors to be Considered When Developing Models This section discusses a few of the factors that should be considered in the development of a reliability model. It is by no means an exhaustive list, but it is included here to give the reader ideas on the types of factors that should be considered. 6.13.1. Causes of Electronic System Failure
An assumption often made when using traditional reliability prediction methodologies is that the failure rate of a product or system is primarily determined by the components comprising the system. A significant number of failures also stem from non-component causes such as defects in design and manufacturing. Historically, these factors have not been explicitly addressed in prediction methods. The data in Figure 6.13-1 contains the nominal percentage of failures attributable to each of eight identified predominant failure causes based on failure mode data collected by the RIAC on electronic systems.
Reliability Information Analysis Center 253
Chapter 6: Interpretation of Reliability Estimates
Figure 6.13-1: Nominal Failure Cause Distribution of Electronic Systems The definitions of failure causes are: •
Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor, connector, etc.) failing to perform its intended function. Examples include part failures due to poor quality; manufacturer or lot variability; or any process deficiency that causes a part to fail before its expected wearout limit is reached.
•
Design (9%): Failures resulting from an inadequate design. Examples include tolerance stack-up, unanticipated logic conditions (e.g., sneak paths), a non-robust design for given environmental stresses, etc.
•
Manufacturing (15%): Failures resulting from anomalies in the manufacturing process that are not related to the inherent reliability of a part, i.e., faulty solder joints, inadequate wire routing resulting in chafing, bent connector pins, etc.
•
System Management (4%): Failures traceable to faulty interpretation of system requirements, imposition of “bad” requirements (missing, inadequate, ambiguous or contradictory), or failure to provide the resources (funding and/or personnel) required to design and build a reliable product or system. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 254
Chapter 6: Interpretation of Reliability Estimates
•
Wearout (9%): Failures resulting from wearout-related failure mechanisms due to basic device physics. Examples of electronic components exhibiting wearout-related failure mechanisms are electrolytic capacitors, solder joints, microwave tubes (such as TWTs), and switch and relay contacts.
•
No defect (20%): Perceived failures that cannot be reproduced upon further testing. These may or may not be an actual failure; however they are removals and, therefore, are typically counted toward the logistic failure rate (or MTBF). Examples include the inability of the maintenance environment to recreate the operational environmental stresses under which the original failure occurred, or “looser” tolerances on the test equipment than on the platform or system from which the defective unit was taken.
•
Induced (12%): Failures resulting from an externally applied stress. Examples are electrical overstress and maintenance-induced failures (i.e., dropping, bending pins, etc.).
•
Software (9%): Failures of a system to perform its intended function due to the manifestation of a software fault
While there are reliability assessment methods for specific causes listed above, (i.e., components, software, etc.) there are few methodologies that attempt to take a holistic view of system reliability and integrate them into a single methodology. One example of a methodology that attempts to do this is 217Plus, which is described in Chapter 7. 6.13.2. Selection of Factors
The process of reliability assessment can be viewed as an IPO model, which has input parameters (I), the process or models used to assess the reliability as a function of those input parameters (P), and an output (O). This is illustrated in Figure 6.13-2.
Reliability Information Analysis Center 255
Chapter 6: Interpretation of Reliability Estimates
Input
Process
Output
Initial Conditions
Reliability Metrics
Stresses Figure 6.13-2: IPO Model Examples of the IPO variables, as applied to reliability modeling, are shown in Table 6.13-1. Table 6.13-1: Factors to be Considered in a Reliability Model Defect-Free
Intrinsic Initial Conditions
Defects
Extrinsic
Input
Operational
Stresses Environmental
• • • • • • • • • • • • • • • • • • • • • • • • • • • •
Voids Material Property Variation Geometry Variation Contamination Ionic Contamination Crystal Defects Stress Concentrations Organic Contamination Nonconductive Particles Conductive Particles Contamination Ionic Contamination Thermal Electrical Chemical Optical Chemical Exposure Salt Fog Mechanical Shock UV Exposure Drop Vibration Temperature - High and Low Temperature Cycling Humidity Atmospheric Pressure – Low and High Radiation – EMI, Cosmic Sand and Dust
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 256
Chapter 6: Interpretation of Reliability Estimates
Table 6.13-1: Factors to be Considered in a Reliability Model (continued) Process Output
This is the reliability assessment process using the various techniques described in this book. • Mean Life • Median Life • MTBF • Failure Rate • Time to X% Fail • B10 Life • Distribution Parameters • Weibull Characteristic Life and Shape Parameter • Lognormal Mean and Standard Deviation
Additional information will be discussed in Chapter 8. Given the stochastic nature of reliability prediction for many failure causes, it is impossible to develop a model that is adequately sensitive to all conceivable factors. At least, this is true in all but the simplest of cases. That’s why model developers need to select what are believed to be the most relevant factors, and then model accordingly. This highlights the fact that reliability assessment falls into two distinct categories: the modeling of intrinsic and extrinsic failure causes. Intrinsic failure causes are generally those whose root cause is from a known failure mechanism that affects the entire population of product. These can often be predicted within acceptable bounds by understanding the stresses, the material properties, etc. Extrinsic failure causes are those resulting from unpredictable causes, often a complex sequence of events that ultimately results in the failure cause. Unfortunately, many real world situations fall into this category. It’s unfortunate because these are the ones whose likelihood is most difficult to predict. Generally, components that have very low failure rates are governed by these mechanisms. It is often those unexpected, unpredictable things that happen somewhere upstream in the process, or in the supplier’s process. This is the premise behind the 217Plus system assessment methodology. While it is difficult to predict the likelihood of these “extreme events”, or even identify the failure cause a priori, it is possible to assess controllable factors that have a relationship to the likelihood of experiencing the failure cause. 6.13.3. Reliability Growth of Components
Another issue facing reliability model developers is the manner in which reliability growth is accounted for. A good model reflects state-of-the-art technology. However, empirical models are usually developed from the analysis of field data, which takes time Reliability Information Analysis Center 257
Chapter 6: Interpretation of Reliability Estimates
to collect. The faster the growth, the more difficult it is to derive an accurate (i.e., “current”) model. As an example of this reliability growth effect, Table 6.13-2 contains, for each generic component electronic type, the growth rate that has been observed from data collected by the RIAC. These reliability growth factors are included in the 217Plus component models. The growth rate model used for each component for this purpose is:
λ ∝ e − β (t −t 1
2
)
where: λ= β= t1 = t2 =
the estimated failure rate as a function of year of manufacture the growth rate the year of part manufacture for which a failure rate is estimated the year of manufacture of parts on which the data was collected Table 6.13-2: Failure Rate Data Summary Component Type Capacitor, Ceramic Capacitor, Electrolytic Capacitor, Tantalum Connectors Diode, General Purpose Diode, Schottky Diode, Zener IC, Digital, Nonhermetic IC, Hermetic (All Types) IC, Linear, Nonhermetic IC, Memory/Microprocessor, Nonhermetic Inductors LED Optoelectronic Devices Relays Resistors, All Types Switches Thyristors Transformers Transistor, Bipolar Transistor, FET, N-Channel Transistor, Microwave
Growth Rate (β) 0.0082 0.229 0.229 0.23 0.223 0.297 0.150 0.473 0.33 0.293 0.479 0.0 0.34 0.087 0.0 0.00089 0.0 0.20 0.0 0.281 0.397 0.269
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 258
Chapter 6: Interpretation of Reliability Estimates
6.13.4. Relative vs. Absolute Humidity
There are many failure mechanisms that are accelerated by the combination of temperature and humidity. When modeling failure causes that are a function of humidity, a question arises as to whether the model should be a function of relative humidity or absolute humidity. The appropriate metric to use will depend on whether the failure cause is a function of the absolute amount of water at the surface of the item under analysis. If this is the case, absolute humidity is probably the appropriate measure. The relationship between absolute and relative humidity is illustrated in Figure 6.13-3.
Figure 6.13-3: Relationship Between Absolute and Relative Humidity
6.14. Addressing Data with No Failures In many cases, reliability estimates are made with data containing few or no failures. The analyst must be careful when using this data to estimate reliability. The true failure rate of a component is only available after prolonged operation, but reliability estimates are usually required before this data becomes available. In other words, the analyst needs a leading indicator of reliability, not a lagging indicator. Therefore, before the component has experienced enough operating time to estimate the true failure rate, there may be some data available. This data is often a certain number of operating hours with no Reliability Information Analysis Center 259
Chapter 6: Interpretation of Reliability Estimates
observed failures. A common way of utilizing this data is to estimate a single-sided confidence level of the failure rate, based on the observed number of operating hours. As an example, consider a situation in which a component’s true failure rate is 0.1 failures per million operating hours. Figure 6.14-1 illustrates the 60% and 90% upper bound estimates as a function of the number of operating hours. For example, if there are 1 million observed operating hours, then the upper bound of the failure rate, at a 60% confidence level, is 0.916 F/10e6. I n other words, there is 60% confidence that the true failure rate is less than 0.916 F/10e6 hours. Only after there have been a total of 6 to 8 million operating hours is the 60% upper bound a reasonable estimate.
Figure 6.14-1: Estimated Upper Bound failure Rates vs Operating Time at 60 and 90% Confidence Using a single-sided failure rate bound for reliability estimates can be dangerous, because they can be very pessimistic. Exactly how pessimistic is determined by the number of operating hours relative to the true failure rate. Moreover, if the upper bound is used on multiple components in an assembly, then the pessimism in the assembly failure rate estimate is compounded. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 260
Chapter 6: Interpretation of Reliability Estimates
The Bayesian techniques described previously are a way to address the issue of few or no failures. This is, in fact, the premise of the 217Plus methodology. This approach, while it requires a prior estimate, can alleviate the pessimistic nature of reliability estimates made only from an observed number of hours with no failures. Another related approach is to pool “like” data together for the purpose of estimating a failure rate. For example, if a component has no failures, but there is also data available on other components within the “family” of components, the data can be combined. An example of this approach is described in the section on NPRD (Section 7.4). In that case, the pooling occurs as a function of part type, quality and environment. The algorithm used in that case was similar to a Bayesian approach, but was tailored to the specific constraints of the data.
6.15. Reliability of Components Used Outside of Their Rating A significant issue with the application of commercial microcircuits used in severe environments is the temperature rating of the part. The associated temperature range over which a manufacturer will guarantee performance is limited to that of a commercial part, i.e. typically 0 to 70 degrees C. Military and aerospace applications often require guaranteed performance over wider temperature ranges, i.e. -55 to 125 degrees C. While this is not a reliability prediction issue per se, it does confound the definition of failure criteria. For example, although a part may not perform beyond it rated temperature, it usually does not catastrophically fail and, therefore, is not considered a reliability failure. However, many practitioners do consider this a reliability issue and, as such, turn to reliability models for the quantification of the microcircuit reliability in their specific extended range application. There are no reliability models currently available that can quantify the reliability of parts when used beyond their rating. All existing models make the implicit assumption that parts are used within their rating. A separate, but critical, requirement for the reliable application of components is the qualification of parts and manufacturers to insure that specific parts will function reliably in the intended application. The application of a component beyond its rated value of stress can result in one or more undesired effects. First, there can be reliability ramifications, which can manifest themselves in a variety of ways: either as a sudden, catastrophic failure or as a latent failure. The detectability of the first is much better, since it can be observed with product or system testing. Latent failures are much more difficult to detect, and require more testing and modeling using the techniques described in this book. The second type of undesired effect is related to component performance. Performance characteristics can either be permanently degraded or they may be subject to a “reversible” process in which Reliability Information Analysis Center 261
Chapter 6: Interpretation of Reliability Estimates
the performance recovers after the overstress condition is taken away. In any event, these possible undesired effects should be studied and understood before applying components beyond their rated stress values.
6.16. References 1. http://www.mortality.org 2. Farachi, V., “Electronic Component Failure Rate Prediction Analysis,” RIAC Journal, Nov., 2006.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 262
Chapter 7: Examples
7.
Examples
This chapter presents several examples of reliability models that are intended to provide a cross section of several different methodologies. The focus of the examples is to present methodologies that the author has personally developed, and ,thus, can provide insight into the logic and rationale for their development. Several examples were previously presented in Chapter 2, but not in detail. This section presents more detail regarding model factors, development methods, etc. The following examples are provided: 1. MIL-HDBK-217 Model Development Methodology – The generic modeling methodology for many of the models contained in MIL-HDBK-217 is presented in this section. Not all of the models in the handbook have been developed using this methodology, but the majority have been. This is presented so that the reader can gain an understanding of the approach and methodology used, and to provide insight into the decisions faced by the model developer. 2. 217Plus Reliability Models – 217Plus is the methodology developed by the RIAC to fill the void left after MIL-HDBK-217 was no longer scheduled to be updated. The approach taken in the development of this methodology was quite different than the methodology for MIL-HDBK-217. It was intended to be a holistic approach in which all primary causes of electronic system failure were accounted for. Therefore, factors addressing non-component reliability were considered. It was also intended to be holistic in terms of its ability to leverage experience from predecessor systems, and utilize information from empirical testing. The general approach for this methodology was previously presented in Chapter 2 in the “Combining Data” section. The additional information presented in this section presents the details on the remaining portions of the methodology. Additionally, the development of models for several different components is presented. First is the development of the original twelve electronic part types. For these models, sufficient field reliability data was available. The second component models presented are for photonic component types. For these, very little field data was available, and, therefore, the original 217Plus approach needed to be tailored. 3. Life Model Example – The intent of the life modeling example that will be presented is to illustrate an application of the life modeling methodologies previously discussed. This is a hypothetical example, but provides information pertaining to the various elements of life modeling. Reliability Information Analysis Center 263
Chapter 7: Examples
4. NPRD – This section, covering the RIAC “Nonelectronic Parts Reliability Data (NPRD)” publication, is presented to illustrate the nuances of field reliability data, the manner in which data is merged, and the manner in which it is used in reliability modeling. Some of this information was previously presented in Chapter 2 in the section on the use of field data, but more detail will be presented here. This will hopefully provide the user with an appreciation for both the uses and limitations of this type of data. The examples presented in this section were selected to provide a cross-section of various methodologies, including prediction, assessment and estimation. It is presented to complement the information previously provided in Chapter 2.
7.1. MIL-HDBK-217 Model Development Methodology MIL-HDBK-217 is probably the most widely used of the empirically-based reliability prediction methodologies. The basic premise of the handbook is the use of historical piece-part test and field failure rate data as the basis for predicting future product or system reliability. The handbook includes failure rate models for most electronic part types, and many electromechanical part types. The latest version of MIL-HDBK-217 is “F, Notice 2”, dated 28 February 19958. The handbook was almost a casualty of Perry’s DoD Acquisition Reform initiative, but it survived primarily on the wide use of, and dependency on, the methodology throughout the military-industrial complex and the lack of a suitable replacement. The models that are currently contained in MIL-HDBK-217 have been developed by various organizations, which use various techniques for their development. However, Reference 1 will be used to illustrate a typical model development methodology. The study documented in this report developed the models for discrete semiconductor devices. Excerpts from this report are summarized within this section. The model development methodology is shown in Figure 7.1-1. Each of the elements in this methodology is further examined below.
8
As noted previously, as of the publication date of this book, a Draft of MIL-HDBK-217G is currently in the works, with an anticipated release in 2010.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 264
Chapter 7: Examples
Figure 7.1-1: MIL-HDBK-217 Model Development Methodology Reliability Information Analysis Center 265
Chapter 7: Examples
7.1.1. Identify Possible Variables
The first step in the modeling methodology is to identify possible model factors. In this example, the possible factors were: • • • • • • • • • • • • • •
Device Style Power Rating Package Type Semiconductor Material Structure (NPN, PNP) Electrical Stress Circuit Application Quality Level Duty Cycle Operating Frequency Junction Temperature Application Environment Complexity Power Cycling
7.1.2. Develop Theoretical Model
A series of theoretical failure rate prediction models is hypothesized to provide the resultant models with a sound theoretical/engineering backing. Basically, theoretical model development involves evaluation of the effects of the parameters identified in the previous phase. In addition, the optimal model form (i.e., additive, multiplicative, or a combination) is determined and the time dependency of the discrete semiconductor failure rates is studied. The development of the theoretical device failure rate prediction models is an integral part of the overall model development process. Information collected through literature searches and discrete semiconductor user and vendor surveys is reviewed and evaluated to aid in the development of theoretical models for each discrete semiconductor device type group. The theoretical models serve the following functions: 1. Assure that the prediction models conform to physical and chemical principles 2. Select variables when not possible to determine sing purely statistical techniques In general terms, the theoretical models were of the following form. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 266
Chapter 7: Examples n
λ = λbπ T π E π Q ∏π i i =1
where: λ= theoretical failure rate prediction λb = base failure rate, dependent on device style πT = temperature factor (based on the Arrhenius relationship) πE = environment factor πQ = quality factor based upon device screening level and hermeticity Product of πi = the product of Pi factors based upon variables from the potential list of input variables found to have a significant effect on the discrete semiconductor failure rate. 7.1.3. Collect and QC Data
The collection of empirical reliability data is integral to the approach used in model development. Four specific data collection tasks were defined. The first task was a system/equipment identification process. A survey of numerous military equipments was conducted to identify system/equipments meeting predetermined criteria established to ensure plentiful and accurate data. The second task was an extensive survey of discrete semiconductor manufacturers and users. The third task was in-person visits to organizations where data could not be accessed by other means. The final data collection task was the compilation of data referenced in the literature and documented technical studies. Also, as part of this task, additional contact was made between the authors and/or study sponsors to determine whether more data was available. The results of the four specific data collection tasks are described in the following sections. Five minimum criteria were established to define an acceptable data source. Each potential equipment selection was evaluated with these criteria before proceeding with data summarization. These five criteria were: Reliability Information Analysis Center 267
Chapter 7: Examples
1. 2. 3. 4. 5.
Data available to the part level Primary failures could be separated from total maintenance actions Sufficient detail, including stress levels, could be identified for the components Part hours could be precisely determined Sufficient equipment hours existed to expect discrete semiconductor failures
In addition to these criteria, the following factors were considered: 1. 2. 3. 4.
Number of different discrete semiconductor part types Existence of low-population and state-of-the-art parts Application environment Age of data
Data summarization consisted of the extraction and compilation of the desired data elements from the source reports and/or supporting documentation, and coding the data for computer entry. Data summarization consisted of the following five tasks for sources of field data: 1. 2. 3. 4. 5.
Identification of discrete semiconductor part types within the chosen equipment Determination of part characterization information Identification of relevant part failures Determination of applicable electrical and environmental stress levels Determination of equipment operating histories
The data collected for this effort is summarized on the next page, in Table 7.1-1. Included are, for each part type, the number of observed failures and operating hours. In addition to this data, other information was captured, such as quality level, environment, etc. 7.1.4. Correlation Coefficient Analysis
Using the multiple linear regression technique makes the implicit assumption that the variables under analysis are independent, and not correlated. In practice, however, factors are often highly correlated, thus making it difficult to deconvolve the effects that each factor has.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 268
Chapter 7: Examples
Table 7.1-1: Data Collected for Model Development Part Class Switching Diode Rectifier Diode Voltage Regulator Diode Voltage Reference Diode Current Regulator Diode Transient Suppressor Diode PNP Transistor, 5W Dual Transistor Darlington Transistor JFET MOSFET Unijunction Device Thyristor Schottky Microwave Diode Tunnel Diode Varactor PIN Diode Microwave Power Transistor LED Infrared Emitting Diode (IRED) Alphanumeric Display (Segment) Alphanumeric Display (Display) Photodetector Opto-isolator
Failures
Part Hours (Millions)
86 471 228 282 2 7 2330 246 52 89 1 57 878 209 19 245 18 72 30 1857 2612 22 0 144 4 7 170
916.91 7745.48 1154.84 2951.22 13.54 6.58 24706.61 1845.35 75.10 112.24 7.05 76.58 5177.81 431.77 68.23 1013.18 129.39 234.45 173.2 13413.37 1138.70 4827.08 39.1 636689.67 646.09 47.0 595.96
An example of this is the correlation between quality and environment. This correlation exists because higher quality parts are often used in the more severe environments. As such, the analyst’s options are to: 1. Keep the factors as derived, with the caveat that they may be in error 2. Treat the factors as a combined, “pooled” factor representing the correlated variables 3. Use alternate approaches to quantifying the effects of either or all correlated variables Reliability Information Analysis Center 269
Chapter 7: Examples
7.1.5. Stepwise Multiple Regression Analysis
This step in the analysis consists of the following: 1. Each factor is linearized in accordance with the desired acceleration model 2. The regression is performed and coefficients are estimated For example, consider the following model in which the factors to be included are the base failure rate, a temperature factor and a stress factor:
λ = λbπ T π s or:
λ = λbe
− Ea KT
Sn
Taking the log of both sides yields: − Ea
ln λ = ln λb + ln e KT + ln S n ln λ = ln λb +
− Ea + n ln S KT
or:
λ=a
ln λb +
− Ea + n ln S KT
The transforms are shown in Table 7.1-2. Table 7.1-2: Data Transforms Variable Observed failure rate Temperature Stress
Transform ln λ -1/T ln S
When the regression is performed, the intercept is “ln λ b”, and the temperature factor and stress coefficients are “–Ea/K” and “n”, respectively. In MS Excel, the LINEST function 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 270
Chapter 7: Examples
is used to determine the model coefficients. These are the values used in the original equation:
λ = λbe
− Ea KT
Sn
If categorical variables are to be modeled, they can be modeled with regression analysis by assigning a “1” or a “0” to the variable, and performing the regression as described above. As an example, consider the case in which the product or system to be modeled has temperature, stress, environment and quality as the four variables affecting the reliability. This is shown in Table 7.1-3. Table 7.1-3: Regression Data Including Categorical Variables Variable
Environment
Independent variable (e.g., λ)
Temperature
Stress
ln(λ1) ln(λ2) ln(λ3) ln(λ4) ln(λ5)
1/T1 1/T2 1/T3 1/T4 1/T5
lnS1 lnS2 lnS3 lnS4 lnS5
GB 0 1 0 0 1
AI 1 0 0 1 0
GM 0 0 1 0 0
Quality Commercial 1 0 1 0 1
Industrial 0 0 0 1 0
Military 0 1 0 0 0
The equation above, expanded with the inclusion of the categorical variables, becomes:
λ =e
ln λb +
− Ea + n ln S + a1GB + A2 AI + A3GM + A4Comm.+ A5 Ind .+ A6 Mil KT
where Ai are the coefficients of the categorical variables determined from the regression analysis. 7.1.6. Goodness-of-Fit Analysis
There are several ways to analyze how good the model fits the data. The standard error provides an indication of the significance of the specific factor under analysis. The standard error is the standard deviation of the coefficient estimate. Therefore, if the standard error is small relative to the coefficient estimate, this is an indication that the factor is statistically significant. Likewise the opposite is also true.
Reliability Information Analysis Center 271
Chapter 7: Examples
Residual plots are also useful in assessing how good the model is as a predictor of reliability. The smaller the residuals, the better the model is. Another useful plot, similar to a residual plot, is obtained when plotting the log10 of the observed-to-predicted ratio. If this metric is relatively tightly clustered and centered around zero, this is an indication of a good model. 7.1.7. Extreme Case Analysis
One of the potential problems in using a multiplicative model form is that extreme value problems can arise. For example, when all input factors are simultaneously at their high or low values, the resultant predicted failure rate can be unrealistically high or low. This situation can be addressed with the use of different model forms, such as in the case of the RIAC 217Plus models, in which a combination additive and multiplicative model form is used. 7.1.8. Model Validation
The last step in the process is to validate the model. This is accomplished by ensuring that the resulting models fit the observed data to a reasonable degree. Additionally, the models can be checked against observed data not used in the model development. Valuable data for this purpose is data at levels above the component level. In many cases, high quality data can be obtained on systems or assemblies, but not at the part level. This occurs due to the level at which maintenance is performed and data is captured. Therefore, while the data cannot be used for model development, it can be used for model validation. Another thing that must be accounted for in the model validation effort is the scaling of base failure rates to account for data in which there were no observed failures. The methodology presented in this section is based on the premise that there exists a point estimate of the dependent variable, in this case the failure rate. In cases where there are no failures, a point estimate is not possible, i.e., only a lower single-sided confidence bound is possible. The use of this confidence bound value cannot be used to represent the data since the resultant model will be pessimistic (i.e., the failure rate will be artificially increased). Only using the data points for which there are failures is also not appropriate because it also will artificially bias the model pessimistically. Potential solutions to this situation include: •
Scaling the base failure rates to reflect the zero failure data. One possible alternative to accomplish this is to scale the base failure rates with the boundary 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 272
Chapter 7: Examples
•
condition that the predicted number of failures in the entire dataset equals the observed number. Use of maximum likelihood (MLE) parameter estimation techniques. These MLE techniques are especially suited to censored data such as zero failures.
7.2. 217Plus Reliability Prediction Models 7.2.1. Background
In 1994, Military Specifications and Standards Reform (MSSR) decreed the adoption of performance-based specifications as a means of acquiring and modifying weapons systems. This led to the cancellation of many military specifications and standards. This, coupled with the fact that the Air Force had re-directed the mission of Rome Laboratory (now called the Air Force Research Laboratory (the preparing activity for MIL-HDBK217)) away from reliability, resulted in MIL-HDBK-217 becoming obsolete, with no government plans to update it. The RIAC believed that there was a need for a reliability assessment technique that could be used to estimate the reliability of systems in the field. A viable assessment methodology needed: 1. Updated component reliability prediction models, since MIL-HDBK-217 was not to be updated 2. A methodology for quantifying the effect that non-component variables have on system reliability 3. To be useable by reliability engineers with data that is typically available during the system development process The RIAC is chartered with the collection, analysis and dissemination of reliability data and information. To this end, it publishes quantitative reliability data such as failure rate and failure mode/mechanism compendiums, as well as failure rate models. It is not required to provide these services, but does so because there is a need for this data in the reliability engineering community. It will continue to engage in such activities as long as there appears to be this need by reliability practitioners. For this reason, the 217Plus models and methodology were developed. There are two primary elements to 217Plus, component reliability prediction models and system-level models. A system failure rate estimate is first made by using the component models to estimate the failure rate of each component. These failure rates are then summed to estimate the system failure rate. This is the traditional methodology used in many reliability predictions, and represents the reliability prediction, i.e., a reliability estimate that is made before empirical data or detailed assessments are available. This Reliability Information Analysis Center 273
Chapter 7: Examples
prediction is then modified in accordance with system level factors, which account for non-component, or system level, effects. This is an example of a reliability “assessment”, in which the process and design factors are assessed. Finally, the prediction and assessment are combined with empirical data to form the reliability “estimate” of the product, which is the best estimate of reliability based on all analysis and data available to the analyst. The goal of component reliability models is to estimate the “rate of occurrence of failure”, or ROCOF, and accelerants of a component’s primary failure mechanisms within an acceptable degree of accuracy. Toward this end, the models should be adequately sensitive to operating scenarios and stresses, so that they allow the user the ability to perform tradeoff analysis amongst these variables. For example, the basic premise of the 217Plus models is that they have predicted failure rates for operating periods, non-operating periods and cycling. As a result, the user can perform tradeoff analysis amongst duty cycle, cycling rate, and other variables. As an example, a question that frequently arises is whether a system will have a higher failure rate if it is continuously powered on, or whether it is powered off during periods of non-use. The models in 217Plus are structured to facilitate the tradeoff analysis required to answer this question. A flow diagram of the entire approach was presented in Chapter 2, which guides the user in the application of the component models and the system level models. The basis for the 217Plus methodology is the component reliability models, which estimate a system’s reliability by summing the predicted failure rates of the constituent components in the system. This estimate of the system reliability is further modified by the application of “System-Level” factors, called Process Grade Factors (PGF). Development of the component models is presented in Sections 7.2.3 through 7.2.5. The primary intent of this section is to detail the development of the 217Plus methodology. It is provided to familiarize the reader with the issues faced by model developers in order to allow a better understanding of 217Plus and similar models. It provides details related to certain aspects of model development. 7.2.2. System Reliability Prediction Model 7.2.2.1. 217Plus Background
The premise of traditional methods of reliability predictions, such as MIL-HDBK-217, is that the failure rate of a product or system is primarily determined by the components comprising it. Historically, a significant number of failures also stem from noncomponent causes such as design deficiencies, manufacturing defects, inadequate 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 274
Chapter 7: Examples
requirements, induced failures, etc., that have not been explicitly addressed in prediction methods. The data in Figure 7.2-1, presented previously, contains the nominal percentage of failures attributable to each of eight identified predominant failure causes based on data collected by the RIAC. The data in this figure represents nominal percentages. The actual percentages can vary significantly around these nominal values. Softw are 9%
Parts 22%
No Defect 20%
Manufacturing 15%
Induced 12% Wearout System 9% Management 4%
Design 9%
Figure 7.2-1: Failure Cause Distribution of Electronic Systems The definitions of failure causes, as presented in an earlier chapter, are: •
Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor, connector, etc.) failing to perform its intended function. Examples include part failures due to poor quality; manufacturer or lot variability; or any process deficiency that causes a part to fail before its expected wearout limit is reached.
Reliability Information Analysis Center 275
Chapter 7: Examples
•
Design (9%): Failures resulting from an inadequate design. Examples include tolerance stack-up, unanticipated logic conditions (e.g., sneak paths), a non-robust design for given environmental stresses, etc.
•
Manufacturing (15%): Failures resulting from anomalies in the manufacturing process that are not related to the inherent reliability of a part, i.e., faulty solder joints, inadequate wire routing resulting in chafing, bent connector pins, etc.
•
System Management (4%): Failures traceable to faulty interpretation of system requirements, imposition of “bad” requirements (missing, inadequate, ambiguous or contradictory), or failure to provide the resources (funding and/or personnel) required to design and build a reliable product or system.
•
Wearout (9%): Failures resulting from wearout-related failure mechanisms due to basic device physics. Examples of electronic components exhibiting wearout-related failure mechanisms are electrolytic capacitors, solder joints, microwave tubes (such as TWTs), and switch and relay contacts.
•
No defect (20%): Perceived failures that cannot be reproduced upon further testing. These may or may not be an actual failure; however they are removals and, therefore, are typically counted toward the logistic failure rate (or MTBF). Examples include the inability of the maintenance environment to recreate the operational environmental stresses under which the original failure occurred, or “looser” tolerances on the test equipment than on the platform or system from which the defective unit was taken.
•
Induced (12%): Failures resulting from an externally applied stress. Examples are electrical overstress and maintenance-induced failures (i.e., dropping, bending pins, etc.).
•
Software (9%): Failures of a system to perform its intended function due to the manifestation of a software fault
Another example that this author has experience with is shown in Figure 7.2-2, which represents the distribution observed for Erbium Doped Fiber Amplifiers (EDFAs) used in long haul telecommunications systems. The distribution is different than the above chart, which is a pooled result from various system types and manufacturers. This example is 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 276
Chapter 7: Examples
provided to illustrate the notion that the system type and manufacturing practices will dictate the specific distribution obtained.
8% No Fault Found 21% Manufacturing Defect
1% - Component Mechanical 7% - Component Electrical
63% - Component - Pumps and Other Optical Components
Figure 7.2-2: Optical Amplifier Failure Cause Distribution 7.2.2.2. Methodology Overview
The 217Plus methodology is structured to allow the user the ability to estimate the reliability of a product or system in the initial design stages when little is known about it. For example, a reliability prediction early in the development phase of a system can be made based on a generic parts list, using default values for operational profiles and stresses. As additional information becomes available, the model allows the incremental addition of empirical test and field data to supplement the initial prediction. The purpose of 217Plus is to provide an engineering tool to assess the reliability of electronic systems. It is not intended to be the "standard" prediction methodology, and it can be misused if applied carelessly, just as any empirical or physics-based model can. Also, it is a tool to allow the user the ability to estimate the failure rate of parts, assemblies and systems. It does not consider the effect of redundancy or perform FMEAs. The intent of 217Plus is to provide the data necessary as an input to these analyses. The methodology allows for the modification of a base reliability estimate with Process Grading Factors for the failure causes listed in Section 7.2.2.1. Reliability Information Analysis Center 277
Chapter 7: Examples
These process grades correspond to the degree to which actions have been taken to mitigate the occurrence of product or system failure due to these failure categories. Once the base estimate is modified with the process grades, the reliability estimate is further modified by empirical data taken throughout item development and testing. This modification is accomplished using Bayesian techniques that apply the appropriate weights for the different data elements. Advantages of the 217Plus methodology are that it uses all available information to form the best estimate of field reliability, it is tailorable, it has quantifiable confidence bounds, and it has sensitivity to the predominant product or system reliability drivers. The methodology represents a holistic approach to predicting, assessing and estimating product or system reliability by accounting for all primary factors that influence the inability of an item to perform its intended function. It factors in all available reliability data as it becomes available on the program. It, thus, integrates test and analysis data, which provides a better prediction foundation and a means for estimating variances from different reliability measures. 7.2.2.3. System Reliability Model
The fundamental 217Plus failure rate model for a system is as follows:
λ P = λ IA (Π P + Π D + Π M + ΠS + Π I + Π N + Π W ) + λ SW The sum of the Pi-factors in the parenthesis represents the cumulative multiplier that accounts for all of the processes used in system development and sustainment. The sum of these values is normalized to unity for processes that are considered to be the mean of industry practices. The individual model factors are: λP
=
λIA
=
ΠP ΠD ΠM ΠS ΠI ΠN
= = = = = =
Predicted failure rate of the product or system (in failures per million calendar hours) Initial assessment of the failure rate based on component failure rate estimates Parts process multiplier Design process multiplier Manufacturing process multiplier System management process multiplier Induced process multiplier No-defect process multiplier
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 278
Chapter 7: Examples
ΠW λSW
= =
Wearout process multiplier Software failure rate prediction
Additional factors included in the model account for the effects of infant mortality, environment, and reliability growth. Since each of these factors does not influence all of the factors in the above equation, they are applied selectively to the applicable factors. For example, environmental stresses will generally accelerate part defects and manufacturing defects to failure. These additional factors are normalized to unity under average conditions, so that the value inside the parenthesis is one under nominal conditions and for nominal processes.
λ P = λ IA (Π P Π IM Π E + Π D Π G + Π M Π IM Π E Π G + ΠS Π G + Π I + Π N + Π W ) + λ SW where, ΠIM ΠE ΠG
= = =
Infant mortality factor Environmental factor Reliability growth factor
The initial assessment of the failure rate, λIA, is the seed failure rate value, which is obtained by using the 217Plus component reliability prediction models, along with other available data. This failure rate is then modified by the Pi-factors that account for specific processes used in the design and manufacture of the product or system, along with the environment, reliability growth and infant mortality characteristics of the item. The above failure rate expression represents the total failure rate of the system, which includes "induced" and "no defect found" failure causes. If the inherent failure rate is desired, then the "induced" and "no defect found" Pi-factors should be set to zero, since they represent operational and non-inherent failure causes. 7.2.2.4. Initial Failure Rate Estimate
An initial estimate of a system failure rate is based on a combination of the component failure rate models, the empirical field failure rate data contained in the RIAC databases, or user-defined failure rates from other sources that are entered directly by the user. This initial failure rate is then used as a seed value that represents a typical failure rate for the product or system. It is then adjusted in accordance with the PGFs, infant mortality characteristics, reliability growth characteristics, and environmental stresses. In addition, software is modeled as a separate failure rate. Reliability Information Analysis Center 279
Chapter 7: Examples
All variables in the model default to average values, not worst-case values. As a result, the user has the option of applying any or all factors, depending on the level of knowledge of the product or system and the amount of time or resources available for the assessment. If a traditional reliability prediction is desired, the user can perform it using the component models and the RIAC database failure rates contained in 217Plus9. As additional data and information becomes available, the analysis can be expanded to include these system-level factors. 7.2.2.5. Process Grading Factors
An objective of the 217Plus system model is to explicitly account for the factors contributing to the variability in traditional reliability prediction approaches. This is accomplished by grading the process for each of the failure cause categories. The resulting grade for each cause corresponds to the level to which an organization has taken the action necessary to mitigate the occurrence of failures of that cause. This grading is accomplished by assessing the processes in a self-audit fashion. Any or all failure causes can be assessed and graded. If the user chooses not to address a specific failure cause, the model simply reverts to the default "average" value. If the user chooses to apply the PGF methodology for any failure cause, there are a minimum number of questions that should be assessed and graded. Beyond this minimum, the user can selectively assess and grade additional criteria. If answers to the grading questions are not known, the model simply ignores those criteria. Process grading is used to quantify the following factors: • • • • • • •
ΠP (parts process multiplier) ΠD (design process multiplier) ΠM (manufacturing process multiplier) ΠS (system management process multiplier) ΠI (induced process multiplier) ΠN (no-defect process multiplier) ΠW (wearout process multiplier)
The sum of the Π factors within the parentheses in the failure rate model is equal to unity for the average grade. Each factor will increase if "less than average" processes are in used and decrease if “better than average” processes are in used.
9
The RIAC 217Plus software contains databases that hold the RIAC’s NPRD and EPRD failure rate data, converted to failures per million calendar hours. The RIAC “Handbook of 217Plus Reliability Prediction Models” does not contain this supplementary data. The RIAC NPRD and ERPD databooks are available for separate purchase from the RIAC, and are in units of failures per million operating hours.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 280
Chapter 7: Examples
Features of this PGF methodology are that it: • • • • •
Explicitly recognizes and accounts for special (assignable) cause problems Models reliability from the user (or total system-level) perspective Promotes cross-organizational commitment to Reliability, Availability and Maintainability (RAM) Quantitatively grades developers' efforts to affect improved reliability Maintains continuing organizational focus on RAM throughout the development cycle
Reference 2 presents the results of the study in which the process grades were determined. 7.2.2.6. Basis Data for the Model 7.2.2.7. Uncertainty in Traditional Approach Estimates
A goal of 217Plus is to model predominant system reliability drivers. The premise of traditional methods such as MIL-HDBK-217 is that the failure rate is primarily determined by the technology and application stress of the components comprising the product or system. This was a good premise many years ago, when components exhibited higher failure rates and systems were not as complex as they are today. Increased item complexity and component quality have resulted in a shift of system failure causes away from components to more system-level factors, including system requirements, interface problems and software problems. A significant number of failures also stem from non-component causes such as defects in design and manufacturing. Historically, these factors have not been explicitly addressed in prediction methods. The approach used to develop the 217Plus model was to (1) quantify the uncertainty in predictions using "component-based" traditional approaches and (2) explicitly model the factors contributing to that uncertainty. Data was collected by the RIAC on systems for which both predicted and observed MTBF data was available. This was done for the purpose of quantifying the uncertainty in traditional component-based predictions. Table 7.2-1 presents the multipliers of a failure rate point estimate as a function of confidence level that was derived from analysis of this data. For example, using traditional approaches, one could be 90% certain that the true failure rate was less than 7.575 times the predicted value.
Reliability Information Analysis Center 281
Chapter 7: Examples
Table 7.2-1: Uncertainty Level Multiplier Percentile 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
Multiplier 0.132 0.265 0.437 0.670 1.000 1.492 2.290 3.780 7.575
7.2.2.8. System Failure Causes
The premise of the 217Plus model developed in the RIAC study was that the failure rate attributable to the predominant system-level failure causes could be quantified. In addition to the intrinsic variability associated with the failure rate prediction, there is additional variability associated with the variance in the distribution of failure causes. This requires that there be baseline data that quantifies the failure rate of each cause. The data in Table 7.2-2 was used for this purpose. This table contains, for each source of data, the percentage of failures attributable to each of the eight identified predominant failure causes. It should be noted here that the reported percentages of failure due to some failure causes might be underestimated. For example, system management and software may be under-reported because failures are usually not attributed to those categories, even when they are the root cause of failure. This also means that the percentages from the other causes may be overestimated. Although the authors recognize that this is likely, the values in the model reflect the reported values. However, if a user of the model has failure cause distribution information from which the model factors can be tailored, this data should be used instead of the nominal values.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 282
Chapter 7: Examples
Table 7.2-2: Percentage of Failures Attributable to Each Failure Cause Survey Respondent 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Part Defect 5 34 13 9 46 46 19 28 42 64 24 15 32 13 19 61 38 30
Mfg. Defect 38 28 5 31 10 25 39 28 42 0 28 13 1 10 3 5 15 19
Design 0 0 5 38 19 2 10 28 16 0 0 4 5 10 5 5 17 10
System Wearout No Induced Software Mgt. Defect 0 0 42 8 8 0 39 0 0 0 0 3 30 43 0 0 6 0 16 0 0 12 0 14 0 0 12 0 14 0 0 10 0 22 0 0 0 0 17 0 0 0 0 0 0 0 17 0 20 0 0 6 34 8 0 12 6 17 32 1 11 27 16 7 0 1 13 0 34 20 0 5 40 7 20 1 15 10 3 0 0 12 0 18 0 1 11 11 15 3
An analysis was then performed on the Table 7.2-2 data to quantify the distributions of percentages for each failure cause. This was accomplished by performing a Weibull analysis of each column. The resulting distributions are summarized in Table 7.2-3. Table 7.2-3: Weibull Parameters for Failure Cause Percentages Failure Cause Parts Manufacturing Design System Management Wearout Induced No Defect Software
Characteristic Percentage 33.9 23.2 13.9 7.1 14.7 19.8 31.9 15.0
Weibull Shape Parameter (beta) 1.62 0.96 1.29 0.64 1.68 1.58 1.92 0.70
Reliability Information Analysis Center 283
Chapter 7: Examples
Table 7.2-4 summarizes the failure rate multiplier values for each of the eight failure causes as a function of the grade for each of the eight. The generic formula for the multiplier is given as:
Πi = −α × (ln Ri )1/ β In this calculation, the characteristic percentages listed in Table 7.2-3 are scaled by a factor of 1.11 to ensure that the sum of the multipliers is equal to one when each grade is equal to 0.50. In this case, a grade of 0.50 represents an "average" process, and since the model is normalized to an average process, the total multiplier of the initial assessment failure rate is equal to one under these conditions.
Parts
Manufacturing
Design
System Management
Wearout
Induced
No Defect
Table 7.2-4: Multipliers as a Function of Process Grade
0.725 0.655 0.612 0.581 0.556 0.535 0.516 0.500 0.486 0.472 0.460 0.449 0.438 0.428 0.419 0.410 0.402 0.394 0.386 0.379 0.372
0.948 0.800 0.714 0.653 0.606 0.567 0.535 0.507 0.482 0.461 0.441 0.423 0.406 0.391 0.376 0.363 0.351 0.339 0.328 0.317 0.307
0.378 0.333 0.306 0.286 0.271 0.258 0.247 0.237 0.229 0.221 0.214 0.207 0.201 0.195 0.190 0.185 0.180 0.176 0.171 0.167 0.163
0.643 0.498 0.420 0.367 0.328 0.298 0.273 0.251 0.233 0.218 0.204 0.191 0.180 0.170 0.161 0.152 0.145 0.137 0.131 0.124 0.119
0.304 0.276 0.258 0.245 0.235 0.227 0.219 0.212 0.207 0.201 0.196 0.191 0.187 0.183 0.179 0.176 0.172 0.169 0.166 0.162 0.160
0.433 0.391 0.365 0.346 0.330 0.317 0.306 0.296 0.288 0.279 0.272 0.265 0.259 0.253 0.247 0.242 0.237 0.232 0.227 0.223 0.219
0.588 0.540 0.511 0.488 0.470 0.455 0.442 0.430 0.420 0.410 0.401 0.393 0.385 0.378 0.371 0.364 0.358 0.352 0.346 0.340 0.335
Cumulative Percentage (Grade) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 284
Parts
Manufacturing
Design
System Management
Wearout
Induced
No Defect
Chapter 7: Examples
0.365 0.358 0.352 0.345 0.339 0.333 0.328 0.322 0.317 0.311 0.306 0.301 0.296 0.291 0.286 0.281 0.277 0.272 0.267 0.263 0.259 0.254 0.250 0.246 0.241 0.237 0.233 0.229 0.225 0.221 0.217 0.213 0.209 0.205 0.202 0.198
0.298 0.288 0.280 0.271 0.263 0.256 0.248 0.241 0.234 0.228 0.221 0.215 0.209 0.203 0.198 0.192 0.187 0.181 0.176 0.171 0.167 0.162 0.157 0.153 0.148 0.144 0.140 0.136 0.132 0.128 0.124 0.120 0.117 0.113 0.109 0.106
0.160 0.156 0.152 0.149 0.146 0.143 0.140 0.137 0.134 0.131 0.128 0.125 0.123 0.120 0.118 0.115 0.113 0.110 0.108 0.106 0.104 0.101 0.099 0.097 0.095 0.093 0.091 0.089 0.087 0.085 0.083 0.081 0.080 0.078 0.076 0.074
0.113 0.108 0.103 0.098 0.094 0.090 0.086 0.083 0.079 0.076 0.072 0.069 0.067 0.064 0.061 0.059 0.056 0.054 0.052 0.049 0.047 0.045 0.043 0.042 0.040 0.038 0.036 0.035 0.033 0.032 0.030 0.029 0.028 0.026 0.025 0.024
0.157 0.154 0.151 0.149 0.146 0.144 0.141 0.139 0.137 0.134 0.132 0.130 0.128 0.126 0.124 0.122 0.120 0.118 0.116 0.114 0.112 0.111 0.109 0.107 0.105 0.104 0.102 0.100 0.098 0.097 0.095 0.093 0.092 0.090 0.088 0.087
0.214 0.210 0.206 0.203 0.199 0.196 0.192 0.189 0.185 0.182 0.179 0.176 0.173 0.170 0.167 0.164 0.161 0.159 0.156 0.153 0.151 0.148 0.146 0.143 0.140 0.138 0.136 0.133 0.131 0.128 0.126 0.124 0.121 0.119 0.117 0.114
0.330 0.325 0.320 0.315 0.310 0.306 0.301 0.297 0.293 0.288 0.284 0.280 0.276 0.272 0.269 0.265 0.261 0.257 0.254 0.250 0.247 0.243 0.240 0.236 0.233 0.229 0.226 0.223 0.219 0.216 0.213 0.210 0.206 0.203 0.200 0.197
Cumulative Percentage (Grade) 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57
Reliability Information Analysis Center 285
Parts
Manufacturing
Design
System Management
Wearout
Induced
No Defect
Chapter 7: Examples
0.194 0.190 0.186 0.183 0.179 0.175 0.172 0.168 0.164 0.160 0.157 0.153 0.149 0.146 0.142 0.138 0.135 0.131 0.127 0.123 0.119 0.116 0.112 0.108 0.104 0.100 0.096 0.092 0.088 0.084 0.079 0.075 0.070 0.066 0.061 0.056
0.103 0.099 0.096 0.093 0.090 0.086 0.083 0.080 0.077 0.074 0.072 0.069 0.066 0.063 0.061 0.058 0.055 0.053 0.050 0.048 0.045 0.043 0.040 0.038 0.036 0.034 0.031 0.029 0.027 0.025 0.023 0.021 0.019 0.017 0.015 0.013
0.072 0.071 0.069 0.067 0.065 0.064 0.062 0.060 0.059 0.057 0.055 0.054 0.052 0.050 0.049 0.047 0.046 0.044 0.042 0.041 0.039 0.038 0.036 0.035 0.033 0.031 0.030 0.028 0.027 0.025 0.023 0.022 0.020 0.019 0.017 0.015
0.023 0.022 0.021 0.020 0.019 0.018 0.017 0.016 0.015 0.014 0.013 0.013 0.012 0.011 0.010 0.010 0.009 0.008 0.008 0.007 0.007 0.006 0.006 0.005 0.005 0.004 0.004 0.003 0.003 0.003 0.002 0.002 0.002 0.001 0.001 0.001
0.085 0.084 0.082 0.080 0.079 0.077 0.076 0.074 0.073 0.071 0.069 0.068 0.066 0.065 0.063 0.062 0.060 0.058 0.057 0.055 0.053 0.052 0.050 0.048 0.047 0.045 0.043 0.042 0.040 0.038 0.036 0.034 0.032 0.030 0.028 0.026
0.112 0.110 0.108 0.106 0.103 0.101 0.099 0.097 0.095 0.092 0.090 0.088 0.086 0.084 0.081 0.079 0.077 0.075 0.073 0.071 0.068 0.066 0.064 0.062 0.059 0.057 0.055 0.052 0.050 0.047 0.045 0.042 0.040 0.037 0.034 0.031
0.194 0.190 0.187 0.184 0.181 0.178 0.174 0.171 0.168 0.165 0.162 0.158 0.155 0.152 0.149 0.145 0.142 0.139 0.135 0.132 0.129 0.125 0.122 0.118 0.114 0.111 0.107 0.103 0.099 0.095 0.091 0.087 0.082 0.078 0.073 0.068
Cumulative Percentage (Grade) 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92 0.93
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 286
Parts
Manufacturing
Design
System Management
Wearout
Induced
No Defect
Chapter 7: Examples
0.051 0.045 0.039 0.033 0.025 0.016
0.011 0.009 0.007 0.005 0.003 0.002
0.013 0.012 0.010 0.008 0.006 0.003
0.001 0.001 0.000 0.000 0.000 0.000
0.023 0.021 0.018 0.015 0.012 0.008
0.028 0.025 0.022 0.018 0.014 0.009
0.062 0.057 0.050 0.043 0.035 0.024
Cumulative Percentage (Grade) 0.94 0.95 0.96 0.97 0.98 0.99 7.2.2.9. Environmental Factor
MIL-HDBK-344 (Reference 6) defines the stress screening strength (SS) to be “the probability that a specific screen will precipitate a latent defect to failure and detect it by test, given that a latent defect susceptible to the screen is present. It is the product of the precipitation efficiency (PE) and detection efficiency (DE).” It is equivalent to the percentage of defects that are removed from the prescreened population:
SS =
Dremoved Din
where:
Dremoved = D in − Dremaining The failure rate is, therefore: λ=
D field (t ) t
where: t = the period, in hours, over which the MTBF is to be measured Dfield = the number of field failures due to latent defects occurring during the interval “t”. Since SS is the percentage of defects removed from the population, it follows that: Reliability Information Analysis Center 287
Chapter 7: Examples
Dfield = Dremaining* SS field The SSfield is the effective screening strength of the stresses that the product or system will encounter in the field, and SSESS is the screening strength that the system is exposed to during environmental stress screening (ESS). It also follows that Dfield is equal to the cumulative (integral of) field failure rate:
D field = ∫ λ (t ) D field = ∫ λ postscreened (t )
∫
D field = SS * λ prescreened (t )
λ postsceened = SS * λ prescreened This indicates that, in addition to estimating the effect that ESS has on system reliability, the screening strength calculated from field stresses (SSfield) can be effectively used as a failure rate multiplier that accounts for the environmental stresses: SS field (t ) =
1 − e − kt t
where, SSfield(t) = k =
equivalent screening strength of the field environment field precipitation rate
The total screening strength, SStotal , after accounting for both the temperature cycling and vibration-related portions, is:
SStotal = PTC * SS(TC) + PRV * SS(RV) where: the percentage of failures resulting from temperature cycling stresses PTC = the percentage of failures resulting from random vibration stresses PRV = SS(TC) = the screening strength applicable to temperature cycling 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 288
Chapter 7: Examples
SS(RV) = the screening strength applicable to random vibration. Algorithms for calculating screening strength are given in a subsequent section. If the actual values of PTC and PRV are unknown, the default values that should be used are:
PTC = 0.80 PRV = 0.20 Since the component failure rates described above are relative to a ground benign environment, the failure rate multiplier is the ratio of the SS value in the use environment to the SS value in a ground benign environment:
ΠE =
PTC * SS (TCuse ) + PRV * SS ( RVuse ) PTC * SS (TCGb ) + PRV * SS ( RVG ) b
where: PTC = percentage of failures resulting from temperature cycling stresses PRV = percentage of failures resulting from random vibration stresses SS = screening strength applicable to the application environmental values As previously indicated, the SS value is the screening strength and has been derived from MIL-HDBK-344. It is an estimate of the probability of both precipitating a defect to failure and detecting it once it is precipitated by the test.
SS TC = 1 − e (− kTC t )
SSRV =1 − e(−k RV t ) k TC = 0.0017 ( ΔT + .6) .6 [ln (RATE + 2.718) ]
3
where: ΔT = Tmax − Tmin
(in degrees C)
RATE = degrees C/minute t = # of cycles k RV = 0.0046 G 1.71 Reliability Information Analysis Center 289
Chapter 7: Examples
The parameter “G” is the magnitude of vibration stress, in units of Grms. Whenever possible, the actual values of delta T (ΔT) and vibration (Grms) should be used for the use application environment when calculating SS values. If the actual values are not known, then the default values of ΔT (summarized in the component model descriptions later) can be used. A discussion of the values of “k” follows. For RV screens it is necessary to include an axis sensitivity factor. The RV applied in the axis perpendicular to the plane of the board will have the greatest effect. When selecting and modeling RV stress, the precipitation efficiency is, thus, given by:
[1− exp (-kt)]* (Axis Sensitivity Factor) where the “axis sensitivity factor” is the defect density in the sensitive axis divided by the total defect density. Transmissibility and resonance effects must be considered, and the frequency spectrum may need to be suitably notched to avert overstress or wearout effects. Similarly, thermal mass and conductivities must be considered when determining temperature cycle (TC) transition rates and required dwell times. The stress levels for all of these equations pertain to the product or system being screened and not the test chamber conditions. It should also be noted that the expressions and tables for precipitation efficiency are only approximate and, as in the estimation of initial defects, should be refined based upon actual user data according to the techniques of Procedure D of MIL-HDBK-344. Under the average temperature cycling and random vibrations conditions that represent the data used in development of the models, the denominator is 0.205. This value is a normalization constant such that the environment factor is equal to 1.0 when a product or system is subjected to the average stress levels. The values assumed for the rate and duration are 2 degrees C per minute and 10 hours, respectively. Therefore, the environment factor is: 1.71 ( ( − 0.065 (ΔT + 0.6 )0.6 ) ⎞ − 0.046 G ) ⎞ ⎞ ⎛ ⎛ ⎛ 0.855 × ⎜ 0.8⎜1 − e ⎟ + 0.2⎜1 − e ⎟⎟ ⎝ ⎠ ⎝ ⎠⎠ ⎝ ∏E = 0.205
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 290
Chapter 7: Examples 7.2.2.10.
Reliability Growth
The 217Plus model includes a factor for assessing the reliability growth characteristics of a product or system10. The premise of this factor is that the processes that contribute to system reliability growth in the field may or may not exist. The degree to which growth exists is estimated by a grading factor that assesses the processes contributing to growth. The growth factor calculation is given by the formula: ΠG =
1.12(t + 2) −α 2 −α
The denominator in the above expression is necessary to ensure that the value of the factor is 1.12 at the time of field deployment, regardless of the growth rate (α). Figure 7.2-3 illustrates the growth Pi-factor multiplier for various values of growth rates as a function of time. 1.2
1
Pi (Growth)
0.8
0 0.2 0.5
0.6
0.7 1 0.4
0.2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 Time (years)
Figure 7.2-3: ΠG vs. Time and Growth Rates The value of “α” is estimated by determining the degree to which the potential for growth exists. This estimation is accomplished in a manner similar to the process grading 10 The system reliability growth factor is different from, and in addition to, the reliability growth factors used in the 217Plus component models to reflect component technology improvements from their respective baseline years.
Reliability Information Analysis Center 291
Chapter 7: Examples
methodology by assessing and grading the processes that can contribute to reliability growth. 7.2.2.11.
Infant Mortality
Infant mortality is accounted for in the model with a time-variant factor that is a function of the level to which ESS has been applied. The infant mortality correction factor, ΠIM, is calculated as:
t - 0.62 Π IM = (1 - SSESS ) 1.77 where: t = SSESS =
time in years the screening strength of the screen(s) applied, if any.
The value of SS can be determined by using the stress screening strength equations as presented in Section 7.2.2.9. The above expression represents the instantaneous failure rate. If the average failure rate for a given time period is desired, this expression must be integrated and divided by the time period. 7.2.2.12.
Combining Predicted Failure Rate with Empirical Data
The user of this model is encouraged to collect as much empirical data as possible and use it in the 217Plus reliability assessment. This was summarized in Section 2.6, and is done by mathematically combining the initial assessment made (based on the initial assessment and the process grades) with empirical data. This step combines the best "pre-build" failure rate estimate obtained from the initial assessment (plus the influence of the PGFs) with the metrics obtained from the empirical data. Bayesian techniques are used for this purpose. This technique accounts for the quantity of data by weighting large amounts of data more heavily than small amounts. The failure rate estimate obtained above forms the "prior" distribution, comprised of a0 and b0. 7.2.3. Development of Component Reliability Models 7.2.3.1. Model Form
Traditional methods of reliability prediction model development, as discussed earlier in the section on MIL-HDBK-217, have included the statistical analysis of empirical failure rate data. Statistical methods have included ANOVA, multiple linear regression, sensitivity analysis, etc. When using multiple linear regression techniques with highly 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 292
Chapter 7: Examples
variable data (which is often the case with empirical failure rate data), a requirement of the model form is that it be multiplicative (i.e., the predicted failure rate is the product of a base failure rate and several factors that account for the stresses and component variables that influence reliability). An example of a multiplicative model is as follows:
λ p = λbπ eπ qπ s where: λp = λb = Πe = Πq = Πs =
predicted failure rate base failure rate environmental factor quality factor stress factor
However, a primary disadvantage of the multiplicative model form is that the predicted failure rate value can become unrealistically large or small under extreme value conditions (i.e., when all factors are at their lowest or highest values). This is an inherent limitation of multiplicative models, primarily due to the fact that individual failure mechanisms, or classes of failure mechanisms, are not explicitly accounted for. A better approach is an additive model which predicts a separate failure rate for each generic class of failure mechanisms. Each of these failure rate terms are then accelerated by the appropriate stress or component characteristic. This model form is as follows;
λ p = λ oπ o + λ eπ e + λ cπ c + λ i + λ sj π sj where: λp = λo = πo = λe = πe= λc = πc = λi = λsj = πsj =
predicted failure rate failure rate from operational stresses product of failure rate multipliers for operational stresses failure rate from environmental stresses product of failure rate multipliers for environmental stresses failure rate from power or temperature cycling stresses product of failure rate multipliers for cycling stresses failure rate from induced stresses, including electrical overstress and ESD failure rate from solder joints product of failure rate multipliers for solder joint stresses Reliability Information Analysis Center 293
Chapter 7: Examples
By modeling the failure rate in this manner, factors that account for the application and component specific variables that affect reliability (π factors) can be applied to the appropriate additive failure rate term. Additional advantages to this approach are that they: •
• • • •
Address operating, non-operating and cycling-related failure rates in an additive model which are weighted in accordance with the operational profile (duty cycle and cycling rate). The Pi-factors modify only the applicable failure rate term, thereby eliminating many of the extreme value problems that plague multiplicative models Are based on observed failure mode distributions so that observed component failure causes are empirically modeled Are based on quantitative stresses (and not on qualitative environmental categories), but default to average stress conditions as a function of environment Are industry-independent and predict the average failure rates of best commercial practices Can be tailored with test data, if available, by applying the test data to appropriate additive term via the Bayesian method
7.2.3.2. Acceleration Factors
Acceleration factors (also called Pi-factors) are used in the 217Plus models to estimate the effect on failure rate of various stress and component variables. Since the traditional technique of multiple linear regression was not used in the derivation of the failure rate models, the Pi-factors were derived by utilizing either industry accepted values, values determined separately from data available to the RIAC, or values from previous modeling efforts. For example, the models typically include both an operating and non-operating temperature factor based on the Arrhenius relationship, which require an activation energy for operating and non-operating conditions. To estimate these values for the models, previous modeling studies (along with existing prediction methodologies) were used. Similarly, some factors were based on test data. For example, the exponent used in the delta T Pi-factor for the 217Plus integrated circuit model is based on fallout rate data from temperature cycling tests that were performed at various levels of delta T. 7.2.3.3. Time Basis of Models
Traditional reliability prediction models have been based on the operating time of the part, and the units were typically failures per million (or billion) operating hours (F/106H). The RIAC 217Plus models (and the empirical data contained in the RIAC databases included with the RIAC 217Plus software) predict the failure rate in units of 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 294
Chapter 7: Examples
failures per million calendar hours (F/106CH). This is necessary (and appropriate) because it is the common basis for all failure rate contribution terms used in the model (operating, non-operating, cycling, and induced). If an equivalent operating failure rate is desired (in units of failures per million operating hours), the failure rate (in F/106CH) can be divided by the duty cycle to yields a failure rate in F/106operating hours. 7.2.3.4. Failure Mode to Failure Cause Mapping
There are two primary types of data on which the RIAC 217Plus component models are based: failure rate and failure mode. The model development process required that the failure rate data be apportioned into four failure cause categories. Since the failure mode data contained in the RIAC databases was typically not defined by these categories, it was necessary to transform the RIAC failure mode data into a failure cause distribution. This was accomplished by assessing the stresses that accelerate the specific class of failure categories, and estimating the percentage of failures that could be attributed to those stresses. The primary stresses that potentially accelerate operational failure modes are operating temperature, vibration, current and voltage. The stresses that accelerate environmental failure causes are non-operating (i.e., dormant) ambient temperature, corrosive stresses (contaminants/heat/humidity), ageing stresses (time), and humidity. As an example, Table 7.2-5 summarizes this process for a resistor. Each of the six failure modes included in the analysis are listed across the top of the table, i.e. EOS, contamination, etc., along with their associated observed relative percentage of occurrence. This data was collected by the RIAC and was based primarily on the root cause failure analysis results of parts that had failed in the field. Table 7.2-5: Example of Failure Mode-to-Failure Cause Category Mapping Failure Category
Operational Stresses
Accelerating Stresses/ Causes
Failure Mode Contamination Cracked Chip Leakage out 41.20% 23.50% 17.60% 7.10% 5.90% EOS
Operating Temperature Vibration
p
Voltage Ambient (Dormant) Temp. Corrosion
p p
p
%
Total %
0.00
0.05
4.70% s
Current Environmental Stresses
TNI
s
0.04
s
0.00
s
0.00
s
0.08
s
0.09
Ageing
s
p
s
0.05
Humidity
p
p
s
0.09
Power Cycling
Power Cycling
Induced/EOS
Induced/EOS
p
p
p
Reliability Information Analysis Center 295
s
0.31
s
0.22
0.22
s
0.42
0.42
Chapter 7: Examples 7.2.3.5. Derivation of Base Failure Rates
Once the Pi-factors were defined for each component type that was modeled, and once the failure rate was apportioned amongst the failure causes, the base failure rate could be determined. This was accomplished by (1) gathering all failure rate data, (2) estimating the model input variables (temperatures, stresses, etc.) for each source of data, (3) calculating the associated Pi-factor for each failure rate, and (4) deriving a base failure rate for each of the failure cause categories. For example, the failure rate associated with operational stresses is equated to the product of the base failure rate and the operational Pi-factors: PFC * λobs = λbπ o where: PFC = λobs = λb = πo =
percentage of failure rate attributable to operational failure causes observed failure rate base failure rate to be derived product of model Pi-factors
Solving for λb, and adding a factor to account for data points which have had no observed failures, yields:
λb =
PFC * λobs
πo
* PF
The PF parameter is the percentage of total observed calendar hours associated with components that have had observed failures. This factor is necessary to pro-rate the base failure rate which was calculated from those data records containing failures. Once this value of λb was calculated for each data record, the geometric mean was used as the best estimate of the base failure rate. 7.2.3.6. Combining the Predicted Failure Rate with Empirical Data
The user of the 217Plus model is encouraged to collect as much empirical data as possible and use it in the assessment. This is done by mathematically combining the prediction made (based on the initial assessment and the process grades) with empirical data, resulting in a reliability estimate. This step will combine the best “pre-build” failure rate estimate obtained from the initial assessment (with process grading) with the metrics obtained from the empirical data. Bayesian techniques are used for this purpose. This technique accounts for the quantity of data by weighting large amounts of data more 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 296
Chapter 7: Examples
heavily than small quantities. The failure rate estimate obtained above forms the “prior” distribution, comprised of a0 and b0. If empirical data (i.e., test or field data) is available on the system under analysis, it can be combined with the best pre-build failure rate estimate using the following equation:
λ=
ao + a1 + ....an bo + b1 + ....bn
where:
λ = ao =
the best estimate of the predicted failure rate the equivalent number of failures of the prior distribution corresponding to the reliability prediction (after process grading has been accounted for). The default value is:
a0 = 0.5 bo =
the equivalent number of hours associated with the reliability prediction (after process grading) After a0 is calculated, the value of b0 can be calculated by: b0 =
a0
λp
a1 through an = the number of failures experienced in the empirical data. There may be “n” different types of data available b1 through bn = the equivalent number of cumulative operating hours (in millions) experienced in the empirical data. These values must be converted to equivalent hours by accounting for the accelerating effects between the test and use conditions. If test data is available that was taken at accelerated conditions, it needs to be converted to the conditions of interest. A traditional reliability prediction can be performed at both the test and use conditions, and the equivalent number of hours (bi) can be accelerated by the failure rate ratio between the test and use temperatures, as follows:
Reliability Information Analysis Center 297
Chapter 7: Examples
H Eq =
λT 1 * HT λT 2
where: H Eq =
λ T1 =
λT2 =
HT =
the equivalent number of test hours the predicted failure rate at the test conditions, obtained by performing a reliability prediction of the product or system at the test conditions the predicted failure rate at the use conditions, obtained by performing a reliability prediction of the product or system at the use conditions the actual number of test hours
The benefits of including empirical data in the failure rate estimate are that it: • •
Integrates all reliability data that is available at the point in time when the estimate is performed (analogous to the statistical process called “meta-analysis”) Provides flexibility for the user to customize the reliability model with actual historical experience data
7.2.3.7. Estimating Confidence Levels
The 217Plus methodology also estimates confidence levels around the failure rate. Before empirical data is available on a system, the levels are assessed based on a distribution that was derived by analyzing data on a variety of systems for which both reliability predictions and field data were available. After test or field data becomes available and failures are accrued, traditional Chi-square techniques can be used to estimate the uncertainty in the reliability prediction. 7.2.3.8. Using the 217Plus Model in a Top-Down Analysis
If empirical data exists on a predecessor system, the equation that translates the failure rate from the old system to the new system is as follows:
λ predicted = λ predecessor *
λ predicted , new λ predicted , predecessor
The (predicted, new)/(predicted, predecessor) failure rate ratio accounts for the differences in application environment, complexity, stresses, date, etc. The predicted failure rates for the predecessor and the new system are determined using the complete 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 298
Chapter 7: Examples
detailed 217Plus methodology previously described. The observed predecessor failure rate is used as the baseline against which the new system failure rate is estimated. 7.2.3.9. Capacitor Model Example
This section presents an example of the 217Plus component model for capacitors. The failure rate equation for capacitors is:
λP = π Gπ C (λOBπ DCOπ TOπ S + λEBπ DCN π TE + λTCBπ CRπ DT ) + λSJBπ SJDT + λEOS λP = πG =
predicted failure rate, failures per million calendar hours reliability growth failure rate multiplier:
π G = e (− β (Y −1993 )) β= πC =
growth constant. A function of capacitor type (see Table 7.2-6) capacitance failure rate multiplier:
⎛C⎞ π C = ⎜⎜ ⎟⎟ ⎝ C1 ⎠
CE
C= capacitance, in microfarads C1 = constant. A function of capacitor type (see Table 7.2-6) CE = constant. A function of capacitor type (see Table 7.2-6) λOB = base failure rate, operating πDCO = failure rate multiplier for duty cycle, operating:
π DCO =
DC DC1op
πTO = Failure rate multiplier for temperature, operating:
π TO = e
⎛ − Eaop ⎛ 1 1 ⎞ ⎞⎟ ⎜ ⎜ ⎟ ⎜ .00008617⎜ T + 273 − 298 ⎟ ⎟ ⎝ AO ⎠⎠ ⎝
Reliability Information Analysis Center 299
Chapter 7: Examples
Eaop = activation energy, operating. A function of capacitor type (see Table 7.26) πS = failure rate multiplier for electrical stress:
⎛S ⎞ π S = ⎜⎜ A ⎟⎟ ⎝ S1 ⎠
n
SA = stress ratio, the applied voltage stress divided by the rated voltage S1 = constant. A function of capacitor type (see Table 7.2-6) n= constant. A function of capacitor type (see Table 7.2-6) λEB = base failure rate, environmental (see Table 7.2-6) πDCN = failure rate multiplier, duty cycle – nonoperating: π DCN =
1 − DC DC 1nonop
πTE = failure rate multiplier, temperature-environment :
π TE = e
⎛ − Ea nonop ⎛ 1 1 ⎞ ⎞⎟ ⎜ ⎟ ⎜ ⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟ ⎠⎠ ⎝ AE ⎝
Eanonop = activation energy, nonoperating. A function of capacitor type (see Table 7.2-6) λTCB = base failure rate, temperature cycling (see Table 7.2-6) πCR = failure rate multiplier, cycling rate:
π CR =
CR CR1
πDT = Failure rate multiplier, delta temperature:
π DT
⎛T −T ⎞ = ⎜⎜ AO AE ⎟⎟ ⎝ DT1 ⎠
2
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 300
Chapter 7: Examples
λSJB = base failure rate, solder joint (see Table 7.2-6) πSJDT = failure rate multiplier, solder joint delta temperature:
π SJDT
⎛ T − TAE ⎞ = ⎜ AO ⎟ 44 ⎠ ⎝
2.26
λEOS = failure rate, electrical overstress (see Table 7.2-6)
DC1op
TRdefault
DC1nonop
Eanonop
CR1 DT1 n C1 S1 CE
Eaop
Table 7.2-6: Capacitor Parameters
Aluminum
0.000465
0.00022
0.000214
0.000768
.00095
0.229
0.17
0.5
0
0.83
0.4
1140.35
21
5
7.6
0.6 0.23
Ceramic
0.001292
0.000645
0.000096
0.00014
.00095
0.0082
0.17
0.3
0
0.83
0.3
1140.35
21
3
0.1
0.6 0.09
General
0.000634
0.000351
0.000083
0.000259
.00095
0.033
0.17
0.3
0
0.83
0.3
1140.35
21
7
0.1
0.6 0.09
Mica/Glass
0.000826
0.000997
0.000888
0.000764
.00095
0.0082
0.17
0.4
0
0.83
0.4
1140.35
21
10
0.1
0.6 0.09
Paper
0.000663
0.000075
0.000882
0.000042
.00095
0.0082
0.17
0.2
0
0.83
0.2
1140.35
21
5
0.1
0.6 0.09
Plastic
0.000994
0.001462
0.001657
0.002531
.00095
0.0082
0.17
0.2
0
0.83
0.2
1140.35
21
6
0.1
0.6 0.09
Tantalum
0.000175
0.000049
0.000032
0.000816
.00095
0.229
0.17
0.2
0
0.83
0.2
1140.35
21
17
7.6
0.6 0.23
Tantalum
0.000175
0.000049
0.000032
0.000816
.00095
0.229
0.17
0.2
0
0.83
0.2
1140.35
21
17
7.6
0.6 0.23
Variable, Air
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0
0.83
0.3
1140.35
21
6
0.35 0.5 0.09
Variable, Ceramic
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0
0.83
0.1
1140.35
21
3
0.35 0.5 0.09
Variable, FEP
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0
0.83
0.2
1140.35
21
6
0.35 0.5 0.09
Variable, General
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0
0.83
0.2
1140.35
21
6
0.35 0.5 0.09
Variable, Glass
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0
0.83
0.2
1140.35
21
3
0.35 0.5 0.09
Variable, Mica
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0
0.83
0.2
1140.35
21
10 0.35 0.5 0.09
Variable, Plastic
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0
0.83
0.2
1140.35
21
6
Part Type
7.2.3.10.
λOB
λEB
λTCB
λIND
λSJB
β
0.35 0.5 0.09
Default Values
The default values for the environmental and operating profile factors are summarized in Tables 7.2-7and 7.2-8.
Reliability Information Analysis Center 301
Chapter 7: Examples
Table 7.2-7: Default Environmental Stress Values Environment Airborne Airborne, Fixed Wing Airborne, Fixed Wing, Inhabited Airborne, Fixed Wing, Uninhabited Airborne, Missile Airborne, Missile, Flight Airborne, Missile, Launch Airborne, Rotary Wing Airborne, Rotary Wing, Inhabited Airborne, Rotary Wing, Uninhabited Airborne, Space Ground Ground, Man Pack Ground, Mobile Ground, Mobile, Heavy Wheeled Ground, Mobile, Heavy Wheeled, Chassis Mounted Ground, Mobile, Heavy Wheeled, Engine Compartment Ground, Mobile, Heavy Wheeled, Engine Mounted Ground, Mobile, Heavy Wheeled, Instrument Panel Closed Ground, Mobile, Heavy Wheeled, Instrument Panel Open Ground, Mobile, Heavy Wheeled, Trunk Ground, Mobile, Light Wheeled Ground, Mobile, Light Wheeled, Chassis Mounted Ground, Mobile, Light Wheeled, Engine Compartment Ground, Mobile, Light Wheeled, Engine Mounted Ground, Mobile, Light Wheeled, Instrument Panel Closed Ground, Mobile, Light Wheeled, Instrument Panel Open Ground, Mobile, Light Wheeled, Trunk Ground, Mobile, Tracked Ground, Stationary Ground, Stationary, Indoors Ground, Stationary, Outdoors Naval Naval, Shipboard Naval, Shipboard, Sheltered Naval, Shipboard, Unsheltered Naval, Submarine
TAO TAE Humidity 55 14 40 55 14 40 55 14 40 71 14 50 55 14 40 55 14 40 55 14 40 55 14 40 55 14 40 71 14 50 55 14 40 35 17 40 55 14 40 55 14 40 55 14 40 55 14 40 55 14 40 55 14 40 55 14 40 55 14 40 55 14 40 55 14 40 34 14 40 40 14 40 58 14 40 31 14 40 24 14 40 17 14 40 55 14 40 35 19 40 30 23 40 40 14 50 55 14 80 55 14 80 40 20 70 60 14 90 55 23 50
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 302
Vibration (GRMS) 9 9 9 9 10 1.3 16 3.3 3.3 3.3 0 0 1 10 10 10 10 10 10 10 10 4 4 4 4 4 4 4 2 0 0 0 0.7 0.7 0.7 0.7 1
Chapter 7: Examples
Table 7.2-8: Default Operating Profile Values Equipment type Automotive Commercial Aircraft Computer Consumer Emergency Power Industrial Military Aircraft Military Ground Naval Telecommunications
Operating profile DC CR (C/yr) 5 1000 25 2982 80 1491 30 368 10 50 80 184 25 1008 45 263 80 50 80 368
7.2.4. Photonic Model Development Example 7.2.4.1. Introduction 7.2.4.1.1.
Component Reliability Models Form
This section summarizes the manner in which photonic device models were derived (Reference 3). It is included to demonstrate the development of models when little field data is available. The photonic component model form is:
λP = π Q (λOBπ DCOπ TOπ V + λEBπ DCN π TEπ RH + λTCBπ CRπ DT + λind ) where: λp = predicted failure rate πQ = multiplier for photonic device quality λOB = base failure rate from operational stresses πDCO = failure rate multiplier for duty cycle:
π DCO =
DC DC1op
Reliability Information Analysis Center 303
Chapter 7: Examples
πTO = factor for operating temperature:
π TO = e πV =
⎛ − Eaop ⎛ 1 1 ⎞ ⎞⎟ ⎜ ⎜ ⎟ ⎜ .00008617 ⎜ T +T + 273 − 298 ⎟ ⎟ ⎝ AO R ⎠⎠ ⎝
vibration factor: ⎛ V +1⎞ π V = ⎜⎜ a ⎟⎟ ⎝ Vc ⎠
nvib
λEB = base failure rate from environmental stresses πDCN = failure rate multiplier for nonoperating duty cycle:
π DCN =
1 − DC 1 − DC1op
πTE = nonoperating temperature factor:
π TE = e
⎛ − Eanonop ⎛ 1 1 ⎞ ⎞⎟ ⎜ ⎟ ⎜ ⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟ ⎠⎠ ⎝ AE ⎝
πRH = humidity factor:
π RH
⎛ RH a + 1 ⎞ ⎟⎟ = ⎜⎜ ⎝ RH c ⎠
n RH
λTCB = base failure rate from power or temperature cycling stresses πcr =
cycling rate factor:
π CR =
CR CR1
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 304
Chapter 7: Examples
πDT = delta Temperature factor:
π DT
⎛ T + T − TAE ⎞ = ⎜ AO R ⎟ 14 ⎝ ⎠
n PC
λi = failure rate from induced stresses The model parameters are defined as follows: λP = πQ = = λOB πDCO = DC = DC1op = πTO = Eaop = = TAO TR = πV = = VA VC = = nvib = λEB πDCN = = πTE Eanonop = = TAE πRH = RHa = RHc = = nRH λTCB = πCR = CR = CR1 =
predicted failure rate, failures per million calendar hours failure rate multiplier for quality base failure rate, operating failure rate multiplier for duty cycle, operating duty cycle (fraction of calendar time in operation) 0.25 failure rate multiplier, temperature – operating activation energy - operating ambient operating temperature temperature rise above TAO failure rate multiplier, vibration level max vibration level applied (Grms) 1.0 vibration exponent base failure rate, environment failure rate multiplier, duty cycle – nonoperating failure rate multiplier, Temperature – environment activation energy, nonoperating ambient environmental temperature failure rate multiplier, relative humidity relative Humidity (%) 50% relative humidity exponent base failure rate, temperature cycling failure rate multiplier, cycling rate cycling rate (cycles per year) 1000 Reliability Information Analysis Center 305
Chapter 7: Examples
πDT nPC 7.2.4.1.2.
= =
failure rate multiplier, delta temperature temperature cycling exponent
Model Development Methodology
The modeling methodology that was used in the photonics device modeling study is summarized in Figure 7.2-4. This methodology is similar to the 217Plus model development methodology, but was tailored for the specific needs of photonic components. Each element of this methodology is explained in the following sections.
Collect reliability data and populate spreadsheet Collect failure mode data
Map observed failure modes into the failure cause categories
Identify the base percentage of failure rate attributable to each cause
Estimate stresses to which the parts were exposed Estimate acceleration model constants
Calculate a normalization stress accelerating stress
Estimate acceleration factors (Pi factors) for each part from each data source
Calculate the based failure rates for each cause such that observed = predicted failure rates
Figure 7.2-4: Model Development Methodology Flowchart 7.2.4.2. Model development methodology and results
This section details the model development methodology and also presents the results of each task in this methodology. Each task in Figure 7.2-4 is described in the following sections. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 306
Chapter 7: Examples 7.2.4.2.1.
Collect Failure Mode Data
There are two primary types of data upon which the component models are based, failure rate and failure mode. The model development process required that the failure rate data be apportioned into the following four defined failure cause categories: • • • •
Failures from operational stresses Failures from environmental stresses Failures from power or temperature cycling stresses Failures from induced stresses
Since failure mode data is typically not classified according to these categories, it is necessary to transform the failure mode distribution data into the failure cause distribution. This failure mode distribution data was obtained from several sources: • • •
Data collected during the photonic device study Data obtained from the literature Analysis similar to a Failure Mode and Effects Analysis (FMEA), in which failure causes are hypothesized.
An example of this is summarized in Table 7.2-9, in which the failure causes for a connector are hypothesized (2nd column), and then an occurrence rating is given for each cause. This rating is in the 3rd column, and is scored as a 1, 3 or 9. This weighting scheme is often used in FMEA analysis. The result is a fractional value for each failure cause that is proportional to the weighting. The sum of all of these values for each component type equals 1.0. The methodology used in the photonics device models to derive the fraction of occurrence differs from the methodology presented previously for the 217Plus components, in that failure mode distributions were not available during the photonics model development effort. For the 217Plus models, the components were more mature and therefore, there was considerable history of both failure mode and failure rate data to draw upon.
Reliability Information Analysis Center 307
Chapter 7: Examples
Table 7.2-9: Failure Cause Summary for Connectors Component Type
Connector (SC and FC)
7.2.4.2.2.
Failure Cause Spring failure Wear of the connector resulting in misalignment Wear of the end face Contamination of facet (sand, dust, grease) Contamination on outside that wicks in Eccentric wear on the ferrule causes misalignment Crimping too tight causes pinching Crimping too loose causes it to fall apart O-ring failure Contraction of the outer jacket causes fiber pistoning Fracture of the end face Misalignment of cable end due to sleeve wear Misalignment of cable end due to buckling from tolerance stack up Misalignment of cable end due to separation from tolerance stack up Insufficient cure of epoxy Corrosion, pitting or facets Embrittlment of organic materials due to UV exposure
Occurrence 3 3 1 9 1 1 3 1 1 3 1 1
Fraction of Occurrence 0.073 0.073 0.024 0.220 0.024 0.024 0.073 0.024 0.024 0.073 0.024 0.024
3
0.073
3 3 3 1
0.073 0.073 0.073 0.024
Map Observed Failure Modes into the Failure Cause Categories
To transform the failure mode distribution data into the failure cause distribution, the following process was used: • • • •
Identify failure modes and their relative percentages (summarized above) Identify the accelerating factors applicable to each failure cause Identify the accelerating stresses applicable to each failure cause category (for example, accelerating stresses from device operation applicable to many photonic components will be optical power, temperature, etc.) Map the accelerating stress to the appropriate failure modes (identify them as being a primary, secondary or no accelerant driver)
The last item is accomplished by assessing whether each stress is a primary accelerant of the failure mode, a secondary accelerant, or is not an accelerant. A 3:1 weighting between primary and secondary accelerant was then used in estimating the percentage of failures that could be attributed to those stresses. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 308
Chapter 7: Examples
The primary stresses that potentially accelerate operational failure modes are operating temperature, vibration, current/voltage and optical power. The stresses that accelerate environmental failure causes are nonoperating ambient temperature, corrosive stresses (contaminants/heat/humidity), and aging stresses (time). As an example, Table 7.2-10 summarizes this process for our connector example.
Environmental
Accelerating Stress or Cause Operating temperature Vibration s p p Current/voltage Optical power
Ambient temperature Corrosion Ageing Humidity
p p p s
Power Cycling
Power Cycling
s s
Induced/handling
Induced/handling
p
s
p s s s
s p p
p p p
p p s
p s
s s
s p p s p
TOTAL
Total 100 %
Failure Cause Category Operational Stresses
7.32% 7.32% 2.44% 21.95% 2.44% 2.44% 7.32% 2.44% 2.44% 7.32% 2.44% 2.44% 7.32% 7.32% 7.32% 7.32% 2.44%
Spring failure Wear of the connector resulting in misalignment Wear of the end face Contamination of facet (sand, dust, grease) Contamination on outside that wicks in Eccentric wear on the ferrule causes misalignment Crimping too tight causes pinching Crimping too loose causes it to fall apart O-ring failure Contraction of the outer jacket causes fiber pistoning Fracture of the end face Misalignment of cable end due to sleeve wear Misalignment of cable end due to buckling from tolerance stack up Misalignment of cable end due to separation from tolerance stack up Insufficient cure of epoxy Corrosion, pitting or facets Embrittlment of organic materials due to UV exposure
Table 7.2-10: Failure Mode to Failure Cause Category for Connectors (SC and FC)
0.00 0.11 0.10 0.00 0.01
s p 0.07 0.30 p 0.04 0.09 p 0.10 0.23 0.23 0.36 0.36 1.00 1.00
Reliability Information Analysis Center 309
Chapter 7: Examples
Each of the failure modes is listed across the top of the table, and each of the accelerating stresses/causes is listed down the left side. Each combination is identified with a “blank” (no acceleration from the factor), a "p" (primary) or an "s" (secondary). The associated relative percentage of failures attributable to the accelerating stress/cause is listed down the right columns. The % column (second from the right) is calculated as follows: ⎛ ⎞ ⎜ ⎟ wi ⎟ ⎜ % = ∑ FM % n ⎜ ⎟ FM1 ⎜ ∑ wi ⎟ ⎝ AC1 ⎠ n
where: FM% = the percentage associated with the ith failure mode wi = the weight of the specific combination of failure mode and accelerating stress or cause (0 for none, 1 for secondary, and 3 for primary)
For example, the % value for ambient temperature (as part of the environmental failure cause category) is:
⎛1⎞ ⎛1⎞ ⎛1⎞ ⎛1⎞ 7.32%⎜ ⎟ + 7.32%⎜ ⎟ + 7.32%⎜ ⎟ + 2.44%⎜ ⎟ = 0.07 ⎝1⎠ ⎝ 11 ⎠ ⎝ 4⎠ ⎝4⎠ Therefore, an estimate of the percentage of failure causes accelerated by ambient temperature is 7%. 7.2.4.2.3.
Identify the Base Percentage of Failure Rate Attributable to Each Cause
The base percentages of failure rate are calculated by summing the accelerating stress/cause percentages associated with each failure cause. For our connector example, the four percentages associated with the “operating” accelerating stresses/causes is 11%, or 0.11. These percentages are an estimate of the percent of failures that can be expected for each cause under nominal stress conditions. In this case, nominal stresses are the average stresses to which the models are normalized. Table 7.2-11 summarizes the failure cause percentages (in fractional form).
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 310
Chapter 7: Examples
Table 7.2-11: Failure Cause Percentages for Connectors Failure Rate Term Operational Environmental Power Cycling Induced 7.2.4.2.4.
Percentage (Fraction) 0.11 0.30 0.23 0.36
Collect Reliability Data and Populate Spreadsheet
As previously summarized, the approach that was taken in photonics device model development methodology relied on the collection of quantitative failure mode and failure rate data. Literature searches were performed toward the goal of collecting the quantitative data required for model development. Sources searched for applicable data included: • • • • • • •
Optical Society of America (OSA) SPIE RIAC databases Total Electronic Migration System (TEMS) (a database of government-related research from IACs and other sources) Government-Industry Data Exchange Program (GIDEP) Manufacturers data Data mined from the Web
The results of this data collection effort, for connectors, are summarized in Table 7.2-12.
Reliability Information Analysis Center 311
Chapter 7: Examples
Failures
Lambda Observed
0 0.8 368 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1752 0 1 1752 0 1 1752 0 1 1752 20 1 0
Hours
12 0 0 0 0 0 0 0 0 125 110 125 125 0
RHa
5 0 0 0 0 0 0 0 0 0 0 0 0 0
CR
23 85 60 85 85 85 85 85 -40 -40 -40 -40 -40 25
DC
30 85 60 85 85 85 85 85 -40 85 70 85 85 25
VA
Delta T
Connector
Field Damp heat Damp heat Damp heat Damp heat Damp heat High temperature storage High temperature storage Low temperature storage Thermal Cycling Thermal Cycling Thermal Cycling Thermal Cycling Vibration
TR
Data Type
TAE
Part Type
TAO
Table 7.2-12: Data Collected for Connectors
40 85 95 85 85 85 2
33333333.33 20000 20160 1056 22000 22000 20160 1056 1056 5000 12600 50 5500 33
1 0 12 0 0 0 8 0 0 0 16 0 0 0
30
The first column is the part type; the second is the data type. Data types used in the photonics device study included: • •
Field data Test data Thermal cycling o Vibration o Damp heat o High temperature storage o Low temperature storage o Operating life test
The 3rd through tenth columns are the estimates of the actual stresses to which the part was exposed in the field or during the test. These stresses are defined as follows: TAO TAE
= =
ambient operating temperature ambient environmental temperature
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 312
Chapter 7: Examples
TR VA DC CR RHa 7.2.4.2.5.
= = = = =
temperature rise above TAO maximum vibration level applied (Grms) duty cycle (fraction of calendar time in operation) cycling rate (cycles per year) relative humidity (%)
Estimate Stresses to Which the Parts were Exposed
For each source of data that was collected, an estimate of the stresses and operating profiles to which the component was exposed was required so that the failure rates could be normalized to the actual stresses. These stresses were summarized in the previous section. For test data, these values were generally readily available. For data collected from fielded systems, the actual stress values were not available. Therefore, they had to be estimated. The default values of the environmental and operating profile factors were summarized in Tables 7.2-7 and 7.2-8. Only field data from telecommunication applications used in a ground, stationary, indoors environment was available to the photonics device modeling study, so only the values pertaining to those conditions were estimated in this manner. 7.2.4.2.6.
Estimate Acceleration Model Constants for Each Part
Acceleration factors (or Pi-factors) were used in the component models to estimate the effects of various stress and component variables on the failure rate. The two predominant forms of acceleration factors are the Arrhenius and the power law models. The Arrhenius model is generally used for modeling temperature effects and is:
AFT = e
Ea KT
where “AFT” is the temperature acceleration factor, “Ea” is the activation energy, “K” is Boltzman’s constant, and “T” is the temperature (in degrees K). The power law model is:
AF = S n where “S” is the stress and “n” is a constant. Reliability Information Analysis Center 313
Chapter 7: Examples
The specific forms of these acceleration factors that were used in the models are summarized below. πTO = factor for operating temperature:
π TO = e πV =
⎛ − Eaop ⎛ 1 1 ⎞ ⎞⎟ ⎜ ⎜ ⎟ ⎜ .00008617 ⎜ T +T + 273 − 298 ⎟ ⎟ ⎝ AO R ⎠⎠ ⎝
vibration factor:
⎛ V +1⎞ π V = ⎜⎜ a ⎟⎟ ⎝ Vc ⎠
nvib
πTE = nonoperating temperature factor:
π TE = e
⎛ − Eanonop ⎛ 1 1 ⎞ ⎞⎟ ⎜ ⎜ ⎟ ⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟ ⎝ AE ⎠⎠ ⎝
πRH = humidity factor:
π RH
⎛ RH a + 1 ⎞ ⎟⎟ = ⎜⎜ ⎝ RH c ⎠
n RH
πDT = delta temperature factor:
π DT
⎛ T + T − TAE ⎞ = ⎜ AO R ⎟ 14 ⎝ ⎠
n PC
The temperature factors based on the Arrhenius relationship were normalized to 25 degrees C. The acceleration factors for vibration and relative humidity that are based on the power law were normalized to a specific value, i.e. the denominator, and include a value of 1.0 in the numerator to ensure that the factor does not go to zero with a stress level of zero. Each model has a single factor that needs to be estimated, i.e., “Ea” for the Arrhenius and “n” for the power law. These were estimated in one of the following ways: 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 314
Chapter 7: Examples
1. Values generated from information that was available in the literature 2. Engineering judgment based on the known behavior of similar failure mechanisms For #2, the accelerations were categorized from “no acceleration” to “very high acceleration” for each specific accelerating stress. Table 7.2-13 summarizes the values of the applicable parameters as a function of the relationship.
Table 7.2-13: Categories of Acceleration Model Parameters Dependency Very High High Medium Low None
n (PC) 10 5 2 1 0
Ea (op) Ea (nonop) 1 1 0.7 0.7 0.5 0.5 0.1 0.1 0 0
n (RH) 10 5 2 1 0
n (Vibration) 10 5 2 1 0
Table 7.2-14 summarizes the specific parameter values used in the connector models.
Table 7.2-14: Acceleration Model Parameters Component Type Connector 7.2.4.2.7.
n (PC) 2
Ea (op) Ea (nonop) 0.1 0.1
n (RH) 10
n (Vibration) 5
Calculate a Normalization Stress Accelerating Stress
The Pi factors needed to be normalized to a fixed set of conditions. This approach makes it convenient to derive default Pi-factors. By normalizing the factors in this manner, the Pi-factor is equal to 1.0 when the stress is equal to the default stress. Therefore, if an analyst chooses to ignore the effects of a particular stress, the failure rate will be representative of the default stress levels. The default values for the applicable photonics device model Pi-factors are summarized in Table 7.2-15.
Reliability Information Analysis Center 315
Chapter 7: Examples
7.2.4.2.8.
Default Vibration
Default RH
Default DT
0 10 0 5 20 0 0 0 15 5 15 15 15
Default CR
Connector Passive Micro-Optic Component Passive Fiber-Based Component Isolator VOA Fiber Splice Cable Laser Diode Module Photodiode Transmitter Receiver Transceiver
Default DC
Model Category
Default Tr
Table 7.2-15: Default Model Parameters
0.25
1000
1
50
20
Estimate the Acceleration Factors (Pi-factors) for Each Part from Each Data Source
The acceleration factors used in the models are Pi-factors, which are the acceleration factors normalized to a given stress level. These factors were calculated for each part from each data source. To derive these factors, two pieces of information were required: 1. The estimate of the stress for each data point (in this case, a data point is a single observation of reliability (failures and hours) at a known set of stress conditions). The manner in which these were quantified was previously explained. 2. The default stress level of the data for each stress parameter in the model The Pi-factor was then the acceleration model normalized to the default stress level. An example of this calculation is shown in Table 7.2-16. Every data point available from field or test data had its associated Pi-factor values calculated. Note that some of the Pifactors were zero. This occurs because test data was not applicable to all failure causes. This concept will be further explained in the next section.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 316
Chapter 7: Examples
Table 7.2-16: Summary of Pi-factor Calculations
Pi TE
Pi RH
Pi CR
Pi DT
1.000 32.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 4084101 1.000 496874 1.000 785027 1.000 4084101 1.135 1.000 1.921 1.000 1.506 1.000 1.921 1.000 1.921 1.000 1.921 1.000 1.921 1.000 1.921 1.000 0.337 1.000
Pi DCN
Pi TO
3.200 4.000 4.000 4.000 4.000 4.000 4.000 4.000 3.200 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Pi V
Pi DCO
Pi factors
0.267 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.267 1.333 1.333 1.333 1.333 1.333 1.333 1.333 1.333
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.974 1.921 1.506 1.921 1.921 1.921 1.921 1.921 0.337
0.137 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.137 227 681 227 227 227 0.000 0.000 0.000
0.368 2.037 4.037 4.037 0.000 0.000 0.000 0.000 0.368 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1.063 1.180 1.149 1.162 0.000 0.000 0.000 0.000 0.360 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cable Cable Cable Cable Cable Cable Cable Cable Connector Connector Connector Connector Connector Connector Connector Connector Connector
Field Thermal Cycling Thermal Cycling Thermal Cycling Vibration Vibration Vibration Vibration Field Damp heat Damp heat Damp heat Damp heat Damp heat High temperature storage High temperature storage Low temperature storage
7.2.4.2.9.
Calculate the Base Failure Rates for Each Cause Such That the Observed Failure Rates = the Predicted Failure Rates
In the case of the 217Plus models, which were based solely on field data, the base failure rates for the photonic device models were obtained, as follows, for each failure cause category: m
λ Bi =
∑ (Fobs × %i ) field 1 m
k
1
1
∑ H obs field × ∏ π
where: λBi = Fobs = Hobs = Ππ= i= m= k= %i =
the base failure rate for the ith failure rate term the number of observed field failures the number of observed field hours the product of the applicable Pi-factors to the applicable field environment the number of failure causes the number of field data sources the number of correction factors the percentage of failure rate attributable to the specific failure causes Reliability Information Analysis Center 317
Chapter 7: Examples
The product of the Pi-factors converts the actual hours to an equivalent “effective” number of hours normalized to the default stress values. However, in the case of the photonic models developed for the study, it was necessary to utilize a significant amount of test data since there was not enough field data available. This is due to the fact that there are few field data sources for photonic components. Therefore, the modeling methodology needed to be tailored to accommodate the specific data available on the parts addressed in the photonics device study. This was accomplished by using a Bayesian technique in which the field data becomes the prior distribution, and the summation of the failure and hours from all data sources forms the basis of the posterior distribution. The failure rate parameter of the exponential distribution was, therefore: j
m
λ Bi =
∑ (Fobs × %i ) field + ∑ Fobstest 1
1
m
k
j
k
1
1
1
1
∑ H obs field × ∏ π + ∑ H obstest × ∏ π
where there were “j” test data sources. Each specific type of test data that was collected for the study was applicable to only one of the four specific failure causes, as summarized in Table 7.2-17. Field data, however, encompassed all four failure causes.
Table 7.2-17: Applicability of Test Data Data Type Field Operating Life Test High Temperature Storage Low Temperature Storage Damp Heat Vibration Thermal Cycling
Operating X X
Failure Cause Category Environmental Cycling X X
Induced X
X X X X X
One of the advantages to the model structure was this ability to modify the base failure rates of specific failure causes with test data applicable to only that failure cause. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 318
Chapter 7: Examples
The connector base failure rates resulting from this analysis are listed in Table 7.2-18.
Table 7.2-18: Base Failure Rates (Failures per Million Calendar Hours) Base Failure Rate (failures per million calendar hours)
Component Connector 7.2.4.2.10.
Operating 0.0002
Environmental 0.3053
Cycling 2.7952
Induced 0.0110
Adjust the Base Failure Rates
The last step in the process was to adjust the base failure rates to ensure that the predicted number of failures was equal to the observed number. The manner in which this was accomplished was to scale the base failure rates to ensure that the cumulative predicted number of failures of the entire population of observed data points was equal to the observed number of failures. This was accomplished by using the MS Excel “goal seek” function, which finds the value of a “correction factor” that satisfies this boundary condition. This approach is conceptually similar to a maximum likelihood method. 7.2.4.2.11.
Treatment of Quality and Environmental Stresses
There were several options for modeling the effects of environmental stresses. Early in the study, it was decided that the effects of quality and environment would be treated such that the photonic component models would be “stand-alone”. This approach differed from the form of the 217Plus methodology, in that quality and environment were treated as “system” level effects. This concept was based on the premise that quality and environmental effects were manifested more at the assembly or system level than they were at the component level. The photonic component models include the effects of their pertinent environmental stresses in the component models, instead of applying the environment factor in the assembly or system model, as was the case with 217Plus. The primary environmental stresses included in the photonic component models are temperature, humidity and vibration. The quality factor ( Q) is calculated in a manner similar to the 217Plus methodology, but tailored to the unique concerns of photonic components. This factor is calculated as follows: 1
π q = α i (− ln (R i )) β
i
Where αi and βi are Weibull parameters representing the distribution of the percentage of failures attributable to components (parts). The quality factor is scaled within this Reliability Information Analysis Center 319
Chapter 7: Examples
distribution based on how good the parts control program is. The parameter “Ri” is the rating of the parts control program and is calculated from: ni
Ri =
∑ j =1
GijWij
ni
∑W j =1
ij
where, rating of the process for the ith failure cause, from 0.0 to 1.0 the grade for the jth item of the ith failure cause. This grade is the rating between 0.0 and 1.0 (worst to best). Wij = the weight of the jth item of the ith failure cause n i = the number of grading criteria associated with the ith failure cause
Ri = Gij =
The 217Plus grading criteria, as applied to the photonics device models, are provided in Table 7.2-19. These were tailored specifically for photonic components.
Table 7.2-19: Part Quality Process Grade Factor Questions for Photonic Device Models Highest Actual Possible Score Score
Rating
Input Range
User Input
Is there a documented part selection and part management process?
yes = 5 no = 0
Y,N
N
5
0.0
Are part evaluation and qualification processes established to add parts to the PPL?
yes = 3 no = 0
Y,N
N
3
0.0
yes = 3 no = 0
Y,N
N
3
0.0
yes = 6 no = 0
Y,N
N
6
0.0
Will new parts be added to the PPL to design this FRU?
yes = 4 no = 0
Y,N
N
4
0.0
Are procedures in place to detect part problems in both manufacturing and the field?
yes = 10 no = 0
Y,N
N
10
0.0
yes = 10 no = 0
Y,N
N
10
0.0
yes = 10 no = 0
Y,N
N
10
0.0
Parts Contribution to Reliability
Does a cross functional development team (CFDT) review and approve new candidate parts for addition to the PPL? Is this a commercial off-the-shelf (COTS) purchased assembly with a good history of operational reliability?
Are quality and reliability data tracked on parts and fed back to suppliers so they know their performance on this product? Is there a design compliance checklist to ensure that all parts are properly applied, operating at sufficient margin with respect to environmental and operational stresses, and take into account lessons learned?
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 320
Chapter 7: Examples Highest Actual Possible Score Score
Rating
Input Range
User Input
yes = 7 no = 0
Y,N
N
7
0.0
yes = 7 no = 0
Y,N
N
7
0.0
yes = 10 no = 0
Y,N
N
10
0.0
yes = 7 no = 0
Y,N
N
7
0.0
Is there a device specification for all critical and custom parts?
yes = 5 no = 0
Y,N
N
5
0.0
Has the supplier reviewed the part application for all critical and custom parts?
yes = 7 no = 0
Y,N
N
7
0.0
yes = 7 no = 0
Y,N
N
7
0.0
yes = 7 no = 0
Y,N
N
7
0.0
yes = 7 no = 0
Y,N
N
7
0.0
Is there a first article inspection and acceptance test planned?
yes = 7 no = 0
Y,N
N
7
0.0
Have key suppliers identified their part failure mechanisms?
yes = 10 no = 0
Y,N
N
10
0.0
Have the sources and the extent of part variation been identified?
yes = 7 no = 0
Y,N
N
7
0.0
Have mitigations been identified to handle the effects of part's variations?
yes = 8 no = 0
Y,N
N
8
0.0
Will a design of experiments part evaluation, considering variations, as well as manufacturing variations, be conducted?
yes = 7 no = 0
Y,N
N
7
0.0
Will developers' quality organization audit the supplier's processes and facility capabilities?
yes = 6 no = 0
Y,N
N
6
0.0
A,B,C,D,E
B
10
0.0
Parts Contribution to Reliability Are teaming relationships established with all critical component suppliers? Will all suppliers provide timely failure reporting and corrective action support (FRACAS) for both critical and custom parts? (Timely reporting implies a 2 week turnaround with faster response on priority demand.) Have supplier identified the likely failure modes on critical and custom parts, and does the design take these failure modes into account? Are operational failure rate and failure mode data provided by the suppliers of critical and custom parts being used?
Will critical suppliers provide timely notice of impending part changes to allow the developer to assess the impact? Is a change history log maintained to provide traceability of engineering change actions and their associated rationale for critical and custom parts? Will part identification (revision numbers) be shown on the part to identify the particular part configuration, including the level of the part’s firmware?
Is an optical path adhesive (OPA) used in the component
A. No OPA = 10 B. yes, MFD 10 um = 8
Reliability Information Analysis Center 321
Chapter 7: Examples
Parts Contribution to Reliability
Are there thin films (AR coatings, filter elements) in the light path?
Rating A. No Thin film = 0 B. yes, and surface is prepared by sputtering = 2 C. yes, and surface is not prepared by sputtering = 3
Highest Actual Possible Score Score
Input Range
User Input
A,B,C
A
3
0.0
Does the component contain fused fiber?
yes = 5 no = 0
Y,N
N
5
0.0
Does the component contain fiber?
yes = 5 no = 0
Y,N
N
5
0.0
Was the package thermally designed to safely dissipate heat by understanding and modeling the thermal characteristics?
yes = 3 no = 0
Y,N
N
3
0.0
Has the manufacturer characterized the power handling capability of the component?
yes = 5 no = 0
Y,N
N
5
0.0
yes = 5 no = 0
Y,N
N
5
0.0
yes = 4 no = 0
Y,N
N
4
0.0
A,B
B
4
0.0
Y,N
N
3
0.0
A,B,C
C
3
0.0
Y,N
N
3
0.0
Have acceleration factors for power and temperature been quantified and are they used to determine the derating requirements? Does the component contain absorbers at wavelengths for which the component will be exposed (i.e. garnet, shutter, etc.) How is dissipated power intended to be dumped? Does the component rely on alignment of free space components attached with organics Cleanliness precautions
For components that have a fiber/epoxy interface, is the fiber tip inspected to ensure it is free of defects and contamination?
A. with a heat sink = 4 B. dissipation not actively managed = 0 yes = 3 no = 0 A. stringent cleaning procedures =3 B. some cleaning procedures = 2 C. no cleaning procedures = 0 yes = 3 no = 0
7.2.4.3. Uncertainty Analysis
An analysis was performed to quantify the degree of uncertainty in the predicted failure rates. This was accomplished by calculating the predicted failure rate and comparing it to the observed failure rate. The metric that was used for this analysis was the log10 of the value: predicted failure rate/observed failure rate. The value of this metric should cluster around zero if the prediction models are approximating the observed data. Calculation of the standard deviation of this metric also provides a quantification of the uncertainty levels present in the predictions made with these models. Table 7.2-20 summarizes the mean and standard deviation of this metric for all of the data and for only the field data.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 322
Chapter 7: Examples
Table 7.2-20: Summary of Uncertainty Metrics Mean Standard deviation
All Data Field Data -0.68 0.20 1.07 0.44
Figures 7.2-5 and 7.2-6 illustrate the distribution of this metric for all data and for just field data. For this analysis, only data for which failures occurred were included, since data with no observed failures only have a single-sided bound on the failure rate and, therefore, cannot be compared to the predicted value. The result of not including zero failure data is that the metric is biased. As can be seen in these figures, the distribution of all failures is significantly wider than the distribution of just the field failure rates. This is due to the fact that the non-field data, i.e. test data, is typically at extreme conditions. Therefore, the uncertainty in these extreme cases is typically larger than for nominal conditions.
Histogram 14 12 Frequency
10 8 6 4 2 0 -3
-2
-1
0
1
2
3
LOG 10 (PREDICTED/OBSERVED)
Figure 7.2-5: Distribution of Log10 Predicted/Observed Failure Rate Ratio for All Data
Reliability Information Analysis Center 323
Chapter 7: Examples
Histogram 7 6 Frequency
5 4 3 2 1 0 -0.25
0.25 0.75 1.25 1.75 More LOG 10 (PREDICTED/OBSERVED)
Figure 7.2-6: Distribution of Log10 Predicted/Observed Ratio for Field Data Only The distributions of the predicted/observed failure rate ratio are illustrated in Figure 7.27. With this metric, the value should be centered about one, since the log of this ratio has not been taken. Re lia So f t W e ibu ll+ + 7 - w w w . Re lia So ft. com
distribution of predicted/observed failure rate ratio
cumulative probability
99 . 0 00
Prob a bility-Lo gn orma l
50 . 0 00
10 . 0 00
5. 00 0
1. 00 0 0. 00 1
0 . 0 10
0. 10 0
1. 000
10. 00 0
1 00 . 0 00
predicted/observed ratio F olio1\Da ta 1: μ= −1 .5 5 8 5 , σ=2 .4 9 2 5 , ρ= 0 .9 8 8 0 F olio1\Da ta 2: μ= 0 .4 5 5 6 , σ=0 .9 5 4 7 , ρ=0 .8 8 1 3
Figure 7.2-7: Distributions of the Predicted/Observed Failure Rate Ratio for All Data and For Field Data Only 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 324
Chapter 7: Examples 7.2.4.4. Comments on Part Quality Levels
Part quality level has traditionally been used as one of the primary variables affecting the predicted failure rate of a component. The quality level categories were usually those defined by the applicable military specification. One of the problems that developers had when developing MIL-HDBK-217 models was de-convolving the effects of quality and environment. For example, multiple linear regression analysis of field failure rate data was usually used to quantify model variables as a function of independent variables such as quality and environment. A basic assumption of such techniques is that the independent variables are statistically independent of each other. However, in reality they are not, since the “higher” quality components are generally used in the severe environments and the commercial quality components are used in the more benign environments. This correlation makes it difficult to discern the effects of each of the variables individually. Additionally, there are several attributes pooled into the quality factor, including qualification, process certification, screening and quality systems. The approach used in the 217Plus model to quantify the effects of part quality is to treat it as one of the failure causes for which a process grade is determined. In this manner, issues related to qualification, process certification, screening and quality systems were individually addressed. 7.2.4.5. Explanation of Failure Rate Units
The 217Plus models predict the failure rate in units of failures per million calendar hours. This is necessary because the 217Plus methodology accounts for all failure rate contribution terms (i.e., operating, nonoperating, cycling and induced), and the appropriate manner in which they can be combined is to use a common time basis for the failure rate, which is calendar hours. If an equivalent operating failure rate is desired in units of failures per million operating hours, the 217Plus reliability prediction should be performed with the actual duty cycle to which the unit will be subjected, then divide the resulting failure rate (in f/106 calendar hours) by the duty cycle to yield a failure rate in terms of f/106 operating hours. The resulting “operating” failure rate will be artificially increased to account for the nonoperating and cycling failures that would not otherwise be accounted for. The incorrect way to predict a 217Plus failure rate in units of failures per million operating hour is to set the duty cycle equal to 1.0. The resulting failure rate in this case would be valid only if the actual duty cycle is 100%. If the actual duty cycle is not 100%, then the failures during non-operating periods will not be accounted for. Reliability Information Analysis Center 325
Chapter 7: Examples
7.2.5.
System-Level Model
7.2.5.1. Model Presentation
As a reminder, the total 217Plus system model is:
λ P = λ IA (Π P Π IMΠ E + Π D Π G + Π M Π IMΠ E Π G + ΠSΠ G + Π I + Π N + Π W ) + λ SW where: λp =
predicted failure rate of the system
λIA =
initial assessment of the failure rate. This failure rate is based on new component failure rate models derived by the RIAC presented in Section 2.2, whose derivations are discussed in the next section
Each of the following model factors represents a failure cause:
ΠP ΠD ΠM ΠS ΠI ΠN ΠW
= = = = = = =
parts process factor design process factor manufacturing process factor system management process factor induced process factor no-defect process factor wearout process factor
Each of these factors is calculated as follows: 1
π i = αi (− ln(Ri ))β
i
where αi and βi are constants for each failure cause category, as given in Table 7.2-21. The parameter Ri is calculated as: ni
Ri =
∑ j =1
GijWij
ni
∑W j =1
ij
where:
Ri =
rating of the process for the ith failure cause, from 0.0 to 1.0. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 326
Chapter 7: Examples
the grade for the jth item of the ith failure cause. This grade is the rating between 0.0 and 1.0 (worst to best). Wij = the weight of the jth item of the ith failure cause n i = the number of grading criteria associated with the ith failure cause
Gij =
Table 7.2-21: Parameters for the Process Grade Factors Model Factor Symbol (Πi) ΠD ΠM ΠP ΠS ΠN ΠI ΠW
Name Design process factor Manufacturing process factor Parts Quality process factor Systems Management process factor CND process factor Induced process factor Wearout process factor
α
β
0.12 0.21 0.30 0.06
1.29 0.96 1.62 0.64
Default value for factor if Ri is unknown 0.094 0.142 0.243 0.036
0.29 0.18 0.13
1.92 1.58 1.68
0.237 0.141 0.106
ΠIM = infant mortality factor
Π IM =
t - 0.62 (1 - SSESS ) 1.77
where: t=
time in years. This is the instantaneous time at which the failure rate is to be evaluated. If the average failure rate for a given time period is desired, this expression must be integrated and divided by the time period. SSESS = the screening strength of the screen(s) applied, if any ΠE = environmental factor
((
) (
.6 1.71 .855 × .8 1 − e(−.065(ΔT +.6 ) ) + .2 1 − e(−.046G ) πE = .205
))
where:
ΔT = the change in temperature between operating and non-operating periods (TAO-TAE) Reliability Information Analysis Center 327
Chapter 7: Examples
G = the magnitude of random vibration while the system is operating, in GRMS ΠG = reliability growth factor, given by the formula: ΠG =
1 .12 (t + 2 ) −α 2 −α
where:
α= Ri =
the growth constant, which is equal to Ri for reliability growth processes the rating of the growth process using the criteria in Table 7.2-30, and is given as: ni
Ri =
∑ j =1
GijWij
ni
∑W j =1
ij
7.2.5.2. 217Plus Process Grading Criteria
This section contains a listing of all of the criteria that comprise the definition and scoring for the individual 217Plus Process Grades. An index of the tables included within this section is listed in Table 7.2-22.
Table 7.2-22. Index of Process Grade Type Questions Table Number 7.2-23 7.2-24 7.2-25 7.2-26 7.2-27 7.2-28 7.2-29 7.2-30
Process Grade Type Design Manufacturing Part Quality System Management CND Induced Wearout Growth
The rating for each process grade type, Ri,is given as: ni
Ri =
∑ j =1
GijWij
ni
∑W j =1
ij
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 328
Chapter 7: Examples
where: rating of the process for the ith failure cause, from 0.0 to 1.0. the grade for the jth item of the ith failure cause. This grade is the rating between 0.0 and 1.0 (worst to best). Wij = the weight of the jth item of the ith failure cause n i = number of grading criteria associated with the ith failure cause
Ri = Gij =
These tables are organized as follows. Column 1 contains the criteria associated with the specific Process Grade Type. Column 2 is the grading criteria (Gij). Most of the questions are designated with a Y/N in this column. In these cases, a Yes (Y) answer equals "1" and a "No" answer equals “0”. The question will receive the full weighted score for a "Yes" answer and a zero for a "No" answer. In some cases, the grading criteria is not binary, but rather can be one of three or four possible values. The grading criteria for these are noted in this column. Column 3 identifies the scoring weight (Wij) associated with the specific question. In the event that a model user does not wish to answer all of the questions, he/she can choose a subset of the most important questions by using only those with weight values of seven or higher. Questions that are not scored should not be counted in the number of grading criteria (ni) associated with the ith failure score.
Reliability Information Analysis Center 329
Chapter 7: Examples
7.2.5.3. Design Process Grade Factor Questions
Table 7.2-23: Design Process Grade Factor Questions Question
Gij
Wij
What is the % of lead design engineering people with cross training experience in manufacturing or field operations (thresholds at 10, 20%)?
20 = 1
5
What is the % of team members having relevant product experience (thresholds at 25, 50%)?
50 = 1
5
What is the % of team members having relevant process experience, i.e., they have previously developed a product under the current development process (thresholds at 20, 40%)?
40 = 1
4
What is the % of development team that have 4-year technical degrees (thresholds at 20, 40%)?
40 = 1
3
What is the % of engineering team having advanced technical degrees (thresholds at 10, 20%)?
20 = 1
3
What is the % of engineering team members involved in professional activities in the past year; hold patents; authored/presented papers; are registered professional engineers, or professional society offices at the National level (thresholds at 10, 20%)?
20 = 1
2
What is the % of engineering team members who have taken engineering courses in the past year (thresholds at 10, 20%)?
20 = 1
2
Are resource people identified for program technology support across key technology and specialty areas such as optoelectronics, servo control, Application Specific Integrated Circuits (ASIC) design, etc., to provide program guidance and support as needed?
Yes = 1 No = 0
7
Are resource people identified, for program tools support, to provide guidance and assistance with Computer Aided Design (CAD), simulation, etc.?
Yes = 1 No = 0
6
How many (0,1,2,3) of the program objectives of cost, schedule and reliability did the manager successfully meet for the last program that he/she was responsible?
3=1 2 = .5 1 = .25 0=0
10
Is this development program organized as "Cross Functional Development Teams" (CFDT) involving: design, manufacturing, test, procurement, etc.?
Yes = 1 No = 0
8
Does this Field Replaceable Unit (FRU) depend more on mature technology than state of the art technology?
Yes = 1 No = 0
3
Is design of experiments (DOE) used to ensure robustness of the FRU in the product under all operational and environmental variations?
Yes = 1 No = 0
5
Are critical components identified along with plans to mitigate their risks?
Yes = 1 No = 0
5
Have designs been reviewed and plans made for part obsolescence during the product's life cycle?
Yes = 1 No = 0
6
Are considerations made to accommodate part form factor evolution? This applies particularly to those parts deemed likely to change during the production life of the fielded system.
Yes = 1 No = 0
5
Are predominantly standard tools required for maintenance (limited-to-no use of special tools)?
Yes = 1 No = 0
2
Will the design application be modeled by variational analysis to ensure design centering?
Yes = 1 No = 0
5
Will timing analysis be performed on digital circuits?
Yes = 1 No = 0
5
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 330
Chapter 7: Examples
Table 7.2-23: Design Process Grade Factor Questions (continued) Question
Gij
Wij
Will a network modal analysis be performed on analog circuits?
Yes = 1 No = 0
5
Will electrical stress analysis be performed on electronic circuits?
Yes = 1 No = 0
6
Will mechanical stress analysis be performed on relevant components, materials and structures?
Yes = 1 No = 0
3
Will a prototype be developed in time to have user feedback impact the design?
Yes = 1 No = 0
10
Will customer feedback on the prototype be sought?
Yes = 1 No = 0
10
Will design personnel participate in a Failure Modes and Effects Analysis (FMEA), Failure Modes Effects and Criticality Analysis (FMECA), or Fault Tree Analysis (FTA) that is performed concurrently with the design effort?
Yes = 1 No = 0
6
Will the design engineer also design the diagnostic code for this FRU?
Yes = 1 No = 0
4
Will a worst-case analysis be performed?
Yes = 1 No = 0
4
Will the product support tasks be ergonomically evaluated (human factors) from an Operations & Maintenance standpoint?
Yes = 1 No = 0
4
Will the product be analyzed using a human factor task analysis to ensure the Operations and Maintenance Tasks are tailored to human capabilities?
Yes = 1 No = 0
4
Will the chassis that this FRU is mounted in be thermally measured and analyzed and operating temperatures assured to be at a safe margin below device limits?
Yes = 1 No = 0
6
Is electrical/mechanical power by electronic logic or physical action (switches)?
Yes = 1 No = 0
4
Do control procedures ensure that the system and its software are put in a safe state during power shut down?
Yes = 1 No = 0
4
125 = 0
8
Will environmental analyses and profiling (thermal, dynamic) be performed on the product to ensure it is used within its design strength capabilities?
Yes = 1 No = 0
5
Will the product be analyzed/tested for electromagnetic compatibility (EMC) and radiated/conducted susceptibility and emissions?
Yes = 1 No = 0
6
Will the product be EMC-certified, per the European CE (Conformity European) regulatory compliance criteria for equipment used in Europe, or under a similarly rigorous standard such as DO-160 (commercial aircraft)?
Yes = 1 No = 0
4
Are the size of equipment orifices (cover openings) less than 1/10 of the wavelength of the signal frequencies that the equipment will generate within its enclosure or be exposed to in its environment?
Yes = 1 No = 0
4
Do traces on a Printed Wiring Board (PWB) run over a ground plane or an impedance control layer (e.g., power planes) and never over reference plane or power plane voids?
Yes = 1 No = 0
4
Do traces on alternate PWB layers run orthogonal to one another, when a reference plane or power plane is not interposed between them?
Yes = 1 No = 0
4
Are adjacent traces separated by at least twice their width, except for minor adjacencies that run less than a half inch?
Yes = 1 No = 0
4
Is the power source filtered over the range of 1KHz to 100 MHz for military or 150KHz to 30 MHz for commercial power, and utilize surge suppression devices where appropriate?
Yes = 1 No = 0
4
Are all interconnect cables emerging from a shielded cabinet grounded to the chassis for operating frequencies greater than 1 MHz or capacitively decoupled to the chassis for frequencies less than 1 MHz?
Yes = 1 No = 0
4
Are traces set back at least 2 widths from the edge of the reference or ground plane?
Yes = 1 No = 0
4
Is there a shared product development vision that includes Design for Manufacturability (DFM) goals?
Yes = 1 No = 0
8
What is the maximum silicon junction temperature on this FRU in degrees C (thresholds at 90 and 125 degrees C)? If not applicable select “0”.
th
Reliability Information Analysis Center 331
Chapter 7: Examples
Table 7.2-23: Design Process Grade Factor Questions (continued) Question
Gij
Wij
Are part types standardized via a Preferred Parts List (PPL)?
Yes = 1 No = 0
5
Is there continuing focus to keep the PPL up to date and to minimize the number of parts on the PPL, by increasing part standardization, encouraging designers to use the PPL and requiring analysis to justify adding a new part to the PPL?
Yes = 1 No = 0
6
Is this product to be built on an existing manufacturing platform that makes use of existing process capabilities?
Yes = 1 No = 0
6
Do plans for follow-on products and product retirement exist?
Yes = 1 No = 0
2
Are new, critical parts qualified for by test and analysis prior to their inclusion in the system?
Yes = 1 No = 0
6
100% = 0
4
Yes = 1 No = 0
6
100 = 0
4
Are PWB traces at least 5 mils in width?
Yes = 1 No = 0
5
Is the development process documented?
Yes = 1 No = 0
5
Is the process documentation on-line with the recognition that the on-line version is the only standard? (All printed copies are for reference only).
Yes = 1 No = 0
6
Does each process activity have clear entry and exit criteria?
Yes = 1 No = 0
2
Is the system configuration documented on-line, with changes since the last baseline highlighted to keep the entire team current with the design?
Yes = 1 No = 0
6
Are there functional block diagrams of the system, subsystems, etc., down to the FRU level?
Yes = 1 No = 0
3
Are examples of good development products (e.g., specs, plans, documentation) provided to the engineering team, typifying the desired work products for each stage of development?
Yes = 1 No = 0
6
Are examples of past problems provided to the engineering team that typify those found at each stage of development?
Yes = 1 No = 0
4
Is there a closed-loop problem database to track development problems to closure?
Yes = 1 No = 0
5
Does development activities planning include the identification of critical path tasks?
Yes = 1 No = 0
5
Are critical path tasks planned to minimize cycle time impacts and improve schedule robustness?
Yes = 1 No = 0
4
Are individual developers encouraged to make contact with their customer counterpart?
Yes = 1 No = 0
6
Will Cross-Functional Development Team (CFDT) phase reviews/sign-offs follow each product development phase: requirements, preliminary design, final design, and test?
Yes = 1 No = 0
9
Are formal reviews documented and defect data analyzed and tracked, along with any action items, to completion?
Yes = 1 No = 0
5
Do design reviewers share responsibility for the performance of the design once they have reviewed it?
Yes = 1 No = 0
3
Are developers rated on the success of the overall product in the field?
Yes = 1 No = 0
3
Is there a technical review board in place to minimize design changes and maintain cost, schedule and reliability goals?
Yes = 1 No = 0
5
How does the part count on this project compare with predecessor products or competitive products? (Thresholds at 75 and 100%). Are there DFM guidelines provided that the program must adhere to? (e.g., a good DFM design is fabricated on a uni-axis assembly orientation, preferably built from the bottom) What % of inter-connections are there compared to the predecessor version of this FRU (thresholds at 70 and 100%)?
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 332
Chapter 7: Examples
Table 7.2-23: Design Process Grade Factor Questions (continued) Question
Gij
Wij
Are engineering change (EC) costs budgeted, measured and tracked against their associated design driver?
Yes = 1 No = 0
6
Is reliability and/or quality a significant goal or the number one goal placed on the entire development organization? This occurs on safety critical applications such as air traffic control, nuclear or critical medical applications.
Yes = 1 No = 0
10
20 = 1
4
Are individual developers empowered by having input and control over resources to accomplish their job, such as having a travel budget (if travel is required)?
Yes = 1 No = 0
2
Are engineering team members dedicated full time to the project?
Yes = 1 No = 0
4
Are process owners identified across the development team for each configuration item (CI) and its components?
Yes = 1 No = 0
3
Is there tracking of open problems, action items and cross system dependencies?
Yes = 1 No = 0
4
Is there a change review board/process?
Yes = 1 No = 0
4
Is development creativity fostered through planned creativity exercises to spawn breakthrough thinking with respect to design simplicity, cost, schedule and reliability?
Yes = 1 No = 0
4
Are failures traced to their root cause and managed to resolution?
Yes = 1 No = 0
4
Is this development process ISO rated?
Yes = 1 No = 0
4
How many (0,1,2,3) of the cost, schedule and reliability goals did the last product developed by this organization meet?
3=1 2 = .5 1 = .25 0=0
10
Do you know the reliability performance of your current products in the field versus their predicted reliability?
Yes = 1 No = 0
5
If so, are previous reliability estimates greater than 15% of the predicted reliability? When not applicable select "No".
Yes = 1 No = 0
7
Is there a 15% staffing buffer on the program, i.e., will the program be staffed to 115% of the needed baseline to allow for contingencies?
Yes = 1 No = 0
9
Are in-process metrics maintained to track actual vs. planned defect rates, schedule and resource targets?
Yes = 1 No = 0
4
Can continuous measurable improvement (CMI) be demonstrated for the development processes?
Yes = 1 No = 0
4
Are development processes maintained on-line with all printed paper copies designated "for reference use only"?
Yes = 1 No = 0
6
Are there procedures to ensure that documentation stays current with the design?
Yes = 1 No = 0
6
Is there a requirements document for this program?
Yes = 1 No = 0
4
Is there a Functional Specifications document?
Yes = 1 No = 0
4
Are there document owners or points-of-contact identified for these documents so the development team knows who it can go to for a specific need?
Yes = 1 No = 0
3
Do the team members contribute to the creation and/or review and approval of these documents?
Yes = 1 No = 0
4
Are documentation standards promoted with examples to demonstrate what is considered adequate documentation?
Yes = 1 No = 0
4
What is the % of FRU reuse across the system? (thresholds at 10 and 20%).
Reliability Information Analysis Center 333
Chapter 7: Examples
Table 7.2-23: Design Process Grade Factor Questions (continued) Question
Gij
Wij
Is product documentation field-tested prior to product delivery or general availability?
Yes = 1 No = 0
5
Is there a procedure for field feedback on product operations and maintenance documentation?
Yes = 1 No = 0
4
Does product documentation maximize pictures and minimize words (fallibility of natural language)?
Yes = 1 No = 0
4
Is product documentation kept at reading grade level 10 or less?
Yes = 1 No = 0
4
Is there an operational concept document developed prior to high level design that is maintained throughout development?
Yes = 1 No = 0
8
Is there a set of hardware and process design guidelines that provide general and component-specific design guidance practices?
Yes = 1 No = 0
6
Is there an assumptions/dependencies database that is maintained and reviewed prior to each development stage exit?
Yes = 1 No = 0
8
Is a distributed architecture used?
Yes = 1 No = 0
5
Does the design exclude electro-optical devices?
Yes = 1 No = 0
5
Does the design exclude electro-mechanical devices?
Yes = 1 No = 0
5
Is chemical processing excluded from this design?
Yes = 1 No = 0
5
Is hot fusing (toner) excluded from this design?
Yes = 1 No = 0
5
Are all voltages used in this design less than 110 VAC?
Yes = 1 No = 0
2
Are all operating frequencies less than 50 MHz
Yes = 1 No = 0
5
What is the number of developers on this project (thresholds of 20 and 100)?
100 = 0
5
What is the development schedule in months? (thresholds at 18, 36 and 48 months)
48 = 0
8
Is there a 24-hour/day availability requirement?
Yes = 1 No = 0
6
Does the operational concept call for a remote operations and maintenance (O&M) operator to be able to diagnose system problems as part of the system concept?
Yes = 1 No = 0
6
Is this PWB of standard size or dimension?
Yes = 1 No = 0
5
Does this FRU have a 25% reduction in parts count over its predecessor or competitor?
Yes = 1 No = 0
5
Are stuck faults required to be isolated down to a single failing FRU 90% of the time?
Yes = 1 No = 0
4
Does this FRU report its status via a "Management Information Data Bit" (MIB) capability for fault determination and isolation?
Yes = 1 No = 0
2
Is there over-voltage / under-voltage detection and reporting?
Yes = 1 No = 0
3
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 334
Chapter 7: Examples
Table 7.2-23: Design Process Grade Factor Questions (continued) Question
Gij
Wij
Will FRUs be "hot-pluggable"?
Yes = 1 No = 0
2
Is there an independent test team?
Yes = 1 No = 0
5
Is the customer directly involved in defining the product's operational profile and in reviewing the test plans?
Yes = 1 No = 0
5
Does test planning take into account the "lessons learned" database?
Yes = 1 No = 0
4
Does a problem tracking database exist and is it being used on this program?
Yes = 1 No = 0
5
Will accelerated testing be performed during development that combines temperature and vibration?
Yes = 1 No = 0
4
Will alpha tests be conducted, whereby the final product is robustly tested against probable extensions of its operational environment?
Yes = 1 No = 0
6
Will beta tests be conducted, whereby customers can use and test pre-release versions of the product, feeding back their results to the developer?
Yes = 1 No = 0
6
Are test procedures, set-up conditions, results, etc., documented so that measurements can be verified, failures reproduced, test conditions recreated, and corrective actions confirmed?
Yes = 1 No = 0
5
Will a gold standard (tested product) be preserved for comparative regression analysis?
Yes = 1 No = 0
6
Will product changes be regression tested?
Yes = 1 No = 0
5
Will the FRU be reliability or endurance tested (at any assembly level)?
Yes = 1 No = 0
4
Can parts (ASIC, EPROM) be reprogrammed in the circuit?
Yes = 1 No = 0
4
Can active elements be backwardly driven for more complete coverage?
Yes = 1 No = 0
4
95 = 1
4
Yes = 1 No = 0
4
>40 = 1 32-40 = .75 25-32 = .25 95 = 1 80-95 = .5 50-80 = .25 3 = 0 3 = .5 2 = .75 1 = 1 >20 = 0 10-20 = .5 2)?
Is it easy to visually distinguish between fasteners (e.g., no minor differences in length) prior to installation? Is there only one type fastener drive (torx, Phillips, etc.) needed in the assembly, installation and maintenance of this FRU? Are mounting guides or registration pins provided for aligning and securing electro-mechanical or electro-optical parts? Are development personnel, including manufacturing, all co-located? Does this project have a built-in 15% staffing buffer, i.e., staffing is at least 115% of base requirements? Is the project organized around self-directed work teams? Are workers rated on both total output and quality? Are there process improvement teams with continuous measurable improvement (CMI) goals? Are employees rated on field performance of the product? Is there an advanced manufacturing engineering (AME) support department to help bridge between engineering and production? Has Cross Functional Development Team (CFDT) been implemented such that the manufacturing manager is able to explain the design concept? Are manufacturing people encouraged to ask questions of development people (identified points of contact) when questions arise? Are enterprise points-of-contact (POCs) identified (development, manufacturing, test, field, marketing) to help answer questions and address issues across the organization? Can any of the line or quality personnel "stop the line" if that person believes a serious problem exists?
Has the majority of the manufacturing leadership had direct field or customer contact in the past year? Do manufacturing people have measurable goals to improve production metrics, including quality and cycle time? If answer to 3.2.13 is yes, do direct manufacturing people participate in developing the goals? Do manufacturing personnel have goals for continuous quality improvement? Are there quality circles that meet regularly? Are teams rewarded or recognized for improving quality? Are key metrics for quality and cost monitored and tracked? Is the cost of defect prevention measures tracked (proactive quality)?
Reliability Information Analysis Center 337
Gij
Wij
Yes = 1 No = 0 >2 = 0 2 = .25 1 = .5 0 = 1 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes No Yes No Yes No
= = = = = =
1 0 1 0 1 0
Yes No Yes No Yes No Yes No Yes No
= = = = = = = = = =
1 0 1 0 1 0 1 0 1 0
Yes = 1 No = 0 Yes No Yes No Yes No
= = = = = =
1 0 1 0 1 0
Yes No Yes No Yes No Yes No Yes No
= = = = = = = = = =
1 0 1 0 1 0 1 0 1 0
3
3
3 3 3 6 6 3 4 5 4 3 1 5 5
5
3 5 4 3 3 3 5 3
Chapter 7: Examples
Table 7.2-24: Manufacturing Process Grade Factor Questions (continued) Question Is the cost of problem corrections tracked (corrective quality)? Do the process operators collect and interpret their own statistical process control (SPC) operational data? Is machine-level configuration control practiced? Is the cost of engineering changes (EC's) tracked and allocated back to the responsible development entity that caused the EC? Are root cause failure analyses performed on Pareto-significant manufacturing line problems? Are root cause failure analyses performed on Pareto-significant field problems? Is there a continuing focus on eliminating test escapes so as to find problems when they are created rather than when the customer receives the system? Is a lessons-learned database maintained based upon problem post mortem analysis? Are lessons learned fed back to development personnel at the corresponding development phase where particular, significant fault types have been found to occur? Will this FRU have an (expected) yield of over 90%? Are examples of field manufacturing defects displayed for production personnel? Do manufacturing people have current awareness of the field performance of their products, in terms of problem types and problem rates? Are the manufacturing processes based upon sensitivity analyses, process Failure Modes and Effects Analysis (FMEA) or Design of Experiments (DOE)? Has a declared manufacturing vision that incorporates reliability and quality been established, documented and communicated to personnel? Is leadership rotated among manufacturing personnel participating in a quality circle? Does management promote quality circles with continuous measurable improvement (CMI) targets? Do employees' personal development/assessment plans emphasize product and process quality? Are team-building exercises promoted as lead-ins to the production phases? Do manufacturing personnel get 40 hours of training a year? Do you visit suppliers, review their processes, and make suggestions for process improvement? Do you invite suppliers, or customers to review your company's processes and allow them to suggest ways the company can do things better? Does manufacturing participate in design reviews? Is management aware and involved in day to day manufacturing operations on a regular basis? Is management located in proximity to line people and accessible to them? Do part suppliers manage their stock at your production facility? Will this product be built on an existing manufacturing line vs. a new manufacturing process that will have to be developed to support the manufacture this product?
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 338
Gij
Wij
Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No
= = = = = = = = = = = = = = = =
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Yes No Yes No Yes No
= = = = = =
1 0 1 0 1 0
Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No
= = = = = = = = = = = = = = = = = = = = = =
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Yes No Yes No Yes No
= = = = = =
1 0 1 0 1 0
Yes = 1 No = 0
3 5 5 4 6 7 6 6 4 3 3 4 6 4 2 3 5 3 2 3 3 5 4 4 3 6
Chapter 7: Examples
Table 7.2-24: Manufacturing Process Grade Factor Questions (continued) Question Has the manufacturing process been mistake- proofed? Is there an EC budget for this FRU, and will results be measured against this budget? Do manufacturing personnel know the projected average cost of an EC once product is in the field? Has a demand-based, pull system been established for manufacturing processing stations? Are Printed Wiring Boards (PWBs) conformal coated? Are there tighter tolerances than 0.020" on unaided hand assembly operations associated with manufacturing this FRU, or integrating it into the next higher level assembly? Are there tighter tolerance requirements than 0.005" for fixtured assembly operations with measurement capability? Are there tighter tolerances than 0.0005" for automated assembly operations? Is the manufacturing process documented? Has manufacturing provided a product design checklist of their concerns to the development team at the start of development? Is the checklist identified above reviewed for compliance at each development milestone review? Is 90% test coverage achieved on the components in this FRU? Is a shipping test performed on samples of the packaged product? Are FRUs burned in for at least 24 hours? Is a "gold standard" of the qualified item maintained for regression test purposes? Is Design of Experiments (DOE) used in setting up and controlling testing? Is there an Operational Reliability Test conducted to simulate the customer application?
How many elements (0 to 4) of environmental stress screening (ESS) are run: 1. temperature bake, 2. temperature cycle, 3. temperature shock, 4. vibration?
Are production test stress screens conducted? Does this PWB have fewer than 6 layers? Is this PWB small enough so that it cannot bow or "oil can" in handling and usage? Is it a one-sided or two-sided board? If this is a PWB, are the majority of components attached via methods other than surface mount technology (SMT)? Does this FRU have at least 25% fewer solder joints than its predecessor or competitor?
Reliability Information Analysis Center 339
Gij
Wij
Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No
= = = = = = = = = = = = = = = =
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Yes No Yes No Yes No
= = = = = =
1 0 1 0 1 0
Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 0 = 0 1 = .25 2 = .5 3 = .75 4 = 1 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 1 = 1 2 = 0 Yes = 1 No = 0 Yes = 1 No = 0
6 3 3 4 3 3 3 3 5 5 5 5 4 5 4 7 5
5
5 3 3 3 3 3
Chapter 7: Examples
Table 7.2-24: Manufacturing Process Grade Factor Questions (continued) Question
Gij
Wij
100 = 1 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0
What is the solder joint spacing (in mils) (thresholds at 30,50,100)?
Is ball grid array (BGA) technology excluded from this design? Has your organization previously implemented BGA technology into a design? Have card insertion guides been used in the design? Have board stiffeners been used in the design?
5
3 6 3 3
7.2.5.5. Part Quality Process Grade Factor Questions
Table 7.2-25: Part Quality Process Grade Factor Questions Question
Gij
Wij
Is there a documented part selection and part management process?
Yes = 1 No = 0
5
Is there a Preferred Parts List (PPL)?
Yes = 1 No = 0
5
Are part evaluation and qualification processes established to add parts to the PPL?
Yes = 1 No = 0
5
Does a cross-functional development team (CFDT) review and approve new candidate parts for addition to the PPL?
Yes = 1 No = 0
5
Is this a commercial off-the-shelf (COTS) purchased assembly with a good history of operational reliability? If the assembly is not COTS, select "Yes".
Yes = 1 No = 0
6
Will new parts be excluded from being added to the PPL to design this FRU?
Yes = 1 No = 0
4
Are procedures in place to detect part problems in both manufacturing and the field?
Yes = 1 No = 0
5
Are quality and reliability data tracked on parts and fed back to suppliers so they know their performance on this product?
Yes = 1 No = 0
5
Is there a design compliance checklist to ensure that all parts are properly applied, operating at sufficient margin with respect to environmental and operational stresses, and take into account lessons learned?
Yes = 1 No = 0
6
Are there processes in place that specifically address precautions and handling of parts/components susceptible to electrostatic discharge (ESD)?
Yes = 1 No = 0
5
Do part specifications reflect environmental and regulatory compliance requirements for the specific intended application?
Yes = 1 No = 0
5
Has mechanical interfacing of critical parts been facilitated by providing mating parts/assemblies to the part supplier?
Yes = 1 No = 0
4
Is there an end of life plan to recycle or dispose of this part?
Yes = 1 No = 0
4
Are teaming relationships established with all critical component suppliers?
Yes = 1 No = 0
6
Are critical parts ISO 9000 certified?
Yes = 1 No = 0
4
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 340
Chapter 7: Examples
Table 7.2-25: Part Quality Process Grade Factor Questions (continued) Question
Gij
Wij
Are critical parts QS 9000 (automobile manufacturer certification) certified?
Yes = 1 No = 0
6
In the case of commercial off-the-shelf (COTS) equipment, is the purchased assembly certified and marked to sell in Europe (CE marked)? If the assembly is not COTS, select “Yes”.
Yes = 1 No = 0
6
Is the FRU under configuration management control by the time it enters system test?
Yes = 1 No = 0
5
Are critical parts burned in for at least 24 hours?
Yes = 1 No = 0
5
Will the supplier manage the developer’s inventory, in the case of high volume production?
Yes = 1 No = 0
4
Will all suppliers provide timely failure reporting and corrective action support (FRACAS) for both critical and custom parts (timely reporting implies a 2 week turnaround with faster response on priority demand)?
Yes = 1 No = 0
4
Have vendor dependencies been identified for critical and custom components?
Yes = 1 No = 0
4
Have suppliers identified the likely failure modes on critical and custom parts, and does the design take these failure modes into account?
Yes = 1 No = 0
4
Are operational failure rate and failure mode data provided by the suppliers of critical and custom parts being used?
Yes = 1 No = 0
4
Is there a part control drawing for critical and custom parts?
Yes = 1 No = 0
5
Is there a device specification for all critical and custom parts?
Yes = 1 No = 0
5
Has the supplier reviewed the part application for all critical and custom parts?
Yes = 1 No = 0
4
Has the developer met with suppliers to discuss the application of all critical and custom parts?
Yes = 1 No = 0
4
Has a supplier’s technical point of contact (POC) been identified for addressing reliability concerns?
Yes = 1 No = 0
3
Will critical suppliers provide timely notice of impending part changes to allow the developer to assess the impact?
Yes = 1 No = 0
4
Is a change history log maintained to provide traceability of engineering change actions and their associated rationale for critical and custom parts?
Yes = 1 No = 0
5
Will part identification (revision numbers) be shown on the part to identify the particular part configuration, including the level of the part’s firmware?
Yes = 1 No = 0
4
Will suppliers routinely update firmware on parts returned for repair?
Yes = 1 No = 0
4
If suppliers update firmware will the part identification reflect this change?
Yes = 1 No = 0
4
Will suppliers’ part support timing horizon meet program development, manufacture, and field support component requirements?
Yes = 1 No = 0
4
Will vendor provide timely notice of production/support cessation and provide an “end of life” buy opportunity?
Yes = 1 No = 0
4
Will future releases of this part be compatible with respect to form, fit and function?
Yes = 1 No = 0
4
Is there a first article inspection and acceptance test planned?
Yes = 1 No = 0
4
Do critical and custom parts on this FRU all have at least a 12-month warranty?
Yes = 1 No = 0
4
Have likely part developments, evolution, and extensions of critical/custom parts been identified by the supplier?
Yes = 1 No = 0
6
Are there 32 Kbytes or more of firmware embedded in this FRU?
Yes = 1 No = 0
4
Reliability Information Analysis Center 341
Chapter 7: Examples
Table 7.2-25: Part Quality Process Grade Factor Questions (continued) Question
Gij
Wij
Have development personnel meet with supplier's technical personnel?
Yes = 1 No = 0
4
Has a functional block diagram been developed for COTS or purchased complex part assemblies?
Yes = 1 No = 0
4
Has a failure history been collected for critical parts, complex assemblies, or COTS items?
Yes = 1 No = 0
4
Have key suppliers identified their part failure mechanisms?
Yes = 1 No = 0
4
Have suppliers, in the case of complex part assemblies, supported the developer in performing a Failure Modes and Effects Analysis (FMEA) on those assemblies?
Yes = 1 No = 0
6
Have the sources and the extent of part variation been identified?
Yes = 1 No = 0
5
Have mitigations been identified to handle the effects of part's variations?
Yes = 1 No = 0
5
Do you know the supplier's dependencies and needs?
Yes = 1 No = 0
4
Will a design of experiments part evaluation, considering variations, as well as manufacturing variations, be conducted?
Yes = 1 No = 0
6
Have mechanical interfacing components been provided to the key vendors to assure proper mechanical mating?
Yes = 1 No = 0
5
Will the developer's quality organization audit suppliers' processes and facility capabilities?
Yes = 1 No = 0
5
Will the developer receive notice of pending part changes?
Yes = 1 No = 0
5
Will the developer have approval rights of part changes?
Yes = 1 No = 0
4
Are procedures and processes in place for the identification and handling of critical reliability components (derating, screening, failure response, etc.)?
Yes = 1 No = 0
8
7.2.5.6. System Management Process Grade Factor Questions
Table 7.2-26: System Management Process Grade Factor Questions Question
Gij
Wij
Does the customer participate with the developer in developing/validating a requirements statement?
Yes = 1 No = 0
5
Is Quality Function Deployment (QFD) used to help develop requirements and requirements traceability?
Yes = 1 No = 0
7
If QFD is not used, is there another systematic way used, such as a Pugh chart, to identify and document customer needs and preferences?
Yes = 1 No = 0
4
Is there a system specification?
Yes = 1 No = 0
5
Does an "operations concept" document exist?
Yes = 1 No = 0
6
Has a comprehensive literature study been done of relevant design and reliability technology advancements?
Yes = 1 No = 0
5
Have previous or similar products been reviewed for their advantages and pitfalls?
Yes = 1 No = 0
4
Has a "lessons learned" database been studied to ensure the product will not repeat past problems?
Yes = 1 No = 0
5
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 342
Chapter 7: Examples
Table 7.2-26: System Management Process Grade Factor Questions (continued) Question
Gij
Wij
Have aggressive requirements (particularly reliability, availability, and/or safety) been explicitly specified?
Yes = 1 No = 0
10
Have regulatory agency compliance requirements been included?
Yes = 1 No = 0
5
Does the requirements definition also account for what the product is supposed to "not do" (for example, air bags should not deploy except on impact)?
Yes = 1 No = 0
5
Is there a plan as to how to retire or recycle this new system at the end of its life?
Yes = 1 No = 0
2
Does a requirements database exist to capture opportunistic requirements for future consideration?
Yes = 1 No = 0
4
Have future expansion requirements been identified (such as loading growth) and can the system handle the projected growth in demand?
Yes = 1 No = 0
4
Are requirements deemed achievable within program budget and schedule restraints, with a 90% confidence level?
Yes = 1 No = 0
5
Are product requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and safety concerns)?
Yes = 1 No = 0
5
Has a project level Failure Modes and Effects Analysis (FMEA) been done in conjunction with designers and system engineers at the planning stage?
Yes = 1 No = 0
5
Will the FMEA be refined down to the Field Replaceable Unit (FRU) level during design?
Yes = 1 No = 0
5
Does this product have to meet CE (European) standards?
Yes = 1 No = 0
6
Have likely product extensions been identified in the planning stage?
Yes = 1 No = 0
3
Are creativity and team building exercises being conducted during the planning stage?
Yes = 1 No = 0
2
Are future product releases planned in order to systematically integrate new requirements and features?
Yes = 1 No = 0
4
Are trade studies shared with the customer to broaden the base of inputs and support for design decisions?
Yes = 1 No = 0
5
Does a vision statement that speaks to reliability exist for the product?
Yes = 1 No = 0
5
Does a functional block diagram exist for this system?
Yes = 1 No = 0
4
Do sketches, drawings, or models exist for the delivered product?
Yes = 1 No = 0
4
Is the development team provided guidelines for acceptable deliverables at kick-off meetings for each development stage?
Yes = 1 No = 0
6
Are prototypes planned for early design?
Yes = 1 No = 0
5
Is this design an incremental improvement over an existing design?
Yes = 1 No = 0
6
Will state diagrams be developed before detail design to depict control flows?
Yes = 1 No = 0
4
Will data flow diagrams be developed before detail design begins?
Yes = 1 No = 0
4
Are entity-relationship diagrams developed prior to detail design?
Yes = 1 No = 0
4
Will a list identifying the capabilities and advantages that this product provides the customer be developed and maintained?
Yes = 1 No = 0
2
Is there a system transition plan to replace the current system with the new system, in a smooth, non-disruptive manner? When not applicable select “Yes”.
Yes = 1 No = 0
8
Reliability Information Analysis Center 343
Chapter 7: Examples
Table 7.2-26: System Management Process Grade Factor Questions (continued) Question
Gij
Wij
Are requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and design autonomy)?
Yes = 1 No = 0
5
Are requirements verification activities planned for the appropriate stages of product development?
Yes = 1 No = 0
5
Are entrance and exit criteria established for each development stage?
Yes = 1 No = 0
5
Is requirements traceability verified and maintained throughout development?
Yes = 1 No = 0
5
Is requirements compliance verified prior to the exit of each phase and prior to shipment?
Yes = 1 No = 0
5
Are test cases developed concurrently with the design and reviewed by the designers?
Yes = 1 No = 0
5
Is there a log of key product decisions and accompanying rationale for traceability?
Yes = 1 No = 0
6
Does the specified reliability represent an improvement of 10% or greater over its predecessor or competitive products?
Yes = 1 No = 0
5
Is there definition and agreement as to what constitutes successful product reliability performance by the customer?
Yes = 1 No = 0
5
Can this product be built using existing manufacturing processes (line)?
Yes = 1 No = 0
8
Are development and reliability requirements developed by a cross-functional development team (CFDT)?
Yes = 1 No = 0
8
Are system issues routinely documented as action items?
Yes = 1 No = 0
5
Do design reviews have technical representation from all interfacing areas?
Yes = 1 No = 0
5
Is prototype interconnection hardware routinely provided to interfacing subsystems and suppliers to guide their packaging?
Yes = 1 No = 0
3
Is there a requirement to detect and isolate faults to a single FRU 90% of the time?
Yes = 1 No = 0
6
Is there a system failure modes and effects analysis (FMEA) done during planning stage, and is it updated throughout the program?
Yes = 1 No = 0
6
Customer and process Q1: Have I identified who are my internal customers and my external customers?
Yes = 1 No = 0
3
Customer and process Q2: Have I identified what deliverables my customers need (plans, prototypes, documentation,…)?
Yes = 1 No = 0
3
Customer and process Q3: Do I know when my customers require my deliverables?
Yes = 1 No = 0
2
Customer and process Q4: Is there a customer centered quality initiative that will be incorporated to differentiate your deliverables?
Yes = 1 No = 0
3
Customer and process Q5: Is there an identified tool or process improvement that the reliability section or the development organization will gain from this effort?
Yes = 1 No = 0
3
Have the customers been notified and concur on items Q1-Q3 above?
Yes = 1 No = 0
4
Is there a database that documents cross-functional dependencies that is managed to closure?
Yes = 1 No = 0
6
Is a database on cross-functional dependencies maintained?
Yes = 1 No = 0
5
Do the developers, reviewers, testers, QA, manufacturing, and customer program office, all share in the accountability for getting a successful program to the field?
Yes = 1 No = 0
5
Are developers and the entire product team rated or rewarded based upon the field performance of the product?
Yes = 1 No = 0
6
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 344
Chapter 7: Examples
Table 7.2-26: System Management Process Grade Factor Questions (continued) Question
Gij
Wij
Are there designated points of contact in each product development area?
Yes = 1 No = 0
3
Are there designated facilitators to manage cross-system issues?
Yes = 1 No = 0
5
Are there checklists covering reliability concerns for each program phase?
Yes = 1 No = 0
4
Is the technical staff encouraged to talk directly to its customer counterparts?
Yes = 1 No = 0
6
Are periodic informal activities, such as brown bag lunches, promoted to encourage team member technical exchange in an informal atmosphere?
Yes = 1 No = 0
2
Can a technical employee call for a technical review board of peers when it is felt appropriate to address a broad-impact technical concern?
Yes = 1 No = 0
5
Does this equipment not require an interface with other vendors' equipment or government furnished equipment (GFE)?
Yes = 1 No = 0
5
Is the % of product reuse from previous products 25% or more of the lines of code for software?
Yes = 1 No = 0
6
Is the % of product reuse from previous products 50% or more of the FRU count or cost for hardware?
Yes = 1 No = 0
6
Do program planning sessions have cross-functional representation?
Yes = 1 No = 0
5
Do technical reviews have cross-functional representation?
Yes = 1 No = 0
5
Are there program development plans that show timing of activities and deliverables (this should be done during the requirements phase and maintained throughout the program)?
Yes = 1 No = 0
4
Is there a non-management person designated to work full-time as the program technical lead, who works as a cross-team facilitator?
Yes = 1 No = 0
8
Is there a team-building effort and project brain storming at each program phase?
Yes = 1 No = 0
3
Are documentation products maintained on-line and accessible to all program personnel?
Yes = 1 No = 0
5
Is there a program database of "action items" that is maintained and managed to closure?
Yes = 1 No = 0
6
Is there a formal documented change process?
Yes = 1 No = 0
5
Are self-audits periodically performed on the change process?
Yes = 1 No = 0
4
Are business cases always run to evaluate the benefits and impacts of making a change (e.g., Reinertsen's model)?
Yes = 1 No = 0
5
Are total cost estimates made for ECs, including scrap, rework, tooling, and the potential slippage of schedule?
Yes = 1 No = 0
5
Are there two or less EC's planned during the first year of shipping?
Yes = 1 No = 0
5
Are ECs blocked into sections and scheduled ahead on periodic intervals to promote timely integration of changes?
Yes = 1 No = 0
6
Are ECs at or below the plan to date?
Yes = 1 No = 0
3
Do change review meetings have cross-functional representation?
Yes = 1 No = 0
5
Are there any ECs that are modifying previous ECs on the FRU?
Yes = 1 No = 0
5
Is there an EC meeting log maintained that includes the change rationale, the analysis provided, and meeting participants?
Yes = 1 No = 0
5
Reliability Information Analysis Center 345
Chapter 7: Examples
Table 7.2-26: System Management Process Grade Factor Questions (continued) Question
Gij
Wij
Are EC management metrics collected with a focus on continual, measurable process improvement?
Yes = 1 No = 0
5
Are the program development, integration, and test activities charted, showing tasks, their timing, operational dependencies and identification of critical path activities?
Yes = 1 No = 0
5
Are critical path elements identified (e.g., long-lead items)?
Yes = 1 No = 0
5
Is there a focus to get items off the critical path?
Yes = 1 No = 0
5
Are there risk assessment and contingency plans to minimize critical path risk?
Yes = 1 No = 0
4
Has the program met its targeted dates so far?
Yes = 1 No = 0
5
Is this product architecture based upon a distributed architecture?
Yes = 1 No = 0
8
Are there no future product inventions required on this program?
Yes = 1 No = 0
8
Are the R&M design goals sufficiently defined and allocated to ensure that customer needs are met?
Yes = 1 No = 0
5
Has development committed to support the required tasks for meeting the customer's R&M needs?
Yes = 1 No = 0
5
Does the design approach emphasize R&M as a major goal?
Yes = 1 No = 0
5
Has an agreed-to process been defined to assess progress towards meeting R&M goals and requirements?
Yes = 1 No = 0
5
Have adequate means been agreed upon to ensure that the R&M objectives of the product will have been achieved?
Yes = 1 No = 0
5
Have processes been defined and implemented to ensure that the designed-in (inherent) reliability does not degrade during manufacturing and operational use?
Yes = 1 No = 0
5
7.2.5.7. Can Not Duplicate (CND) Process Grade Factor Questions
Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions Question Is the system required to isolate to a single Field Replaceable Unit (FRU) on 90% % of failures? Is there a specified time limit to isolate a fault, effect a repair and restore the system? Is there a requirement for 90% or greater test coverage within the FRU being analyzed? Does the system promote remote serviceability with failure status communicated via Ethernet, serial port, parallel port, serial bus, etc., to a central maintenance station? Is there any remote failure protection for this FRU residing on a separate FRU (e.g., an arc suppression circuit that is located on a different FRU than the relay FRU)? Is this FRU designed to be hot-pluggable? Does the FRU designer also design the fault isolation software that supports fault diagnosis? Are multiple occurrences of "Can Not Duplicate" (CND) incidents analyzed for root cause of the problem?
Gij Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 346
Wij 6 5 6 4 5 4 6 6
Chapter 7: Examples
Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions Question Are test, warranty, early-life, and high fallout FRUs subjected to double fault verification (this procedure re-inserts the faulted FRU to ensure the problems track the replaced FRU)? Do your current products experience 40% or less Can Not Duplicate (CND) failures (note that CNDs are synonymous with No Defects Found (NDF) and No Trouble Found (NTF))? Is a failure mode and effect analysis (FMEA) performed down to the FRU level or the Circuit Card Assembly (CCA) level, whichever is lower? Do design personnel participate directly in performing the FMEA? Are maintenance analysis procedures (MAPs) developed to map failure symptoms to the failing FRU? Are the MAPs verified by inserting faults in a maintainability test? Are the MAPs updated with actual test and field data? Has your company established the cost impact of a field failure? Does the system contain error logging and reporting capability? Does the system promote ongoing analysis of soft error conditions that might predict when a likely failure will occur? Will the contractor developing this equipment also be responsible for maintaining it? Does the repair facility have the ability to recreate the conditions under which a true false alarm occurred (sequence of events, operator error, sneak circuit, etc.) and are these techniques used to try to recreate the failure? Does the repair facility have the ability to recreate the conditions under which a real failure occurred (high/low temperature, thermal cycling/shock, vibration/ mechanical shock, etc.) and are these techniques used to try to recreate the failure? Will the maintainer be motivated to provide timely and complete documentation of the diagnosis and repair action? Do the system maintenance personnel receive feedback on their repair reports and the actions taken to mitigate the failure reoccurrence? Are the performance specification limits of the test equipment used to troubleshoot/repair the system, FRU, etc., equal to or more stringent than the performance specification limits of the system, FRU, etc., in its actual application? Are CND failures included in the Failure Reporting and Corrective Action System (FRACAS) system and closed out through corrective action verification?
Gij Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0
Wij 10 8 5 5 5 4 5 3
Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0
5 4 5
Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0
5 5 5 5 5 5
7.2.5.8. Induced Process Grade Factor Questions
Table 7.2-28: Induced Process Grade Factor Questions Question
Gij
Wij
Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced failure through electrostatic discharge?
Yes = 1 No = 0
6
If parts/materials are susceptible, are procedures used to protect them during handling, test, assembly, packaging, storage, transportation and use (i.e., wrist straps, non-conductive work areas, ionized air, warning labels, maintenance manuals, etc.)?
Yes = 1 No = 0
4
Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical overstress resulting from another primary failure?
Yes = 1 No = 0
4
Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical transients generated within the system/FRU, or received from outside the system/FRU (via cable/wiring harnesses)?
Yes = 1 No = 0
6
Are maintenance manuals/procedures written such that risk of Electrostatic Discharge/Electrical Overstress (ESD/EOS) during troubleshooting and repair activity is identified (warning labels, etc.)?
Yes = 1 No = 0
4
Reliability Information Analysis Center 347
Chapter 7: Examples
Table 7.2-28: Induced Process Grade Factor Questions (continued) Question
Gij
Wij
Has the operating environment that the part/FRU/system is to be used in been evaluated to determine the potential for mishandling of the equipment that could result in induced mechanical failure (weather; personnel capabilities; training needs)?
Yes = 1 No = 0
6
Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced (mechanical) secondary failure resulting from the primary failure of another part/assembly?
Yes = 1 No = 0
4
If parts/materials are susceptible to induced mechanical damage, are procedures in-place to protect them during handling, test, assembly, packaging, storage, transportation and use?
Yes = 1 No = 0
4
Is the part, FRU, and/or system designed such that it can be handled and transported in a manner that minimizes the risk of induced mechanical failure (proper location/use of handles; orientation labels – “This Side Up”; etc.)?
Yes = 1 No = 0
4
Are shipping tests run to ensure adequacy of packaging and shipping procedures to protect the product during transportation?
Yes = 1 No = 0
4
Are maintenance manuals/procedures written such that the risk of induced mechanical damage during troubleshooting and repair activity is identified (warning labels, etc.)?
Yes = 1 No = 0
4
Do maintenance manuals include detailed instructions for removing and replacing parts/components/assemblies from sockets and/or soldered PCB and multiplayer boards, etc.?
Yes = 1 No = 0
4
Do maintenance manuals include detailed instructions for disconnecting and reconnecting wires, harnesses, cables, hoses, etc.?
Yes = 1 No = 0
4
Is the FRU/system ergonomically designed such that it can be used by the customer in normal operation without unnecessary risk of induced mechanical damage?
Yes = 1 No = 0
4
Is the FRU designed to withstand normal handling and expected mishaps (e.g., a drop off a 36-inch high table top) without induced mechanical damage?
Yes = 1 No = 0
4
Are wires color coded, and connectors keyed or of differing configuration such that FRUs cannot be misplugged?
Yes = 1 No = 0
4
7.2.5.9. Wearout Process Grade Factor Questions
Table 7.2-29: Wearout Process Grade Factor Questions Question
Gij
Wij
Have all parts and materials been selected for use in the design that extend the wearout life of the part/Field Replaceable Unit (FRU)/system to meet/exceed its required useful life?
Yes = 1 No = 0
6
Has the expected reliability of parts subjected to significant mechanical loading been modeled to ensure the capability to endure the mission, e.g., using Miner’s life expectation rule for components subjected to cyclical loads?
Yes = 1 No = 0
4
Have wearout failure modes and mechanisms at the part, FRU and system level been identified and mitigated during the Failure Modes and Effects Analysis (FMEA) process?
Yes = 1 No = 0
6
Do the relevant failure modes/mechanisms include fatigue (solder joints for electronic components/assemblies; welds for bonded materials; fractures in mechanical parts/assemblies/materials; etc.)?
Yes = 1 No = 0
4
Do the relevant failure modes/mechanisms include leaks (electrolyte loss in electrolytic capacitors; worn seals in hydraulic systems; etc.)?
Yes = 1 No = 0
4
Do the relevant failure modes/mechanisms include chafing (wires in electrical harnesses; wear in hydraulic lines and hoses; etc.)?
Yes = 1 No = 0
4
Do the relevant failure modes/mechanisms include cold flow of insulation (wires wrapped around sharp edges or subjected to pressure points; etc.)?
Yes = 1 No = 0
4
Do the relevant failure modes/mechanisms include wearout resulting from cyclic operations (activation in electronic switch/relay contacts; mating/unmating of electronic or mechanical connectors; etc.)?
Yes = 1 No = 0
4
Do the relevant failure modes/mechanisms include wearout resulting from breakdown of insulation in wires, or dielectric materials in semiconductors?
Yes = 1 No = 0
4
Do the relevant failure modes/mechanisms include wearout resulting from moving parts (bearings, gears, belts, springs, seals, etc.)?
Yes = 1 No = 0
4
Has the system/FRU/part design been modified based on the wearout modes/mechanisms identified in the FMEA to reduce or minimize their occurrence to the maximum extent feasible?
Yes = 1 No = 0
4
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 348
Chapter 7: Examples
Table 7.2-29: Wearout Process Grade Factor Questions (continued) Question
Gij
Wij
Are process FMEAs performed to determine the failure modes/mechanisms of critical processes during manufacturing?
Yes = 1 No = 0
4
Is data collected and analyses performed to determine the process capability of manufacturing processes?
Yes = 1 No = 0
4
Is statistical process control (SPC) applied to manufacturing processes to control the process mean and variability?
Yes = 1 No = 0
4
Is the measured mean of each manufacturing process parameter equal to, or better than, the parameter value used to calculate the wearout failure rates of the system/FRU parts/components?
Yes = 1 No = 0
4
If required, has this product been hardened to withstand adverse environmental stresses such as corrosion, radiation, humidity, etc.?
Yes = 1 No = 0
4
Are procedures defined/implemented to ensure that assembly/test steps during manufacturing do not contribute to early wearout of susceptible items (i.e., minimize connector matings/unmatings; stress relief/tie downs to minimize chafing during test; etc.)?
Yes = 1 No = 0
6
Do maintenance manuals/procedures instruct repair personnel to check that wire harnesses are properly secured, seals are properly reinstalled, connectors are properly mated, etc., following troubleshooting/repair?
Yes = 1 No = 0
4
Is preventative maintenance planned to replace wear out-susceptible parts/materials at or before their L10 life (where no more than 10% of the units should experience wearout)?
Yes = 1 No = 0
6
Are wearout-susceptible parts/materials inspected during each corrective maintenance action to find and replace items exhibiting premature wearout?
Yes = 1 No = 0
4
Are wearout-susceptible parts/materials inspected during each preventive maintenance action to find and replace items exhibiting premature wear out?
Yes = 1 No = 0
4
Are wearout failures (both valid and premature) included in the Failure Reporting and Corrective Action System (FRACAS) and closed out through corrective action, which could include life-extension opportunities?
Yes = 1 No = 0
4
Is field data tracked and analyzed to detect FRUs displaying increasing failure rate tendencies, i.e., wearout?
Yes = 1 No = 0
4
7.2.5.10.
Growth Process Grade Factor Questions
Table 7.2-30: Growth Process Grade Factor Questions Question Is there an effective Failure Reporting and Corrective Action System (FRACAS) in place for the fielded system? What is the percentage of field failures for which the root cause is determined? Is analysis performed to determine if the failure is recurring? Are design, manufacturing, or system management related potential corrective actions identified? Are the original designers or manufacturing personnel consulted regarding the potential corrective action? Is there a field support infrastructure in place that can affect the necessary changes? Are systems adequately tested to insure that the changes were made properly without inducing other defects or damage?
Reliability Information Analysis Center 349
Gij Yes = 1 No = 0 Yes = 1 No = 0 G = percentage/100 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0 Yes = 1 No = 0
Wij 8 8 6 6 4 10 5
Chapter 7: Examples
7.3. Life Modeling Example 7.3.1. Introduction
This section presents the results of an analysis in which the intent was to quantify the reliability of a seal used in an assembly. The approach taken in the analysis was to perform life tests under a variety of conditions, and to develop life models from this data so that lifetimes could be predicted as a function of the appropriate stress and product variables. In this manner, estimates of reliability under a wide range of use conditions could be made. This is an example of an assessment methodology, the results of which would be more accurate than a prediction method applied to the seal. If the analyst is able to develop a model like the one presented here for a specific component or failure cause, the resulting model should be weighed more heavily than a prediction on the specific component. 7.3.2. Approach
All samples were tested under a variety of temperature and relative humidity conditions. In addition, samples included two factors which were varied in the life tests: Process Force and Hardness. These stresses and product/process variables were expected to be the ones that most heavily influenced the product reliability. 7.3.3. Reliability Test Plan
The Reliability Test Plan required that the lifetime be measured at various magnitudes of these variables, such that life model parameters (including acceleration factors) could be quantified. Table 7.3-1 summarizes, for each variable, the number of levels, and the level values.
Table 7.3-1: Parameter Levels Variable Temperature Humidity Process force Hardness
Number of Levels 2 2 2 3
Levels 85, 130 C 85, 100% 2, 20 N 25, 50, 100 V
Table 7.3-2 summarizes the tests performed.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 350
Chapter 7: Examples
Table 7.3-2: Test Plan Summary Sample Size 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Temperature
Humidity
Hardness
Process Force
85 85 85 85 85 85 130 130 130 130 130 130 130 130 130 130 130 130
85 85 85 85 85 85 85 85 85 85 85 85 100 100 100 100 100 100
25 50 100 25 50 100 25 50 100 25 50 100 25 50 100 25 50 100
2 2 2 20 20 20 2 2 2 20 20 20 2 2 2 20 20 20
The tests were performed by first inspecting each sample, then exposing them to the specific combination of variables as previously summarized, and, finally, re-inspecting them at various intervals. The exposure times and inspection intervals were structured such that short lifetimes could be observed in the event that acceleration factors were higher than anticipated. Therefore, more frequent inspections were performed early in the test, followed by less frequent inspections for the surviving samples. Failed samples were removed from the test. Data was then summarized in a format suitable for life modeling. The required data elements included stress and product/process variables, plus life variables, as follows:
•
Variables: o Temperature o Humidity o Process force o Hardness Reliability Information Analysis Center 351
Chapter 7: Examples
•
Life variables o Last known good time o First known bad time
7.3.4. Results 7.3.4.1. Times to Failure Summary
The test results for the seal samples are presented in Table 7.3-3. Included in this table is the sample number, the temperature (in degrees C), the relative humidity, the Hardness, the process force, whether the sample failed (F) or survived (S), and the time at which it failed or survived.
Table 7.3-3: Life Test Results T
RH
Thickness
Speed
F or S
Time to F/S
T
RH
Thickness
Speed
F or S
Time to F/S
85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
25 25 25 25 25 25 25 25 25 25 25 25 25 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 100 100 100 100 100 100 100 100
2 2 2 2 2 2 2 20 20 20 20 20 20 20 2 2 2 2 2 2 2 20 20 20 20 20 20 20 2 2 2 2 2 2 2 20
S S S F S S S S S S S S S S S S S S S S S S S S F S S S S S S S S S S S
1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 778 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159 1159
85 85 85 85 85 85 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130
85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
100 100 100 100 100 100 25 25 25 25 25 25 25 25 25 25 25 25 25 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 100 100
20 20 20 20 20 20 2 2 2 2 2 2 2 20 20 20 20 20 20 20 2 2 2 2 2 2 2 20 20 20 20 20 20 20 2 2
S S S S S S F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
1159 1159 1159 1159 1159 1159 278 158 130 237.5 158 196.5 130 158 196.5 237.5 428 237.5 130 158 237.5 196.5 278 196.5 278 158 158 196.5 158 196.5 428 158 237.5 158 278 158
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 352
Chapter 7: Examples T
RH
Thickness
Speed
F or S
Time to F/S
T
RH
Thickness
Speed
F or S
Time to F/S
130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130
85 85 85 85 85 85 85 85 85 85 85 85 85 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
100 100 100 100 100 100 100 100 100 100 100 100 100 25 25 25 25 25 25 25 25 25 25 25 25 25 25 50 50 50 50
2 2 2 2 2 2 20 20 20 20 20 20 20 2 2 2 2 2 2 2 20 20 20 20 20 20 20 2 2 2 2
F F F F F F F F F F F F F F F F F F F F F S F F F F F F S F S
278 220 371 278 325 428 58 325 428 325 428 278 196.5 58 59 34 58 34 34 58 58 70 34 34 58 1.5 58 58 70 58 70
130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
50 50 50 50 50 50 50 50 50 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100
2 2 2 20 20 20 20 20 20 20 2 2 2 2 2 2 2 20 20 20 20 20 20 20
F F S F S F F F F F S F F F F F F S S F F F F S
58 58 70 58 70 34 34 58 34 58 70 58 58 58 58 58 34 70 70 58 58 58 58 70
The 2-parameter Weibull distribution parameters for the TTF distributions for the samples are shown in Table 7.3-4.
Table 7.3-4: Times to Failure Distribution Parameters Test Condition 85C/85%RH 130C/85%RH 130C/100%RH
Characteristic Life 2109 268 62.1
Shape Parameter 5.1 2.71 3.2
The TTF distributions for each of the three test conditions are illustrated in Figure 7.3-1.
Reliability Information Analysis Center 353
Chapter 7: Examples Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m
Probability - Weibull
9 9. 0 00
P ro ba bility-W e ibull F o lio1 \SL-130 , 1 00 W e ibull-2P MLE SRM MED F M F = 3 3/S= 9 Da ta Po ints Susp Po ints Pro ba bilit y Line
9 0. 0 00
F o lio1 \SL-130 , 8 5 W e ibull-2P MLE SRM MED F M F = 4 3/S= 0 Da ta Po ints Pro ba bilit y Line
Unreliability, F(t)
5 0. 0 00
F o lio1 \SL-85, 85 W e ibull-2P MLE SRM MED F M F = 2 /S= 40 Da ta Po ints Pro ba bilit y Line
1 0. 0 00
5. 00 0
1. 00 0 1. 0 0 0
10 . 00 0
1 00 . 0 00
10 0 0. 0 00
Bill De nson Co rning 1 1 /24 /2 00 8 5 :0 5:2 2 PM 1 0 00 0. 00 0
T ime, (t) F olio 1\SL-13 0, 10 0: β =3 .2 1 8 3 , η=6 2 .1 1 9 5 F olio 1\SL-13 0, 85 : β= 2 .7 2 2 1 , η= 2 6 8 .2 4 7 9 F olio 1\SL-85 , 85: β =5 .0 5 0 5 , η=2 1 0 9 .0 6 3 5
Figure 7.3-1: Times To Failure Distributions 7.3.4.2. Life Models
Life models were generated from the data summarized above. These life models estimate the TTF distribution as a function of the variables used in the experiments. A general form of the Weibull reliability function used is:
R=e
⎛t ⎞ −⎜ ⎟ ⎝α ⎠
β
where: R= α=
the reliability, or probability of survival, at time “t” the Weibull characteristic life (i.e., the time to 63% failure) 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 354
Chapter 7: Examples
β=
the Weibull shape parameter
The characteristic life is then developed as a function of the applicable variables. The model form is: α0
α1
α = e e T RH α H α F α 2
4
3
Where: α 0 through α 4 = T= RH = H= F=
parameter coefficients estimated in the life modeling process temperature in degrees K (C+273) relative humidity ionic contamination the process force
Maximum likelihood analysis was performed to determine the values of β, α0, α 1, α 2, α 3, and α 4 that maximize the value of the above likelihood function. These parameter estimates then become the coefficients in the life model. The Likelihood Function is:
(
(
))
L = ∏ f (ti , β ,α 0 ,α1,α 2 ,α 3 ,α 4 ) * ∏ 1 − F t j , β ,α 0 ,α1,α 2 ,α 3 ,α 4 * ∏(F (tk , β ,α 0 ,α1,α 2 ,α 3 ,α 4 ))(1 − F (tk −1, β ,α 0 ,α1,α 2 ,α 3 ,α 4 )) where:
f = F= ti = tj tk and tk-1 =
Weibull pdf (probability density function) cumulative Weibull function (probability of failure) failure times survival times times that bracket the failure interval
The first of the three product terms represent failures at known times, the second represents survivals, and the third represent failures that occur within intervals but the precise failure times are not known Once the model parameters are estimated in this fashion, the reliability at any time, and for any combination of variables, can be estimated. Reliability Information Analysis Center 355
Chapter 7: Examples
The estimated parameters are summarized in Table 7.3-5. In this table, the best estimate is provided along with the 80% 2-sided confidence levels around the estimate. A small variation between the lower and upper confidence bound are indicative of significant variables.
Table 7.3-5: Estimated Parameter 80% 2-Sided Confidence Bounds Parameter
Lower 80% CL
Best Estimate
Upper 80% CL
β
2.737
3.073
3.450
α0
19.68
23.98
28.28
α1
6957.2
8015.7
9074.3
α2
-9.45
-8.83
-8.21
α3
0.131
0.215
0.299
α4
-0.0031
0.0388
0.0807
The resulting equation for the characteristic life is then:
α =e
23.98
e
8015.7 T
RH
−8.83
H
0.2150
F
0.0388
Once the model parameters are estimated, then a variety of output formats are possible. For example, Figure 7.3-2 illustrates the probability of failure as a function of temperature and relative humidity at a time of 50,000 hours.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 356
Chapter 7: Examples Unreliability vs Stress Surface
Figure 7.3-2: Probability of Failure vs. Temperature and Relative Humidity at 50,000 Hours
7.4. NPRD Description Information from the RIAC document “Nonelectronic Parts Reliability Data (NPRD)” (Reference 4) is presented here to provide the reader with: 1. An understanding of the issues involved in the collection and interpretation of field reliability data 2. A summary of alternatives that can be used to combine data from various sources The purpose of NPRD is to present failure rate data on a wide variety of electromechanical and mechanical parts and assemblies (including many types of electronic assemblies). While there are reliability prediction methodologies for standard electronic components such as MIL-HDBK-217 and the RIAC 217Plus methodology, Reliability Information Analysis Center 357
Chapter 7: Examples
there are few sources of failure rate data for other component types. All part types and assemblies for which RIAC has data are included in NPRD with the exception, of standard electronic component types. Although the data contained in NPRD were collected from a wide variety of sources, RIAC has screened the data such that only high quality data is added to the database and presented in this document. In addition, only field failure rate data is included. The intent of this section is to provide the user with information to adequately interpret and use data to supplement standard reliability prediction methodologies. It is not feasible for documents like MIL-HDBK-217 or other prediction methodologies to contain failure rate models on every conceivable type of component and assembly. Traditionally, reliability prediction models have been primarily applicable only for generic electronic components. Therefore, NPRD serves a variety of needs:
• •
To provide failure rates on assemblies in cases where piece-part level analyses are not feasible or required To complement other prediction methodologies by providing data on part types not addressed by its models
7.4.1. Data Collection
The failure rate data contained in the newest version of NPRD (NPRD-2010) will represent a cumulative compilation of data collected from the early 1970's through December 2008. RIAC is continuously soliciting new field data in an effort to keep the databases current. The goals of these data collection efforts are as follows: 1. To obtain data on relatively new part types and assemblies. 2. To collect as much data on as many different data sources, application environments, and quality levels as possible. 3. To identify as many characteristic details as possible, including both part and application parameters. The following generic sources of data were used for this publication: 1. 2. 3. 4. 5.
Published reports and papers Data collected from government-sponsored studies Data collected from military maintenance data collection systems Data collected from commercial warranty repair systems Data from commercial/industrial maintenance databases 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 358
Chapter 7: Examples
6. Data submitted directly from military or commercial organizations that maintain failure databases An example of the process by which RIAC identifies candidate systems and extracts reliability data on military systems is summarized in Table 7.4-1.
Table 7.4-1: Data Summarization Process (1)
Identify System Based On: • • • •
Environments/Quality Age Component Types Availability of Quality Data
(2)
Build Parts List:
Obtain Illustrated Parts Breakdown (IPB) Ensure Correct Version of System Consistent with Maintenance Data Identify Characteristics of Components (Part Numbers, Federal Stock Number, Vendor Catalogs, etc.) Enter Part Characteristics into Database
• • • •
(3)
Obtain Failure Data:
•
(4)
Obtain Operating Data:
(5)
Transform Data to Common RIAC Database Template
• • •
Reliability Improvement Warranty, DO56, Warranty Records, etc. Match Failures to IPB Insure Part Replacements were Component Failures Add Failure Data to Database
• • •
Verify Equipment Inventory Equipment Hours/Miles, Part Hours/Miles Application Environment
Perhaps the most important aspect of this data collection process is identifying viable sources of high quality data. Large automated maintenance databases, such as the Air Force REMIS system or the Navy's 3M and Avionics 3M systems, typically will not provide accurate data on piece parts. They can, however, provide acceptable data on assemblies or LRUs, if used judiciously. Additionally, there are specific instances in which they can be used to obtain piece-part data. Piece-part data from these maintenance systems is used in the RIAC's data collection efforts only when it can be verified that they accurately report data at this level. Reliability Improvement Warranty (RIW) data are another high quality data source which has been used. Reliability Information Analysis Center 359
Chapter 7: Examples
Completeness of data, consistency of data, equipment population tracking, failure verification, availability of parts breakdown structures, and characterization of operational histories are all used to determine the adequacy of the data. In many cases, data submitted to the RIAC is discarded since an acceptable level of credibility does not exist. Inherent limitations in data collection efforts can result in errors and inaccuracies in summary data. Care must be taken to ensure that the following factors are considered when using a data source. Some of the sources of error are: 1. There are many more factors affecting reliability than can be identified 2. There is a degree of uncertainty in any failure rate data collection effort. This uncertainty is due to the following factors: a. Uncertainty as to whether the failure was inherent (common cause) or event-related (special cause) b. Difficulty in separating primary and secondary failures c. Much of the collected data is generic and not manufacturer specific, indicating that variations in the manufacturing process are not accounted for d. It is very difficult to distinguish between the effects of highly correlated variables. For example, the fact that higher quality components are typically used in more severe environments makes it impossible to distinguish the effect that each has, independently, on reliability. e. Operating hours can be reported inaccurately f. Maintenance logs can be incomplete Actual component stresses are rarely known. Even if nominal stresses are known, actual stresses which significantly impact reliability can vary significantly about this nominal value. The impacts of complex environmental stresses on reliability during field operation of a product or system is also extremely difficult, if not impossible, to discern. When collecting field failure data, a very important variable is the criteria used to define, detect and classify failures. Much of the failure data presented in NPRD-2010 were identified by maintenance technicians performing a repair action, indicating that the criteria for failure is that a part in a particular application has failed in a manner that makes it apparent to the technician. In some data sources, the criteria for failure were that the component replacement must have remedied the failure symptom. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 360
Chapter 7: Examples
7.4.2. Data Interpretation
Data contained in NPRD-2010 reflects industry average failure rates, especially the summary failure rates which were derived by combining several failure rates on similar parts/assemblies from various sources. In certain instances, reliability differences can be distinguished between manufacturers or between detailed part characteristics. Although the summary section of NPRD cannot be used to identify these differences (since it presents summaries only by generic type, quality, environment, and data source), the listings in the detailed section of NPRD contain all of the specific information that was known for each part and, therefore, can sometimes be used to identify such differences. Data in the summary section of NPRD represent an "estimate" of the expected failure rate. The "true" value will lie within some confidence interval about that estimate. The traditional method of identifying confidence limits for components with exponentially distributed lifetimes has been the use of the Chi-Square distribution. This distribution relies on the observance of failures from a homogeneous population and, therefore, has limited applicability to merged data points from a variety of sources. To give users of NPRD a better understanding of the confidence they can place in the presented failure rates, an analysis of RIAC data in the past concluded that, for a given generic part type, the natural logarithm of the observed failure rate is normally distributed with a standard deviation of 1.5. This means that 68 percent of the actual experienced failure rates will be between 0.22 and 4.5 times the mean value. Similarly, 90% of actual failure rates will be between 0.08 and 11.9 times the presented mean value. As a general rule-of-thumb, this type of precision is typical of probabilistic reliability prediction models and point-estimate failure rates such as those contained within NPRD. It should be noted that this precision is applicable to predicted failure rates at the component level, and that confidence will increase as the statistical distributions of components are combined when analyzing modules or systems. In virtually all of the field failure data collected for NPRD, TTF was not available. Few current DoD or commercial data tracking systems report elapsed time indicator (ETI) meter readings that would allow TTF compilations. Those that do lose accuracy following removal and replacement of failed items. To accurately monitor these times, each replaceable item would require its own individual time recording device. Data collection efforts typically track only the total number of item failures, part populations, and the number of system operating hours. This means that the assumed underlying TTF distribution for all failure rates presented in NPRD is the exponential distribution. Unfortunately, many part types for which data are presented typically do not follow the exponential failure law, but rather exhibit wearout characteristics, or an increasing failure Reliability Information Analysis Center 361
Chapter 7: Examples
rate in time. While the actual TTF distribution may be Weibull or lognormal, it may appear to be exponentially distributed if a long enough time has elapsed. This assumption is accurate only under the condition that components are replaced upon failure, which is true for the vast majority of data contained in NPRD. To illustrate this, refer to Figure 7.4-1, which depicts the apparent failure rate for a population of components that are replaced upon failure, each of which follow the Weibull TTF distribution. This illustrates Drenick’s theorem that was discussed earlier in this book.
MTTF = Mean-Time-to-Failure, α = Weibull Characteristic Life
Figure 7.4-1: Apparent Failure Rate for Replacement Upon Failure At t = 0, the population of parts has not experienced operation. As operating time increases, parts in the original population are replaced and the failure rate increases. The failure rate then decreases as the majority of parts have been replaced with new parts. The population of replaced parts undergo the same process with the exception that the deviation of the second distribution is greater due to the fact that the "time zeros" of the replaced parts, themselves, are spread over time. This process continues until the "time zeros" of the parts have become sufficiently randomized to result in an apparent exponentially distributed population. The approximate time at which this asymptotic value is reached as a function of beta is given in Table 7.4-2. The asymptotic value of failure rate is 1/alpha, regardless of beta.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 362
Chapter 7: Examples
Table 7.4-2: Time at Which Asymptotic Value is Reached β 2 4 6 8
Asymptote 1.0 2.4 4.2 7.0
Additionally, since MTTF is often used instead of characteristic life, their relationship should be understood. The ratio of alpha/MTTF is a function of beta and is given in Table 7.4-3.
Table 7.4-3 α/MTTF Ratio as a Function of β β 1.0 2.0 2.5 3.0 4.0
Asymptote 1.00 1.15 1.12 1.10 1.06
Based on the previous discussion, it is apparent that the time period over which data is collected is very important. For example, if the data is collected from “time zero” to a time which is a fraction of alpha, the failure rate will be increasing over that period and the average failure rate will be much less than the asymptotic value. If however the data is collected during a time period after which the failure rate has reached its asymptote, the apparent failure rate will be constant and will have the value 1/alpha. The detailed data section in NPRD presents part populations which provide the user the ability to further analyze the time logged to an individual part or assembly, and to estimate the characteristic life. For example, the detailed section presents the population and the total number of operating hours for each data record. Dividing the part operating hours by the population yields the average number of operating hours for the system/equipment in which the part/assembly was operating. An entry for a commercial quality mercury battery in a ground, fixed (GF) environment indicates that a population of 328 batteries had experienced a total of 0.8528 million part hours of operation. This indicates that each battery had experienced an average of 0.0026 million hours of operation in the time period over which the data was collected. If a shape parameter, beta, of the Weibull distribution is known for a particular part/assembly, the user can use this data to extrapolate the average failure rate presented in NPRD to a Weibull characteristic life Reliability Information Analysis Center 363
Chapter 7: Examples
(alpha). If the percent failure rate is relatively low, the methodology is of limited value. If a significant percent of the population has failed, the methodology will yield results for which the user should have a higher degree of confidence. The methodology presented is useful only in cases where TTF characteristics are needed. In many instances, knowledge of the part characteristic life is of limited value if the logistics demand is the concern. This data can, however, be used to estimate characteristic life in support of preventive maintenance efforts. The assumptions in the use of this methodology are: 1. Data were collected from "time zero" of the part/assembly field usage 2. The Weibull distribution is valid and β is known Table 7.4-4 contains cumulative percent failure as a function of the Weibull beta shape parameter and the time/characteristic life ratio (t/α). The percent failure from the NPRD detailed data section can be converted to a (t/alpha) ratio using the data in Table 7.4-4. Once this ratio is determined, a characteristic life can be determined by dividing the average operating hours per part (part hours/population) by the (t/alpha) ratio. It should be noted here that the percentage failures in the table can be greater than 100, since parts are replaced upon failure and there can be an unlimited number of replacements for any given part.
Table 7.4-4: Percent Failure for Weibull Distribution
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 364
Chapter 7: Examples
As an example, consider the NPRD detailed data for “Electrical Motors, Sensor”; Military Quality Grade; Airborne, Uninhabited (AU) environment; and a Population Size of 960 units. Assume for this data entry that there were 359 failures in 0.7890 million part-operating hours. The data may be converted to a characteristic life in the following manner: 1. Determine the Percent Failure: 359 % Failure = 960 = 37.4%
2. Determine a typical Weibull shape parameter (β). For motors, a typical beta value is 3.0 (Reference 5). 3. Convert the Percent Failure to a t/alpha ratio using Table 7.4-4 (for % fail = 37.4 and β = 3) t
α
≅ 0.65
(extrapolating between 31 and 42)
4. Calculate average operating hours per part: Part Hours 0.7890 = = 0.00082 million hours Population Count 960
5. Calculate α:
⎞ ⎛ Part Hours ⎟⎟ ⎜⎜ Population Count ⎠ 0.00082 α=⎝ = = 0.00126 million hours 0.65 ⎛t⎞ ⎜ ⎟ ⎝α ⎠ Based on this data, an approximate Weibull characteristic life is 1260 hours. The user of this methodology is cautioned that this is a very approximate method for determining the characteristic life of an item when TTF data is not available. It should also be noted that for small values of time (i.e.; t < 0.1 alpha), random failures can predominate, effectively masking wearout characteristics and rendering the methodology inaccurate. Reliability Information Analysis Center 365
Chapter 7: Examples
Additionally, for small operating times relative to α, the results are dependent on the extreme tail of the distribution, thus significantly decreasing the confidence in the derived alpha value. For part types exhibiting wearout characteristics, the failure rate presented represents an average failure rate over the time period in which the data was collected. It should also be noted that for complex nonelectronic devices or assemblies, the exponential distribution is a reasonable assumption. The user of this data should also be aware of how data on cyclic devices such as circuit breakers is presented in NPRD. Ideally, these devices should have failure rates presented in terms of failures per operating cycles. Unfortunately, from the field data collected, the number of actuations is rarely known and, therefore, the listed failure rates are presented in terms of failures per operating hour for the equipment in which the part is used. 7.4.3. Document Overview
The RIAC NPRD databook is organized into the following sections: Section 1: Section 2: Section 3: Section 4: Section 5: Section 6: Section 7: Section 8:
Introduction Part Summaries Part Details Data Sources Part Number/Mil Number Index National Stock Number Index with Federal Stock Class Prefix National Stock Number Index without Federal Stock Class Prefix Part Description Index
Sections 2 through 8 are described in detail in the following sections. 7.4.3.1. "Part Summaries" Overview
The summary section of NPRD contains combined failure rate data, presented in order of Part Description, Quality Level, Application Environment, and Data Source. The Part Description itself is presented in a hierarchical classification. The known technical characteristics, in addition to the classification, are contained in Section 3 of the book, “Part Details”. All data records were combined by totaling the failures and operating hours from each unique data source. In some cases, only failure rates were reported to RIAC. These data points do not include specific operating hours and failures, and have dashes in the Total Failed and Operating Hours/Miles fields. Table 7.4-5 describes each field presented in the summary section. 100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC 366
Chapter 7: Examples
Table 7.4-5: Field Descriptions Field Name Part Description
Field Description Description of the part, including the major family of parts and specific part-type breakdown within the part family. The RIAC does not distinguish parts from assemblies within NPRD. Information is presented on parts/assemblies at the indenture level at which it was available. The description of each item for which data exists is made as clear as possible so that the user can choose a failure rate on the most similar part or assembly. The parts/assemblies for which data is presented can be comprised of several part types, or they can be a constituent part of a larger assembly. In general, however, data on the part type listed first in the data table is representative of the part type listed and not of the higher level of assembly. For example, a listing for “Stator, Motor” represents failure experience on the stator portion of the motor and not the entire motor assembly. Added descriptors to the right, separated by commas, provide further details on the part type listed first. Additional detailed part/assembly characteristics can be found, if available, in the Part Details section of NPRD.
Quality Level
The Quality Level of the part, as indicated by: Commercial - Commercial quality parts Military - Parts procured in accordance with MIL specifications Unknown - Data resulting from a device of unknown quality level
App. Env.
The Application Environment describes the conditions of field operation. See Table 7.4-6 for a detailed list of the application environments and their descriptions. These environments are consistent with MIL-HDBK-217. In some cases, environments more generic than those used in MIL-HDBK-217 are used. For example: "A" indicates the part was used in an Airborne environment, but the precise location and aircraft type was not known. Additionally, some environments are more specific than the current version of MIL-HDBK-217, since the current version has merged many of the environment categories and the NPRD data was originally categorized into the more specific environment. Environments preceded by the term "NO" are indicative of components used in a non-operating product or system in the specified environment.
Data Source
Source of data comprising the NPRD data entry. The source number may be used as a reference to Section 4 of NPRD to review the specific data source description.
Failure Rate Fails / (E6)
The failure rate presented for each unique part type, environment, quality, and source combination. It is the total number of failures divided by the total number of life units. No letter suffix indicates that the failure rate is in failures per million operating hours. An "M" suffix indicates the unit is failures per million miles. For roll-up data entries (i.e., those without sources listed), the failure rate is derived using the data merge algorithm described in this section. A failure rate preceded by a "